SVM FOUR

(1)

CHAPTER FOUR

UNDERSTANDING

SVM

HYPERPARAMETERS

4.1 INTRODUCTION

Training SVM hyperparameters is a difficult problem and has received much attention, as

mentioned in Section 2.2. This serves as an introductory chapter to understanding the SVM

(2)

CHAPTER FOUR UNDERSTANDING SVM HYPERPARAMETERS

hyperparameters. An important relationship with regard to C and SVM accuracy was

iden-tified, which led to the work presented in Chapter 5. While these observations are not novel (Keerthi et. al [ 16] have made similar observations), we base our hypothesis that SVMs

are in fact not LMCs for non-separable datasets on the relationship between C and SVM

accuracy, which we observed.

Better understanding of the hyperparameters' behavior will also lead to better insight into how to select sensible hyperparameter values.

The chapter is organized as follows: In Section 4.2, we will give a brief overview of the role of the different SVM hyperparameters, and then investigate SVM behavior over a wide range ofhyperparameter values in Section 4.3. The relationship between the hyperparame-ters and the amount of training data, as well as a novel hyperparameter tuning strategy, will be discussed in Section 4.4, followed by experiments in Section 4.4.4, testing the proposed tuning strategy.

4.2 ROLE OF RBF AND LINEAR KERNEL HYPERPARAMETERS

Two popular kernels and their corresponding hyperparameters will be discussed: the linear and RBF kernels.

The linear SVM has no kernel parameters, hence the only parameter to be tuned is C

from Eq. 2.12. C penalizes samples that are either misclassified, or that fall within the

margin surrounding the separating hyperplane. High values of C would thus give more

weight to the misclassification term in Eq. 2.12 than the margin. In the extreme of C

---+

oo

the margin term becomes negligible; minimizing Eq. 2.12 hence entails minimizing the sum of all errors only. Small values of C give more weight to the margin term, with C

---+

0

ensuring that the margin is maximized (C = 0 is not sensible for non-separable problems).

The same arguments for C apply to the case where an RBF kernel is used. As discussed

above, the RBF kernel

(4.1)

allows the SVM to construct non-linear decision boundaries by mapping the data into an

infinite dimensional feature space. In addition to C, one has to search for optimal values of

1, which controls the kernel width.

Optimal hyperparameter values are problem-specific (see Fig. 4.1 for example, where

the optimal value of C is clearly problem-dependent), as are the exact values of C and 1,

which we consider to be large or small. Furthermore, C and 1 are not independent from one

another. In general, for non-separable datasets, we consider values of C to be large when

DEPARTMENT OF ELECTRICAL, ELECTRONIC AND COMPUTER ENGINEERING 30

(3)

(assuming the optimal value of ry at that value of C) ( 1) the number of SV s has stabilized

close to or at the minimum number of possible SVs (as opposed to many SVs for small

values of C) and (2) the optimal achievable accuracy has been reached. C is small until the

number of SVs starts declining and the accuracy starts increasing. From Figure 4.4, C on

the Thyroid dataset would thus be considered large above 10°·5 _{and small below 10-}1

.

4.3 SVM BEHAVIOR ACROSS A LARGE SPECTRUM OF

HYPERPARAMETER VALUES

Keerthi et al. [16] investigated SVM behavior at very small and large values of the SVM hyperparameters in an attempt to understand the hyperparameter space better and come up with a heuristic approach to traversing it more efficiently than by grid search. In this section,

we will extend and verify the relationship between large C and SVM accuracy they presented

with the aim of identifying reasonable boundaries within which one could expect to find the optimal hyperparameter values. For linear kernels specifically, it is found that large values

of C lead to good classification accuracy. It also casts doubt on the LMC tag associated with

SVMs (for linear kernels at least), given that it will be shown that the misclassification term dominates the margin term in the SVM error function (see Chapter 5).

4.3.1 LINEAR KERNELS

Keerthi et al. [ 16] observed that for linear S VMs, after a sufficiently high value of C

>

C*, the cross-validation accuracy converges to a value close to (if not at) optimal accuracy. This

implies that as C-+ oo, EsvM -+ C

l:i

f.i·

We repeated and confirmed the observation from

[16] on a number of datasets, as displayed in Fig. 4.1. SVMs were trained and evaluated

using 10-fold cross-validation, with C ranging from 10-7 _{to 10}7 _{in increments of 10°·}5

. The

average of the 10-fold cross-validation error for the Banana, Titanic, Thyroid, Heart, Solar Flare, Diabetes, Image and German datasets are displayed in Fig. 4.1. In addition to

cross-validation accuracy, the average number of SVs at each value of C is also shown in Fig.

4.2.

The results indicate that C -+ 0 leads to poor classification accuracy (for all datasets in

Fig. 4.1; the lowest accuracy is achieved when C

<

10-3

·5). Figs. 4.2 and 4.3 give some

insight into why this is true in practice: the SVM learns very little for small values of C and

assigns almost all training points as SV s (severe under:fitting). As the value of C is increased,

the SVM starts to approximate the true decision boundary. This can be seen (albeit for an SVM with an RBF kernel) from Fig. 4.2, in that the number of SV s is reduced significantly.

(4)

The reduction of SVs correlates with the corresponding increase in accuracy observed in Fig. 4.1, as the value of C is increased. It is also easy to see in Fig. 4.3 where the SV s

are highlighted in colored circles. As Cis increased from 10-2 _{in Fig. 4.3(a) to}₁₀4 _{in Fig.}

4.3(d), the corresponding accuracy increases from 94.4% to 96.8%, with the SVs visibly becoming less and concentrated around the true decision boundary.

For the set of problems considered in this thesis, C = 10-3·5 is clearly too small, as

significant increases in accuracy are observed for all problems when C becomes larger. The

range over which this increase in accuracy occurs is also problem-specific: for the Thyroid

dataset in Fig. 4.1, a good estimation ofthe decision boundary only starts after C = 10-1,

with the optimal accuracy achieved after C 104. For the Titanic dataset, on the other

hand, accuracy starts increasing just after C

=

10-3

·5 and optimal accuracy is attained at

C = 10-2_•_It_{is interesting to note that the accuracy for all datasets peaks and then either}

stays at the same level, or declines very slightly.

9 5 Banana -Titanic 90 -Thyroid Heart - - • Solar Flare Diabetes

---·

... -"'Image

:7

j

---German 85

Figure 4.1: 1 0-fold cross-validation accuracy for linear SVMs against log( C). All functions converge after a sufficiently high value of C.

4.3.2 RBF KERNELS

The relationship between C and SVM accuracy, which was observed for the linear kernel, is

also evident in the case of the RBF kernel. Specifically, an increase in C corresponded to an

increase in the SVM accuracy when using a linear kernel. For the range of values of C that

was considered, further increases in C did not result in a significant drop in accuracy after

reaching optimal accuracy. This trend (increase inC ~increase in SVM accuracy) can be

seen in Fig. 4.4, where a color scale is used to indicate accuracy (low accuracy is indicated in dark blue, while high accuracy is indicated in dark red). The trend is not as simple as in the linear case, where only one hyperparameter has to be tuned, as there is clear dependence

(5)

CHAPTER FOUR 0.9

£

0.8 ~ " ~ 07 ~ E :i. o.sr.---=-~___, -Banana -Titanic -Thyroid 0.5 Heart - - • Solar Flare Diabetes 0.4 ... "'Image ... .. Gennan ~a ~a ~4 • • UNDERSTANDING SVM HYPERPARAMETERS

~---.. --4---1

' • ' ' _' ' _', ~2 0 log( C) '•

Figure 4.2: Number of SVs vs log( C). The number of SVs are indicated as :;::;,:~';.~ as

the dataset sizes (and hence the number of SVs) vary significantly, making presentation on a single graph difficult. It is clear that for small C, the algorithm does not learn much and assigns almost all points as SVs.

between C and ')'. However, at very large values of C, there are still values of ')' for which

high accuracies are achievable1

.

It is also clear that very large values of ')' lead to severe overfitting and a steep

corre-sponding drop in accuracy, even for large C. Small values of C lead to poor classification

accuracy irrespective of the value of')'.

The optimal region within which to search for the hyperparameter values is evidently where C is large and ')' is small.

1 _{These contour plots are always generated with log(}_C)_{and log(}_'Y)_{on the respective axes and accuracy} indicated by color. This is useful because the dependence of C and "( is easy to see. It is also the conventional way in which these grid search results are represented in this field (see for example [16]).

(6)

CHAPTER FOUR UNDERSTANDING SVM HYPERPARAMETERS 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 -0.2 -0.2 -0.4 -0.4 -0.6 -0.6 -0.8

.

~ .. •"' -0.8 -1 -1 -0.5 0 0.5 -1 -1 -0.5 0 0.5

(a) C = 10-2_,_'Y₌_10-0_·5_,_Acc=94.4 _(b)

c

₌_10-1_,_'Y₌_10-0_·5_,_Acc=94.9

0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 -0.2 -0.2 -0.4 -0.4 -0.6 -0.6 -0.8

...

_" -0.8

...

~1~---~0~.5---~0---~0.~5---~ ~1~---~0~.5~---~0---0.~5---~ (c) C = 10°·5_,"(₌_10-0_·5_,Acc=95.55 _(d)_C₌ ₁₀4_,_'Y₌ _10-0_·5_,_Acc=96.8

Figure 4.3: An artificial two-class dataset is shown (class one samples are shown in red and class two samples are shown in blue). Support vectors for class one are highlighted in black, while SVs from class two are highlighted in green. The dataset was generated by randomly sampling points from a Gaussian mixture model (for details see Section 3.4.1). The data points that are retained as SVs after training an SVM, and the corresponding decrease of the

number of such data points as C is increased from 10-2 to 104 are shown.

h

is kept constant

at 10-0_·5 _{in all cases.) It is clear that as C is increased, the SVM starts to approximate the}

true boundary between the classes, as the SVs become more concentrated on that boundary.

(7)

CHAPTER FOUR -7 -5 -3 -1 1 3 5 7 log(y) (a) Diabetes -5 -3 -1 1 3 5 7 log(y) (c) Heart I UNDERSTANDING SVM HYPERPARAMETERS 80 70 65 60

u-1

c; _Q -7 -5 -3 -1 1 log(y) -3 5 (b) Thyroid -5 -3 -1 1 log(y) (d) German 3 5 7 3 5 7

Figure 4.4: Contour plots depicting the CV accuracy over a wide range of log( C) and log("()

for the UCI Diabetes, Thyroid, Heart and German datasets. Accuracy is indicated by color (see color bar next to each figure), with dark blue corresponding to the lowest accuracy achieved and dark red to the highest accuracy achieved.

NORTH-WEST UNIVERSITY 95 90 85 80 75 76 75 74 73 72 71 70 69 68

(8)

4.4 HYPERPARAMETERS VS THE NUMBER OF TRAINING

SAMPLES

(N)

In this section, we will discuss the relationship between the SVM hyperparameters and N, the number of training samples. We will show that there is a useful relationship between C and N, which can be exploited in cases where there is too much training data to perform a normal grid search in an acceptable amount of time (an extensive grid search on the DFKI problem, for example, will take approximately four months if performed on a single state-of-the-art PC).

4.4.1 CVS N

Consider the SVM error function in Eq. 2.12: as the number of training samples is increased, the width of the optimal separating boundary, and thus the first term in that equation, will remain approximately constant. Since the fraction of marginal or misclassified samples will

also depend only weakly on N for large enough N, the summation in the second term will

grow linearly with N. Hence, C ex

-Jt

to maintain a constant balance between the two terms.

From Fig. 4.5, this relationship can be seen to hold on a sufficiently large dataset (in this case the image classification dataset from the UCI database). In particular, note how

the lower range for C from within which one can obtain the best accuracy systematically

increases as the amount of data is decreased. Tables 4.1, 4.2, 4.3, 4.4 and 4.5 show the exact values of the corresponding grid searches. Note how the contours start increasing at

C

=

10-2 _{when all samples are used for training, but at}_C

₌

_w-

1 _{when only 12.5% or}

6.25% is used.

Care should be taken in cases where high classification accuracy can be obtained though (hence a small percentage of misclassifications ), since the probability of selecting a subset with a representative distribution of samples that will be misclassified becomes less likely. This is evident if one considers the extreme case of selecting a 50% subset of a dataset with

100 samples with only one error. It is impossible to select a representative distribution.

(9)

CHAPTER FOUR UNDERSTANDING SVM HYPERPARAMETERS 95 95 90 90 85 85 G' 80 _G' 80 c; c; .Q ₇₅ .Q ₇₅ . 70 70 65 65 60 60 -3 -1 3 -3 -1 ₃ log(y) log(y)

(a) 100% of data used (b) 50% of data used

95 95 90 90 .. 85 G' 80 _G' 80 c; c; .Q ₇₅ .Q ₇₅ 70 70 65 65 60 60 -3 -1 3 -3 -1 3 log(y) log(y)

(c) 25% of data used (d) 12.5% of data used

-3 -1 3

log(y)

(e) 6.5% of data used

Figure 4.5: Contour plots showing the results of a grid search with varying amounts of data. Fig. 4.5(a) shows a contour plot for all of the data, with every subsequent figure generated with half the amount of data of the previous plot. In this fashion, Fig. 4. 5 (d) is generated with an eighth of the amount of data used for Fig. 4.5(a). Accuracy is indicated by color (see color bar next to each figure), with dark blue corresponding to the lowest accuracy achieved and dark red to the highest accuracy achieved.

(10)

zo

0 trl id "C -l ~ ::t: id I -j ~~ trl trl \I)

z

-l -l

co

_z

'"r1

... m

<: t"'"' trl trl id n ~ -l -l id ....:: _n... ~

r

m

t"'"' trl n -l id 0

z

... n ~

z

tJ

n

0 ~ "C

c::

-l trl id

m

z

Cl

z

_trl trl id

z

Cl VJ 00

c \'y

-5 -3 57.14 -2.5 57.14 -2 57.14 -1.5 57.14 -1.0 57.14 -0.5 57.14 0 57.14 0.5 57.14 1.0 57.14 1.5 57.27 2 72.90 2.5 73.51 3 73.64 3.5 80.69 4 83.81 4.5 84.16 5 84.55 5.5 84.72 6 85.24 6.5 85.45 7 87.92

Table 4.1: Grid search results when using all (100%) of the training samples on the Image dataset.

-4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 73.07 73.38 74.68 57.32 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 72.94 73.42 87.10 89.61 86.36 57.23 57.14 57.14 57.14 57.14 57.14 57.14 57.14 72.90 73.51 84.03 89.83 91.73 93.51 89.83 58.35 57.14 57.14 57.14 57.14 57.27 72.90 73.51 79.65 88.87 91.47 94.59 95.97 96.02 89.05 60.13 57.14 57.14 57.27 72.90 73.46 75.37 87.10 90.74 93.68 95.97 97.27 96.80 95.24 81.13 57.14 57.27 72.90 73.51 73.94 84.33 89.57 92.42 95.24 97.19 98.01 97.01 95.28 82.16 57.27 72.90 73.51 73.68 82.55 87.49 90.69 93.72 96.36 97.79 98.01 97.14 95.24 82.16 72.90 73.51 73.68 81.04 85.19 89.31 92.08 95.11 97.19 98.05 98.14 97.10 95.19 82.16 73.51 73.64 80.87 84.29 87.49 90.52 93.51 95.93 97.45 98.18 98.18 97.06 95.19 82.16 73.64 80.78 83.90 85.32 89.35 92.38 94.85 97.14 97.45 98.27 98.18 97.01 95.19 82.16 80.69 83.77 84.55 87.75 90.43 93.77 95.58 97.36 97.58 98.10 97.97 97.01 95.19 82.16 83.81 84.16 85.32 89.31 92.42 94.98 96.67 97.53 97.66 98.05 97.97 97.01 95.19 82.16 84.16 85.06 87.84 90.52 93.68 95.45 97.32 97.62 98.01 97.58 97.97 97.01 95.19 82.16 84.55 85.28 89.18 92.25 94.89 96.62 97.40 97.62 97.58 97.58 97.97 97.01 95.19 82.16 85.06 87.88 90.52 93.59 95.24 96.93 97.45 97.53 97.01 97.58 97.97 97.01 95.19 82.16 85.28 89.22 92.25 94.68 96.45 97.40 97.45 97.36 97.10 97.58 97.97 97.01 95.19 82.16 87.92 90.48 93.72 95.24 96.75 97.45 97.14 97.49 97.10 97.58 97.97 97.01 95.19 82.16 89.18 92.25 94.76 96.36 97.36 97.53 96.97 97.19 97.10 97.58 97.97 97.01 95.19 82.16 90.52 93.68 95.28 96.58 97.19 97.32 97.01 96.93 97.10 97.58 97.97 97.01 95.19 82.16 -2.5 3 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.14 57.71 57.32 67.92 65.71 68.61 66.06 68.61 66.06 68.61 66.06 68.61 66.06 68.61 66.06 68.61 66.06 68.61 66.06 68.61 66.06 68.61 66.06 68.61 66.06 68.61 66.06 68.61 66.06 68.61 66.06 68.61 66.06

n

::t:

>

"C -l trl id 'Tj 0

c::

id

e

z

tJ trl id \I)

~

z

tJ

z

Cl C/.l

<

~

::c:

-<: "C trl id ~ id

>

~ trl

...,

trl id \I)

(11)

ZtJ

0 trl ;:tl "tt >-l

>

::c: ;:tl I >-l ~~ trl trl en Z >-l >-l

co

_z

"r1 ... tr1

<

l ' trl trl ;:tl \.1

;e

>-l >-l ;:tl

-< ...

_\.1

>

r

tr1 l ' trl \.1 >-l

~

z

(=j

>

z

0

n

0 ~ "tt c::: >-l trl ;:tl tr1

z

Cl

z

_trl trl ;:tl

z

Cl VJ 1.0

c \'y

-5 -3 57.58 -2.5 57.58 -2 57.58 -1.5 57.58 -1.0 57.58 -0.5 57.58 0 57.58 0.5 57.58 1.0 57.58 1.5 57.58 2 73.07 2.5 73.33 3 73.77 3.5 76.02 4 83.20 4.5 84.68 5 84.68 5.5 85.11 6 85.45 6.5 85.45 7 87.53

Table 4.2: Grid search results when using 50% of the training samples on the Image dataset.

-4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 70.13 59.39 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 68.48 73.51 78.18 87.10 66.67 57.58 57.58 57.58 57.58 57.58 57.58 57.58 57.58 71.69 73.33 77.06 89.52 90.91 89.26 66.41 57.58 57.58 57.58 57.58 57.58 57.58 72.64 73.33 74.11 88.23 91.34 93.33 94.46 92.64 65.89 57.84 57.58 57.58 57.58 73.07 73.33 73.94 84.76 91.26 92.81 95.41 95.50 95.93 95.50 70.22 57.58 57.58 73.07 73.33 73.77 80.35 89.61 92.29 94.37 95.84 97.06 96.19 95.41 71.86 57.58 73.07 73.33 73.77 77.49 87.36 91.34 93.68 95.50 96.71 97.14 96.36 95.41 71.86 73.07 73.33 73.77 76.28 84.76 90.04 92.38 94.63 95.76 97.32 97.40 96.28 95.58 71.86 73.33 73.77 75.93 84.07 87.53 91.08 94.03 95.67 96.62 97.32 97.32 96.45 95.58 71.86 73.77 76.02 83.38 85.37 89.87 91.77 94.46 95.67 96.88 97.06 97.23 96.45 95.58 71.86 76.02 83.20 85.11 87.71 90.82 93.77 95.24 96.10 96.62 97.23 97.06 96.45 95.58 71.86 83.20 84.68 85.63 89.70 91.52 94.63 95.76 96.62 96.19 96.80 97.06 96.45 95.58 71.86 84.59 84.85 87.53 90.82 93.59 95.50 96.19 96.54 96.36 96.28 97.06 96.45 95.58 71.86 84.76 85.54 89.61 91.52 94.55 95.84 96.36 96.28 96.19 96.28 97.06 96.45 95.58 71.86 85.19 87.45 90.74 93.51 95.58 96.02 96.62 95.93 95.41 96.28 97.06 96.45 95.58 71.86 85.54 89.70 91.43 94.46 96.02 96.28 96.02 95.24 95.32 96.28 97.06 96.45 95.58 71.86 87.27 90.65 93.51 95.58 95.76 96.02 95.84 95.32 95.32 96.28 97.06 96.45 95.58 71.86 89.52 91.52 94.46 95.76 96.10 96.28 95.32 94.81 95.32 96.28 97.06 96.45 95.58 71.86 90.74 93.42 95.76 95.67 96.45 95.32 94.46 94.81 95.32 96.28 97.06 96.45 95.58 71.86 2.5 57.58 57.58 57.58 57.58 57.58 57.58 63.64 63.98 63.98 63.98 63.98 63.98 63.98 63.98 63.98 63.98 63.98 63.98 63.98 63.98 63.98 3 57.58 57.58 57.58 57.58 57.58 57.58 61.99 62.25 62.25 62.25 62.25 62.25 62.25 62.25 62.25 62.25 62.25 62.25 62.25 62.25 62.25

n

::c:

>

"tt >-l tT1 ::0 '"rj 0

e

::0

c:::

z

0 tT1 ::0 en

~

z

0

z

Cl rJJ

<

~

::r:

-<

"tt tT1 ::0 ~ ::0

>

~ tT1 >-l tT1 ::0 en

(12)

Zt:J

0 trl I'd "1:1 >-l

>

::r: I'd I >-l ~~ trl trl rn Z >-l >-l

co

z

"rj

- m

<

t'"' trl trl I'd n ~ >-l >-l I'd ....::

-

n

>

r

m

t'"' trl n >-l

~

z

rs

>

z

0

n

0 ~ "1:1

c

>-l trl ~

m

z

a

z

trl trl I'd

z

a

~ 0 C\1 -5 -3 59.00 -2.5 59.00 -2 59.00 -1.5 59.00 -1.0 59.00 -0.5 59.00 0 59.00 0.5 59.00 1.0 59.00 1.5 59.00 2 59.00 2.5 74.74 3 75.09 3.5 75.26 4 81.14 4.5 84.26 5 85.47 5.5 86.16 6 86.51 6.5 86.68 7 87.54

Table 4.3: Grid search results when using 25% of the training samples on the Image dataset.

-4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 71.63 74.91 71.45 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 74.91 74.91 87.54 90.66 83.39 59.00 59.00 59.00 59.00 59.00 59.00 59.00 59.00 74.91 74.91 84.95 90.48 91.52 90.48 81.14 59.34 59.00 59.00 59.00 59.00 59.00 74.74 75.09 77.68 90.83 91.70 93.77 94.64 92.21 77.51 64.88 59.00 59.00 59.00 74.74 75.09 75.78 88.41 91.18 93.08 95.33 94.98 92.39 79.41 65.57 59.00 59.00 74.74 75.09 75.43 85.64 91.35 93.08 94.64 94.98 95.33 92.39 79.41 65.57 59.00 74.74 75.09 75.26 82.87 88.41 91.52 93.60 95.16 95.50 95.33 92.39 79.41 65.57 74.74 75.09 75.26 81.83 85.81 91.18 92.56 94.98 95.33 94.81 95.33 92.39 79.41 65.57 75.09 75.26 81.49 84.78 88.93 92.04 93.94 94.64 94.12 94.81 95.33 92.39 79.41 65.57 75.26 81.31 84.95 86.85 91.35 92.56 94.64 94.81 94.29 94.81 95.33 92.39 79.41 65.57 81.14 84.26 86.16 88.93 92.04 93.43 94.98 93.94 94.29 94.81 95.33 92.39 79.41 65.57 84.26 85.81 87.20 91.18 92.73 94.98 94.64 93.94 94.29 94.81 95.33 92.39 79.41 65.57 85.29 86.51 88.58 92.04 93.43 93.94 93.94 93.94 94.29 94.81 95.33 92.39 79.41 65.57 86.33 87.02 91.35 92.56 94.46 93.94 93.77 93.94 94.29 94.81 95.33 92.39 79.41 65.57 86.68 88.58 91.87 93.60 93.94 93.77 94.29 93.94 94.29 94.81 95.33 92.39 79.41 65.57 87.20 90.83 92.56 94.64 93.77 93.43 94.29 93.94 94.29 94.81 95.33 92.39 79.41 65.57 88.75 91.87 93.43 93.43 94.46 94.12 94.29 93.94 94.29 94.81 95.33 92.39 79.41 65.57 91.00 92.73 94.81 93.25 93.60 93.94 94.29 93.94 94.29 94.81 95.33 92.39 79.41 65.57 2.5 59.00 59.00 59.00 59.00 59.00 59.00 62.11 62.28 62.28 62.28 62.28 62.28 62.28 62.28 62.28 62.28 62.28 62.28 62.28 62.28 62.28 3 59.00 59.00 59.00 59.00 59.00 59.00 61.25 61.59 61.59 61.59 61.59 61.59 61.59 61.59 61.59 61.59 61.59 61.59 61.59 61.59 61.59

n

::r:

>

'""d ...J tT1 :;d "r1 0 c:::: :;d

e

z

0 tT1 :;d en

~

z

0

z

0 (/)

<

~

::r:

....:: '""d tT1 :;d ~ :;d

>

~ tT1 ...J tT1 :;d en

(13)

zo

o

m

~ ~

::c: ~ I ...., ~3:: m m rJJ

z

...., ....,

co

z

>'Ij - ti1

<

t'"' m m ~ (") ~ ...., ...., ~

-<

r;

>

t'"' ti1 t'"' m (") ...., ~ 0

z

r;

>

z

0

n

0 3:: '"1:1

c:

...., m ~ ti1

z

a

z

m m ~

z

a

~ ...

c

\1

-5 -3 60.55 -2.5 60.55 -2 60.55 -1.5 60.55 -1.0 60.55 -0.5 60.55 0 60.55 0.5 60.55 1.0 60.55 1.5 60.55 2 60.55 2.5 60.55 3 74.74 3.5 75.09 4 77.85 4.5 84.43 5 86.16 5.5 86.51 6 87.20 6.5 87.54 7 87.89

Table 4.4: Grid search results when using 12.5% of the training samples on the Image dataset.

-4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 74.74 77.16 82.01 61.25 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 60.55 74.74 76.12 90.31 90.31 86.85 63.32 60.55 60.55 60.55 60.55 60.55 60.55 60.55 74.74 74.74 88.93 91.35 92.73 92.73 86.51 66.78 62.28 60.55 60.55 60.55 60.55 74.74 74.74 85.47 91.00 91.35 94.81 93.77 85.81 70.59 63.67 60.55 60.55 60.55 74.74 74.74 78.55 90.66 92.73 91.70 94.46 92.39 85.81 70.59 63.67 60.55 60.55 74.74 74.74 77.51 87.89 92.04 92.39 93.43 93.08 92.73 85.81 70.59 63.67 60.55 74.74 74.74 77.85 84.78 91.35 92.39 91.70 92.73 92.39 92.73 85.81 70.59 63.67 74.74 75.09 77.85 84.43 88.93 91.70 92.04 91.70 92.39 92.39 92.73 85.81 70.59 63.67 75.09 77.85 84.08 86.51 91.70 92.39 92.04 91.70 90.31 92.39 92.73 85.81 70.59 63.67 77.85 84.43 86.16 89.27 91.70 92.39 91.00 91.00 90.31 92.39 92.73 85.81 70.59 63.67 84.43 86.16 87.54 92.04 92.39 92.73 90.31 90.66 90.31 92.39 92.73 85.81 70.59 63.67 86.16 86.85 89.97 92.04 91.70 91.00 90.66 90.31 90.31 92.39 92.73 85.81 70.59 63.67 86.51 87.89 91.70 91.70 92.04 90.66 89.27 90.31 90.31 92.39 92.73 85.81 70.59 63.67 87.54 89.27 91.35 91.35 91.35 92.04 89.97 90.31 90.31 92.39 92.73 85.81 70.59 63.67 87.89 91.00 91.70 91.35 90.66 88.93 89.97 90.31 90.31 92.39 92.73 85.81 70.59 63.67 88.93 91.35 91.00 90.66 91.00 88.58 89.97 90.31 90.31 92.39 92.73 85.81 70.59 63.67 90.31 91.70 91.00 90.66 88.93 88.58 89.97 90.31 90.31 92.39 92.73 85.81 70.59 63.67 2.5 60.55 60.55 60.55 60.55 60.55 60.55 61.25 61.94 61.94 61.94 61.94 61.94 61.94 61.94 61.94 61.94 61.94 61.94 61.94 61.94 61.94 -3 60.55 60.55 60.55 60.55 60.55 60.55 60.55 61.25 61.25 61.25 61.25 61.25 61.25 61.25 61.25 61.25 61.25 61.25 61.25 61.25 61.25

n

::c:

>

'"1:1 ...., m ~ '"Tj 0

c:

~

e

z

0 m ~ rJJ ~

z

0

z

Cl (/)

<

~

::c:

><: '"1:1 m ~ ~ ~

>

3:: m ...., m ~ rJJ

(14)

zo

o

m :::0 "0 --l

>

::r:: :::0 I --l ~~ m m IJJ

z

--l --l

co

z

'T1 - trJ <: r m m :::0 \.1 ~ --l --l :::0 ....::

-

\.1

>

r trJ r m \.1 --l

~

z

-\.1

>

z

0

n

0 ~ "0

c

--l m :::0

~

Cl

z

m m :::0

z

Cl .j:::.. N

c \'y

-5 -3 58.62 -2.5 58.62 -2 58.62 -1.5 58.62 -1.0 58.62 -0.5 58.62 0 58.62 0.5 58.62 1.0 58.62 1.5 58.62 2 58.62 2.5 58.62 3 61.38 3.5 70.34 4 71.72 4.5 79.31 5 82.07 5.5 82.76 6 83.45 6.5 82.76 7 84.14

Table 4.5: Grid search results when using 6.25% of the training samples on the Image dataset.

-4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 63.45 60.00 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 59.31 71.03 86.90 91.03 75.86 60.69 58.62 58.62 58.62 58.62 58.62 58.62 58.62 59.31 70.34 86.90 91.03 93.79 95.86 80.00 62.07 60.00 58.62 58.62 58.62 58.62 61.38 70.34 78.62 90.34 93.10 94.48 95.17 80.69 64.83 61.38 58.62 58.62 58.62 61.38 70.34 69.66 88.28 89.66 93.79 95.86 93.79 80.69 64.83 61.38 58.62 58.62 61.38 70.34 71.03 84.14 88.97 91.72 95.17 94.48 93.79 80.69 64.83 61.38 58.62 61.38 70.34 71.72 82.76 88.28 88.97 95.17 94.48 94.48 93.79 80.69 64.83 61.38 61.38 70.34 71.72 80.69 85.52 88.97 90.34 93.79 91.72 94.48 93.79 80.69 64.83 61.38 70.34 71.72 80.00 83.45 88.28 86.90 93.79 93.79 91.72 94.48 93.79 80.69 64.83 61.38 71.72 79.31 82.07 85.52 88.97 88.28 93.79 90.34 91.72 94.48 93.79 80.69 64.83 61.38 79.31 82.07 83.45 87.59 86.21 93.10 93.79 88.28 91.72 94.48 93.79 80.69 64.83 61.38 82.07 82.76 85.52 87.59 88.28 93.79 89.66 88.28 91.72 94.48 93.79 80.69 64.83 61.38 82.76 84.83 86.21 86.90 93.10 92.41 88.28 88.28 91.72 94.48 93.79 80.69 64.83 61.38 84.14 85.52 87.59 88.28 92.41 88.28 88.28 88.28 91.72 94.48 93.79 80.69 64.83 61.38 84.14 86.21 86.90 92.41 91.72 87.59 88.28 88.28 91.72 94.48 93.79 80.69 64.83 61.38 85.52 87.59 88.28 92.41 86.90 87.59 88.28 88.28 91.72 94.48 93.79 80.69 64.83 61.38 86.90 87.59 92.41 92.41 86.21 87.59 88.28 88.28 91.72 94.48 93.79 80.69 64.83 61.38 2.5 3 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62 58.62

n

:I:

>

'"0 --l m :::0 '"rj 0

c

~

c

z

0 m ~ IJJ

~

z

0

z

Cl tZl

<

~

::c:

....:: '"0 m :::0 ~ ~

>

~ m --l m ~ IJJ

(15)

4.4.2 1 VS N

The relationship between optimal kernel widths and N is known to be weak in well-studied

problems such as kernel density estimation (see Section 3.4 of [67], where a relationship

1 ex N1₁5 _{is derived). While the exact relationship is of course problem-dependent, the fact}

that it has been found to be weak is of importance for the purpose of our argument. This

can be understood from the weak dependence of nearest-neighbor distances on N. In kernel

density estimation, one wants to choose the bandwidth to be as small as possible in order

not to over-smooth the probability density estimate. The exact size of the bandwidth (I)

in the case of an RBF kernel is influenced most by the points closest to the point being

considered. Asking how this distance changes as N is increased is thus similar to asking

how nearest-neighbor distances change as N is increased. Typical examples are shown in

Fig. 4.6, where we see that a 64-fold increase inN only increases the median of the

nearest-neighbor distance by a factor of three (five dimensions) and a factor of two (ten dimensions). We consider this to be a weak relationship, as a considerable increase in dataset size does not have the same relative influence on the median nearest-neighbor distance. We therefore

expect a similarly weak relationship between the optimal 1 and N for SVM training2_. _We

consequently assume that a narrow line search around a value obtained on a subset of the data will suffice to obtain the optimal kernel width (we use narrow rather subjectively in this

thesis, but do restrict narrow to mean at most 102 _{larger or smaller than the current value).}

.,a (ij ~6 w ~4 "iii ~2

Density estimates in 1 Od.

4 6 dmin Density estimates in 5d. N=64 N=32 N=16 N=B ~N=4 N=2 - - - N=1 7 8

Figure 4.6: Density estimates for the distance to the nearest neighbor when randomly

sam-pling N points from a five- and 10 dimensional normal distribution with zero mean and unit

variance. Note the weak implied relationship between 1 and N.

2_{This observation can be seen to hold on the Image dataset, where it is clear that the value of}_N_{on good} values of 'Y is small (see Tables 4.1, 4.2, 4.3, 4.4 and 4.5).

(16)

We also investigated some measures related to the feature space to see ifthere are any in-dicators of the range within which the optimal 1 occurs. We present some of these measures, and find that when the training samples are scaled to have zero mean and unit variance, the

inverse of the dimensionality~ is a surprisingly accurate predictor of the value of 1 for most

datasets. This is shown for the datasets considered in this thesis in Section 7.3.4, as SVMs trained with this approximation to 1 give statistically indistinguishable results from an SVM trained with a full grid search (see Tables 7.4 to 7.16).

The specific measures we considered are illustrated by histograms in Figs. 4.7 and 4.8, depicting the squared euclidean distances for each instance to each of:

• nearest neighbor (nn) within same class • nearest neighbor in other class

• mean within class • mean to other class • to all points in same class • to all points in other class.

These measures were chosen simply to explore the relationship between N and 1 and the

influence that class separability may have, if any, on the optimal value for I·

The remaining measures are presented in Table 4.6. It is interesting to note that the

minimum value of 1 has a high inverse correlation of -0.614 with the feature dimensionality

(see Table 4.7). In order to test the effect of redundant features on a correlation with gamma, feature reduction was also performed by selecting Eigenvectors until 95% and 80% of the variance were accounted for. This did not however result in a higher correlation.

(17)

all samples (other class); mean=40.60, median~35.73 all samples (same class); mean~39.40, median~34.52

50 100 150 200 250 100 150 200 250

(a) (b)

nn (other class); mean~11.23, median~10.37 nn (same class); mean=9.80, median=8.35

50,---~---.---~---~--. 60 50 40 60 80 100 (c)

other class mean; mean~20.90, median=17.36

20 40 60 80 (e) 100 120 140 160 45 40 35 50 80 100 (d)

same class mean; mean=20.84, median=17.25

100

(f)

Figure 4.7: German: histograms depicting squared euclidean distances to all samples in the other 4. 7(a) and the same 4. 7(b) class, nn in the other 4. 7(c) and same class 4. 7(d) and to the other class mean 4. 7(e), as well as to the same class mean 4. 7(/).

(18)

10-all samples (other class); mean=38.47, median=25.69 x 10e11 samples (same class); mean=33.44, median=22.33

4.5i-'-x-'--'--~--~-~-~--~-~-~---, 4

3.5

50 100 150 200 250 300 350 400 150 200 250 300 350 400

(a) (b)

nn (other class); mean=6.57, median=2.43 nn (same class); mean=0.69, median=0.14 1400 1200 1000 800 600 400 200 30 40 50 0 .. 0 10 20 30 40 50 (c) (d)

other class mean; mean=19.01, median=12. 72 same class mean; mean=18.75, median=12.57

500.--~-~-~---~-~--~-, 500.--~-~-~---~-~--~-, 450 400 50 1 00 150 200 250 300 350 400 (e) 400 350 300 250 100 150 200 250 300 350 400 (f)

Figure 4.8: Image: histograms depicting squared euclidean distances to all samples in the other 4.8(a) and the same 4.8(b) class, nn in the other 4.8(c) and same class 4.8(d) and to the other class mean 4.8(e), as well as to the same class mean 4.8(/).

(19)

zo

n

0 tTl _{Table 4.6:}_{Mean euclidean distances ( p) between samples for several datasets. The subscripts o and s refer to samples from the other and}

::r:

~ ~

>

same classes respectively. We also show the number of dimensions

if

the first 95% and 80% of the variance were explained respectively

if

"d

7 ~

....,

considering eigenvalues and eigenvectors calculated on the data covariance matrix. BC refers to the Breast Cancer dataset. ti1

~s::: :;-;

tTl tTl "Tj

[/J

z

All samples NN Class 11 Dim. 0

...., ....,

Dataset / #instances

e

co

:;-;

z

"r1

J1 max mm J-lo J-ls J-lo 1-ls J-lo J-ls All 95% 80%

... ti1

<

I:"" Banana 0.0 -1.0 1.0 4.0 4.0 0.1 0.0 2.5 2.4 2 2 2 5300 tTl tTl :;-; (') [/J ...., BC* -1.5 -3.0 -1.0 18.7 17.3 3.6 3.0 9.8 9.7 9 8 6 277

-

:;-; ....,

-

_{DFKil &4} -1.5 -1.5 -1.0 107.2 100.7 23.4 18.4 53.0 52.9 42 20 11 9514

-< (')

>

_I:"" _{DFKI5 & 7} _{-1.0 -1.0 -1.0 117.0 91.0} _{30.3 16.5 53.2 52.7 42} ₂₀ ₁₁ ₁₀₇₃₃

ti1 DFKI2 & 5 -1.5 -1.5 -1.5 104.5 103.4 21.5 18.2 53.0 52.9 42 20 11 9341

I:"" Diabetes -2.0 -3.0 -1.0 16.9 15.1 2.5 1.9 8.9 8.7 8 7 5 768 tTl (') ...., Solar Flare -3.0 -3.0 -1.0 18.4 17.5 3.3 1.2 9.8 9.7 9 9 6 1066 :;-; 0 _German _{-3.0 -2.0 -1.0 40.6} _39.4 _{11.2 9.8} _{20.9 20.8 20} ₁₈ ₁₃ ₁₀₀₀

e

z

-

Heart -2.0 -3.0 -1.5 28.9 23.1 9.1 5.6 14.0 13.6 13 12 8 270

z

(') ti

>

Image -0.5 -1.5 -0.5 38.5 33.4 6.6 0.7 19.0 18.8 18 10 6 2310 ti1

z

:;-; ti _Splice _{-1.5 -2.0 -1.5 121.8 118.2 65.4 51.6 61.0 60.9 60} ₅₆ ₄₄ ₃₁₇₅ [/J

~

(1 Thyroid -1.0 -3.0 0.5 10.6 9.4 3.5 0.4 5.8 5.6 5 4 3 215 0

z

s:::

_Titanic _{-1.0 -1.5 1.0} _6.7 _5.3 _0.9 _0.0 _3.8 _3.4 ₃ ₃ ₃ ₂₀₅₁ ti

-"d

_z

c

0 ...., tTl C/) id

_<

ti1 _~

z

0 _:I:

-

-<

z

tTl "d ti1 tTl :;-; id

-

~

z

0 :;-;

>

s::: ti1 ....,

::JI

I~

ti1

(20)

Ztl

o

m

::l

~

:;: ::l

~~ m m [/1

z

...., ....,

co

z

"'j -tTl

<

r m m ~ (j [/1 ...., ::J ~

-< ;::;

>

r tTl r m (j ...., ~ 0

z

-(j

>

z

0

n

0 ~ "d

e

_...., m ~ tTl

z

Cl

z

m m ~

z

Cl +:-. 00 ll "/p, !.OOE+OO (max 5.63E-OI !min 5.89E-OI Allp,0 5.25E-04 Allp,. -2.77E-02 NN~to -1.83E-02 NN~ts -7.40E-02 ClasSp,0 -1.46E-02 Class~"• -1.59E-02

Dim All -4.06E-02

Dim95% -1.6IE-01

Dimso% -1.56E-01

Table 4.7: Correlation coefficients for the different measures from Table 4.6.

'Y All samples NN Class JL

max min !lo !Ls !lo Jls !Lo Jls All 5.63E-OI 5.89E-OI 5.25E-04 -2.77E-02 -1.83E-02 -7.40E-02 -1.46E-02 -1.59E-02 -4.06E-02 l.OOE+OO 2.68E-OI 4.59E-01 4.32E-01 2.74E-OI 2.33E-OI 4.45E-OI 4.44E-01 3.97E-OI 2.68E-OI I.OOE+OO -5.88E-OI -5.97E-01 -5.11E-01 -5.19E-01 -5.96E-01 -5.97E-01 -6.14E-01 4.59E-01 -5.88E-01 l.OOE+OO 9.91E-01 8.62E-01 8.30E-01 9.98E-01 9.98E-01 9.82E-01 4.32E-01 -5.97E-01 9.91E-01 I.OOE+OO 8.72E-01 8.61E-01 9.97E-01 9.98E-01 9.89E-01 2.74E-01 -S.llE-01 8.62E-01 8.72E-01 l.OOE+OO 9.86E-01 8.68E-01 8.68E-01 9.27E-01 2.33E-01 -5.19E-01 8.30E-01 8.61E-01 9.86E-01 l.OOE+OO 8.46E-01 8.47E-01 9.10E-01 4.45E-01 -5.96E-01 9.98E-01 9.97E-01 8.68E-01 8.46E-OI I.OOE+OO l.OOE+OO 9.88E-01 4.44E-01 -5.97E-01 9.98E-OI 9.98E-01 8.68E-01 8.47E-01 l.OOE+OO l.OOE+OO 9.88E-01 3.97E-01 -6.14E-01 9.82E-OI 9.89E-01 9.27E-01 9.10E-01 9.88E-01 9.88E-01 1.00E+OO 1.68E-01 -5.82E-01 7.95E-01 8.24E-OI 9.76E-01 9.85E-OI 8.10E-01 8.11E-01 8.89E-01 9.87E-02 -5.02E-01 6.80E-OI 7.17E-01 9.44E-01 9.62E-01 6.98E-01 6.99E-01 7.97E-OI

Dim. 95% 80% -1.61E-01 -1.56E-OI 1.68E-OI 9.87E-02 -5.82E-01 -5.02E-01 7.95E-01 6.80E-01 8.24E-01 7.17E-01 9.76E-01 9.44E-01 9.85E-01 9.62E-01 8.10E-01 6.98E-01 8.11E-01 6.99E-01 8.89E-01 7.97E-01 l.OOE+OO 9.84E-OI 9.84E-01 l.OOE+OO (1 ::I:

>

'"d ...., m ~ >Tj 0

e

~

c

z

0 m ~ [/1

~

0

z

Cl (/').

<

a:::

::c:

>-<: "d m ~ ~ ~

>

~ m ...., m ~ [/1

(21)

4.4.3 ALGORITHM FOR FINDING OPTIMAL HYPERPARAMETERS ON A SUBSET OF THE DATA

4.4.3.1 SCALE C FOLLOWED BY LINE SEARCH FOR 'Y

Given the relationships mentioned in Sections 4.4.1 and 4.4.2, we propose the following

strategy for finding the optimal hyperparameter values on very large datasets:

• Select a subset of the training data Nsub and find the optimal hyperparameter values

using 10-fold cross-validation.

• Let Cfull be that C which will be used with the full dataset. Calculate Cfull

=

Csub ·

:•ub ,

where Csub is the optimal value of C found on the subset of the data, Nsub the

full

number of samples in the subset used and Nfull the number of samples in the full

dataset.

• Using Cjuu' do a line-search in a narrow region around '"'!sub to find 'YJull·

4.4.3.2 FIX 'Y

=

-}J

FOLLOWED BY LINE SEARCH FOR C

We propose another algorithm for training the SVM hyperparameters more efficiently than an exhaustive search. This algorithm is based on the observations from Table 4.7, where we see

indications that good values of 'Y correlate well with the inverse of the feature dimensionality.

Hence, the following procedure to training the hyperparameters is proposed: • Set 'Y

=

~-• Perform a line search for C, possibly on a subset of the data.

• Retrain the SVM with the value of C from the line search and 'Y

=

~.

4.4.4 EXPERIMENTAL ANALYSIS

All the experiments reported in this chapter were conducted on the IDA benchmark

repos-itory (available at http:

I

/www.

fml. tuebingen. mpg. de/Members/raetsch/

benchmark). Results from [37] and [62] are also displayed, but are not directly com-parable, since we followed a different experimental setup. The reason for the different setup is that it is not clear that the training and test sets were fully independent. This shortcoming was also mentioned in [29]. Both measures are an approximation of the true generalization error, with ours being a stronger estimate than the results in [37, 62].

(22)

We have thus taken the complete dataset (concatenation of the first train and test sets) and divided it into 10 test sets, where for each test set, the rest of the data is considered the training set. For each such fold, the training set was then again partitioned into 10 folds, each of which was used to find the optimal parameters with which to evaluate the corresponding held-out test set. We thus performed 10 independent evaluations, with each evaluation possibly having different SVM hyperparameter values. Our 10-fold cross-validation approach also necessarily assigned more training data to each model than was the case in the original partitions from [62] (we thus have a 90-10 split whereas [62] aimed for 60-40 in general). The encoding of some of the categorical features in the IDA benchmark repository is also not well-suited to SVMs, in that some categories that are conceptually equidistant are

encoded as ranked. We present our best results with proper encoding. (In particular, we

pre-processed the Splice dataset so that categorical features, which are conceptually equidistant from one another, are numerically equidistant. This preprocessing was done by encoding each categorical feature as a four-bit feature, so that the euclidean distance from any cate-gory to any other catecate-gory is exactly one. This binary encoding leads to significantly better classification accuracy.) Since we cannot compare our results to those of [37, 62] directly because of the different experimental protocols followed, we decided to perform all our ex-periments using proper encoding. This same encoding will be used when we compare our results, using the techniques proposed in [3] (see Chapter 7).

The SVM error rates reported in [37] and [62] were averaged over 100 partitions of the dataset and are represented as the mean error observed with the corresponding standard error in the first column of Table 4.8. All our results are presented as the mean error together with

the corresponding standard error 3

.

Also in Table 4.8 are results obtained when using the algorithm proposed in Section 4.4.3 (i.e. a grid search on a subset of the data, followed by a line search on the full data set around appropriately scaled hyperparameter values). We tested this algorithm using randomly se-lected subsets of 50%, 25% and 12.5% of the full training set. The results obtained were encouraging in that the SVMs were trained in a fraction of the time required for the full grid-search. This is especially useful where one has a large dataset (for example DFKI) where SVM hyperparameter training takes approximately 120 days on the full dataset and less than two days on 12.5% of the dataset. Another resource that is saved by reducing dataset size is memory. While no explicit measure of memory usage was performed, it will play an important role when very large datasets, which are too large to store in memory, are consid-ered. However for the problems considered in this thesis, the focus is on reducing the SVM

3_{0nly four datasets were considered in this study because of the exploratory nature of the empirical} investigation.

(23)

Table 4.8: The 1 0-fold cross-validation error rates obtained using SVMs with RBF kernels. Datasets marked with an asterisk show results for a ry line search without C adaptation. The DFKI dataset results are reported on the single accompanying test set. No cross-validation was thus performed and for that reason a single error rate is reported as opposed to a mean and standard error. Also note that none of the results are statistically sigrificant (see Table 4.11. The results from Riitsch are excluded from the statistical analysis because of different experimental protocols).

Dataset SVM Error Total#

Ratsch 100% 50% 25% 12.5% samples Image 2.7± 0.6 2.17±0.2 1.95 ± 0.3 1.86 ± 0.4 2.34 ± 0.4 2310 Image* 2.08 ± 0.3 1.95 ± 0.3 1.91 ± 0.3 1.73 ± 0.3 2310 Splice 10.9 ± 0.7 3.40 ± 0.4 3.56

±

0.5 3.84 ± 0.6 4.00

±

0.6 3175 Splice* 3.53 ± 0.5 3.62 ± 0.4 3.55 ± 0.5 3.56 ± 0.5 3175 Waveform 9.9 ± 0.4 8.52 ± 0.4 8.64 ± 0.5 8.72 ± 0.5 8.44 ± 0.5 5000 Waveform* 8.42 ± 0.4 8.54 ± 0.4 8.62 ± 0.5 8.52 ± 0.4 5000 DFKI 54.98 55.88 55.05 54.78 34843 DFKI* 55.19 55.08 54.95 55.00 34843

Table 4.9: Approximate total CPU time for performing grid searches in Table 4.8. While the cluster on which these times were measured was used exclusively for the experiments in question, the times can only be indicative of general duration, since care was not taken to optimize for cache misses, for example, which could have a significant impact on run time performance. Dataset CV time in hh:mm 100% 50% 25% 12.5% Image 13:42 4:20 1:31 0:43 Splice 113:35 37:35 14:36 8:29 Waveform 233:41 40:02 7:11 3:12 DFKI 2883:41 430:09 97:45 38:06

(24)

Table 4.10: The 10-fold cross-validation mean and standard error when training SVMs with

(1) a full grid search for C and"( (s(C, "!))and (2) with"(

=

~,followed by a line search

over C (s( C, 'Y

=

~

)).

The hyperparameter training time is also included. Paired Wilcoxon

rank sum tests were performed and it was found that none of the results are statistically

significantly different at the 0.01 significance level.

Mean & Std. Error Duration (DD:HH:MM:SS)

Dataset SVM: s(C, 'Y) SVM s(C, 'Y = ~) SVM: s(C,"f) SVM s(C, 'Y = ~) Banana 9.53

±

0.26 9.53

±

0.31 03:03:13:47 00:02:50:39 Breast Cancer 27.75

±

1.86 25.95

±

1.85 00:00:10:53 00:00:00:20 DFKI 1 &4 12.48

±

0.20 12.51

±

0.24 42:11:35:33 01:01:06:43 DFKI 2 & 5 2.06

±

0.13 2.11

±

0.15 26:00:27:34 00:07:16:46 DFKI 5 & 7 25.87

±

0.40 26.57

±

0.31 139:14:16:03 06:14:06:13 Diabetes 24.23

±

1.90 23.58

±

1.60 00:01:53:17 00:00:04:44 Solar Flare 32.36

±

1.04 33.11

±

1.09 00:01:35:41 00:00:02:04 German 24.10

±

1.50 23.60

±

1.33 00:01:33:48 00:00:04:04 Heart 14.81

±

1.66 17.04

±

2.22 00:00:03:45 00:00:00:12 Image 2.38

±

0.15 2.12

±

0.20 00:01:25:58 00:00:08:33 Splice 8.25

±

0.47 8.25

±

0.58 00:06:05:18 00:00:26:43 Thyroid 3.72

±

1.36 3.27

±

1.01 00:00:00:51 00:00:00:05 Titanic 20.99

±

0.77 20.95

±

0.76 00:02:20:07 00:00:02:54

hyperparameter training processing time.

The results without C adaptation also give an indication (albeit not a statistically signifi-cant one) that C adaptation may sometimes hurt performance compared to simply retaining the C value found with the smaller set. Specifically, the reader is referred to the Image result when using 12.5% of the dataset in Table 4.8. Even though the results with and without C adaptation are not significantly different, the big difference in accuracy (albeit statistically insignificant at the 0.01 significance level) indicates that care should be taken when using very small samples of the dataset, which result in a large corresponding change inC accord-ing to the scalaccord-ing criteria Cfull = Csub · ~sub. The fact that none of the results in Table

full

4.8 are statistically significantly better than the others is interesting. It shows that as long as

C is chosen in the correct region, performing a line search on the other parameter can find

competitive solutions to the best solution found using a full grid search.

Results obtained with the algorithm proposed in Section 4.4.3.2, where 'Y = ~ and a line

search for Cis performed, are very promising (see Table 4.10). The 'Y =~gives results that

are statistically indistinguishable from those obtained with a full grid search, while training the hyperparameters in a fraction of the time it takes to do an exhaustive grid search.

While the clusters on which these times were measured were used exclusively for the experiments in question, the times can only be indicative of general duration, since care was

(25)

Table 4.11: Statistical significance test results corresponding to the 10-fold cross-validation

results presented in Table 4.8. In Table 4.8, results are presented when SVM hyperparam-eters are trained using the algorithm proposed in Section 4.4.3. The percentage for each method corresponds to the percentage of the total number of available training samples

used to perform the initial grid search, while a

*

indicates results where no C scaling was

performed. In this table, the independent two-sample t-test is used to test whether or not a particular method performs significantly better than another at the 0.01 significance level.

Using the same notation as in [3 ], we indicate that a method in a particular row performs significantly better(<), worse(>) or statistically similar (no symbol) than the method in the corresponding column. * * '2f( '2f( '2f( _* _* '2f( 0 '2f( '2f( trl 0 '2f( '2f( trl 0 0 trl r-.i 0 0 trl r-.i

-

trl N

- -

trl N

-~ -~ -~ -~ -~ -~ -~ -~

!ZI !ZI !ZI !ZI !ZI !ZI !ZI !ZI

Image SVM 100% -SVM50% -SVM25% -SVM 12.5% -SVM 100%* -SVM50%* -SVM25%* -SVM 12.5%* -Splice SVM 100% -SVM50% -SVM25% -SVM 12.5% -SVM 100%* -SVM50%* -SVM25%* -SVM 12.5%* -Waveform SVM 100% -SVM50% -SVM25% -SVM 12.5% -SVM 100%* -SVM50%* -SVM25%* -SVM 12.5%*

-DEPARTMENT OF ELECTRICAL, ELECTRONIC AND COMPUTER ENGINEERING 53

(26)

not taken to optimize for among others cache misses, which could have a significant impact on run time performance. Furthermore, only actual training time is measured; the time that it takes to create the different folds is negligible for the problems addressed in this thesis, but may consume a significant amount of time for much larger datasets.

4.5 CONCLUSION

We presented theoretical and empirical arguments that give more insight as to how to make intelligent choices regarding the region within which to search for optimal hyperparameter values. We also presented a simple algorithm that finds the SVM hyperparameter values in much less time and with significantly fewer resources, by using scaling arguments derived from the SVM error function in Sections 4.4.1 and 4.4.2. The scaling arguments with regard

to C and N are sensitive to underfitting in cases where subsets are selected from datasets that

have little overlap, as shown in Table 4.8 for the case where 12.5% of the data was selected for the Image and Splice datasets respectively (note the low error rates achievable on both

datasets). By performing K -fold cross-validation on each fold's training set in order to obtain

the optimal hyperparameter values, a too low value of K may result in an underestimation

of the value of C. Fig. 4.9 shows how this happened for the case of one of the folds of the

Splice dataset. Note that when LOO cross-validation was performed, the optimal value of

C was very large, whereas 10-fold cross-validation led to a complete underestimation of C

(small peak before accuracy converges to a slightly lower value). Our results also indicate

that a narrow line search over 'Y without C adaptation4, (given initial parameters from a grid

search), is the safest approach to SVM training when one has large amounts of training data.

The influence of noise on the selection of optimal C needs to be investigated further in order

to exploit the relationship between C and N fully.

Another very interesting observation, which will be investigated further in Chapter 7, is the apparent correlation between a good5 value of '"'( and ~, where d is the number of

dimensions (see the correlation between the minimum value of optimal 'Y vs the feature

dimensionality in Table 4. 7). This indicates that one can select '"'( without a line search and perform a corresponding line search on C.

Our results, especially Figs. 4.1 and 4.9, also cast doubt on the contribution of the margin

4_{Much work has been done on efficient line searches (see for example [68]). This is an interesting topic for} further research, but is considered to be outside the scope of this thesis.

5_{The optimal value}_of"(,_{as found with 10-fold cross-validation, is shown in Table 4.6. However, for} several datasets, other values of 'Y give results which are statistically the same. The values of 'Y which fall in this category are indicated by the minimum and maximum values in the same table. This is indeed also the default value used for 'Y in LIBSVM [26].

(27)

CHAPTER FOUR 95 90 85 80 ~75 ~ 70 65 60 55 UNDERSTANDING SVM HYPERPARAMETERS

...-I ~-10-fold _LOOCVCV 50 -2.5 -1.5 -0.5 0.5 1.5 2.5 3.5 4.5 5.5 6.5

log( C)

Figure 4.9: Cross-section of the contour plot ofhyperparameter values vs accuracy for both 10-fold cross-validation and LOO cross-validation. This particular cross-section was taken from one of the folds of the 12.5% Splice subset and depicts varying C vs classification accuracy with 'Y fixed at 0.01. It is interesting to note that C for the 10-fold cross-validation

estimate has an apparent best accuracy at C ~ 10°, whereas the LOO CV estimate has no

peak in accuracy; rather, the accuracy reaches an asymptote, after which further increases in C have no further visible effect on accuracy.

term in Eq. 2.12, since accuracies very close to or at the maximum achievable accuracy are

obtained for very large values of C. For reasons that will be discussed in detail in Chapter

5, large values of C give more weight to the misclassification term in Eq. 2.12. This in turn

makes the margin term less important in the minimization of Eq. 2.12. In cases with very

large optimal C values, one can thus not conclude that the margin is being maximized, but

rather that the sum of errors is being minimized. The relationship between large C and the

size of the margin and misclassification terms will be explored further in Chapter 5.

SVM FOUR

CHAPTER FOUR

UNDERSTANDING

SVM

HYPERPARAMETERS

Contents

4.1

INTRODUCTION

4.2 ROLE OF RBF AND LINEAR KERNEL HYPERPARAMETERS

---+

---+

4.3 SVM BEHAVIOR ACROSS A LARGE SPECTRUM OF

HYPERPARAMETER VALUES

>

l:i

f.i·

<

=

---·

:7

j

£

~---.. --4---1

.

c

...

...

h

u-1

4.4 HYPERPARAMETERS VS THE NUMBER OF TRAINING

SAMPLES

(N)

-Jt

=

=

w-

zo

z

co

z

... m

r

m

z

z

n

c::

m

z

z

z

c \'y

n

>

c::

e

z

~

z

z

<

::c:

>

...,

ZtJ

>

co

z

<

;e

-< ...

>

r

~

z

>

z

n

z

z

₌

_w-

_z

_z