(1)EMPIRICAL METHODS Contents 3.1 Estimating the generalization error 3.2 Minimizing the generalization error 3.3 Statistical hypothesis testing

(1)

EMPIRICAL METHODS

Contents

3.1 Estimating the generalization error 3.2 Minimizing the generalization error 3.3 Statistical hypothesis testing . . . . .

3.3.1 Independent two-sample t-tests.

3.3.2 The Wilcoxon signed-rank test . 3.4 Datasets . . . .

3.4.1 Artificial dataset 3.4.2 Banana . . . . 3.4.3 Breast Cancer .

3.4.4 DFKI Speaker Age Recognition Database . 3.4.5 Diabetes .

3.4.6 German 3.4.7 Heart 3.4.8 Image

3.4.9 MNIST Handwritten Digit Database . 3.4.10 Solar Flare

3.4.11 Splice . . .

19

20 21 21 22 22 23 24 25 25 26 26 26 26 26 26 27 27

(2)

3.4.12 Thyroid 3.4.13 Titanic

3.5 Mathematical notation

27 27 27

This chapter describes the empirical methods followed, as well as the datasets which were used in this thesis. Notation used throughout the thesis is also introduced.

3.1 ESTIMATING THE GENERALIZATION ERROR

In order to train SVM hyperparameters, a good estimate of the generalization error is re- quired [28]. A common approach is to approximate the generalization with the k-fold cross- validation error [38, 39, 40, 30], of which a special case is the LOO cross-validation error, where k is equal to the total number of samples minus one. The LOO cross-validation error is attractive because it is known to give an almost unbiased estimate of a classifier's generalization error [36, 37, 28], but it is also very expensive to compute for datasets with even a moderate number of samples [30]. For practical purposes, a lower value of k is typically used, and it has been found that k = 5 or k = 10 gives a good trade-off between low variance and bias respectively [3]. For this reason, we use k-fold cross-validation throughout this thesis to estimate a classifier's generalization error, with k = 10.

We perform 10-fold cross-validation as follows:

• Create 10 random folds of equal size. Where the number of samples is not divisible by 10, fold sizes are allowed to differ by at most one sample.

• Train a classifier on nine folds and evaluate on the remaining fold. This is repeated 10 times by keeping each fold as a held out set once.

• Aggregate the test results to form an estimate of the generalization error.

The generalization error estimate is presented as a mean error over all the folds, as well as the standard error of the mean (via), where sis the sample standard deviation. In Chapter 7 the generalization error estimate is presented as quantiles (25%, 50% and 75%) whenever results are directly compared to the likelihood validation function proposed in [3].

Our decision to deviate from using the 100 splits from the intelligent data analysis (IDA) benchmark repository [62] (as was used in among others [62, 29, 3]) warrants a discussion.

Splitting the data into 100 folds and using five of those folds for training the SVM hyperparameters while evaluating on all 100 gives only a weak indicator of the generalization error [29]. This is because the folds used to choose the optimal hyperparameter values and the

DEPARTMENT OF ELECTRICAL, ELECTRONIC AND COMPUTER ENGINEERING 20

(3)

folds used to evaluate the generalization performance overlap. For this reason, we rather chose to retrain and evaluate SVMs using grid search (Tables 4.8, 5.1 and 5.2) and to regen- erate results using the software distributed with [3] (Chapter 7). In Tables 4.8, 5.1 and 5.2, results presented in [62] are displayed simply to show that our SVM results are related. The results are thus not directly comparable, rather, the SVM and perceptron kernel (PK) results should be compared to each other.

3.2 MINIMIZING THE GENERALIZATION ERROR

The validation function used to minimize the generalization error is in all cases either a vari- ant of traditional grid search, or the likelihood validation function proposed in [3]. A significant part of this thesis focuses on presenting more efficient ways to utilize a simple and well-understood validation function, namely grid search. Grid search is a baseline against which many authors compare their proposed techniques [2, 39, 43, 37], the main disadvan- tages mentioned being the fact that grid search is time-consuming and that it cannot be used to efficiently optimize more than two hyperparameters simultaneously because of the com- binatorial explosion of possible hyperparameter combinations. For a detailed discussion on the different validation functions, the reader is referred to Section 2.2.2.

Grid search is performed as follows:

• Create a linear grid on the log scale, with steps of 10~.

• Select conservative start and end values for the grid (empirically, we have observed that good choices of of such a grid for C ranges from 10-3 to 10⁵, and for 'Y from 10-3 to 10¹^).

• Train a classifier on each such grid point and evaluate on the held out fold when using 10-fold cross-validation.

• The optimal value of the validation function is chosen to be that combination of hyperparameter values which gives the lowest 10-fold cross-validation error.

3.3 STATISTICAL HYPOTHESIS TESTING

In order to decide if results are statistically significant, two different statistical significance tests were employed in this thesis: the independent two-sample t-test is discussed in Section 3.3.1 and the Wilcoxon signed-rank test in Section 3.3.2. For both tests, the null hypothesis is that the means of the two samples being compared are equal. All results are evaluated at the 0.01 significance level.

NORTH-WEST UNIVERSITY

(4)

3.3.1 INDEPENDENT TWO-SAMPLE T-TESTS

The independent two-sample t-test is used to test the null hypothesis that the means from two randomly drawn samples are independent, under the assumption that the two sample sizes are equal and that the variance of the underlying distributions are equal. Since sample sizes are equal to 10 in all cases, and since the samples are drawn from the same distribution, both assumptions are valid¹^.The t-statistic is calculated as

/11 - /12

(3.1)

where 11 is the sample mean and s the sample standard deviation.

To perform a significance test, two-tailed test p-values are then calculated using 2n - 2 degrees of freedom, with n = 10.

3.3.2 THE WILCOXON SIGNED-RANK TEST

The Wilcoxon signed-rank test was proposed by Frank Wilcoxon in 1945 [63]. It is a non- parametric statistical significance test, which is preferred over the paired t-test when the population cannot be assumed to be normally distributed, or when there are very few observations per sample.

To calculate the Wilcoxon test statistic W, given two paired samples A and B with n observations each, the following procedure is followed:

1. Calculate the difference between each pair of observations Ci = Ai - Bi, with i = 1 .. . n.

2. Discard the number of zero differences Ci = 0, and let m be the number of non-zero differences.

3. Order IC1..ml from smallest to largest.

4. Assign a rank to each value (smallest rank(ICil) = 1 and biggest rank(ICil) = m).

5. For all values of ICil which are equal, assign the mean of their ranks to each equal value.

1 The same folds were used whenever two or more techniques were compared. Instances where independent two-sample t-tests are used for significance tests are cases where only the mean and standard deviation of the tests are available, and for which regeneration of results in order to perform a more stringent statistical significance test is prohibitively expensive.

(5)

• For example, ifrank(IC1I = 0.45) = 1 and rank(IC2I = 0.45) = 2 from the previous step (and no other ICil = 0.45), reassign the ranks as rank(IC1I) = 1.5 and rank(IC2I) = 1.5.

6. Assign the original sign of each Ci to the updated rank.

• For example, continuing with the same example as above, if C1 rank(C1 ) = -1.5

7. Sum all positive ranks to W + and negative ranks to W _.

8. Set the Wilcoxon test statistic W = min (I W-I, W +).

-0.45,

9. Given W and m, calculate the p-value from a table, or exactly by for example fol- lowing the algorithm proposed here: http: I lcomp9 .psych. cornell. edul Darlingtonlwilcoxonlwilcox4.htm

3.4 DATASETS

Several datasets were used to evaluate and compare different classifiers². Detailed information - and results where applicable - will be presented in this section. While the majority of the datasets considered are small, two moderate to large datasets are considered: the DFKI and MNIST datasets.

Table 3.1 shows the number of instances, dimensions and classes for several datasets.

The majority of the datasets described in this chapter are those from the intelligent data analysis (IDA) benchmark repository, as described in [62] and available at http: I lwww.

fml. tuebingen. mpg. deiMemberslraetschlbenchmark. The IDA benchmark repository contains datasets from the University of California Irvine (UCI), data for evaluating learning in valid experiments (DELVE) and statistical and logical (STATLOG) benchmark repositories. The datasets included are: Banana, Breast Cancer, Diabetes, Solar Flare, German, Heart, Image, Ringnorm, Splice, Thryoid, Titanic, Twonorm and Waveform. For details on how these datasets were processed and split, see [62]. Alternative websites where the datasets are available, and where more information is provided, will be mentioned for each.

2 As this thesis was compiled over a period of years, not all datasets are used for all experiments. However, all datasets are used when all the approaches are compared in Chapter 7. In exploratory work, such as finding the optimal block size for Rprop, not all figures and results are shown. The results that are shown were carefully chosen to be representative of the trends observed.

(6)

Table 3.1: Number of instances, dimensions and classes of all data sets. For those data sets marked with an asterisk, the IDA benchmark repository version of the dataset is slightly different from UCI.

Dataset Number of:

Feature type instances dimensions classes

Artificial dataset 4096 2 2 real

Banana 5300 2 2 real

Breast Cancer 277* 9 2 categorical

DFKI Speaker Age Recognition 47578 42 7 real

DFKI (classes 1 & 4) 9514 42 7 real

DFKI (classes 2 & 5) 9341 42 7 real

D FKI (classes 5 & 7) 10733 42 7 real

Diabetes 768 8 2 real

German 1000 20 2 real & categorical

Heart 270 13 2 real & categorical

Image 2310 18* 2 real

MNIST Handwritten Digits 70000 784 10 real

Solar Flare 1066* 9 2 categorical

Splice 3175* 60 2 categorical

Thyroid 215 5 2 real

Titanic 2051 3 2 categorical

For the purposes of this thesis, only two-class problems are considered. While it is easy to extend any classifier to multi-class classification [14, 15], it is more difficult to make sense of results (one has to consider a class confusion matrix). There are also various ways in which a multi-class classifier can be constructed: one-vs-all, n * (n- 1) two-class problems with majority voting and probabilistic output, to name a few.

A short description of each dataset is now provided.

3.4.1 ARTIFICIAL DATASET

An artificial dataset was created to illustrate graphically how SVMs create decision boundaries (see Fig. 4.3). The dataset was created by randomly sampling from two Gaussian mixture distributions. If we represent the d-dimensional probability density function of the Gaussian mixture distribution for class 1 with M = 3 mixtures, mean J-t, covariance ~ and mixture weights w as

(7)

For both classes,

The mean 1-l for class 1 is given by

for class 2 by

and the covariance :E for both classes

1

0.4

W1 = 0.3 0.3

- [1.0 0.7] - [0.9

:El- :E2-

0.3 2.0 0.5 -0.5] - [1.0 _:E3-

0.9 0.3

2048 samples were generated for each of classes 1 and 2.

3.4.2 BANANA

0.7]

2.0

The Banana dataset is a toy two-class dataset used and described in [62]. It was generated by sampling from non-linearly transformed Gaussian and uniform distributions. Four of these distributions were distorted by adding uniformly distributed noise.

3.4.3 BREAST CANCER

The Breast Cancer³dataset has two classes, one with 201 instances and another with 85 instances. The instances with missing features (9) are not present in the IDA benchmark repository.

The dataset is also available from the UCI machine learning repository [64], at http:

//archive.ics.uci.edu/rnl/datasets/Breast+Cancer.

3This breast cancer domain was obtained from the University Medical Centre, Institute of Oncology, Ljubl- jana, Yugoslavia.

(8)

3.4.4 DFKI SPEAKER AGE RECOGNITION DATABASE

This dataset consists of spoken audio samples from native German speakers. Samples are labeled by gender, as well as age groups (children, young males and females, adult males and females and senior males and females). A comprehensive overview of the suggested exper- imental setup is given in [65], while the long-term features used in this thesis are described in [10].

3.4.5 DIABETES

The Pima Indians Diabetes dataset [ 64] consists of data from female patients (Pima Indian heritage) who were 21+ years of age. It can be downloaded from the UCI machine learning repository, at http: I I archive. ics. uci. edulmll datasetsiPima+Indians+

Diabetes.

3.4.6 GERMAN

The German credit data dataset [64] (available at http: I I archive. ics. uci. edul mlldatasetsiStatlog+ (German+Credit+Data)) has real and categorical features which are used to classify people as being good or bad credit risks.

3.4.7 HEART

The Heart dataset [64] (available at http:llarchive.ics.uci.edulmll datasetsiStatlog+ (Heart)), classifies patients as having heart disease or not.

3.4.8 IMAGE

The image segmentation dataset [64] (available at http: I I archive. ics. uci. edul mll datasetsiStatlog+ ( Image+Segmentation)) contains 3x3 pixel regions, with every pixel belonging to one of seven outdoor images as labeled by humans.

3.4.9 MNIST HANDWRITTEN DIGIT DATABASE

Available at http:llyann.lecun.comlexdblmnistl, this database consists of handwritten digits which have been centered in a 28x28 image [66]. The train set consists of 60000 samples and the test set of 10000 samples. These sets were selected from the National Institute for Standards and Technology (NIST) special database 3 training and test sets (50%

from the training and 50% from the testing sets).

(9)

3.4.10 SOLAR FLARE

The Solar Flare dataset [64] (available at http: I /archive. ics. uci. edu/ml/

datasets/ Solar+Flare) contains features describing active regions of the sun for pre- dicting three types of solar flares.

3.4.11 SPLICE

The molecular biology (splice-junction gene sequences) dataset [64] (available at http://archive.ics.uci.edu/ml/datasets/Molecular+Biology+

(Splice-junction+Gene+Sequences)) contains features describing sequences of DNA, with the associated task being to recognize boundaries between exons and introns.

3.4.12 THYROID

The Thyroid gland dataset [64] (available at http: I /archive. ics. uci. edu/ml/

machine -learning- databases/thyroid- disease/) contains features describing normal, hypo and hyper functioning thyroid glands. The specific dataset used is named the "new-thyroid dataset".

3.4.13 TITANIC

The Titanic survival dataset (available at http: I /www. cs. toronto. edu/ -delve/

data Ida t as e t s . h t ml) contains three features (social class, sex and age) which are used to predict whether the person survived the disaster.

3.5 MATHEMATICAL NOTATION

This section contains mathematical notation used throughout this thesis (notation that is not often used is excluded and rather discussed in the text).

DEPARTMENT OF ELECTRICAL, ELECTRONIC AND COMPUTER ENGINEERING 27 NORTH- WEST UNIVERSITY

(10)

Table 3.2: Mathematical notation used throughout the thesis.

Notation

p w wo ai,(3i ar

c

~i 'Y

K(a, b) Lp

LD

T R k Sp w

d

sv sv

inf TJ

Description

ith class label

ith feature vector [1 Xi]

total number of samples total number of mixtures sample mean

sample standard deviation covariance matrix

transpose of a vector or matrix

(subscript i) indicates the ith element of a range of values SVMmargin

vector normal to the separating hyperplane constructed by an SVM bias term in the SVM error function

Lagrange multipliers

perceptron kernel support vector weights SVM regularization parameter

SVM slack variables

RBF kernel parameter (bandwidth) kernel function of a and b

primal form of the Lagrangian dual form of the Lagrangian a bound on the generalization error radius of a sphere

as ink-fold cross-validation span of support vectors Wilcoxon test statistic number of dimensions

set of SVs satisfying 0 < ai < C set of SV s satisfying 0 < o:i ~ C infinity

step size in stochastic gradient descent