EMPIRICAL METHODS
Contents
3.1 Estimating the generalization error 3.2 Minimizing the generalization error 3.3 Statistical hypothesis testing . . . . .
3.3.1 Independent two-sample t-tests.
3.3.2 The Wilcoxon signed-rank test . 3.4 Datasets . . . .
3.4.1 Artificial dataset 3.4.2 Banana . . . . 3.4.3 Breast Cancer .
3.4.4 DFKI Speaker Age Recognition Database . 3.4.5 Diabetes .
3.4.6 German 3.4.7 Heart 3.4.8 Image
3.4.9 MNIST Handwritten Digit Database . 3.4.10 Solar Flare
3.4.11 Splice . . .
19
20 21 21 22 22 23 24 25 25 26 26 26 26 26 26 27 27
3.4.12 Thyroid 3.4.13 Titanic
3.5 Mathematical notation
27 27 27
This chapter describes the empirical methods followed, as well as the datasets which were used in this thesis. Notation used throughout the thesis is also introduced.
3.1 ESTIMATING THE GENERALIZATION ERROR
In order to train SVM hyperparameters, a good estimate of the generalization error is re- quired [28]. A common approach is to approximate the generalization with the k-fold cross- validation error [38, 39, 40, 30], of which a special case is the LOO cross-validation error, where k is equal to the total number of samples minus one. The LOO cross-validation error is attractive because it is known to give an almost unbiased estimate of a classifier's gener- alization error [36, 37, 28], but it is also very expensive to compute for datasets with even a moderate number of samples [30]. For practical purposes, a lower value of k is typically used, and it has been found that k = 5 or k = 10 gives a good trade-off between low vari- ance and bias respectively [3]. For this reason, we use k-fold cross-validation throughout this thesis to estimate a classifier's generalization error, with k = 10.
We perform 10-fold cross-validation as follows:
• Create 10 random folds of equal size. Where the number of samples is not divisible by 10, fold sizes are allowed to differ by at most one sample.
• Train a classifier on nine folds and evaluate on the remaining fold. This is repeated 10 times by keeping each fold as a held out set once.
• Aggregate the test results to form an estimate of the generalization error.
The generalization error estimate is presented as a mean error over all the folds, as well as the standard error of the mean (via), where sis the sample standard deviation. In Chapter 7 the generalization error estimate is presented as quantiles (25%, 50% and 75%) whenever results are directly compared to the likelihood validation function proposed in [3].
Our decision to deviate from using the 100 splits from the intelligent data analysis (IDA) benchmark repository [62] (as was used in among others [62, 29, 3]) warrants a discussion.
Splitting the data into 100 folds and using five of those folds for training the SVM hyperpa- rameters while evaluating on all 100 gives only a weak indicator of the generalization error [29]. This is because the folds used to choose the optimal hyperparameter values and the
DEPARTMENT OF ELECTRICAL, ELECTRONIC AND COMPUTER ENGINEERING 20
folds used to evaluate the generalization performance overlap. For this reason, we rather chose to retrain and evaluate SVMs using grid search (Tables 4.8, 5.1 and 5.2) and to regen- erate results using the software distributed with [3] (Chapter 7). In Tables 4.8, 5.1 and 5.2, results presented in [62] are displayed simply to show that our SVM results are related. The results are thus not directly comparable, rather, the SVM and perceptron kernel (PK) results should be compared to each other.
3.2 MINIMIZING THE GENERALIZATION ERROR
The validation function used to minimize the generalization error is in all cases either a vari- ant of traditional grid search, or the likelihood validation function proposed in [3]. A sig- nificant part of this thesis focuses on presenting more efficient ways to utilize a simple and well-understood validation function, namely grid search. Grid search is a baseline against which many authors compare their proposed techniques [2, 39, 43, 37], the main disadvan- tages mentioned being the fact that grid search is time-consuming and that it cannot be used to efficiently optimize more than two hyperparameters simultaneously because of the com- binatorial explosion of possible hyperparameter combinations. For a detailed discussion on the different validation functions, the reader is referred to Section 2.2.2.
Grid search is performed as follows:
• Create a linear grid on the log scale, with steps of 10~.
• Select conservative start and end values for the grid (empirically, we have observed that good choices of of such a grid for C ranges from 10-3 to 105, and for 'Y from 10-3 to 101 ).
• Train a classifier on each such grid point and evaluate on the held out fold when using 10-fold cross-validation.
• The optimal value of the validation function is chosen to be that combination of hy- perparameter values which gives the lowest 10-fold cross-validation error.
3.3 STATISTICAL HYPOTHESIS TESTING
In order to decide if results are statistically significant, two different statistical significance tests were employed in this thesis: the independent two-sample t-test is discussed in Section 3.3.1 and the Wilcoxon signed-rank test in Section 3.3.2. For both tests, the null hypothesis is that the means of the two samples being compared are equal. All results are evaluated at the 0.01 significance level.
DEPARTMENT OF ELECTRICAL, ELECTRONIC AND COMPUTER ENGINEERING 21
NORTH-WEST UNIVERSITY
3.3.1 INDEPENDENT TWO-SAMPLE T-TESTS
The independent two-sample t-test is used to test the null hypothesis that the means from two randomly drawn samples are independent, under the assumption that the two sample sizes are equal and that the variance of the underlying distributions are equal. Since sample sizes are equal to 10 in all cases, and since the samples are drawn from the same distribution, both assumptions are valid1. The t-statistic is calculated as
/11 - /12
(3.1)
where 11 is the sample mean and s the sample standard deviation.
To perform a significance test, two-tailed test p-values are then calculated using 2n - 2 degrees of freedom, with n = 10.
3.3.2 THE WILCOXON SIGNED-RANK TEST
The Wilcoxon signed-rank test was proposed by Frank Wilcoxon in 1945 [63]. It is a non- parametric statistical significance test, which is preferred over the paired t-test when the population cannot be assumed to be normally distributed, or when there are very few obser- vations per sample.
To calculate the Wilcoxon test statistic W, given two paired samples A and B with n observations each, the following procedure is followed:
1. Calculate the difference between each pair of observations Ci = Ai - Bi, with i = 1 .. . n.
2. Discard the number of zero differences Ci = 0, and let m be the number of non-zero differences.
3. Order IC1..ml from smallest to largest.
4. Assign a rank to each value (smallest rank(ICil) = 1 and biggest rank(ICil) = m).
5. For all values of ICil which are equal, assign the mean of their ranks to each equal value.
1 The same folds were used whenever two or more techniques were compared. Instances where independent two-sample t-tests are used for significance tests are cases where only the mean and standard deviation of the tests are available, and for which regeneration of results in order to perform a more stringent statistical significance test is prohibitively expensive.
DEPARTMENT OF ELECTRICAL, ELECTRONIC AND COMPUTER ENGINEERING 22
• For example, ifrank(IC1I = 0.45) = 1 and rank(IC2I = 0.45) = 2 from the previous step (and no other ICil = 0.45), reassign the ranks as rank(IC1I) = 1.5 and rank(IC2I) = 1.5.
6. Assign the original sign of each Ci to the updated rank.
• For example, continuing with the same example as above, if C1 rank(C1 ) = -1.5
7. Sum all positive ranks to W + and negative ranks to W _.
8. Set the Wilcoxon test statistic W = min (I W-I, W +).
-0.45,
9. Given W and m, calculate the p-value from a table, or exactly by for example fol- lowing the algorithm proposed here: http: I lcomp9 .psych. cornell. edul Darlingtonlwilcoxonlwilcox4.htm
3.4 DATASETS
Several datasets were used to evaluate and compare different classifiers2. Detailed informa- tion - and results where applicable - will be presented in this section. While the majority of the datasets considered are small, two moderate to large datasets are considered: the DFKI and MNIST datasets.
Table 3.1 shows the number of instances, dimensions and classes for several datasets.
The majority of the datasets described in this chapter are those from the intelligent data analysis (IDA) benchmark repository, as described in [62] and available at http: I lwww.
fml. tuebingen. mpg. deiMemberslraetschlbenchmark. The IDA benchmark repository contains datasets from the University of California Irvine (UCI), data for evalu- ating learning in valid experiments (DELVE) and statistical and logical (STATLOG) bench- mark repositories. The datasets included are: Banana, Breast Cancer, Diabetes, Solar Flare, German, Heart, Image, Ringnorm, Splice, Thryoid, Titanic, Twonorm and Waveform. For details on how these datasets were processed and split, see [62]. Alternative websites where the datasets are available, and where more information is provided, will be mentioned for each.
2 As this thesis was compiled over a period of years, not all datasets are used for all experiments. However, all datasets are used when all the approaches are compared in Chapter 7. In exploratory work, such as finding the optimal block size for Rprop, not all figures and results are shown. The results that are shown were carefully chosen to be representative of the trends observed.
DEPARTMENT OF ELECTRICAL, ELECTRONIC AND COMPUTER ENGINEERING 23
NORTH-WEST UNIVERSITY
Table 3.1: Number of instances, dimensions and classes of all data sets. For those data sets marked with an asterisk, the IDA benchmark repository version of the dataset is slightly different from UCI.
Dataset Number of:
Feature type instances dimensions classes
Artificial dataset 4096 2 2 real
Banana 5300 2 2 real
Breast Cancer 277* 9 2 categorical
DFKI Speaker Age Recognition 47578 42 7 real
DFKI (classes 1 & 4) 9514 42 7 real
DFKI (classes 2 & 5) 9341 42 7 real
D FKI (classes 5 & 7) 10733 42 7 real
Diabetes 768 8 2 real
German 1000 20 2 real & categorical
Heart 270 13 2 real & categorical
Image 2310 18* 2 real
MNIST Handwritten Digits 70000 784 10 real
Solar Flare 1066* 9 2 categorical
Splice 3175* 60 2 categorical
Thyroid 215 5 2 real
Titanic 2051 3 2 categorical
For the purposes of this thesis, only two-class problems are considered. While it is easy to extend any classifier to multi-class classification [14, 15], it is more difficult to make sense of results (one has to consider a class confusion matrix). There are also various ways in which a multi-class classifier can be constructed: one-vs-all, n * (n- 1) two-class problems with majority voting and probabilistic output, to name a few.
A short description of each dataset is now provided.
3.4.1 ARTIFICIAL DATASET
An artificial dataset was created to illustrate graphically how SVMs create decision bound- aries (see Fig. 4.3). The dataset was created by randomly sampling from two Gaussian mixture distributions. If we represent the d-dimensional probability density function of the Gaussian mixture distribution for class 1 with M = 3 mixtures, mean J-t, covariance ~ and mixture weights w as
DEPARTMENT OF ELECTRICAL, ELECTRONIC AND COMPUTER ENGINEERING 24
For both classes,
The mean 1-l for class 1 is given by
for class 2 by
and the covariance :E for both classes
1
0.4
W1 = 0.3 0.3
- [1.0 0.7] - [0.9
:El- :E2-
0.3 2.0 0.5 -0.5] - [1.0 :E3-
0.9 0.3
2048 samples were generated for each of classes 1 and 2.
3.4.2 BANANA
0.7]
2.0
The Banana dataset is a toy two-class dataset used and described in [62]. It was generated by sampling from non-linearly transformed Gaussian and uniform distributions. Four of these distributions were distorted by adding uniformly distributed noise.
3.4.3 BREAST CANCER
The Breast Cancer3 dataset has two classes, one with 201 instances and another with 85 instances. The instances with missing features (9) are not present in the IDA benchmark repository.
The dataset is also available from the UCI machine learning repository [64], at http:
//archive.ics.uci.edu/rnl/datasets/Breast+Cancer.
3This breast cancer domain was obtained from the University Medical Centre, Institute of Oncology, Ljubl- jana, Yugoslavia.
DEPARTMENT OF ELECTRICAL, ELECTRONIC AND COMPUTER ENGINEERING 25
NORTH-WEST UNIVERSITY
3.4.4 DFKI SPEAKER AGE RECOGNITION DATABASE
This dataset consists of spoken audio samples from native German speakers. Samples are labeled by gender, as well as age groups (children, young males and females, adult males and females and senior males and females). A comprehensive overview of the suggested exper- imental setup is given in [65], while the long-term features used in this thesis are described in [10].
3.4.5 DIABETES
The Pima Indians Diabetes dataset [ 64] consists of data from female patients (Pima Indian heritage) who were 21+ years of age. It can be downloaded from the UCI machine learning repository, at http: I I archive. ics. uci. edulmll datasetsiPima+Indians+
Diabetes.
3.4.6 GERMAN
The German credit data dataset [64] (available at http: I I archive. ics. uci. edul mlldatasetsiStatlog+ (German+Credit+Data)) has real and categorical fea- tures which are used to classify people as being good or bad credit risks.
3.4.7 HEART
The Heart dataset [64] (available at http:llarchive.ics.uci.edulmll datasetsiStatlog+ (Heart)), classifies patients as having heart disease or not.
3.4.8 IMAGE
The image segmentation dataset [64] (available at http: I I archive. ics. uci. edul mll datasetsiStatlog+ ( Image+Segmentation)) contains 3x3 pixel regions, with every pixel belonging to one of seven outdoor images as labeled by humans.
3.4.9 MNIST HANDWRITTEN DIGIT DATABASE
Available at http:llyann.lecun.comlexdblmnistl, this database consists of handwritten digits which have been centered in a 28x28 image [66]. The train set consists of 60000 samples and the test set of 10000 samples. These sets were selected from the National Institute for Standards and Technology (NIST) special database 3 training and test sets (50%
from the training and 50% from the testing sets).
DEPARTMENT OF ELECTRICAL, ELECTRONIC AND COMPUTER ENGINEERING 26
3.4.10 SOLAR FLARE
The Solar Flare dataset [64] (available at http: I /archive. ics. uci. edu/ml/
datasets/ Solar+Flare) contains features describing active regions of the sun for pre- dicting three types of solar flares.
3.4.11 SPLICE
The molecular biology (splice-junction gene sequences) dataset [64] (available at http://archive.ics.uci.edu/ml/datasets/Molecular+Biology+
(Splice-junction+Gene+Sequences)) contains features describing sequences of DNA, with the associated task being to recognize boundaries between exons and introns.
3.4.12 THYROID
The Thyroid gland dataset [64] (available at http: I /archive. ics. uci. edu/ml/
machine -learning- databases/thyroid- disease/) contains features describ- ing normal, hypo and hyper functioning thyroid glands. The specific dataset used is named the "new-thyroid dataset".
3.4.13 TITANIC
The Titanic survival dataset (available at http: I /www. cs. toronto. edu/ -delve/
data Ida t as e t s . h t ml) contains three features (social class, sex and age) which are used to predict whether the person survived the disaster.
3.5 MATHEMATICAL NOTATION
This section contains mathematical notation used throughout this thesis (notation that is not often used is excluded and rather discussed in the text).
DEPARTMENT OF ELECTRICAL, ELECTRONIC AND COMPUTER ENGINEERING 27 NORTH- WEST UNIVERSITY
Table 3.2: Mathematical notation used throughout the thesis.
Notation
p w wo ai,(3i ar
c
~i 'Y
K(a, b) Lp
LD
T R k Sp w
d
sv sv
inf TJ
Description
ith class label
ith feature vector [1 Xi]
total number of samples total number of mixtures sample mean
sample standard deviation covariance matrix
transpose of a vector or matrix
(subscript i) indicates the ith element of a range of values SVMmargin
vector normal to the separating hyperplane constructed by an SVM bias term in the SVM error function
Lagrange multipliers
perceptron kernel support vector weights SVM regularization parameter
SVM slack variables
RBF kernel parameter (bandwidth) kernel function of a and b
primal form of the Lagrangian dual form of the Lagrangian a bound on the generalization error radius of a sphere
as ink-fold cross-validation span of support vectors Wilcoxon test statistic number of dimensions
set of SVs satisfying 0 < ai < C set of SV s satisfying 0 < o:i ~ C infinity
step size in stochastic gradient descent
DEPARTMENT OF ELECTRICAL, ELECTRONIC AND COMPUTER ENGINEERING 28