CHAPTER ONE

(1)

CHAPTER ONE

INTRODUCTION

Contents

1.1 Support vector machines 1.2 Perceptron kernel criterion 1.3 Objectives, hypotheses and outline

3

4 5

Very large datasets are becoming increasingly common with modern data collection techniques and more affordable computer hardware [5, 6]. For example, in the image processing domain, MIT has recently released at least two particularly large image datasets. The "80 million tiny images dataset" [ 5] contains approximately 80 million 32x32 images from the Internet, which can be used for a wide range of applications, among others object classification. More recently, they released the Scene Understanding (SUN) database, which contains 130519 images from 899 categories, the main objective being scene categorization.

In the speech recognition domain, Google has been rapidly training acoustic models for Voice Search in different languages, with American English for example having thousands of hours of speech [7]. Using applications such as those described in [8] and [9], smart- phones are used to collect open-source corpora consisting of hundreds of hours of speech in traditionally under-resourced languages [9]. As a final example, speech has also recently been used to classify speakers according to their age; the "Deutsches Forschungszentrum fiir Kiinstliche Intelligenz" (DFKI) age classification dataset has approximately 34000 training samples, with each sample having hundreds of dimensions, depending on how many mix- tures are used to model each age class and also whether or not mixture weights and variance

2

(2)

CHAPTER ONE INTRODUCTION

updates are used in addition to adapted mixture means (see Section 3, [10] for details on the feature extraction process).

1.1 SUPPORT VECTOR MACHINES

The current availability of such large data sets (and the certainty that these will only become bigger in the future) introduces a dilemma: on the one hand, it is widely believed that support vector machines (SVMs) represent the state of the art for accurate classification [ 11, 12, 3 ], which is an important component of pattern recognition [ 13]. While much more detail will be provided in Chapter 2, we briefly mention the SVM error function here in order to better contextualize the objectives and hypotheses of this work. Furthermore, all functions mentioned here assume only two-class problems. It is straightforward to construct multi- class classifiers from two-class classifiers (see [14, 15] and Section 3.4 for a discussion on why we decided to focus exclusively on two-class classifiers). The SVM error function can be written as:

1 ^I " " " '

Esvm = 2W ^W+ C L..,;~i (1.1)

t

where C is referred to as the regularization parameter and ~i are slack variables. We refer to the first term in Eq. 1.1 as the margin term and the second one as the misclassification term.

For reasons that are elaborated on in Section 2.1.3, the margin term can be written as

~w'

^w⁼

L L aiaiYiYjX~Xj

iESV jESV

(1.2)

where Yi is the class label for the ith SV, ai is a Lagrange multiplier and xi is a feature vector, which has been selected as an SV. The dot product between the feature vectors x~xj can be replaced with a kernel function K (Xi, x j), of which a popular one is the radial basis function (RBF) kernel.

(1.3) where 'Y is also a hyperparameter which controls the kernel width.

Training an SVM entails two processes: (1) training of the SVM hyperparameters, which entails optimization of some validation function to select good values for these hyperparameters and (2) training of the Lagrange multipliers, which entails optimizing the constrained optimization problem (solving Eq. 1.1 subject to constraints, which will be discussed in Chapter 2).

DEPARTMENT OF ELECTRICAL, ELECTRONIC AND COMPUTER ENGINEERING 3

NORTH-WEST UNIVERSITY

(3)

Popular approaches to training SVM hyperparameters entail an expensive line or grid search for linear and RBF kernels respectively [1, 2], which can be very time-consuming on large datasets [10]. For this reason, we are interested in either training the SVM hyperparameter values more efficiently, or finding techniques that perform approximately as well as an SVM, but are computationally cheaper to optimize.

In Chapter 4, we explore the former approach to more efficient classifiers for large datasets. This is done by investigating the effect of the SVM hyperparameters for linear (which is simply the dot product of the two feature vectors in Eq. 1.2) and RBF kernels, and determining how the hyperparameter values vary with different amounts of data (see Chap- ter 4, Section 4.4). Searching for these parameter values consumes most SVM training time, and searching in the wrong region of the parameter space can take significantly longer than searching closer to the optimal parameters.

An interesting result from Chapter 4 is the behavior of SVMs when one of the hyperparameters- the regularization parameter C (see Eq. 2.12)- becomes very large. Our results, in agreement with those from [16], indicate that large values of C are preferable for non- separable datasets, which are typically of interest in pattern recognition. For reasons that are elaborated on in Chapter 5, this observation casts doubt on the large margin classifier (LMC) tag often associated with SVMs. The results of further analysis on this topic are presented in Chapter 5, with the conclusion that there is strong empirical evidence against the LMC tag for non-separable datasets. This is important from the perspective of efficient training of SVM hyperparameters for two reasons: (a) if one can assume a large (not necessarily optimal) value of C to be sufficient for a non-separable dataset, this significantly reduces the total SVM hyperparameter training time, as there is only one hyperparameter to be searched for (or at the very least the search space is reduced), and (b) it suggests that, for non-separable datasets, the margin term in Eq. 1.1 is redundant. This in tum implies that simpler error functions, which only minimize the sum of errors, will give similar results.

1.2 PERCEPTRON KERNEL CRITERION

One such criterion function, which only minimizes the sum of errors, is the perceptron criterion [17, 18]. A non-linear variant can be created as follows: the weight vector can be replaced with a sum of samples that are either misclassified, or that fall within a narrow region around the decision boundary. This can be written as a dot product with the current sample being evaluated, which can in tum be replaced with a kernel function. (For more details on the perceptron kernel (PK) expansion, the reader is referred to Section 5.2.3. More detail on the process of replacing a dot product ofvectors with a kernel function is provided

(4)

in Section 2.1.3.) The PK is very similar to the second term in Eq. 1.1. In Chapter 6, stochastic gradient descent (SGD) is considered as an approach to optimizing the PK function efficiently. Intricacies associated with SGD, such as good initialization and stopping criteria, are also investigated.

The observations from all the chapters are then combined in Chapter 7, where results from classifiers trained using SGD are compared to traditionally optimized SVMs, as well as other approaches that are suggested by our analysis.

Our experiments are performed on a variety of datasets, which are briefly summarized in Appendix 3.

1.3 OBJECTIVES, HYPOTHESES AND OUTLINE

Our main objectives in this study are (a) to gain a better understanding of the behavior of SVM training algorithms, especially as a function of the amount of training data available and (b) to develop algorithms that can be used to train SVMs efficiently. To that end, we investigate the following principal hypotheses:

1. Practical SVMs often do not function as LMCs.

2. More efficient methods than those currently considered state-of-the-art can be found - these can achieve competitive recognition accuracies using SVMs with much less computation than conventional grid searches.

3. Stochastic learning methods can achieve attractive trade-otis between accuracy and computation time for a class of classification problems characterized by a large number (10,000 +) oftraining samples.

The remainder of this thesis is structured as follows: in Chapter 2 we provide background information on concepts relevant to our work, whereas Chapter 3 details the empirical methods we have employed. Chapters 4 and 5 contain investigations into the practical functioning of SVMs, the former being a general investigation into the role of SVM hyperparameters and the latter specifically exploring whether SVMs should be considered LMCs. In Chapter 6, a new stochastic training approach is described. A comprehensive empirical investigation into the relevance and performance of various standard, state-of-the-art and novel algorithms is provided in Chapter 7. Finally, our work is summarized in Chapter 8, which also lists the main contributions of this work as well as the most important unresolved issues arising from our research.