A comparative analysis of classification algorithms on partly structured, multi-class imbalanced data

(1)

Master’s Thesis

A comparative analysis of classification

algorithms on partly structured, multi-class

imbalanced data

Faust van der Molen

Student number: 11401842

Date of final version: 14-08-2018

Master’s programme: Econometrics

Specialisation: Big Data Business Analytics

Supervisor: Prof. dr. M. Worring

Second reader: Dr. N. P. A. van Giersbergen

(2)

i Statement of Originality

This document is written by Faust van der Molen who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

Introduction

If you are active in the data mining field and have ever worked with imbalanced datasets, chances are high that you have come across some examples about algorithms performing classification tasks on imbalanced datasets. These examples are often about fraud detection, churn behaviour or medical diagnosing. In the case of medical diagnosing the classifier is often predicting whether a person is “healthy” or “diseased”, but what if we want the classifier to do more than binary classification?

Continuing on the medical diagnosing example, it can be useful to predict the kind or type of disease a person has based on his medical data. In this setting we would train a classifier on characteristics of the patient like age, sex, weight and length along with for example test results from the blood of the patient. Usually when a doctor performs his diagnosing, the medical history of the patient is considered as well in the form of medical reports from visits to a doctor. In machine learning there are ways to include these textual features as input for the classification as well, so that we have dataset consisting of a number of columns with structured data and a medical history consisting of unstructured text fields, i.e. a partly structured dataset. In a medical scenario we expect the classes to be highly imbalanced, as some diseases are rare and occur only with one in a million people. However something like having a cold or having the flu occurs a lot more often. In such a classification problem when only a few in a hundred observations are of the minority classes, it does not hurt the performance of the classifier much just to predict the best matching majority class and never predict the minority classes. That way it can still attain a high accuracy score.

The first thing to note then is that accuracy might not be the correct performance metric for all classification problems. Secondly: Can we do something about the imbalance in the data, and if so, what is the right approach? And last but not least: Can we not penalize mistakes in the minority class more, in order to reduce the bias in the trained classifier?

All of these questions have already been thoroughly researched in the past years. For example Batista [5] did an extensive empirical analysis on different over- and undersampling schemes. This analysis includes for example the popular SMOTE algorithm by Chawla [12], which combines undersampling of the majority class and creating synthetic samples of the

(5)

CHAPTER 1. INTRODUCTION 2 minority class to aid the training of the classifier. More recently Lopez [38] published a paper covering not only different data sampling schemes, but also the use of cost sensitive learning and ensemble methods when working with imbalanced data. He provides some useful insights on the influence of the imbalance ratio on the classification performance (under the appropriate performance measure of course). He argues that the extent of the imbalance is of much lesser importance than separability of the data. Other authors like Luengo [39] back this up by showing that the imbalance ratio is actually irrelevant, and that data complexity is a much more important measure in indicating the seriousness of the class imbalance.

A feature that all the above analyses have in common is that they have been performed on binary classification problems. It is however not said that all these insights generalize easily to multi-class classification problems, like we described in the medical diagnosing setting. In an empirical study using 12 binary class datasets and 9 multi-class datasets Zhou [69] shows that many of the sampling techniques often used in binary problems are ineffective in a multi-class setting, sometimes even worsening performance. Similarly, Sun [57] writes that resampling tech-niques for multi-class problem are impractical and suggests boosting algorithms as a solution, an observation supported by Wang [62]. None of these papers used datasets that also contained textual data, though.

In this thesis we aim to research the topic of classifying imbalanced multi-class data further, on a specific type of datasets. The question we want to answer is: How do classification algorithms compare to each other in classifying partly structured, multi-class imbalanced datasets? To do so, we first research the performance of the traditional base classifiers and compare them to each other. Secondly, we conduct an experiment on cost-sensitive learning to see if that helps classification performance.

We use two different datasets for our experiments. One is provided by the company Sparkholder1and contains accounting information from Small and Medium Enterprises (SMEs). The dataset contains a few columns of structured data, as well as a text field, responsible for the unstructured part. This dataset contains 34 classes, with an imbalance ratio of almost 100:1.

The second dataset we use is from Kaggle.com and concerns wines. Alongside a short wine review, which obviously is the unstructured part of data, we also have a few columns containing country and province of origin, as well as price and score of the wine. From this we wish to predict the variety of the wine, i.e. is it a Sauvignon Blanc or a Pinot Grigio? The total number of varieties in the dataset is well over 600, but we will focus on the 34 most common varieties, so that this is similar to the amount of levels in the Sparkholder dataset.

With this approach the two datasets share most characteristics. Both have only a few columns of structured data accompanied by a text field from which information needs to be extracted to help classification. In the first dataset descriptions are made up of only a few words, in the latter a few sentences. The two datasets also share the fact that the text fields

(6)

CHAPTER 1. INTRODUCTION 3 contain specialized topics only, accounting vocabulary for the first and wines for the second.

The setup will be as follows. In Chapter 2 we take a closer look at class imbalance. First in a binary setting and then the multi-class framework is discussed. We conclude this part by giving some background on the techniques used to prepare the data for input in the classifiers, for example how we treat the text fields. Then in Chapter 3 we provide a basic overview of the classifiers we use in this thesis. After that we describe the data and feature engineering process for both datasets in Chapter 4. Next, we present our general methodology. In Chapter 6 the experiments that we carry out are explained and the results are evaluated. We end with a discussion in Chapter 7.

(7)

Chapter 2

Theoretical Framework

We kick off with a short review of the current status of the field of classification of imbalanced data. First we consider the usual setting in which there are only two classes and describe some of the findings of papers on this topic. Next, we will consider a few papers that try to generalize the methods of the binary case to the multi-class setting. Lastly we consider the various techniques that we apply to the data that we use in this thesis.

2.1 Class Imbalance: The Binary Case

Class imbalance occurs in many datasets. In a binary setting often the more interesting class is also the class that has fewest observations. Recall for example the fraud case, or a problem of classifying medical data with “healthy” and “sick” labels. The imbalance in classes is usually a problem, because in a machine learning setting, the algorithms are always minimizing some cost function. More often than not, this results in bad performance on the minority classes.

Lopez [38] covers a few reasons why class imbalance may be a problem and cites several papers giving a more in-depth explanation on the specific topic. We will shortly cover the most important ones, and we redirect the reader to those papers if they wish to know more.

Imbalance in the data can be a problem for several reasons. If the data is noisy and the minority group is small, then observations from the minority group are easily mistaken for noise. This problem is magnified when the classes of the data are overlapping. Denil [19] shows that the classification of imbalanced data can be performed well by an SVM, as long as the overlap between the classes is small. As the overlap between classes grows, having imbalance in the data also has a higher impact on performance. Napierala [41] argues that another problem stemming from class imbalance is the existence of small disjuncts, i.e. the few minority examples are split up in even smaller clusters, making it harder to learn them. He shows that in such a scenario a classifier’s performance can be greatly improved by non-random resampling, where noisy majority class samples are removed from the training set.

Many more examples exist of resampling the data. Batista [5] did a comparison between

(8)

CHAPTER 2. THEORETICAL FRAMEWORK 5 SMOTE, random undersampling, random oversampling and some more sophisticated nearest neighbour based sampling schemes. He shows that most sampling schemes show an improve-ment over the original in some datasets, and that random oversampling almost always provides very competitive results despite being a relatively simple solution.

Resampling the data can be categorized as a data level approach to tackling the class imbalance problem, but another approach is also possible. The MetaCost algorithm [21] works by training an ensemble of classifiers on a set of bootstrap-samples from the training set. The training set is run through the ensemble and a distribution over the classes is returned by averaging the probability distributions returned by the members of the ensemble (these may also be degenerate distributions). These probabilities are weighted by a cost matrix and the classes with the highest probability are chosen as the new labels for the training data. This relabeled training set is used as input for the final classifier. Several modifications and improvements to MetaCost have been introduced afterwards, for example by Zadrozny [67]. She relaxes the assumption that all costs are known in advance and proposes a decision tree and Naive Bayes implementation for estimating the class probabilities that systematically outperforms MetaCost.

Another way to try and resolve the class imbalance problem is to modify the classification algorithm, and is called cost sensitive learning. One could think about implementing cost sensitive loss functions in a neural network, implementing per class misclassification costs in an SVM or boosted decision tree ensembles, which reportedly often outperform the different sample techniques [50].

2.2 Class Imbalance in a Multi-class Setting

Where a lot of authors have devoted their time to researching the class imbalance problem in a binary setting, literature on the multi-class case is much more sparse. According to Hoens [29], who wrote a paper on decision tree ensembles in a multi-class setting, this might be due to the fact that the extension of the definition of the binary imbalance problem to multi-class is non-trivial. In a binary setting the amount of imbalance can be represented by just a ratio between the majority and minority class. In the multi-class setting we could have a dataset consisting for the most part of one majority class and several very small minority classes, or we could have a few large classes, then a majority of medium size classes and one minority class, and every variety in between. It is therefore clear that the analysis of the multi-class imbalance problem is not as straightforward as the binary case.

Intuitively sampling techniques may sound easy to generalize to the multi-class setting, but this is not necessarily true. Let us for the moment ignore the different ways for a dataset to be imbalanced, then it is still not clear what our target ratios for oversampling should be, as it can be done in many different ways.

(9)

neu-CHAPTER 2. THEORETICAL FRAMEWORK 6 ral networks. He makes a clear distinction between performance on binary data and multi-class data, and between more or less balanced datasets and imbalanced datasets. From his research we can conclude that sampling techniques have very mixed results. On multi-class datasets SMOTE and undersampling have a negative effect on performance, so they should be avoided. Soft-ensembles are ensembles where not votes, but probabilities are used to determine final class predictions. Threshold-moving is technique where the predicted probabilities are recalculated using the prior class probabilities. Zhou concludes that threshold-moving and soft ensembles provide the most consistent improvements when working with multi-class imbalanced data.

Most papers on the topic of multi-class imbalanced data research one specific classifier and try to find improvements to that kind of classifier, e.g. specific decision tree improvements like Hoens [29] and Wang [62]. Zhou [69] uses solely neural networks and Chen [14] focuses on an improved SVM classifier. None of these studies include comparisons between different base classifiers, when working with multi-class imbalanced data. Fernandez [24] did perform such a study, on 112 of the UCI datasets. He compared 179 different classifier (implementations) and compared them on several different attributes. As an overall classifier he found Random Forests and SVM to be the best. When weighing the accuracy scores by the number of classes in the classification SVM and RF still come out on top. There is one issue with his study though. Fernandez only considers accuracy, which may be a bad performance measure for unbalanced datasets. Furthermore his datasets do not contain unstructured data. That is what the focus of this thesis will be. We research the performance of several base classifiers on partly structured, imbalanced multi-class data.

2.3 Background on Techniques Used

Before we can use our data as input for the classifiers, we need to prepare the text fields to be used as input. Several techniques to do so are discussed below. After that we shortly cover dimension reduction techniques, as the feature vectors can grow very large and not every classifier handles this well.

2.3.1 Text processing

When using text fields as input for a classification algorithm, there are various techniques available to convert the texts to a usable (numeric) format. The methods we use in this thesis are Bag of Words, LDA and Word2Vec. We will explain these techniques briefly.

Bag of Words

The first of the textual features we consider is the Bag of Words approach, or BoW for short. This is the most basic textual feature we can imagine. We build a document-term-matrix (DTM) in such a way that the columns of the matrix correspond to the words that are in the

(10)

CHAPTER 2. THEORETICAL FRAMEWORK 7 corpus and the rows correspond to the observations. A text field like “hello sir” would get a one in the column labelled “hello” and the column labelled “sir”, and zeros in all other columns. This may seem like a blunt approach for larger texts, but often it works surprisingly well.

When building a DTM it is common practice to transform all characters to lower case, to remove numbers and to remove punctuation. Furthermore we usually remove stop words and infrequent words. Normally the weight of a word in a text is just the frequency of its appearance in that text. A more sophisticated way to weigh the terms is to use the term frequency-inverse document frequency, or tf-idf. This method allows us to easily filter out non-discriminating words like prepositions and articles. Using the inverse document frequency as weight was first suggested by Sparck Jones [51], after which Salton [46] summarizes different measures to calculate the term frequency and tf-idf . The basic idea is that a word occurring in all documents does not have that much discriminative power, so that they get a lower weight in the DTM. On the contrary words occurring very often in one text get higher weights. This can be summarized in the following formula:

tf idf (t, d, D) = tf (t, d) · idf (t, D), (2.1)

where tf (t, d) is the count of term t divided by the length of document d. All documents d together make up the corpus D. The inverse document frequency is calculated by

idf (t, d, D) = log N

|{d ∈ D : t ∈ d}|. (2.2)

We can use this tf-idf value both for weighting the terms in the DTM as well as filtering out words below a certain threshold.

LDA

A more sophisticated approach to text processing was introduced by Blei [7]. He explains that texts can be modelled as a collection of latent topics, which can be inferred using his LDA-algorithm. In a more general sense he regards the words in a document as exchangeable random variables and as such they can be modelled as a mixture distribution. He shows that the LDA algorithm performs very well on capturing the topics of and relations between different documents.

There are however situations where LDA performance suffers. Hong [30] shows several topic modelling schemes in which texts from the same author are aggregated to improve topic modelling on short texts.

It is hard to tell the appropriate number of topics in advance, but there are several ways to evaluate a topic model. For example Griffiths [28] uses the posterior probability of the models given the observed data. Arun [4] views LDA as a matrix factorization method of the DTM. Let V be the number of terms in the vocabulary, T the number of topics and D the corpus, as before and let l be a 1 × |D| vector with the document lengths. The DTM is factorized in a

(11)

CHAPTER 2. THEORETICAL FRAMEWORK 8 T × V matrix denoted P and |D| × T matrix denoted Q. He argues that under the right number of topics the singular value distribution of P is comparable to the distribution of the vector lQ. To that end he minimizes a measure of divergence between the two probability distributions, the symmetric KL-divergence, to choose the number of topics. Yet another performance measure of LDA is introduced by Cao [11]. He argues that we should minimize the average cosine distance between topics to find the right amount of topics. Lastly Deveaud [20] argues that the number of topics should be selected by maximizing the information divergence between all pairs of LDA topics.

Word2Vec

Mikolov [40] introduces another approach to extracting textual features. In his 2013 paper he describes the method known as word2vec, where he aims to find vector representations for words in a corpus. This is done using either a continuous bag of words (CBOW) model, or using the skip-gram model.

When training a word2vec model we try to find vector representations for the words in our corpus. We first decide on a window width, which corresponds to the number of surrounding words to consider. In the CBOW model we use the surrounding words in the window as input for the model and the word itself as ouput. These words are coded using 1-of-V coding, where V is the number of terms in the vocabulary. We do not account for word order, hence the bag of words part of the name. The skip-gram model works almost the same, with the difference that it uses the individual words as input and is trained such that it predicts the surrounding words as output. For both models the amount of nodes in the hidden layer represents the dimension of the vector and the vector representing the word is just the vector of weights to the hidden layer. The skip-gram model provides vectors of higher quality at the cost of time efficiency, so we opt to use the skip-gram model.

Several authors try to compare different word embeddings by how they perform on different NLP tasks, like Schnabel [48] and even Zhang [68], who also uses the word embeddings as input for a classification task. He shows that for different vector lengths all options provide good performance in word similarity. We will further discuss the method for choosing the vector lengths in Paragraph 6.1.2.

2.3.2 Dimension Reduction

Principle component analysis (PCA) can be useful when we want to reduce the dimension of the input for a classification algorithm. It is probably the most well-known technique in the field of factor analysis, according to Abdi [2]. In short, PCA is matrix decomposition of the data, where we explain most of the variance in the data by the vectors with the highest eigenvalues of the singular value decomposition. Novakovic [42] shows that on top of dimension reduction PCA can improve classification accuracy, especially for algorithms that have no built-in feature extraction/selection mechanisms, such as neural networks. Subasi [54] and Howley [31] also

(12)

CHAPTER 2. THEORETICAL FRAMEWORK 9 show improvements in generalization performance using an SVM and PCA. The performance of tree based classifiers however may suffer a lot. In most experiments by Howley the error of the C4.5 tree increased fourfold. There is another drawback: PCA only works on continuous numerical variables.

Many datasets also include categorical variables. To that end, several extensions to PCA are possible. MCA, or multiple correspondence analysis, works on datasets containing solely categorical variables, as explained by for example Abdi [1]. Multiple Factor Analysis (MFA) was introduced by Escofier [23] and allows for grouping of variables and those groups may either be numerical or categorical. The technique applies PCA to the groups of numerical variables and MCA to groups of categorical data. Then a global PCA is applied to the intermediate resulting matrices from the groups to get the final result. A special case of MFA is called Factor Analysis of Mixed Data (FAMD), where every variable is its own group. Using these techniques we will prepare our data for the classifiers that are introduced in the next chapter.

(13)

Chapter 3

Overview of Classifiers

Before we do any sort of comparison between the classifiers that we will use in this thesis, let us start by giving a general overview of them. We give a short walk-through of how the classifiers work and if applicable mention some use cases relevant to our problem, that we found in the literature. We also consider problems with overfitting. In that case the model performs extremely well on the training set, but does not generalize well to the test data. We will partly follow the work by Bishop [6] for the neural network and SVM, complemented by a few papers. The other paragraphs will be loosely based on the several papers mentioned in the text.

Let us introduce some notation. Suppose we have N observations in our dataset D = {x_i, ti}Ni=1, and let the vector of explanatory variables be made up of H features x = (x1, ..., xH).

Furthermore suppose that there are K different classes, denoted by y1, ..., yK.

3.1 Neural Network

Chapter 5 of Bishop [6] gives an extensive overview of neural networks. We summarize the most important parts here, without leaving out too many essential details.

The type of neural network we consider here is the multi-layer perceptron (MLP). When we talk about neural networks in this thesis we implicitly mean MLP networks. These neural networks are structured as follows. There is always an input and an output layer and a number of hidden layers. The input layer has H nodes, corresponding to the features. In the case of classification with K classes, the output layer always consists of K nodes. The output nodes together give a probability distribution over the classes for the given input.

In between the input and output layer there are a number of hidden layers with M nodes. All nodes in the aforementioned layers are connected and these connections are given weights.

Now, the values of the input x determine the values of the hidden units z = (z1, ..., zM) in

the following way. First calculate the activations aj, where j ∈ {1, ..., M } by the formula

aj = D X i=0 w(1)_ji xi, (3.1) 10

(14)

CHAPTER 3. OVERVIEW OF CLASSIFIERS 11 where w_j0(1) corresponds to the bias (or intercept in econometric terminology). Then transform these activations using an activation function h, so that

zj = h(aj). (3.2)

Examples of these activation functions are the sigmoid function, the hyperbolic tangent, but also piece-wise linear functions like max(0, aj), which is called the ReLU function and often

used as activation function for the hidden units, as it is much faster than the sigmoid function at minor performance loss, as for example Dahl [17] mentions.

For multi-class classification, the probability that an observation is of class Ck using the

sigmoid function is given by:

p(Ck|x) = yk(x, w) = exp(a0_k) P jexp(a0j) , (3.3) where a0_k=PD j=0w (2)

kjzj, the activation of output node yk.

Now that we understand how to calculate class probabilities using a neural network, all that remains is to explain how to obtain the set of weights for the network. Training of a neural network works through error back-propagation, using some loss function. Common loss functions are cross-entropy loss for softmax layers:

L(w) = −

N

X

i=1

tiln (yi) + (1 − ti) ln (1 − yi), (3.4)

and for layers with piece-wise linear activation functions, often MSE is used:

L(w) = 1 2 N X i=1 ||yi(xi, w) − ti||2, (3.5)

so that in both cases the derivative of the loss function with respect to the weights is a function of the difference between the prediction and the target value.

Using this derivative, we can apply the stochastic gradient descent (SGD) updating scheme to get estimates of the weights. We initialize w(0) and update according to:

w(τ +1)= w(τ )− η∇L(w(τ )), (3.6) where η is the (predefined) learning rate of the network. Note that we converge to a set of weights that satisfy ∇L(w) = 0, but that there is no guarantee that this is a global optimum. It could very well be a local minimum of the loss function.

Using SGD we process data sequentially or in mini-batches instead of all at once in a single large batch. This has the advantage that we can use very large data sets without running into issues with computational feasibility. Another advantage is that we can reuse the same observation multiple times when training the network. We call one pass of all data through the network an epoch. In practice it is common to use multiple epochs, to get higher accuracy from

(15)

CHAPTER 3. OVERVIEW OF CLASSIFIERS 12 the model. We should be weary of using too many epochs though, as the model susceptible to overfitting.

When building neural networks there are a few other parameters we can adjust in order to get a model that is better suited to the data. Firstly we have to choose how many hidden layers to use in the model and how many nodes each should have. For simple networks one or two hidden layers is usually enough, as shown by Huang [32]. He shows mathematical bounds on the errors when using one and two layer neural networks.

The last phenomenon regarding neural networks we treat is the dropout rate. In addition to not using too many epochs to avoid overfitting, we can set a dropout rate per layer of the network. This works by randomly disconnecting a fraction of the nodes (i.e. set the nodes to zero) during the training phase. The nodes to disconnect are varied per batch of observations that are processed. Srivastava [52] argues that in general a dropout rate of 0.2 for the input layer and 0.5 for the hidden layers works well to reduce overfitting and thus increase generalization performance of the model. In a sense using dropout in neural network can be interpreted as an ensemble method, as the weights in the network are a combination of weights obtained from s different training models, where s = _N

batch size.

3.2 SVM

For our treatment of Support Vector Machines we loosely follow Chapter 7 of Bishop [6]. We explain the SVM classifier on an intuitive level, leaving out a lot of the mathematics. We then explain how the SVM can be extended to do multi-class classification as well. For a more in-depth approach, we refer the reader to Chapter 7 of Bishop [6].

An SVM is a maximum margin classifier. This means that we want to find a decision line or (hyper)plane that separates the classes present in the data. Usually we transform the input data into a higher dimensional feature space. To do so we can use any kernel function, like for example the linear kernel, the Gaussian kernel or the polynomial kernel. While the data might not be linearly separable before this transformation, in many cases we have more luck after this transformation, although cases where data is exactly separable are rare in practice. The hyperplane we choose is the one that maximizes the distance between the nearest data points of the different classes, hence the name maximum margin.

To start, consider the following linear model for binary classification:

y(x) = wTφ(x) + b, (3.7)

where φ(x) is some transformation of the input into the feature space, and b a bias parameter. Then mathematically speaking the above boils down to minimizing:

1 2||w|| 2_{+ C} N X n=1 ξn, w.r.t. (3.8) tny(xn) ≥ 1 − ξn and ξn≥ 0,

(16)

CHAPTER 3. OVERVIEW OF CLASSIFIERS 13 for all n. Here tnis the label of the observation and C can be seen as a regularization parameter

balancing the trade-off between a small margin and misclassification errors. Setting C too high will lead to overfitting on the training data and bad generalization performance. Finally ξn is a

slack variable introduced to allow for data-points to lie in the margin around the decision plane. The hyperplane is thus determined by the vectors (data points) that lie exactly on or inside the margin. We call these data points the support vectors. Now if we want to classify new data, we transform these into the higher dimensional feature space as well and compare them against the support vectors we found when fitting the model. Note that in contrast to for example neural networks, the SVM classifier only returns the predicted label of a new observation, but does not provide probabilities.

SVM was originally proposed as a binary classifier, however generalizations are possible to extend it to multi-class classification. A straightforward implementation would be to train K different SVMs, where K is the number of classes. In these, the k-th SVM class k is then labelled as a positive outcome and the other K − 1 classes are labelled as negative. We then train K different 1-vs-K classifiers on these relabeled datasets. There are however a few problems with this approach. First there is the problem of selecting the label to choose from the K SVM outputs. A heuristic approach is to select the maximum over k, leading to:

y(x) = max

k w T

kφ(x) + bk, (3.9)

where wk and bk follow from the corresponding SVM’s and φ(x) denotes the transformation

of the input vector x into the feature space. Bishop [6] argues that this heuristic offers no guarantee to be appropriately scaled, as the K classifiers have been trained on different tasks.

Apart from that there is another problem. Say we have data divided into 25 classes and have 20 observation for each class, so that we have a perfectly balanced dataset of 500 observations. Using the 1-vs-K approach outlined above, we would train the models on samples of 20 positive and 480 negative observations, causing huge class imbalance.

To combat these shortcomings Weston and Watkins [63] propose a model that directly generalizes formula (3.8). They minimize:

1 2 K X m=1 ||wm||2+ C N X n=1 X m6=tn ξm_n, w.r.t. (3.10) wT_t_nφ(xn) + btn ≥ w T mφ(xn) + bm+ 2 − ξnm, ξm_n ≥ 0, m ∈ {1, ..., K}\t_n.

While they argue that their model is theoretically more sound, the resulting models do not reduce error rates a lot. It does however use much less support vectors. The major downside of their model is that it takes much longer to compute. Instead of the O(KN2) complexity of the one-vs-rest approach, the method of Weston and Watkins is of O(K2N2) complexity.

(17)

CHAPTER 3. OVERVIEW OF CLASSIFIERS 14 For this reason Crammer and Singer [16] introduce a different way of generalizing the prob-lem. Instead of adding constraints for every class, they provide a more elegant solution. Cram-mer and Singer’s approach uses a generalization of separating hyperplanes, which induces a generalized notion of margins for multi-class problems. They use this notion to describe a much more compact quadratic optimization problem then Weston and Watkins did. Moreover this problem is decomposable in multiple smaller problems, yielding memory and time gains in the implementation. Crammer and Singer conclude that their method achieves competitive results and running time on various datasets.

Lastly we mention Sun [55], who shows that an SVM works well for text classification on imbalanced data. They argue that the SVM suffers less from class imbalance due to reliance on support vectors, so that the number of other instances does not matter that much. Wu [65] on the other hand states that adding more samples to the majority class skews the decision boundary in the favour of the majority class. There is no agreement on the performance of the SVM classifier on binary imbalanced data. It will be interesting to see what conclusions we can draw for multi-class imbalanced data.

3.3 Decision Trees

Of all the methods we apply in this thesis decision trees are by far the most intuitive. We will talk about CART, C4.5 and C5.0 decision trees. We chose C4.5, because it is one of the most popular decision trees according to Drummond [22]. C5.0 is a newer version of C4.5, and also includes boosting, a technique that often works well on imbalanced data, as several authors show [34] [56] [61]. We do not use CART as a separate classifier, but it is used in the Random Forest procedure that is explained in the next section.

CART stands for Classification And Regression Tree like the homonymous book by Breiman [10]. The trees are trained by forming binary splits that are selected so that we reduce the impu-rity in the child-nodes as much as possible. This impuimpu-rity can be calculated by several different measures. The Gini-index is used in the CART implementation, but Information Gain (based on entropy), or the less used twoing criterion are also possible to use for selecting splits. A CART is first built to its maximum length and afterwards a procedure called pruning is em-ployed to cut off parts of the trees, to improve generalization performance. Unpruned trees can be susceptible to overfitting.

Another heavyweight in the world of decision trees that introduced C4.5 decision trees is Quinlan [44]. Wu [66] describes some of the differences between CART and C4.5. The first is the splitting criterion. The C4.5 tree uses the Information Gain for choosing the variable to split on. Furthermore C4.5 trees do not necessarily form binary splits, as their tests can have more than two outcomes as well. Lastly the pruning is performed differently. For more details we refer the reader to one of the aforementioned papers.

(18)

CHAPTER 3. OVERVIEW OF CLASSIFIERS 15 we wish to prune the tree or not. Furthermore we can enforce binary splits, set the minimum number of observations in final nodes and if we do prune the tree, we can set the pruning threshold, which determines the aggressiveness of the pruning applied to the tree.

A few years after C4.5, Quinlan introduced a newer version called C5.0. Wu also mentions the changes from C4.5 to C5.0. Foremost, C5.0 introduces boosting. Boosting is a technique the creates an ensemble of learners. The learners are trained iteratively and the data points that are misclassified by the previous learner are given extra weight in training the next learner. In this way the bias of the final ensemble should be less than that of a single learner.

Another new feature of C5.0 is that it introduces the option to use rules instead of a tree structure. These rules are simple unordered if/else statements. The predicted class is calculated by weighted voting of the different rules, where the weight of a rule is given by the confidence of that rule.

A final new option is winnowing. If we suspect some features to hold very marginal predictive power, as could be the case when using hundreds or thousands of features, then we can invoke the winnowing procedure to select a subset of features that have most predictive power. Only this subset is then used to build the tree or construct the rules. A further explanation of the winnow procedure can be found in Littlestone [37], though from the C5.0 documentation [45] it is not clear if this exact implementation is used.

3.4 Random Forest

Random forests were introduced by Breiman [9] a few years after his paper on bagging predic-tors [8]. He uses the concept of combining many weak learners into a strong ensemble classifier. In this ensemble the trees are trained on bootstrapped samples of the data, so that all (or at least most) trees are trained on a different training set. The decision trees in a random forest are simplified CART trees. For the splits of the tree each time a random subset of the features is selected to split on. The class that is eventually predicted is determined by majority voting.

Random Forests have become very popular to use, not only due to their competitive per-formance on various problems, but also because of a few other important reasons, as Khalilia [33] argues in an empirical study comparing SVM and Random Forests on highly imbalanced (binary) datasets. The first is that Random Forests have no problems in handling datasets with missing data, and as we later on show, are very good at imputing them as well.

Secondly a Random Forest can calculate an unbiased estimate of the prediction error, using the Out-of-Bag (OOB) performance. This OOB performance is the accuracy on the samples that were not in the bootstrapped training sample for that specific tree. By aggregating these error rates, the estimate for OOB performance is calculated. This OOB evaluation can also be used to asses variable importance. On a given sample we can permute one of the features and see what the impact is on the predictions. Then using for example the loss of accuracy,

(19)

CHAPTER 3. OVERVIEW OF CLASSIFIERS 16 or decrease in the Gini Index we can gain insight on which features have most impact on the classification.

Lastly, the random forest has an interesting feature for working with unbalanced data. We can supply the random forest with class weights, where we could for example use the imbalance rate as weight vector. Chen [13] did a study using this method and concluded that it performed well for binary imbalanced data. It will be interesting to see if this generalizes to our specific setting.

Khalilia [33] also compares Random Forests to SVM in a binary imbalanced setting and concludes that they perform very well in this setting. In all of his 8 datasets Random Forests outperform SVM by varying, though not drastic, margins.

When building Random Forests there are two main parameters of interest that need tuning. The number of trees and the number of variables to consider at each split. Intuitively it is immediately clear that we should not set the number of trees too low. Breiman suggests that a forest should contain at least a thousand trees if you want a stable estimate of the variable importance. Oshiro [43] on the other hand suggests that the number of trees can be set in the 64-128 range, if classification performance is all you are concerned with. After that he finds no significant increase in performance.

This rule of thumb however assumes no other parameter tuning, as the number of trees needed to get the best performance also varies with the number of variables considered at each split. When this number is low, trees have low correlation and generally more trees are needed. On the contrary when using more variables to split on, more correlation between the trees causes the number of trees to have a lower effect on overall performance. We also see this behaviour during the performance tuning of the Random Forest in Paragraph 6.3.2

3.5 Logit

Logistic regression (logit) can be viewed as a more traditional model for classification. Its roots lie in the paper by Cox [15], where he introduces a way to do regression analysis on binary data. There have been numerous extensions to binary logistic regression. The method we use in this thesis is multinomial logistic regression via a penalized maximum likelihood model. For penalization we use the Lasso technique, as introduced by Tibshirani [58]. He argues that his method is superior to other regularization methods. It is more stable than feature subset selection. In comparison to Ridge regression, which shrinks the parameter values of correlated regressors, Lasso shrinks some parameters and sets others to zero, retaining the virtues of both methods.

In general, regularization can be seen as a trade-off between bias and variance. The main idea of penalizing the parameters and driving some of them to zero is to keep only the most important parameters and thus reducing the sensitivity of the classifier to noise in the training data. Introducing the penalty term may reduce the performance of the classifier on the training

(20)

CHAPTER 3. OVERVIEW OF CLASSIFIERS 17 set, but should improve generalization performance.

Using the notation introduced at the start of this chapter, the objective function for mini-mization for the stated problem is:

N X i=1 K X k=1 tikln exp w_kTx_i P jexp wjTxi !2 + λ H X h=1 ||w∗_h||₁. (3.11)

Where tik = 1{ti=k}, so that the first part of the function corresponds to the multi-class cross-entropy loss. The second part is the penalty term. In the binary case we get one set of weights. In contrast, in the multi-class setting we get K sets of weights. Say we stack all 1xH vectors in a matrix W, then in the above wj refers to the j-th row of W, while w∗h refers to the k-th

column of W. Lastly we introduce a value λ in the function. This corresponds to the size of the penalty parameter. Tuning the value of λ will be crucial for getting the best performance out of this classification method.

This concludes the overview of the classifiers that we will compare in our experiments on partly structured, multi-class imbalanced data.

(21)

Chapter 4

Data

In this chapter we introduce the datasets used in this thesis. Like we mentioned in the intro-duction, one of them contains accounting data and the other is on the topic of wines. For both we enlist some descriptive information and talk about the imbalance in the classes. We also show the pattern of missing data and consider the feature engineering process.

4.1 Sparkholder Accounting Data

Sparkholder1 has a huge database containing information on the bookkeepings of many SMEs. These have been submitted by those companies themselves in order to gain insight into their finances and to see what the possibilities for extra company funding are. For this reason Sparkholder needs to be able to classify the accounts in a bookkeeping. Of the 171K observa-tions, the ones that have manually checked labels have been selected and put into the dataset. The part of the dataset usable for supervised learning consists of 22.216 observations.

The features we use can be roughly be split up into two groups. The first are account specific features, such as the general ledger code, the description and the number of bookings in the account. The second group are company specific features and describe some general information about the company, which we will further explain below.

4.1.1 Data description

Each observation is made up of a text field, where users put a description of the account, accom-panied by some other features. In Figure 4.1 we present a few typical examples of observations.

First we see the externalCode, which corresponds to the general ledger code of the account. Next to it is the description given by the user. The columns branch and legal contain id’s indicating what sector the company is active in and what type of legal entity the company has, with 16 and 15 levels respectively. The Balance column then is a sum of all bookings on that account number. type (4 levels) is the bookkeeping system used by the user. bp and dc denote

1

https://www.sparkholder.com/

(22)

CHAPTER 4. DATA 19

Figure 4.1: Sparkholder data example bookings

whether the user itself thought if the account should be on the Balance or Profit/Loss (bp) and whether it is a debit or credit account (dc). #Credit and #Debit indicate the amount of credit and debit bookings in the account and finally isRGS is a binary variable that is 1 if the account was made using the RGS (“Referentie GrootboekSchema”) accounting standard.

4.1.2 Class Imbalance

We can explain the manual classification of the classes by a tree. The first level of the tree splits balance from profit/loss bookings, the second level is again binary and splits on debit and credit booking. The third level is a linear combination of the above four combinations. Things start to get interesting at the fourth level of the tree, were we find 34 different classes. Classifying on the fourth level goes a long way in providing a correct balance sheet for a bookkeeping, but there is even a deeper level of the tree consisting of 161 classes. However, since the fourth level is sufficient to predict the balance sheet, we focus on this level and ignore the final level.

With this many classes, there is bound to be some class imbalance. Figure 4.2 displays the number of observations per class.

Put in numbers: The four majority classes make up more than a third of the observations, with 3258, 1940, 1375 and 1345 observations respectively. On the contrary, the smallest classes only contain 33, 48 and 50 observations per class.

4.1.3 Missing Data

For three of the features introduced in the above we have missing data. We will give some more details on the missing data now and discuss how we are going to handle this in Section 5.2. The missing data is visualized in Figure 4.3. The top red bars show the fraction of the observations that have missing data for the corresponding feature. In the part below the blue boxes indicate that the feature is available and the red boxes indicate that it is missing for the corresponding fraction of the observations. For example in this case a fraction of 0.53 has no missing values, 0.29 has only dc missing, etc.

(23)

CHAPTER 4. DATA 20 Class Frequency 0 1000 2000 3000

Figure 4.2: Class imbalance for Sparkholder Data

(24)

CHAPTER 4. DATA 21 These three features have missing data as the user was not always obligated to fill out these fields. The percentages for the first two features seem to be reasonable, but more than a quarter of the observations have a missing value for ”User DC”. This seems like a lot, but Van Buuren [59] explains that many leading authors in the field are weary to give percentages up to which using variables with missing data is fine. The usability depends on a lot more than the percentage missing.

4.1.4 Feature Engineering

The Sparkholder dataset bundles a lot of information in just a few columns of data and a short text field. To extract the best predictive performance for our models, a lot of feature engineering has been done. We will shortly describe the steps that have been taken for both the textual and the non-textual features below.

Textual Features

The descriptions in the externalDescription of our dataset usually only contain a few words. Often these descriptions are highly informative on how the observation should be classified. To feed all this information to our models we do the following:

First we replace all special vowels, like ‘ä’ or ‘é’ by their regular version. This is done because for example “Privé onttrekkingen” should be interpreted the same as “Prive onttrekkingen”. This also ensures we do not get unnecessarily many terms in our DTM later on.

Next we split off common endings to street names, so that for example “Roetersstraat 41 ” will be changed to “Roeters straat 41 ”. This helps the model recognizing that the booking might have something to do with for example rent or mortgage payments. In this way street names that the model has never seen, can still easily be classified.

Continuing on the previous example: The fact that there are numbers in the description is informative. This indicates that the description may contain a street address, a bank account number or maybe a licence plate of a company car. However, most of the time the actual numbers themselves are not of much value. There are only a few cases where the numbers are of interest and that is when they are VAT percentages. So then for all descriptions we count the amount of numerical characters in there, and after that remove the numbers from the descriptions, unless they are a VAT percentage.

The VAT percentage nicely introduces us to the next processing step. When we remove the punctuation in a later step and replace them by a whitespace, abbreviations like “B.t.w. r/c” would get reduced to “B t w r c”. We select our DTM not to contain words shorter than two letters, so then the above description would be empty. More importantly, we would like “Btw ” to be mapped to the same word in our DTM. Hence for the 25 most common abbreviations, we pattern match the punctuated versions to the non-punctuated version (also regarding capital letters of course) and keep the last one.

(25)

CHAPTER 4. DATA 22 These descriptions are all transformed to lower case, the numbers are removed, and the words are put in a VCorpus, using the tm package in R. We set the lower limit of the word lengths in our dictionary at two, to also include various abbreviations. Then finally, we make a DTM, keeping only terms that appear more than 5 times. This DTM is ready to be used as input for our classifiers now.

We are not finished with the textual-features yet. Before having shortened the words in the description, we use another copy of the description to extract some more features. Along with the counting of numerical values, which we already mentioned, there are some more characters of interest to count. A licence plate, for example always has two hyphens, and is almost always written in capital letters. IBAN account numbers are most of the time written with a fixed amount of dots, and has the bank name in capital letters. Percentage signs often appear alongside interest numbers or VAT numbers and finally we also count the length of the words in the description.

In conclusion we use the following textual features beside the DTM: Lexical features

Number of dots (”.”) Number of hyphens (”-”) Number of %’s

Number of capital letters True/false all letter capital Number of numbers

Table 4.1: Lexical features based on description

Non-textual Features

Now let us move on to the non-textual features. For the most part, this is just a matter of converting the integers or strings to categorical variables. There are two cases where we take a different approach.

The “externalCode” variable is the general ledger account code, which are usually between 0001 and 9999. There are no strict rules for using certain codes for certain type of bookings, but there are some generally accepted good-practices. For instance, most accountants use codes starting with 0-3 for balance accounts and 4-9 for profit/loss accounts. The 8000 range is mostly used for turnover, and for example 4000-4200 is used for personnel costs. Clearly the codes should be converted to categorical variables, but in general the last number of the code does not hold extra information. There are also cases where companies use ascending codes, for example 8001-8009 but then continue with 80010, 80011, etc., so a little care must be taken in selecting the right numbers. Another anomaly is that some bookings are coded with 0010, and

(26)

CHAPTER 4. DATA 23 others with just 10, but they should mean the same.

Taking all these considerations into account, we end up casting the integers to string. Then we take the first three characters of the string and convert to categorical. In this way we already reduce the number of levels from almost 4000 to around 850 levels. From a bookkeeping point of view we can reduce this number even further. As we mentioned a few lines above, a small range of codes if often used for the same type of bookings. We divide the range of 001 to 999 into 50 parts of 20 subsequent general ledger codes. Preliminary tests show that this aggregation is indeed not hurting the classification performance much.

Lastly we perform some processing on the data in the Balance, #Credit and #Debit columns. First we create a feature that saves the sign of the balance. Secondly we group the balance and number of debit/credit bookings by using the empirical quantiles of the data. Per column, we make 10 quantiles of all bookings, excluding accounts with zero balance. Then we save the quantile the feature is in in yet another categorical feature vector.

4.2 Kaggle.com Wine Reviews

The second dataset we use is from a kaggle.com competition and is a dataset containing wine reviews. The data contains 280K instances of wine reviews accompanied with some other features of the wine, like the country and region of origin and the price and score that was given to the wine. To keep the comparison between classifiers on the two datasets independent of the sample size, we select just over 20K instances.

This selection is non-random for two reasons. The categorical variables in the full dataset contain a huge number of levels. In our subset we select groups of instances on the appearance of a maximum number of levels for some of the categorical variables. The first reason for this is to mimic the number of levels in the categorical variables in the Sparkholder dataset. Secondly, a few features had hundreds, or even thousands of levels and some of the classifiers can not handle variables with that many levels (well).

In Table 4.2 the categorical variables and the number of levels in the full set and in the subset are displayed.

Full data (280K) Subset (20K)

Country 50 6 Designation 47239 4677 Province 490 18 Region 1 1332 50 Region 2 18 13 Winery 19186 295 Variety 756 34

(27)

CHAPTER 4. DATA 24 Clearly the designation and winery variables will still pose a problem for some algorithms. We will describe how we handle this problem in the next chapter. First we will discuss some other aspects of the dataset.

4.2.1 Data description

Just like the Sparkholder data, the wine data contains a few columns of structured data and a text field. The text fields in the wine reviews are made up of a few sentences instead of only a few words, as can be seen below in a few typical data points of the wine dataset.

Figure 4.4: Typical observations of wine reviews dataset

The country and description columns are self-explanatory. The designation contains infor-mation on the vineyard the wine is from. The next two columns have numerical data. The first is the number of points awarded to the wine by the taster, and the second is the price of the bottle. The next three contain geographical information on the origin of the wine. The variety of the wine is the variable we try to predict. The final column contains the winery where the wine is from.

4.2.2 Class Imbalance

The target variable in the wine dataset, i.e. the variety, contains heavy imbalance. The three most frequent classes contain just over 9K of the 20K observations, whereas the smallest classes have only 54, 57, 64 and 72 observations. The distribution of the classes is summarized in Figure 4.5.

(28)

CHAPTER 4. DATA 25 Class Frequency 0 1000 2000 3000

Figure 4.5: Class imbalance for wine reviews data

This dataset was chosen for its similarity in mixed feature types and number of levels to predict to the other dataset. Here we see an important difference. We already mentioned that there are many different ways for a multi-class dataset to be imbalanced. The Sparkholder data imbalance from Figure 4.2 can be roughly described as one majority class, followed by a group of around 16 mid-size classes and the remaining classes are minority classes. In contrast the imbalance displayed in Figure 4.5 shows three clear majority classes, then seven mid-size classes and the remaining classes are minority classes.

We do not assume that all classifiers handle different ways for data to be imbalanced equally well, which makes this second dataset a good addition to the experiments we carry out.

4.2.3 Missing Data

Only three of the features have missing values, these are the designation, region2 and the price. We visualize the missing data in the same way as with the other dataset in the graph below.

The missing data on the region2 feature can be explained for the most part by the fact that this feature is available only for US wines, which are by far the most common in our dataset. To that end we introduce a new level for the feature. We set the column region2 for all observation for which the country is not the US equal to “Non-US”. Then only 2.5% is missing instead of just over 15%. This also helps to avoid impossible combinations after imputation of missing values. Figure 4.6 displays the missing data distribution for this dataset. The graph should be read in the same way described in the previous section.

(29)

CHAPTER 4. DATA 26

Figure 4.6: Missing data distribution for wine reviews data

4.2.4 Feature Engineering

The feature engineering process for the wine reviews is less elaborate than for the Sparkholder data. We also transform the points and price from numerical to categorical using empirical quantiles. We count the number of percentage signs, as they can indicate that the wine is a blend. We however do not count dots and hyphens anymore, as we do not consider them as relevant features for this dataset. They would only introduce noise in the data.

For the textual features we do some cleaning of the texts, where we remove the punctuation, convert capital letters to lower case and replace special characters. This time we do not remove numbers, as for example the year a wine is from may have some explanatory power.

When building the DTM while keeping all terms, we have a matrix of over 13.000 columns wide. Clearly this is not desirable. We use tf-idf weighting to select the most discriminating words and keep those in our DTM. If we set the tf-idf threshold to be 0.55, we are left with the most discriminating 1749 words, according to the tf-idf procedure. This DTM can be used as input for the classifiers.

(30)

Chapter 5

Methodology

In this chapter we discuss our approach to evaluating classifier performance and discuss how we handle missing data. Then we explain the setup of the experiments.

5.1 Performance Measures

For classification the most intuitive performance measure is accuracy. This is just the part of the test set that was classified correctly by the model used. As we already mentioned in the Introduction, accuracy can sometimes be a misleading performance measure. For binary classification several other measures have been introduced. Precision and recall are defined in terms of true/false positives (TP and FP) and true/false negatives (TN and FN). The formulas for the both of them read:

Precision = T P

T P + F P , Recall =

T P

T P + F N. (5.1)

This means that we can interpret precision as the ratio of the instances that were correctly predicted as “Positive” among all instances labelled “Positive”. Similarly, recall is the ratio of instances correctly predicted as “Positive” among all instances that where actually of the “Positive” class. The F1 measure combines both measures in a single statistic and is calculated

as:

F1 =

2 · Precision · Recall

Precision + Recall , (5.2)

i.e. it is the harmonic mean of precision and recall.

The F1 measure can be extended for use with K-classes (K > 2). We then get a F1

score per class. For class k the “Positive” case is k and “Negative” case are the other K − 1 classes. Schutze [49] shows that there are two different ways to aggregate the K measures into a single statistic. The first is taking the so-called microaverage, where the scores are weighted by the relative class frequency. The second is taking the macroaverage, where just the simple mean is taken. Schutze argues that because the F1 measure does not take into account true

negatives, that performance in large classes influences the microaverage too much in comparison

(31)

CHAPTER 5. METHODOLOGY 28 to performance on the smaller classes. The macroaverage on the other hand puts equal weight on all classes present in the data. Wiener [64] expresses the same concerns on this difference between the two ways of averaging. The author of this thesis agrees with this vision, so for comparison of the classifier’s performance we will use the macroaverage F1 score. Ferri [25]

notes that using the macroaverage of any statistic might not be a good idea if there are very few examples of a class. However, all classes in our data contain at least 30 to 50 observations, hence this not a problem. Predictive accuracy is also provided in the results, for comparative purposes.

5.2 Missing Data

There are several possibilities when handling missing data. The first is to just leave out obser-vations that have one or more features missing. This is obviously not what we are looking for, as more, or more diverse data to train our models most of the time yields better results. On top of that Allison [3] notes that deletion of data is only valid if the data is missing completely at random, which is a rather strong assumption.

A better solution would be to impute the missing values. Given that the features to be imputed are categorical, there are a few options to choose from. Probably the most popular and well-known option is MICE developed by Van Buuren [60]. MICE has a slight disadvantage however. By design we have to make assumptions about the distribution of the data. The same goes for the existence of some multivariate distribution over the data, as is the case with many other imputation methods.

That is why we take another route. Stekhoven [53] introduces MissForest, a non-parametric method to impute missing values of mixed data-types, just like MICE. Apart from not having to make any assumptions about the data, Stekhoven argues that MissForest has even more advantages over MICE, which we will now elaborate.

MissForest, as the name suggests, is based on Random Forests and as such it easily allows for non-linear and interaction effects between the variables, as opposed to MICE. Another advantage that is gained from the RF implementation is that the method is able to provide out-of-bag (OOB) error rates, without the need for a test set. Stekhoven [53] shows that these OOB error rates perform well in practice and as such give a good indication of the quality of the imputation. He also shows that the performance of MissForest is always at least on par with other imputation methods and in some cases significantly better, while also being attractive in terms of computation time. When using solely non-textual features to impute, we obtain a PFC (proportion of falsely classified) OOB error rate of 0.0918 for the Sparkholder data. Including textual features in the imputation slows down the computation a little, but leads to a drastically improved PFC rate of only 0.0012, clearly we use the second set of imputed values for our further analysis.

For the wine reviews we impute region2 and price using missForest. In this case we obtain a PFC rate of 0.0084 for region2 and a NRMSE (Normalized Root MSE) of 0.1874.

(32)

Chapter 6

Experiments

In this chapter we will explain the experiments that we carry out in further detail. We show the results and discuss them shortly. The setup is as follows. First we perform an experiment in which we select the textual features to be used as input for the classifiers. Next we investigate the use of dimension reduction techniques for this kind of classification problem. The third experiment compares all base classifiers against each other. The final experiment determines the effect of cost sensitive learning.

6.1 Experiment 1: Selecting Textual Features

In the previous parts we already mentioned how we obtained document-term-matrices for both our datasets. In this part we compare them to two more sophisticated methods to get textual features and select what kind of textual features we will use as input to the classification algorithms. When selecting the textual features we look both at their performance as well as the number of dimensions needed to get that performance.

6.1.1 LDA

When selecting the number of topics for our LDA, we shall consider the performance measures mentioned in Paragraph 2.3.1 and pick the number of topics such that we meet as much of the above measures as possible. If the results are inconsistent with each other, we will choose by observing which one works best as input for classification.

To that end we use a neural network with 300 hidden nodes, a fixed random seed for the initialization of the weights and 5-fold stratified cross-validation to compare the performance of LDA. Firstly as an enrichment over the BOW, to provide extra semantic features and secondly, as a dimension reduction technique to replace the BOW.

A neural network was used as it is flexible with the input, and training and testing times are a lot shorter than when using for example SVM or a Random Forest.

Using standard LDA on the Sparkholder data, the preliminary results were not very usable.

(33)

CHAPTER 6. EXPERIMENTS 30 This is however not surprising given that the texts are so short. That is why we utilized a variant on Hong’s [30] author topic model. In our case we did not have an author per observation, but we aggregated the texts in a different way: by their externalCode. The resulting topic model already showed an improvement on the non-aggregated topic model, but we needed to do some further tweaking.

The distribution of posterior probabilities of topics for texts of only two or three words got pretty close to a uniform distribution. All topics received some probability, even if there were no words of the corresponding topic. In larger texts this is not a problem, as more words of a topic get observed, the distribution diverges away from uniform. To simulate the same for our short texts, we decided to implement an ad hoc solution, where we set all randomly assigned topic probabilities to zero and then normalize the remaining probabilities to sum to one.

When trying to find the optimal number of topics for the Sparkholder data we get the following picture:

Figure 6.1: LDA performance measures for Sparkholder data

The first thing we notice is that Griffiths’ measure has no clear maximum, but it does seem to suggest that a large number of topics works better for the data. Combining this with Cao’s measure, we would pick 90 topics in this case. In contrast to this, Arun’s and Deveaud’s measures almost coincide their respective minimum and maximum at 40 topics.

For the final decision of the number of topics, we compared the performance of both LDA topic models as input for a neural network and got the following results:

Clearly a lot of information is lost when not using the DTM along with the LDA topics. This confirms once more the LDA is not really suited for short texts. We do note however that

(34)

CHAPTER 6. EXPERIMENTS 31

DTM 40 topics 90 topics DTM + 40 DTM + 90

F1-score 0.730 0.535 0.663 0.740 0.741

Accuracy 0.837 0.685 0.786 0.840 0.836

Table 6.1: Classification performance on Sparkholder data using LDA features

when using LDA together with the DTM, we slightly increase the F1-score and accuracy when

using 40 topics. Taking only performance without the DTM into account, the topic model with 90 topics is clearly better suited for classification input.

On the wine data the aforementioned performance measures are behaving even worse than on the Sparholder data. There is only one measure displaying an extremum that is not on the border.

Figure 6.2: LDA performance measures for wine reviews data

Deveaud’s measure again favours a low number of topics, indicating an optimum of 15 topics, while the other measures all indicate a larger number of topics. Because the measures are inconclusive, we again choose the number of topics by looking which one works best as input for the classifier. We test 15, 40, 80 and 130 topics. On this data the LDA modelling works a

DTM 15 40 80 130 15+DTM 40+DTM 80+DTM 130+DTM

F1-score 0.351 0.391 0.395 0.394 0.404 0.404 0.423 0.429 0.423

Accuracy 0.451 0.547 0.558 0.563 0.572 0.532 0.566 0.568 0.572

(35)

CHAPTER 6. EXPERIMENTS 32 lot better and we obtain better results than with the DTM. As was the case above, the lower number of topics suggested by Deveaud’s measure does not provide good results. The other measures all indicate a larger number of topics and our results support that. We settle on 130 topics.

6.1.2 Word2Vec

In Paragraph 2.3.1 we wrote about the use of NLP tasks in evaluating a word2vec model. However, this does not make much sense for our current problem. For example when choosing the length of our word vectors, a quick check on the Sparkholder data reveals that all different vector lengths provide good performance when asking similar words to “BMW”. We get a list of mostly other car brands, just like Zhang [68] gets similar results with different vector lengths. This does not help us select the right length for the word vectors, firstly because it is very hard to devise a good test for the best word embeddings and secondly, our main goal is to provide good textual features for the classification and not necessarily to do good at other NLP tasks.

That is why we choose to test the word vectors for different vector lengths as input for classification. The method here is the same as the one we described above for LDA, using a neural network and 5-fold cross validation.

The graph below shows the classification performance when using the word vectors in con-junction with the DTM and without them. We measure performance by accuracy and F1 score.

The dashed line indicates the performance using only the DTM as features for the words in the descriptions.

(36)

CHAPTER 6. EXPERIMENTS 33 DTM noDTM Accur acy F1 0 50 100 150 200 250 0 50 100 150 200 250 0.81 0.82 0.83 0.84 0.70 0.72 0.74

Dimension of Word Vector

Classification P erf or mance Window width 2 4

Figure 6.3: Classification performance using w2v features on Sparkholder data

We can clearly see that in all cases adding the word vectors on top of the DTM improves performance. If we go for pure performance word vectors with length 25 and a window width of 2 on top of the DTM seem to work best. The difference is not very clear though, as all vector lengths perform within around 1% of each other on both performance measures. We also note that combining the word2vec features with the LDA features yielded better performance than using only the word2vec features.

If our main purpose was dimension reduction, and thus the DTM would not be used as input at all, then the graphs on the right indicate that using length 150 word vectors with a window width of 4 is the way to go. The obtained F1-score is still higher than when only using

the DTM.

We note however that the scale of these graphs is relatively small, and that even using 25 dimensional vectors already gives a pretty good result, while reducing the dimension of the word features from over 1100 to 25. This corresponds to a reduction in dimension with a factor of 45, which is obviously huge.

For the wine reviews we see a similar picture. The DTM only performance was relatively poor at 0.35 F1-score and 0.45 accuracy. We excluded the dashed line with the DTM performance

A comparative analysis of classification algorithms on partly structured, multi-class imbalanced data

Master’s Thesis