A comparison between Support Vector Machines an logistic regression based on prediction error rates

(1)

A comparison between Support Vector Machines and Logistic Regression

based on prediction error rates

June 27

th

, 2014

Casper Burik (10001420)

Thesis Supervisor: Dr. N.P.A. van Giersbergen Programme: Econometrie en Operationele Research Track: Econometrie

Field: Big Data

Abstract

Support Vector Machines (SVM) and logistic regression are compared on their prediction error rates. This is done for two datasets, where the size of the training set is varied. Logistic regression performed better in the dataset containing a lot of dummy variables. In the data set with only continuous variables SVM performs better than logistic regression for the larger training set sizes.

(2)

1. Introduction

Nowadays computers process more data than ever before. Computers process most of all economic transactions for example. Another example is Google; it receives 100 billion search queries per month. These large sets of data, often called big data, can be processed, manipulated and analysed. Conventional statistical techniques may work well in this situation. However, there are new and different techniques that may perform better when working with big data (Varian, 2013, pp. 1-‐3). Not only econometricians concern themselves with analysing economic data, but also computer scientists. In particular, the field of machine learning concerns itself with prediction of data. This field brought a lot of new techniques that can be used in econometrics and economics (Varian, 2013, p.5). An example of a technique developed by computer scientists specialized in machine learning is the Support Vector Machine (SVM). This approach is a classification method developed in the 1990s and has gained popularity ever since SVMs perform well under different settings and are considered one of the best out-‐of-‐sample classifiers (James et al., 2013, p.337). Several comparison studies between SVMs and different classification methods have been done before. Lee et al. (2004) have done a very extensive study, comparing 21 classification methods, including logistic regression and SVMs, on seven different data sets involving gene selection. SVM was among the best methods. SVMs have also been compared to other classification methods using economic data. An example is Min and Lee (2005); they compared SVMs to three other classification methods by predicting bankruptcy for firms in Korea. The SVM had the best prediction accuracy in this study.

The aim of this paper is to compare the prediction strength of SVMs to the more conventional logistic regression, as it is one of the most widespread classification methods in econometrics. The two techniques will be compared on their prediction error rate for two different data sets.

In section two the theoretical background of the two techniques will be explained. This section also contains a paragraph on former research that has been done on this subject. The third section will explain the method of comparing the two models. It will also give a description of the data that is used in this study. In the fourth section the results will be discussed. In the fifth and final section a conclusion will be drawn.

2. Theory

This section contains a paragraph on the theory behind the two techniques and a paragraph on former research that has been done on comparing Support Vector Machines with logistic regression.

(3)

2.1. Support Vector Machines and Logistic Regression

Support Vector Machines is a classification method developed in the 1990s and has gained popularity ever since. The idea behind SVMs is to create a boundary between two classes, in the case of a linear estimation this is a line, a plane or a hyper plane, depending on the number of explanatory variables. Depending on which side of the boundary each data point is, a binary response is predicted. SVMs deal with non-‐linear boundaries quite easily (James et al., 2013, p. 337). It is also possible to extend SVMs to multi-‐nominal prediction (James et al., 2013, p. 355).

The mathematics behind the SVMs can be quite difficult and some of it goes beyond the scope of this thesis. The general idea is summarized below as is done by James et al. (2013, pp. 337-‐ 355): max !!,!!!,…,!!",!!,…,,!! 𝑀 subject to: 𝑦!×𝑓 𝑥!|𝛽!, 𝛽!!, … , 𝛽!" ≥ 𝑀 (1 − 𝜖!), 𝜖_! ≤ 𝐶 ! !!! , 𝜖! ≥ 0, !!!! !!!!𝛽!"! = 1

where 𝑦! is a binary variable, it is either -‐1 or 1. 𝑓 𝑥!|𝛽!, 𝛽!!, … , 𝛽!" is a function representing the

boundary between the two classes. M is a margin around the boundary. 𝜖! is a slack variable that

alows an observation to lie on the wrong side of the margin or even on the wrong side of the boundary. 𝜖! is bigger than 0 if the observation is on the wrong side of the margin and 𝜖! is bigger

than 1 if the observation is on the wrong side of the boundary. C is a tuning parameter, usually called the cost parameter, which binds the amount of observations that may lie on the wrong side of the margin or even boundary. When C is 0 all the observations have to lay on the right side of the margin, this usually results in a narrow margin and is only possible if the two classes are separable (James et al., 2013, p. 347). The bigger the value of C, the more violations of the margin are allowed. The best value of C is often found via cross-‐validation. For out-‐of-‐sample predictions, if the value of 𝑓 𝑥!|𝛽!, 𝛽!!, … , 𝛽!" is bigger than 0 the prediction is 1, if the value 𝑓 𝑥!|𝛽!, 𝛽!!, … , 𝛽!" is smaller

than 0 the prediction is -‐1. As it turns out the values of 𝛽!, 𝛽!!, … , 𝛽!" only depend on the

observations that are on the margin, or on the wrong side of the margin and not on the other observations. Those observations are called the support vectors.

An example of a linear function where each observation is on the correct side of the boundary can be seen in figure 1. The SVM with a linear boundary has a boundary function of the form (James et al., 2013, p. 346):

𝑓 𝑥! = 𝛽!+ 𝛽!𝑥!" ! !!!

(4)

An example of a polynomial boundary and a radial boundary can be seen in figure 2.

The boundary function of a polynomial boundary of degree m has the following form (James et al., 2013, p. 350): 𝑓 𝑥_! = 𝛽_!+ 𝛽_!"𝑥_!"! ! !!! ! !!! The radial boundary function has the form of (James et al., 2013, p. 352):

𝑓 𝑥! = 𝛽!+ 𝛽! 𝑒!! (!!"!!!") ! ! !!! ! !!!

The polynomial and radial boundaries have additional parameters (m and γ respectively in the equations above). Those can also be chosen via k-‐fold cross-‐validation.

Figure 1: An example of two classes being

separated by a linear boundary. Source: James et al. (2013, p. 348)

(5)

Logistic regression is a more conventional method than SVMs that can be found in most econometric textbooks, see for instance Heij et al. (2004). The idea behind it is to estimate a model for a binary variable based on the cumulative distribution of the logistic distribution. The advantage of cumulative distributions is that they are always between 0 and 1, making them very suitable for estimating probabilities variables. The logistic regression model is specified as:

P(y_!= 1) = 1

1 + 𝑒!(!!! !!!!!!!!)

Here P is the probability that the dependent variable equals 1. The values of 𝛽!, … , 𝛽! are found via

maximum likelihood estimation (Heij et al., 2004, p. 447). For prediction a value of 1 is predicted if P is bigger than 0.5. If P is smaller than 0.5 a value of 0 is predicted.

2.2 Literature Overview

Several studies comparing SVMs to different classification techniques including logistic regression have been done with data from different fields. Lee et al. (2004) have done a very extensive study, comparing twenty-‐one classification techniques for seven gene expression datasets. In this study they used a linear SVM model and a radial SVM model. The SVMs were one of the best classification prediction models used. The linear SVM model outperformed logistic regression in every dataset, the predictive power of the radial SVM model was close to the linear SVM model and had a tie with the logistic regression model in one data set, being better in the rest. The difference in prediction error rate between SVMs and logistic regression differed a lot per dataset. In some datasets there were comparable results, in some there were large differences. It varied from the same prediction error, to a difference of 30 percentage points, where SVM only had a prediction error rate of 10 per cent and logistic regression an error rate of 40 per cent (Lee et al., 2004, pp. 876-‐878).

Figure 2: On the left: an example of a polynomial boundary separating two classes, on the right a radial

(6)

Min and Lee (2005) compared SVMs to three other methods, one of which was logistic regression. For their comparison they used bankruptcy data of Korean firms. In their comparison they used a radial SVM model. The SVM outperformed the three other methods; logistic regression had the largest prediction error rate.

Boyacioglu, Kara and Baykan (2009) used eight different classification methods to predict bank financial failure rates for Turkish banks. They used four different SVM models with different boundary functions, the polynomial model worked best in this instance. It had a prediction error rate of 0.091 while logistic regression had a prediction error rate of 0.182 in this case.

Other examples of studies using SVMs, logistic regression and other classification techniques for predicting bankruptcy include Wu, Tzeng, Goo and Fang (2007); and Min, Lee and Han (2006). Both of which had similar results as the studies above. Examples of studies comparing SVMs and other classification techniques in different fields include Caruana and Niculescu-‐Mizil (2006), where SVMs outperformed logistic regression in most datasets; and Abu-‐Nimeh et al. (2007), where logistic regression was a better predictor than the SVM. This last study used the techniques to predict whether e-‐mails were phishing or not. The word-‐counts of different words were used as variables. Concluding from former research: In most cases SVMs perform better as classification predictors than logistic regression. So it may be expected that SVMs will also perform better than logistic in this thesis.

3. Method and Data

SVMs and logistic regression will be compared on their prediction error rate. This will be done for two different datasets that will be discussed in section 3.2. Each dataset is separated in to a training set and a test set. Each model will be estimated using the training set. Using the model from the training set, predictions will be made for the test set. With the data from the test set and the predictions, a prediction error rate is computed.

Following the example of Perlich, Provost and Simonoff (2003), who place a large emphasize on varying the size of the training set when comparing tree-‐based methods with logistic regression methods, the size of the training set will be varied. It may vary per technique how it performs depending on the amount of data used. It may be expected that the two techniques will both perform better with larger training sets, as there is more data to build the model from. Varying the size of the training set has not been done in former research on comparing SVMs to other classification techniques. For the first dataset the training set sizes are as follows: 500, 1000, 2000, 3000 and 4000. For the second dataset the sizes are: 500,1000,2000,4000,6000,8000. The training sets are randomly drawn from the total dataset. For each training set size, five random samples are

(7)

drawn. The methods are compared on the average prediction error rate for those five samples, in order to reduce variance.

For the SVMs both a radial and a polynomial model is estimated. The R-‐package e1071 will be used for the estimation. The parameters will be chosen via a grid search algorithm provided in the same package, which uses ten-‐fold cross-‐validation (James et al., 2013, p. 361). The logistic regression model is estimated as specified in section 2.1.2.

Two different datasets are used in this thesis. The first dataset, employment, contains data on 20675 individuals with detailed descriptions of their employment status and especially their opinion towards self-‐employment. The data is collected via a survey by The Gallup Organization (2007) for 25 EU countries, Iceland, Norway and the United States. The survey contained questions on demographics, employment status, occupation of parents, education and opinion towards self-‐ employment. The full survey can be found in the paper of The Gallup Organization. In this paper the employment status is taken as the dependent variable, it has two classes: self-‐employed or employee. This dataset has also been used by Block, Hoogerheide and Thurik (2011) to predict whether a person is self-‐employed or not. In order to create a binary variable, the unemployed are taken out of the dataset. The list of explanatory variables can be found in Table 7 in the attachment, most of which are dummy variables. The total dataset contains 8216 people, of which 1777 are self-‐ employed.

The second dataset, wines, contains data on the physicochemical properties and quality of 4898 white wines. The dataset was created by Cortez et al. (2009). The data will be used to predict whether the quality of the wine is above average or not. Quality was scored on a scale of 0 to 10 with one-‐point increments with an average of 5.88. 3258 wines are above average. Each wine was judged on quality by at least three assessors, the quality variable is the median score of the three assessors (Cortez et al, 2009, p. 548). The variables containing the physicochemical properties are all continuous variables. The list of the explanatory variables can be found in Table 8 in the attachment. The biggest difference between the two datasets is the amount of dummy variables; the first dataset contains a lot of dummy variables, the second one contains none. The first dataset also contains more variables and has almost twice as many entries as the second one.

(8)

4. Results

The results of the estimations with logistic regression and Support Vector Machines for the first dataset can be found in Tables 1, 2 and 3, and Figure 3. The results for the second dataset can be found in Tables 3, 4 and 5, and Figure 4. In each table you find the prediction error rates for each method for different training set sizes and 5 different samples for each size, together with the average over the samples. In the figures the average prediction error is used.

In the first dataset the three models perform similarly. For the polynomial SVM model a third degree function is used. As you can see from Figure 3, all methods improve when more data is used. The SVMs improve a lot between the training set size of 6000 and 8000, and perform slightly better than logistic regression.

Training

set size

Sample

Number: 1

2

3

4

5 Average

500 0,22965

0,23523

0,22525 0,22227 0,23950 0,23056

1000

0,21660

0,21134

0,21092 0,20967 0,21023 0,21054

2000

0,20238

0,20544

0,21042 0,20994 0,20914 0,20874

4000

0,19995

0,20351

0,20731 0,21513 0,20043 0,20659

6000

0,21119

0,20352

0,19856 0,20803 0,20397 0,20352

8000

0,21119

0,20370

0,15278 0,19444 0,19907 0,18750

Table 1: Prediction error rates with logistic regression for dataset 1 (employment)

Training

set size

Sample

Number: 1

2

3

4

5 Average

500 0,21449

0,19827

0,22084 0,21177 0,22551 0,21417

1000

0,21660

0,21924

0,21522 0,21494 0,21591 0,21638

2000

0,21750

0,20898

0,21509 0,21284 0,21573 0,21403

4000

0,21466

0,21727

0,20991 0,21679 0,20944 0,21361

6000

0,21661

0,21209

0,21255 0,21706 0,21751 0,21516

8000

0,15741

0,21382

0,17130 0,19444 0,18056 0,18350

Table 2: Prediction error rates with a radial SVM model for dataset 1 (employment)

(9)

Training

set size

Sample

Number: 1

2

3

4

5 Average

500 0,22356

0,21475

0,21799 0,21669 0,21540 0,21768

1000

0,21924

0,21882

0,22242 0,22228 0,21937 0,22043

2000

0,21606

0,21493

0,21525 0,22297 0,22136 0,21811

4000

0,21181

0,20565

0,21371 0,20944 0,20897 0,20991

6000

0,21661

0,20352

0,21435 0,21796 0,20623 0,21173

8000

0,16204

0,21296

0,14352 0,21296 0,17593 0,18148

Table 3: Prediction error rates with a polynomial SVM model for dataset 1 (employment)

In the second dataset the radial SVM model starts to outperform the logistic regression model when the training set is larger than 1000 observations. In this data set both logistic regression and the radial SVM model perform better when the training set size increases, however the decrease in prediction error rate for logistic regression is small. The radial SVM model ‘learns’ faster than logistic regression from the data in this dataset. For the polynomial SVM model a linear function fit the data best, although it seems the polynomial SVM does not fit the data well at all.

The biggest difference between the two datasets was the number of variables and especially the large number of dummy variables in the first dataset. As it seems, SVMs fit continuous datasets better if the right boundary function is used. When handling a lot of dummy variables, SVMs seem to perform similarly to logistic regression.

Figure 3: Error rates of prediction with Logistic regression model and Support Vector Machines for

dataset 1 (employment)

(10)

Training

set size

Sample

Number: 1

2

3

4

5 Average

500 0,25466 0,25489

0,25330 0,26262 0,25807 0,25671

1000

0,25552 0,25936

0,25064 0,26167 0,25808 0,25705

2000

0,26259 0,23775

0,24189 0,24948 0,24776 0,24790

3000

0,24816 0,24710

0,25606 0,23867 0,24868 0,24773

4000

0,24610 0,23942

0,24053 0,25724 0,23497 0,24365

Table 4: Prediction error rates with logistic regression for dataset 2 (wines)

Training

set size

Sample

Number: 1

2

3

4

5 Average

500 0,25261 0,26944

0,27876 0,29332 0,26853 0,27253

1000

0,25090 0,25962

0,24064 0,23833 0,23474 0,24484

2000

0,23050 0,22326

0,21670 0,21222 0,21705 0,21994

3000

0,21233 0,21338

0,21549 0,21444 0,22023 0,21517

4000

0,16481 0,20267

0,17595 0,17483 0,18820 0,18129

Table 5: Prediction error rates with a radial SVM model for dataset 2 (wines)

Training

set size

Sample

Number: 1

2

3

4

5 Average

500 0,25261 0,26944

0,27876 0,29332 0,26853 0,27253

1000

0,25090 0,25962

0,24064 0,23833 0,23474 0,24484

2000

0,23050 0,22326

0,21670 0,21222 0,21705 0,21994

3000

0,21233 0,21338

0,21549 0,21444 0,22023 0,21517

4000

0,16481 0,20267

0,17595 0,17483 0,18820 0,18129

Table 6: Prediction error rates with a polynomial SVM model for dataset 2 (wines)

(11)

In an effort to explain the large difference in the prediction error rates for the second dataset Figure 5 is made. In this figure the three most significant variables, when estimated by logistic regression, are shown in six different graphs. The estimated logistic regression model for the full dataset is shown in Table 9 in Attachment B, the asterisks indicate the most significant variables. The two colours in the figure depict the two different classes. The two classes show some grouping that can be imagined to be best captured by a radial function, keeping in mind that the actual function is in an eleven dimensional space. The images are not clear enough to base strong conclusions on.

Figure 4: Prediction error rates with logistic regression model and Support Vector Machines for dataset 2 (wines)

(12)

5. Conclusion

Logistic regression and Support Vector Machines were compared on their prediction error rate for two different datasets. Within these two datasets the size of the training set was varied and five different random samples were used for each training set size.

The first data set is a dataset containing information on workers based on a survey. The data contained many dummy variables. The different models predicted if a worker was self-‐employed or an employee. The different models performed similarly. The second data set contained physicochemical information on white wines. The models predicted whether a wine was above average in taste or not. The radial SVM model performed best in this dataset. While the radial SVM model performed worse than logistic regression for the smallest training set size, the radial SVM

Figure 5: Six graphs showing various combinations of the three most significant variables when estimated by logistic

(13)

model improved faster for bigger training set sizes than logistic regression. The polynomial SVM model did not perform well in this data set.

Concluding from the results, it may be the case that SVMs perform similarly to logistic regression when handling dummy variables. With continuous variables SVMs seem to perform better when the right boundary function is used and for the larger training set sizes.

In former research SVMs outperformed logistic regression in most datasets, though it may be concluded from this thesis that the relative performance of both techniques depends on the type of variables and the size of the training set. Training set size and type of variables has not been covered in former comparison studies and may play an important role in relative performance. Although, it is important to note that these are only two datasets, so the difference in these two aspects may also be purely coincidental. Further research with other types of data is needed to confirm these conclusions.

(14)

Attachment A: Lists of variables

Dependent variable

Employment status Binary variable, 1 for self-‐employed (versus employee)

Explanatory variables

Gender Dummy variable, 1 for female Age

Years of Education

Urban Area Three dummy variables:

Metropolitan area

Urban area Rural area

(versus no answer) Occupation father Five dummy variables: Self employed White collar Blue collar Civil Servant Not employed

(versus no answer)

Occupation mother Five dummy variables:

Self employed White collar Blue collar Civil Servant Not employed

(versus no answer) Preference towards job Two dummy variables:

Self employed

employee

(versus no answer) Role of education towards entrepreneur ship Scale of -‐2 to 2

Opinion of entrepreneurs Eight dummy variables for four questions:

Agree

Disagree

(versus no answer)

Opinion towards difficulty of starting a business Scale of -‐2 to 2 25 Country dummies

Table 7: List of variables in dataset 1 (employment)

(15)

Dependent variable Wine quality Binary variable, 1 for above average

Explanatory variables

Min Max Mean Fixed acidity (g(tartaric acid)/dm3₎ _3.8 _14.2 _6.9 Volatile acidity (g(acetic acid)/dm3₎ _0.1 _1.1 _0.3

Citric acid (g/dm3₎ _0.0 _1.7 _0.3

Residual sugar (g/dm3₎ _0.6 _65.8 _6.4

Chlorides (g(sodium chloride)/dm3₎ _0.01 _0.35 _0.05

Free sulphur dioxide (mg/dm3₎ ₂ ₂₈₉ ₃₅

Total sulphur dioxide (mg/dm3₎ ₉ ₄₄₀ ₁₃₈

Density (g/cm3₎ _0.987 _1.039 _0.994

pH 2.7 3.8 3.1

Sulphates (g(potassium sulphate)/dm3₎ _0.2 _1.1 _0.5

Alcohol (vol.%) 8.0 14.2 10.4

Table 8: List of variables in dataset 2 (wines)

Source: Cortez et al. (2009, p. 549)

(16)

Attachment B: The coefficients of the logistic regression model

Estimate

Std. Error

P-‐value

(Intercept)

424,5000

74,2700

0,0000

Fixed acidity

0,1516

0,0731

0,0382

Volatile acidity

-‐6,5140

0,4144

0,0000 *

Citric acid

0,1337

0,3030

0,6590

Residual sugar

0,2207

0,0274

0,0000 *

Chlorides

1,1970

1,6740

0,4743

Free sulfur dioxide

0,0091

0,0028

0,0011

Total sulfur dioxide

-‐0,0006

0,0012

0,6314

Density

-‐438,9000

75,2600

0,0000

pH

1,5820

0,3667

0,0000

Sulphates

2,0090

0,3615

0,0000 *

Alcohol

0,5386

0,0968

0,0000

Table 9: Results of logistic regression for the full data set, the asterisks indicate the most significant variables.

References

Abu-‐Nimeh, S., Nappa, D., Wang, X., & Nair, S. (2007). A comparison of machine learning techniques for phishing detection. ECrime '07 Proceedings of the Anti-‐Phishing Working Groups 2nd Annual

eCrime Researchers Summit, 60-‐69.

Boyacioglu, M. A., Kara, Y., & Baykan, Ö. K. (2009). Predicting bank financial failures using neural networks, support vector machines and multivariate statistical methods: A comparative analysis in the sample of savings deposit insurance fund (SDIF) transferred banks in turkey. Expert

Systems with Applications, 36(2, Part 2), 3355-‐3366.

Caruana, R., & Niculescu-‐Mizil, A. (2006). An empirical comparison of supervised learning algorithms.

ICML '06 Proceedings of the 23rd International Conference on Machine Learning, 161-‐168.

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4), 547-‐553.

Heij, C., de Boer, P., Franses, P. H., Kloek, T., & van Dijk, H. K. (2004). Econometric methods with

applications in business and economics (First ed.). Oxford: Oxford University Press.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning with

(17)

Lee, J. W., Lee, J. B., Park, M., & Song, S. H. (2005). An extensive comparison of recent classification tools applied to microarray data. Computational Statistics & Data Analysis, 48(4), 869-‐885. Min, J. H., & Lee, Y. (2005). Bankruptcy prediction using support vector machine with optimal choice

of kernel function parameters. Expert Systems with Applications, 28(4), 603-‐614. Min, S., Lee, J., & Han, I. (2006). Hybrid genetic algorithms and support vector machines for

bankruptcy prediction. Expert Systems with Applications, 31(3), 652-‐660.

The Gallup Organization. Entrepreneurship survey of the EU (25 member states), United States,

Iceland and Norway. Retrieved 5/14, 2014, from

http://ec.europa.eu/public_opinion/flash/fl_192_en.pdf

Wu, C., Tzeng, G., Goo, Y., & Fang, W. (2007). A real-‐valued genetic algorithm to optimize the parameters of support vector machine for predicting bankruptcy. Expert Systems with