• No results found

A comparison between Support Vector Machines an logistic regression based on prediction error rates

N/A
N/A
Protected

Academic year: 2021

Share "A comparison between Support Vector Machines an logistic regression based on prediction error rates"

Copied!
17
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

           

A  comparison  between  Support  Vector  Machines  and  Logistic  Regression  

based  on  prediction  error  rates  

June  27

th

,  2014  

Casper  Burik  (10001420)  

                 

Thesis  Supervisor:  Dr.  N.P.A.  van  Giersbergen   Programme:  Econometrie  en  Operationele  Research   Track:  Econometrie  

Field:  Big  Data    

Abstract  

Support  Vector  Machines  (SVM)  and  logistic  regression  are  compared  on  their  prediction  error  rates.   This   is   done   for   two   datasets,   where   the   size   of   the   training   set   is   varied.   Logistic   regression   performed   better   in   the   dataset   containing   a   lot   of   dummy   variables.   In   the   data   set   with   only   continuous  variables  SVM  performs  better  than  logistic  regression  for  the  larger  training  set  sizes.    

(2)

1.  Introduction  

Nowadays  computers  process  more  data  than  ever  before.  Computers  process  most  of  all  economic   transactions   for   example.   Another   example   is   Google;   it   receives   100   billion   search   queries   per   month.  These  large  sets  of  data,  often  called  big  data,  can  be  processed,  manipulated  and  analysed.   Conventional   statistical   techniques   may   work   well   in   this   situation.   However,   there   are   new   and   different  techniques  that  may  perform  better  when  working  with  big  data  (Varian,  2013,  pp.  1-­‐3).       Not   only   econometricians   concern   themselves   with   analysing   economic   data,   but   also   computer   scientists.   In   particular,   the   field   of   machine   learning   concerns   itself   with   prediction   of   data.  This   field   brought   a   lot  of   new   techniques  that  can  be   used  in  econometrics  and  economics   (Varian,   2013,   p.5).   An   example   of   a   technique   developed   by   computer   scientists   specialized   in   machine   learning   is   the   Support   Vector   Machine   (SVM).   This   approach   is   a   classification   method   developed   in   the   1990s   and   has   gained   popularity   ever   since   SVMs   perform   well   under   different   settings  and  are  considered  one  of  the  best  out-­‐of-­‐sample  classifiers  (James  et  al.,  2013,  p.337).     Several  comparison  studies  between  SVMs  and  different  classification  methods  have  been   done  before.  Lee  et  al.  (2004)  have  done  a  very  extensive  study,  comparing  21  classification  methods,   including  logistic  regression  and  SVMs,  on  seven  different  data  sets  involving  gene  selection.  SVM   was  among  the  best  methods.  SVMs  have  also  been  compared  to  other  classification  methods  using   economic  data.  An  example  is  Min  and  Lee  (2005);  they  compared  SVMs  to  three  other  classification   methods  by  predicting  bankruptcy  for  firms  in  Korea.  The  SVM  had  the  best  prediction  accuracy  in   this  study.  

  The   aim   of   this   paper   is   to   compare   the   prediction   strength   of   SVMs   to   the   more   conventional   logistic   regression,   as   it   is   one   of   the   most   widespread   classification   methods   in   econometrics.  The  two  techniques  will  be  compared  on  their  prediction  error  rate  for  two  different   data  sets.  

In   section   two   the   theoretical   background   of   the   two   techniques   will   be   explained.   This   section  also  contains  a  paragraph  on  former  research  that  has  been  done  on  this  subject.  The  third   section  will  explain  the  method  of  comparing  the  two  models.  It  will  also  give  a  description  of  the   data  that  is  used  in  this  study.  In  the  fourth  section  the  results  will  be  discussed.  In  the  fifth  and  final   section  a  conclusion  will  be  drawn.    

 

2.  Theory  

This   section   contains   a   paragraph   on   the   theory   behind   the   two   techniques   and   a   paragraph   on   former  research  that  has  been  done  on  comparing  Support  Vector  Machines  with  logistic  regression.      

(3)

2.1.  Support  Vector  Machines  and  Logistic  Regression  

Support   Vector   Machines   is   a   classification   method   developed   in   the   1990s   and   has   gained   popularity   ever   since.   The   idea   behind   SVMs   is   to   create   a   boundary   between   two   classes,   in   the   case   of   a   linear   estimation   this   is   a   line,   a   plane   or   a   hyper   plane,   depending   on   the   number   of   explanatory   variables.   Depending   on   which   side   of   the   boundary   each   data   point   is,   a   binary   response  is  predicted.  SVMs  deal  with  non-­‐linear  boundaries  quite  easily  (James  et  al.,  2013,  p.  337).   It  is  also  possible  to  extend  SVMs  to  multi-­‐nominal  prediction  (James  et  al.,  2013,  p.  355).  

The   mathematics   behind   the   SVMs   can   be   quite   difficult   and   some   of   it   goes   beyond   the   scope  of  this  thesis.  The  general  idea  is  summarized  below  as  is  done  by  James  et  al.  (2013,  pp.  337-­‐ 355):   max !!,!!!,…,!!",!!,…,,!!  𝑀   subject  to:  𝑦!×𝑓 𝑥!|𝛽!, 𝛽!!, … , 𝛽!"   ≥ 𝑀  (1 − 𝜖!),   𝜖! ≤ 𝐶 ! !!! ,  𝜖!   ≥ 0,   !!!! !!!!𝛽!"! = 1    

where  𝑦!  is  a  binary  variable,  it  is  either  -­‐1  or  1.  𝑓 𝑥!|𝛽!, 𝛽!!, … , 𝛽!"  is  a  function  representing  the  

boundary  between  the  two  classes.  M  is  a  margin  around  the  boundary.  𝜖!  is  a  slack  variable  that  

alows   an   observation   to   lie   on   the   wrong   side   of   the   margin   or   even   on   the   wrong   side   of   the   boundary.  𝜖!    is  bigger  than  0  if  the  observation  is  on  the  wrong  side  of  the  margin  and  𝜖!  is  bigger  

than  1  if  the  observation  is  on  the  wrong  side  of  the  boundary.  C  is  a  tuning  parameter,  usually  called   the  cost  parameter,  which  binds  the  amount  of  observations  that  may  lie  on  the  wrong  side  of  the   margin  or  even  boundary.  When  C  is  0  all  the  observations  have  to  lay  on  the  right  side  of  the  margin,   this  usually  results  in  a  narrow  margin  and  is  only  possible  if  the  two  classes  are  separable  (James  et   al.,  2013,  p.  347).  The  bigger  the  value  of  C,  the  more  violations  of  the  margin  are  allowed.  The  best   value   of   C   is   often   found   via   cross-­‐validation.   For   out-­‐of-­‐sample   predictions,   if   the   value   of   𝑓 𝑥!|𝛽!, 𝛽!!, … , 𝛽!"  is  bigger  than  0  the  prediction  is  1,  if  the  value  𝑓 𝑥!|𝛽!, 𝛽!!, … , 𝛽!"  is  smaller  

than   0   the   prediction   is   -­‐1.   As   it   turns   out   the   values   of  𝛽!, 𝛽!!, … , 𝛽!"  only   depend   on   the  

observations   that   are   on   the   margin,   or   on   the   wrong   side   of   the   margin   and   not   on   the   other   observations.  Those  observations  are  called  the  support  vectors.  

  An   example   of   a   linear   function   where   each   observation   is   on   the   correct   side   of   the   boundary  can  be  seen  in  figure  1.  The  SVM  with  a  linear  boundary  has  a  boundary  function  of  the   form  (James  et  al.,  2013,  p.  346):    

𝑓 𝑥! =   𝛽!+   𝛽!𝑥!" ! !!!

   

(4)

An  example  of  a  polynomial  boundary  and  a  radial  boundary  can  be  seen  in  figure  2.    

The  boundary  function  of  a  polynomial  boundary  of  degree  m  has  the  following  form  (James  et  al.,   2013,  p.  350):    𝑓 𝑥! =   𝛽!+   𝛽!"𝑥!"! ! !!! ! !!!   The  radial  boundary  function  has  the  form  of  (James  et  al.,  2013,  p.  352):  

𝑓 𝑥! =   𝛽!+   𝛽!  𝑒!! (!!"!!!") ! ! !!! ! !!!  

 The   polynomial   and   radial   boundaries   have   additional   parameters   (m   and   γ   respectively   in   the   equations  above).  Those  can  also  be  chosen  via  k-­‐fold  cross-­‐validation.  

   

Figure  1:  An  example  of  two  classes  being  

separated  by  a  linear  boundary.  Source:  James  et   al.  (2013,  p.    348)  

(5)

  Logistic   regression   is   a   more   conventional   method   than   SVMs   that   can   be   found   in   most   econometric  textbooks,  see  for  instance  Heij  et  al.  (2004).  The  idea  behind  it  is  to  estimate  a  model   for  a  binary  variable  based  on  the  cumulative  distribution  of  the  logistic  distribution.  The  advantage   of  cumulative  distributions  is  that  they  are  always  between  0  and  1,  making  them  very  suitable  for   estimating  probabilities  variables.  The  logistic  regression  model  is  specified  as:  

P(y!= 1) =   1

1 + 𝑒!(!!! !!!!!!!!)  

Here  P  is  the  probability  that  the  dependent  variable  equals  1.  The  values  of  𝛽!, … , 𝛽!  are  found  via  

maximum  likelihood  estimation  (Heij  et  al.,  2004,  p.  447).  For  prediction  a  value  of  1  is  predicted  if  P   is  bigger  than  0.5.  If  P  is  smaller  than  0.5  a  value  of  0  is  predicted.        

 

2.2  Literature  Overview  

Several   studies   comparing   SVMs   to   different   classification   techniques   including   logistic   regression   have  been  done  with  data  from  different  fields.  Lee  et  al.  (2004)  have  done  a  very  extensive  study,   comparing   twenty-­‐one   classification   techniques   for   seven   gene   expression   datasets.   In   this   study   they  used  a  linear  SVM  model  and  a  radial  SVM  model.  The  SVMs  were  one  of  the  best  classification   prediction  models  used.  The  linear  SVM  model  outperformed  logistic  regression  in  every  dataset,  the   predictive  power  of  the  radial  SVM  model  was  close  to  the  linear  SVM  model  and  had  a  tie  with  the   logistic  regression  model  in  one  data  set,  being  better  in  the  rest.  The  difference  in  prediction  error   rate  between  SVMs  and  logistic  regression  differed  a  lot  per  dataset.  In  some  datasets  there  were   comparable  results,  in  some  there  were  large  differences.  It  varied  from  the  same  prediction  error,   to  a  difference  of  30  percentage  points,  where  SVM  only  had  a  prediction  error  rate  of  10  per  cent   and  logistic  regression  an  error  rate  of  40  per  cent  (Lee  et  al.,  2004,  pp.  876-­‐878).  

Figure  2:  On  the  left:  an  example  of  a  polynomial  boundary  separating  two  classes,  on  the  right  a  radial  

(6)

  Min   and   Lee   (2005)   compared   SVMs   to   three   other   methods,   one   of   which   was   logistic   regression.   For   their   comparison   they   used   bankruptcy   data   of   Korean   firms.   In   their   comparison   they  used  a  radial  SVM  model.  The  SVM  outperformed  the  three  other  methods;  logistic  regression   had  the  largest  prediction  error  rate.  

  Boyacioglu,   Kara   and   Baykan   (2009)   used   eight   different   classification   methods   to   predict   bank  financial   failure   rates  for   Turkish  banks.   They   used   four   different   SVM   models   with   different   boundary  functions,  the  polynomial  model  worked  best  in  this  instance.  It  had  a  prediction  error  rate   of  0.091  while  logistic  regression  had  a  prediction  error  rate  of  0.182  in  this  case.    

  Other  examples  of  studies  using  SVMs,  logistic  regression  and  other  classification  techniques   for  predicting  bankruptcy  include  Wu,  Tzeng,  Goo  and  Fang  (2007);  and  Min,  Lee  and  Han  (2006).   Both   of   which   had   similar   results   as   the   studies   above.   Examples   of   studies   comparing   SVMs   and   other  classification  techniques  in  different  fields  include  Caruana  and  Niculescu-­‐Mizil  (2006),  where   SVMs  outperformed  logistic  regression  in  most  datasets;  and  Abu-­‐Nimeh  et  al.  (2007),  where  logistic   regression   was   a   better   predictor   than   the   SVM.   This   last   study   used   the   techniques   to   predict   whether  e-­‐mails  were  phishing  or  not.  The  word-­‐counts  of  different  words  were  used  as  variables.     Concluding   from   former   research:   In   most   cases   SVMs   perform   better   as   classification   predictors  than  logistic  regression.  So  it  may  be  expected  that  SVMs  will  also  perform  better  than   logistic  in  this  thesis.    

 

3.  Method  and  Data  

SVMs  and  logistic  regression  will  be  compared  on  their  prediction  error  rate.  This  will  be  done  for   two  different  datasets  that  will  be  discussed  in  section  3.2.  Each  dataset  is  separated  in  to  a  training   set  and  a  test  set.  Each  model  will  be  estimated  using  the  training  set.  Using  the  model  from  the   training   set,   predictions   will   be   made   for   the   test   set.   With   the   data   from   the   test   set   and   the   predictions,  a  prediction  error  rate  is  computed.    

Following  the  example  of  Perlich,  Provost  and  Simonoff  (2003),  who  place  a  large  emphasize   on  varying  the  size  of  the  training  set  when  comparing  tree-­‐based  methods  with  logistic  regression   methods,   the   size   of   the   training   set   will   be   varied.   It   may   vary   per   technique   how   it   performs   depending   on   the   amount   of   data   used.   It   may   be   expected   that   the   two   techniques   will   both   perform  better  with  larger  training  sets,  as  there  is  more  data  to  build  the  model  from.  Varying  the   size   of   the   training   set   has   not   been   done   in   former   research   on   comparing   SVMs   to   other   classification  techniques.  For  the  first  dataset  the  training  set  sizes  are  as  follows:  500,  1000,  2000,   3000  and  4000.  For  the  second  dataset  the  sizes  are:  500,1000,2000,4000,6000,8000.  The  training   sets  are  randomly  drawn  from  the  total  dataset.  For  each  training  set  size,  five  random  samples  are  

(7)

drawn.  The  methods  are  compared  on  the  average  prediction  error  rate  for  those  five  samples,  in   order  to  reduce  variance.  

 

For  the  SVMs  both  a  radial  and  a  polynomial  model  is  estimated.  The  R-­‐package  e1071  will   be  used  for  the  estimation.  The  parameters  will  be  chosen  via  a  grid  search  algorithm  provided  in  the   same   package,   which   uses   ten-­‐fold   cross-­‐validation   (James   et   al.,   2013,   p.   361).   The   logistic   regression  model  is  estimated  as  specified  in  section  2.1.2.  

 

Two  different  datasets  are  used  in  this  thesis.  The  first  dataset,  employment,  contains  data   on   20675   individuals   with   detailed   descriptions   of   their   employment   status   and   especially   their   opinion   towards   self-­‐employment.   The   data   is   collected   via   a   survey   by   The   Gallup   Organization   (2007)  for  25  EU  countries,  Iceland,  Norway  and  the  United  States.  The  survey  contained  questions   on  demographics,  employment  status,  occupation  of  parents,  education  and  opinion  towards  self-­‐ employment.  The  full  survey  can  be  found  in  the  paper  of  The  Gallup  Organization.  In  this  paper  the   employment   status   is   taken   as   the   dependent   variable,   it   has   two   classes:   self-­‐employed   or   employee.   This   dataset   has   also   been   used   by   Block,   Hoogerheide   and   Thurik   (2011)   to   predict   whether  a  person  is  self-­‐employed  or  not.    In  order  to  create  a  binary  variable,  the  unemployed  are   taken  out  of  the  dataset.  The  list  of  explanatory  variables  can  be  found  in  Table  7  in  the  attachment,   most  of  which  are  dummy  variables.  The  total  dataset  contains  8216  people,  of  which  1777  are  self-­‐ employed.  

  The  second  dataset,  wines,  contains  data  on  the  physicochemical  properties  and  quality  of   4898  white  wines.  The  dataset  was  created  by  Cortez  et  al.  (2009).  The  data  will  be  used  to  predict   whether  the  quality  of  the  wine  is  above  average  or  not.  Quality  was  scored  on  a  scale  of  0  to  10   with  one-­‐point  increments  with  an  average  of  5.88.  3258  wines  are  above  average.  Each  wine  was   judged  on  quality  by  at  least  three  assessors,  the  quality  variable  is  the  median  score  of  the  three   assessors  (Cortez  et  al,  2009,  p.  548).  The  variables  containing  the  physicochemical  properties  are  all   continuous  variables.  The  list  of  the  explanatory  variables  can  be  found  in  Table  8  in  the  attachment.       The  biggest  difference  between  the  two  datasets  is  the  amount  of  dummy  variables;  the  first   dataset   contains   a   lot   of   dummy   variables,   the   second   one   contains   none.   The   first   dataset   also   contains  more  variables  and  has  almost  twice  as  many  entries  as  the  second  one.    

(8)

4.  Results  

The   results   of   the   estimations   with   logistic   regression   and   Support   Vector   Machines   for   the   first   dataset  can  be  found  in  Tables  1,  2  and  3,  and  Figure  3.  The  results  for  the  second  dataset  can  be   found  in  Tables  3,  4  and  5,  and  Figure  4.  In  each  table  you  find  the  prediction  error  rates  for  each   method   for   different   training   set   sizes   and   5   different   samples   for   each   size,   together   with   the   average  over  the  samples.  In  the  figures  the  average  prediction  error  is  used.  

  In   the   first   dataset   the   three   models   perform   similarly.   For   the   polynomial   SVM   model   a   third  degree  function  is  used.  As  you  can  see  from  Figure  3,  all  methods  improve  when  more  data  is   used.  The  SVMs  improve  a  lot  between  the  training  set  size  of  6000  and  8000,  and  perform  slightly   better  than  logistic  regression.    

 

Training  

set  size  

Sample  

Number:  1  

2  

3  

4  

5  

Average  

500  

  0,22965  

0,23523  

0,22525   0,22227   0,23950   0,23056  

1000  

  0,21660  

0,21134  

0,21092   0,20967   0,21023   0,21054  

2000  

  0,20238  

0,20544  

0,21042   0,20994   0,20914   0,20874  

4000  

  0,19995  

0,20351  

0,20731   0,21513   0,20043   0,20659  

6000  

  0,21119  

0,20352  

0,19856   0,20803   0,20397   0,20352  

8000  

  0,21119  

0,20370  

0,15278   0,19444   0,19907   0,18750  

Table  1:  Prediction  error  rates  with  logistic  regression  for  dataset  1  (employment)  

   

Training  

set  size  

Sample  

Number:  1  

2  

3  

4  

5  

Average  

500  

  0,21449  

0,19827  

0,22084   0,21177   0,22551   0,21417  

1000  

  0,21660  

0,21924  

0,21522   0,21494   0,21591   0,21638  

2000  

  0,21750  

0,20898  

0,21509   0,21284   0,21573   0,21403  

4000  

  0,21466  

0,21727  

0,20991   0,21679   0,20944   0,21361  

6000  

  0,21661  

0,21209  

0,21255   0,21706   0,21751   0,21516  

8000  

  0,15741  

0,21382  

0,17130   0,19444   0,18056   0,18350  

Table  2:  Prediction  error  rates  with  a  radial  SVM  model  for  dataset  1  (employment)  

(9)

 

Training  

set  size  

Sample  

Number:  1  

2  

3  

4  

5  

Average  

500  

  0,22356  

0,21475  

0,21799   0,21669   0,21540   0,21768  

1000  

  0,21924  

0,21882  

0,22242   0,22228   0,21937   0,22043  

2000  

  0,21606  

0,21493  

0,21525   0,22297   0,22136   0,21811  

4000  

  0,21181  

0,20565  

0,21371   0,20944   0,20897   0,20991  

6000  

  0,21661  

0,20352  

0,21435   0,21796   0,20623   0,21173  

8000  

  0,16204  

0,21296  

0,14352   0,21296   0,17593   0,18148  

Table  3:  Prediction  error  rates  with  a  polynomial  SVM  model  for  dataset  1  (employment)  

   

   

In  the  second  dataset  the  radial  SVM  model  starts  to  outperform  the  logistic  regression  model  when   the   training   set   is   larger   than   1000   observations.   In   this   data   set   both   logistic   regression   and   the   radial   SVM   model   perform   better   when   the   training   set   size   increases,   however   the   decrease   in   prediction  error  rate  for  logistic  regression  is  small.  The  radial  SVM  model  ‘learns’  faster  than  logistic   regression  from  the  data  in  this  dataset.  For  the  polynomial  SVM  model  a  linear  function  fit  the  data   best,  although  it  seems  the  polynomial  SVM  does  not  fit  the  data  well  at  all.    

  The  biggest  difference  between  the  two  datasets  was  the  number  of  variables  and  especially   the  large  number  of  dummy  variables  in  the  first  dataset.  As  it  seems,  SVMs  fit  continuous  datasets   better  if  the  right  boundary  function  is  used.  When  handling  a  lot  of  dummy  variables,  SVMs  seem  to   perform  similarly  to  logistic  regression.  

   

Figure  3:  Error  rates  of  prediction  with  Logistic  regression  model  and  Support  Vector  Machines  for  

dataset  1  (employment)    

(10)

 

 

Training  

 

 

 

 

 

set  size  

Sample  

Number:  1  

2  

3  

4  

5  

Average  

500  

0,25466   0,25489  

0,25330   0,26262   0,25807   0,25671  

1000  

0,25552   0,25936  

0,25064   0,26167   0,25808   0,25705  

2000  

0,26259   0,23775  

0,24189   0,24948   0,24776   0,24790  

3000  

0,24816   0,24710  

0,25606   0,23867   0,24868   0,24773  

4000  

0,24610   0,23942  

0,24053   0,25724   0,23497   0,24365  

Table  4:  Prediction  error  rates  with  logistic  regression  for  dataset  2  (wines)  

 

 

Training  

 

 

 

 

 

set  size  

Sample  

Number:  1  

2  

3  

4  

5  

Average  

500  

0,25261   0,26944  

0,27876   0,29332   0,26853   0,27253  

1000  

0,25090   0,25962  

0,24064   0,23833   0,23474   0,24484  

2000  

0,23050   0,22326  

0,21670   0,21222   0,21705   0,21994  

3000  

0,21233   0,21338  

0,21549   0,21444   0,22023   0,21517  

4000  

0,16481   0,20267  

0,17595   0,17483   0,18820   0,18129  

Table  5:  Prediction  error  rates  with  a  radial  SVM  model  for  dataset  2  (wines)  

 

 

Training  

 

 

 

 

 

set  size  

Sample  

Number:  1  

2  

3  

4  

5  

Average  

500  

0,25261   0,26944  

0,27876   0,29332   0,26853   0,27253  

1000  

0,25090   0,25962  

0,24064   0,23833   0,23474   0,24484  

2000  

0,23050   0,22326  

0,21670   0,21222   0,21705   0,21994  

3000  

0,21233   0,21338  

0,21549   0,21444   0,22023   0,21517  

4000  

0,16481   0,20267  

0,17595   0,17483   0,18820   0,18129  

Table  6:  Prediction  error  rates  with  a  polynomial  SVM  model  for  dataset  2  (wines)  

(11)

   

In  an  effort  to  explain  the  large  difference  in  the  prediction  error  rates  for  the  second  dataset  Figure   5  is  made.  In  this  figure  the  three  most  significant  variables,  when  estimated  by  logistic  regression,   are  shown  in  six  different  graphs.  The  estimated  logistic  regression  model  for  the  full  dataset  is   shown  in  Table  9  in  Attachment  B,  the  asterisks  indicate  the  most  significant  variables.  The  two   colours  in  the  figure  depict  the  two  different  classes.  The  two  classes  show  some  grouping  that  can   be  imagined  to  be  best  captured  by  a  radial  function,  keeping  in  mind  that  the  actual  function  is  in   an  eleven  dimensional  space.  The  images  are  not  clear  enough  to  base  strong  conclusions  on.    

Figure  4:  Prediction  error  rates  with  logistic  regression  model  and  Support  Vector  Machines  for   dataset  2  (wines)  

(12)

5.  Conclusion  

Logistic  regression  and  Support  Vector  Machines  were  compared  on  their  prediction  error  rate  for   two   different   datasets.   Within   these   two   datasets   the   size   of   the   training   set   was   varied   and   five   different  random  samples  were  used  for  each  training  set  size.    

The  first  data  set  is  a  dataset  containing  information  on  workers  based  on  a  survey.  The  data   contained  many  dummy  variables.  The  different  models  predicted  if  a  worker  was  self-­‐employed  or   an   employee.   The   different   models   performed   similarly.   The   second   data   set   contained   physicochemical   information   on   white   wines.   The   models   predicted   whether   a   wine   was   above   average  in  taste  or  not.  The  radial  SVM  model  performed  best  in  this  dataset.  While  the  radial  SVM   model   performed   worse   than   logistic   regression   for   the   smallest   training   set   size,   the   radial   SVM  

Figure  5:  Six  graphs  showing  various  combinations  of  the  three  most  significant  variables  when  estimated  by  logistic  

(13)

model   improved   faster   for   bigger   training   set   sizes   than   logistic   regression.   The   polynomial   SVM   model  did  not  perform  well  in  this  data  set.    

Concluding   from   the   results,   it   may   be   the   case   that   SVMs   perform   similarly   to   logistic   regression   when   handling   dummy   variables.   With   continuous   variables   SVMs   seem   to   perform   better  when  the  right  boundary  function  is  used  and  for  the  larger  training  set  sizes.    

In  former  research  SVMs  outperformed  logistic  regression  in  most  datasets,  though  it  may   be  concluded  from  this  thesis  that  the  relative  performance  of  both  techniques  depends  on  the  type   of   variables   and   the   size   of   the   training   set.   Training   set   size   and   type   of   variables   has   not   been   covered   in   former   comparison   studies   and   may   play   an   important   role   in   relative   performance.   Although,  it  is  important  to  note  that  these  are  only  two  datasets,  so  the  difference  in  these  two   aspects   may   also   be   purely   coincidental.   Further   research   with   other   types   of   data   is   needed   to   confirm  these  conclusions.  

   

(14)

Attachment  A:  Lists  of  variables  

   

   

Dependent  variable      

Employment  status   Binary  variable,  1  for  self-­‐employed  (versus  employee)  

Explanatory  variables  

Gender   Dummy  variable,  1  for  female   Age  

  Years  of  Education  

 

Urban  Area   Three  dummy  variables:  

  Metropolitan  area  

 

Urban  area   Rural  area    

(versus  no  answer)   Occupation  father   Five  dummy  variables:     Self  employed     White  collar     Blue  collar     Civil  Servant     Not  employed  

  (versus  no  answer)  

Occupation  mother   Five  dummy  variables:  

  Self  employed     White  collar     Blue  collar     Civil  Servant     Not  employed    

(versus  no  answer)   Preference  towards  job   Two  dummy  variables:    

Self  employed  

  employee  

 

(versus  no  answer)   Role  of  education  towards  entrepreneur  ship   Scale  of  -­‐2  to  2  

Opinion  of  entrepreneurs   Eight  dummy  variables  for  four  questions:  

  Agree  

 

Disagree  

  (versus  no  answer)  

Opinion  towards  difficulty  of  starting  a  business   Scale  of  -­‐2  to  2   25  Country  dummies  

Table  7:  List  of  variables  in  dataset  1  (employment)  

(15)

Dependent  variable               Wine  quality   Binary  variable,  1  for  above  average  

Explanatory  variables              

 

Min   Max   Mean   Fixed acidity (g(tartaric acid)/dm3) 3.8 14.2 6.9 Volatile acidity (g(acetic acid)/dm3) 0.1 1.1 0.3

Citric acid (g/dm3) 0.0 1.7 0.3

Residual sugar (g/dm3) 0.6 65.8 6.4

Chlorides (g(sodium chloride)/dm3) 0.01 0.35 0.05

Free sulphur dioxide (mg/dm3) 2 289 35

Total sulphur dioxide (mg/dm3) 9 440 138

Density (g/cm3) 0.987 1.039 0.994

pH 2.7 3.8 3.1

Sulphates (g(potassium sulphate)/dm3) 0.2 1.1 0.5

Alcohol (vol.%) 8.0 14.2 10.4

Table  8:  List  of  variables  in  dataset  2  (wines)  

Source:  Cortez  et  al.    (2009,  p.  549)  

(16)

Attachment  B:  The  coefficients  of  the  logistic  regression  model  

 

 

Estimate  

Std.  Error  

P-­‐value  

 

(Intercept)  

424,5000  

74,2700  

0,0000  

 

Fixed  acidity  

0,1516  

0,0731  

0,0382  

 

Volatile  acidity  

-­‐6,5140  

0,4144  

0,0000   *  

Citric  acid  

0,1337  

0,3030  

0,6590  

 

Residual  sugar  

0,2207  

0,0274  

0,0000   *  

Chlorides  

1,1970  

1,6740  

0,4743  

 

Free  sulfur  dioxide  

0,0091  

0,0028  

0,0011  

 

Total  sulfur  dioxide  

-­‐0,0006  

0,0012  

0,6314  

 

Density    

-­‐438,9000  

75,2600  

0,0000  

 

pH  

1,5820  

0,3667  

0,0000  

 

Sulphates  

2,0090  

0,3615  

0,0000   *  

Alcohol  

0,5386  

0,0968  

0,0000  

 

Table  9:  Results  of  logistic  regression  for  the  full  data  set,  the  asterisks  indicate  the  most  significant  variables.  

 

References  

Abu-­‐Nimeh,  S.,  Nappa,  D.,  Wang,  X.,  &  Nair,  S.  (2007).  A  comparison  of  machine  learning  techniques   for  phishing  detection.  ECrime  '07  Proceedings  of  the  Anti-­‐Phishing  Working  Groups  2nd  Annual  

eCrime  Researchers  Summit,  60-­‐69.    

Boyacioglu,  M.  A.,  Kara,  Y.,  &  Baykan,  Ö.  K.  (2009).  Predicting  bank  financial  failures  using  neural   networks,  support  vector  machines  and  multivariate  statistical  methods:  A  comparative  analysis   in  the  sample  of  savings  deposit  insurance  fund  (SDIF)  transferred  banks  in  turkey.  Expert  

Systems  with  Applications,  36(2,  Part  2),  3355-­‐3366.    

Caruana,  R.,  &  Niculescu-­‐Mizil,  A.  (2006).  An  empirical  comparison  of  supervised  learning  algorithms.  

ICML  '06  Proceedings  of  the  23rd  International  Conference  on  Machine  Learning,  161-­‐168.    

Cortez,  P.,  Cerdeira,  A.,  Almeida,  F.,  Matos,  T.,  &  Reis,  J.  (2009).  Modeling  wine  preferences  by  data   mining  from  physicochemical  properties.  Decision  Support  Systems,  47(4),  547-­‐553.    

Heij,  C.,  de  Boer,  P.,  Franses,  P.  H.,  Kloek,  T.,  &  van  Dijk,  H.  K.  (2004).  Econometric  methods  with  

applications  in  business  and  economics  (First  ed.).  Oxford:  Oxford  University  Press.    

James,  G.,  Witten,  D.,  Hastie,  T.,  &  Tibshirani,  R.  (2013).  An  introduction  to  statistical  learning  with  

(17)

Lee,  J.  W.,  Lee,  J.  B.,  Park,  M.,  &  Song,  S.  H.  (2005).  An  extensive  comparison  of  recent  classification   tools  applied  to  microarray  data.  Computational  Statistics  &  Data  Analysis,  48(4),  869-­‐885.     Min,  J.  H.,  &  Lee,  Y.  (2005).  Bankruptcy  prediction  using  support  vector  machine  with  optimal  choice  

of  kernel  function  parameters.  Expert  Systems  with  Applications,  28(4),  603-­‐614.     Min,  S.,  Lee,  J.,  &  Han,  I.  (2006).  Hybrid  genetic  algorithms  and  support  vector  machines  for  

bankruptcy  prediction.  Expert  Systems  with  Applications,  31(3),  652-­‐660.    

The  Gallup  Organization.  Entrepreneurship  survey  of  the  EU  (25  member  states),  United  States,  

Iceland  and  Norway.  Retrieved  5/14,  2014,  from  

http://ec.europa.eu/public_opinion/flash/fl_192_en.pdf    

Wu,  C.,  Tzeng,  G.,  Goo,  Y.,  &  Fang,  W.  (2007).  A  real-­‐valued  genetic  algorithm  to  optimize  the   parameters  of  support  vector  machine  for  predicting  bankruptcy.  Expert  Systems  with  

Referenties

GERELATEERDE DOCUMENTEN

This research is funded by a PhD grant of the Insti- tute for the Promotion of Innovation through Science and Technology in Flanders (IWT-Vlaanderen). This research work was carried

The contribution of this work involves providing smaller solutions which use M 0 < M PVs for FS-LSSVM, obtaining highly sparse models with guarantees of low complexity (L 0 -norm

Prediction of Malignancy of Ovarian Tumors Using Least Squares Support Vector

This study aimed at assessing the value of histopathological parameters obtained from an endometrial biopsy (Pipelle  de Cornier; results available preoperatively) and

The contribution of this work involves providing smaller solutions which use M 0 < M PVs for FS-LSSVM, obtaining highly sparse models with guarantees of low complexity (L 0 -norm

The proposed approaches i.e L 0 reduced FS-LSSVM and Window reduced FS- LSSVM method introduce more sparsity in comparison to FS-LSSVM and SVM methods without significant trade-off

As a robust regression method, the ν -tube Support Vector Regression can find a good tube covering a given percentage of the training data.. However, equal amount of support vectors

Support vector machines (svms) are used widely in the area of pattern recogni- tion.. Subsequent