• No results found

Predicting whether someone is insured or not : a comparison between classification trees and logistic regression

N/A
N/A
Protected

Academic year: 2021

Share "Predicting whether someone is insured or not : a comparison between classification trees and logistic regression"

Copied!
35
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

   

Predicting  Whether  Someone  is  Insured  or  Not:  A  Comparison  Between  

Classification  Trees  and  Logistic  Regression  

   

  Abstract  

 

Logistic   regression   and   classification   trees   are   compared   on   their   prediction   power,   variable   selection  and  interpretation  on  whether  someone   is   insured   or   not.   This   is   done   on   three   different   datasets,   where   the   proportion   of   insured   observations   and   the   linearity   in   the   data   are   varied.  The  logistic  regression  performs  even  good   as  classification  trees  in  case  of  nonlinear  data  and   better   in   case   of   linear   data.   Logistic   regression   tends   to   choose   dummy   variables   and   classification   trees   continuous   variables   in   their   prediction   models,   but   both   choose   income  as   an   important  variable.  The  classification  tree  is  easier   to  interpret  nonlinearity  in  the  data.  

     

Author:  Steyn  Heskes   Student  number:  6350399  

Supervisor:  Dr.  N.  P.  A.  van  Giersbergen   Subject:  Econometric  and  Big  Data   Bachelor  of  Science  in  Econometrics   University  of  Amsterdam  

(2)

INDEX  

INDEX  ...  2  

1.  Introduction  ...  4  

2.  Data  and  methods  ...  5  

2.1.  Data  ...  5  

2.2.  Techniques  ...  6  

2.2.1  The  classification  tree  ...  6  

2.2.2  Cost  complexity  pruning  ...  7  

2.2.3  The  logistic  regression  model  ...  8  

2.2.4  Bagging  ...  8  

2.2.5  Random  forest  ...  9  

2.3  Method  ...  10  

2.3.1  Basic  method  ...  10  

2.3.2  Data  generating  process  and  data  manipulation  ...  11  

3.  Results  ...  12  

3.1  Results  of  the  83%  dataset  ...  13  

3.1.1  Logistic  regression  models  ...  13  

3.1.2  Tree  models  ...  14  

3.2  Results  of  the  50%  dataset  ...  16  

3.2.1  Logistic  regression  models  ...  16  

3.2.2  Tree  models  ...  17  

3.3  Results  of  the  linear  50%  dataset  ...  19  

3.3.1  Logistic  regression  models  ...  19  

3.3.2  Tree  models  ...  20  

3.4  Comparison  ...  21  

3.4.1  Comparison  of  the  models  constructed  with  the  83%  dataset  ...  21  

3.4.2  Comparison  of  models  constructed  with  the  50%  dataset  ...  22  

3.4.3  Comparison  of  models  constructed  with  the  linear  50%  dataset  ...  23  

3.4.4  Interpretation  ...  24  

3.4.5  Overall  comparison  ...  24  

4.  Conclusion  ...  25  

5.  Literature  cited  ...  26  

(3)

Appendix  2A  –  Coefficients  83%  dataset  logistic  regression  model  2  ...  30  

Appendix  2B  –  Coefficients  50%  dataset  logistic  regression  model  2  ...  31  

Appendix  2C  –  Coefficients  linear  50%  dataset  logistic  regression  model  2  ...  32  

Appendix  3A  –  Importance  83%  dataset  bagging  and  random  forest  ...  33  

Appendix  3B  –  Importance  50%  dataset  bagging  and  random  forest  ...  34  

Appendix  3C  –  Importance  linear  50%  dataset  bagging  and  random  forest  ...  35  

 

             

 

 

 

 

 

 

               

(4)

1.  Introduction  

Nowadays  computers  are  more  and  more  involved  with  economic  transactions   and   automatically   save   information   about   these   transactions.   The   past   three   decades   datasets   have   consequently   grown   bigger   and   bigger,   resulting   in   extraordinary   big   datasets.   For   example,   companies   such   as   Google   capture   around  20  billion  URL’s  a  day  and  over  100  billion  search  queries  each  month.   This   way,   they   captured   around   30   trillion   URL’s   in   total.   Econometricians   are   however  used  to  deal  with  datasets  that  fit  in  a  single  spread  sheet.  Big  datasets   require  a  different  approach  for  three  main  reasons:  First,  the  datasets  are  so  big   that  they  require  stronger  manipulative  techniques.  Second,  sometimes  there  are   more  variables  available  than  preferred,  causing  the  overfitting  problem.  Third,   big  datasets  will  have  more  flexible  relations  than  just  linear  relations  (Varian,   2013).  

To  deal  with  these  datasets,  new  techniques  have  been  developed,  called   machine  learning  techniques.  These  techniques  have  three  main  differences  with   Ordinary  Least  Squares  (OLS)  (Varian,  2013).  First,  machine  learning  techniques   focus  on  summarizing  certain  nonlinearity  in  the  data  in  contrast  to  OLS,  which   focuses  solely  on  linear  relations.  Second,  OLS  focuses  on  explaining  the  data  as   good  as  possible  within  the  available  data,  whereas  machine  learning  techniques   focus  on  forecasting  outside  the  dataset.  This  can  easily  be  explained  by  recent   developments.   Twenty   years   ago,   datasets   weren’t   that   big   and   therefore   it   wasn’t  possible  to  check  a  model  with  additional  data.  Today,  there  is  so  much   data,   that   it   is   possible   to   select   a   sample   from   the   data   to   create   a   model   and   then   predict   the   remaining   out-­‐of-­‐sample   data   to   validate   the   model.   One   can   therefore   calculate   how   much   of   the   out-­‐of-­‐sample   data   is   well   predicted.   This   technique   is   called   the   k-­‐fold   cross   validation.   Third,   machine   learning   techniques   also   provide   methods   to   deal   with   the   overfitting   problem.   By   selecting   the   most   powerful   variables   and   removing   the   superfluous   variables,   the  model  gets  less  complex  and  prediction  power  improves.  

In   this   paper,   the   machine   learning   technique   Classification   Trees   (CT)   and  the  commonly  used  logistic  regression  model  are  compared  for  three  main   reasons.  First,  the  models  are  both  popular  with  statisticians,  machine  learning   researchers,   data-­‐analysts   and   econometricians.   Second,   they   are   both  

(5)

classification   methods,   used   for   example   to   classify   emails   in   the   categories   ‘spam’   and   ‘no   spam’.   Third,   they   are   both   quite   easy   to   use.   Besides,   CT’s   are   becoming  more  popular  for  several  reasons.  First,  they  are  easy  to  interpretate.   Second,  by  pruning  CT’s,  their  prediction  power  increases  is  expected  to  a  higher   level   than   logistic   regression   models,   subject   to   that   the   dataset   is   big   enough   (Candeviren,   2006;   Perlich,   2003).   Third,   CT   is   a   good   method   for   explaining   non-­‐linearity’s  in  the  data.    

The  aim  of  this  paper  is  to  examine  which  method,  CT  or  Logit  regression,   makes  better  predictions  by  comparing  the  obtained  results,  using  k-­‐fold  cross   validation.  Both  methods  try  to  explain  whether  someone’s  insured  or  not  on  the   basis   of   some   demographic   and   socioeconomic   characteristics   and   some   information  about  health  care  expenditures.  

The   rest   of   the   paper   is   structured   as   follows.   First,   the   data,   data   manipulation   and   techniques   are   described   in   Section   2.   Then   the   research   method   is   described   in   Section   3.   After   that,   the   results   will   be   discussed   in   Section  4  whereupon  a  conclusion  is  drawn  in  Section  5.  

2.  Data  and  methods  

First  the  data  and  manipulation  process  used  to  obtain  the  results  are  described   in   Section   2.1.   Second   the   techniques   to   create   and   improve   the   prediction   models   are   described   in   Section  2.2.   Third   the   research   method   is   described   in   Section  2.3.  

2.1.  Data  

The   Medical   Expenditure   Panel   Survey   (MEPS)   is   a   nationally   representative   survey  of  the  U.S.  civilian  non-­‐institutionalized  population,  which  are  the  people   of  16  years  and  older  living  in  the  50  States  and  the  District  of  Colombia  who  are   not  inmates  of  homes  of  the  aged  and  penal-­‐  and  mental  facilities,  and  are  not  on   active  duty  in  the  Armed  Forces,  set  up  in  1996  by  the  U.S.  Department  of  Health   and   Human   Services   (Shen,   2013).   MEPS   consist   of   the   results   of   survey’s   of   households,   employers   and   medical   providers   to   collect   information   about   health   care   expenditures   and   health   insurance   coverage,   demographic   and   socioeconomic  characteristics.    

(6)

This   study   uses   the   MEPS   data   ‘Basevar.xlsx’,   conducted   in   2005,   which   consists  of  33964  observations.  For  this  study,  the  data  is  manipulated  in  several   ways.   First,   for   classification   trees   to   work,   we   need   that   the   data   is   fully   available.  Missing  values  cause  problems  for  classification  and  logistic  regression   to   work.   Therefore,   only   the   rows   without   missing   values   are   included.   Rows   with  missing  values  are  removed.  In  the  second  step,  all  data  about  the  amount   compensated  by  insurance  companies  is  removed,  because  when  the  amount  is   greater   than   zero,   this   person   is   most   certain   insured.   Last,   two   variables   are   manipulated   into   two   groups.   Hlthins   is   manipulated   into   ‘insured’   and   ‘not   insured’   and   race   is   manipulated   into   ‘White’   and   ‘Other’.   The   variables   that   survived  the  manipulation  process  above  are  described  in  Appendix  1.  

2.2.  Techniques  

This  section  describes  five  techniques,  which  are  used  to  estimate  and  improve   the  different  models,  namely  the  classification  tree  in  Section  2.2.1,  cost  

complexity  pruning  in  Section  2.2.2,  the  logistic  regression  model  in  Section  2.2.3,   bagging  in  Section  2.2.4,  and  random  forest  in  Section  2.2.5.  

2.2.1  The  classification  tree  

The   CT   exists   of   a   network   of   nodes,   where   each   node   denotes   a   variable,   of   which   a   fictitious   example   is   shown   in   Figure   1.   At   each   node,   two   or   more   branches   denote   certain   values   or   value   ranges   after   which   a   new   node   or   terminal  node  appears.  The  start  node  or  family  node  is  the  first  node  of  the  tree   where  the  branches  partition  the  whole  dataset  in  some  classes.  In  Figure  1,  the   start  node  partitions  the  dataset  into  older  than  45  to  the  left  branch  and  younger   or  even  old  as  45   to   the   right   branch.   Each   partitioning   is   based   on   finding   the   most   homogeneous   subsets   (branches)   within   the   variable.   This   way,   an   observation  is  going  top-­‐down  through  the  tree  and  will  end  in  a  terminal  node.   The  terminal  node  exists  of  values  of  the  dependent  variable,  such  as  in  Figure  1   in  ‘Yes’  or  ‘No’.    

(7)

   

 

Classification   means   dividing   data   in   several   categories   based   on   properties  of  the  observations.  The  aim  of  the  CT  method  is  to  construct  a  tree   that  predicts  the  out-­‐of-­‐sample  data  as  good  as  possible  (Candeviren,  2006).  This   method   belongs   to   the   Classification   And   Regression   Trees   (CART)   methods.   Trees  prove  to  be  good  tools  to  summarize  important  non-­‐linearity  in  the  data,   and  work  well  with  big  amounts  of  data  (Varian,  2013).  

2.2.2  Cost  complexity  pruning  

CT’s   can   grow   large   and   become   very   complex.   This   is   desirable   for   explaining   the   training   data,   but   likely   to   overfit   the   data,   resulting   in   poor   prediction   power.  Cost  complexity  pruning  provides  an  outcome  by  introducing  a  penalty   term   to   the   number   of   nodes.   The   tree   with   the   lowest   Mean   Squared   Error   (MSE),   dependent   on   the   number   of   nodes,   is   expected   to   be   the   tree   with   the   best   prediction   power   (Hastie   et   al.,   2013).   Besides,   the   pruned   tree   is   less   complex  and  therefore  easier  to  interpret.  

(8)

2.2.3  The  logistic  regression  model  

The  logistic  regression  model  divides  observations  based  on  their  properties  in   several  categories  or  classes  of  the  dependent  variable,  just  like  CTs.  There  are   no   restrictions   on   the   independent   variables,   they   may   be   of   categorical   or   numerical   form.   In   this   paper,   the   dependent   variable   exists   of   just   two   categories.  The  logistic  regression  model  is  therefore  defined  as  follows  (Heij  et   al.,  2004):   𝑔 𝑥   =   log 𝑃 𝑦 = 1 𝑥 𝑃 𝑦 = 0 𝑥   =   𝛽! +   𝛽!𝑥! ! !!!  

The   coefficients  𝛽!  are   estimated   by   the   log-­‐likelihood   function.   The   estimated  model  can  now  be  interpreted  in  terms  of  the  signs  and  significance  of   the  estimated  coefficients.  

2.2.4  Bagging  

This  technique  is  used  in  order  to  improve  the  performance  of  statistical  learning   methods   such   as   decision   trees.   The   decision   trees   in   Section  2.2.1   and   Section   2.2.2  suffer  from  high  variance  (Hastie  et  al.,  1013).  Therefore,  the  results  of  two   different   samples   can   be   quite   different.   A   decision   tree   with   low   variance   is   expected  to  provide  similar  results  when  applied  repeatedly  to  distinct  datasets.   Bagging  is  a  common  used  method  for  reducing  variance.  

  Generally,   bagging   uses   B   separate   training   sets   to   build   B   prediction   models   and   averages   the   resulting   predictions.   In   other   words,   first,  𝑓!(𝑥),…,  

𝑓!(𝑥)  are  calculated  using  different  training  sets  1  to  B.  Second,  the  low-­‐variance  

statistical  learning  model  is  defined  as  follows:  

𝑓!"# 𝑥 =  1

𝐵 𝑓!(𝑥)

!

!!!

 

  Unfortunately,   mostly   there   are   no   multiple   training   sets   available.   Therefore,  a  technique  called  bootstrap  is  used.  First,  bootstrap  takes  repeated   samples  from  the  training  data  set.  This  way,  B  different  bootstrapped  training   data   sets   are   created.     Second,   the   model   is   trained   on   the  𝑏!!  bootstrapped  

training  set  in  order  to  get  𝑓∗!(𝑥).  Third,  the  predictions  are  averaged  to  obtain  

(9)

𝑓!"# 𝑥 =  1

𝐵 𝑓∗!(𝑥)

!

!!!

 

For   decision   trees,   bagging   works   almost   the   same.   First   B   trees   are   constructed   on   the   basis   of   B   bootstrapped   training   sets.   The   resulting   predictions   are   averaged   to   find   the   resulting   model.   The   individual   B   trees   suffer   from   high   variance,   but   have   low   bias,   whereas   the   resulting   averaged   model  has  lower  variance  at  the  cost  of  some  bias.  In  case  of  classification  trees,   instead  of  taking  the  average,  the  most  commonly  occurring  class  among  the  B   predictions  (majority  vote)  is  used  to  obtain  the  final  prediction  model.  

So  the  advantage  of  bagging  is  that  it  improves  prediction  accuracy,  but  it   also   has   a   disadvantage.   When   we   bag   a   large   number   of   trees,   it   is   no   longer   possible  to  interpret  the  results  in  a  single  tree,  causing  that  it  is  no  longer  clear   which  variables  are  most  important  to  the  final  prediction  model.  The  Gini  index   provides  an  outcome.    

The  Gini  index  is  defined  as  follows,  

𝐺 = 𝑝!"

!

!!!

(1 − 𝑝!")  

where  m  denotes  the  𝑚!!  region  (regions  are  split  up  by  nodes),  k  the  𝑘!!  class  

(in   this   case   1   or   2,   or   ‘Insured’   or   ‘Not   insured’)   and  𝑝!"  is   the   proportion   of  

observations  in  the  training  set  in  the  𝑚!!  region  that  belong  to  the  𝑘!!  class.  The  

Gini   index   measures   the   total   variance   across   the   K   classes.   By   adding   up   the   total  amount  that  the  Gini  index  is  decreased  by  splits  over  a  given  predictor  and   averaged  over  all  B  trees,  the  ones  with  the  largest  mean  decrease  are  the  most   important   variables.   This   measure   is   called   the   Mean   Decrease   in   Gini   index   (MDG).  

2.2.5  Random  forest  

Bagging  considers  all  p  variables  as  split  candidates  at  each  split.  This  way,  most   of   the   trees   will   use   the   strongest   predictor   as   start   node,   resulting   in   B   quite   similar  trees.  Therefore,  the  predictions  from  the  bagged  trees  will  also  be  quite   similar  and  thus  highly  correlated.  But  averaging  over  a  large  amount  of  bagged   trees,  which  are  highly  correlated,  doesn’t  lead  to  a  strong  variance  reduction.    

(10)

Random  forest  overcomes  this  problem  by  decorrelating  the  bagged  trees.   In  contrast  to  bagging,  random  forest  only  considers  a  subset  of  m  <  p  variables   as   split   candidates   at   each   split.   Hereby,   the   m   considered   variables   are   randomly  chosen  at  each  split.  Therefore,  the  strongest  predictor  will  not  even   be  considered  in  (p  –  m)/p  of  the  splits.  This  way,  other  predictors  will  have  a   better  chance  to  get  in  the  model,  decreasing  the  variance  of  the  average  of  the   resulting   trees   and   thereby   more   reliable.   In   this   paper,   we   will   use   the   recommended  m  =   𝑝  (Hastie  et  al.,  2013).  As  with  bagging,  the  variables  with   the  largest  MDG,  are  the  most  important  variables.  

2.3  Method  

In  this  chapter,  first  the  research  method  is  described  in  Section  2.3.1,  then  the   data  generating  process  and  data  manipulation  results  are  described  in  Section   2.3.2.  

2.3.1  Research  method  

To   decide   which   of   the   two   methods,   logistic   regression   or   classification   tree,   predicts   best   whether   someone   is   insured   or   not,   we   first   divide   the   data   into   two  parts,  existing  of  75%  and  25%  of  the  data.  These  sets  are  called  the  training   set   and   the   test   set,   respectively.   This   is   done   in   three   different   ways.   First   by   using   the   first   75%   as   training   set,   second   the   last   75%   and   third   the   middle   75%,   named   F75,   L75   and   M75,   respectively.   The   training   data   is   used   to   construct   a   prediction   model   with   each   of   these   two   methods.   The   prediction   power  will  then  be  examined  (validated)  by  predicting  the  test  set.  A  calculation   of   how   much   of   the   testing   data   is   well   predicted   (success)   divided   by   the   amount  of  data  in  the  testing  data  (total)  gives  the  prediction  power.  The  models   will   subsequently   be   compared   by   their   prediction   power   and   variable   importance  by  looking  at  the  MDG  and  the  p-­‐values  of  the  variables.    

Second,  we  construct  a  CT  by  using  the  training  set  as  described  in  Section   2.2.1.  Then  the  CT  is  used  to  predict  the  test  data.  To  determine  how  well  the  CT   predicts,  the  prediction  power  is  calculated.  Then  the  CT  is  pruned  as  described   in  Section  2.2.2  by  including  a  penalty  for  each  node.  The  bigger  the  penalty,  the   smaller   the   tree   will   be.   We   expect   the   prediction   power   to   increase   after   pruning  (Gareth  James,  2013)  and  will  select  the  tree  with  the  lowest  MSE.      

(11)

Third,   to   improve   the   results   of   the   CT,   new   CTs   are   constructed   by   bagging  as  in  Section  2.2.4.  First,  100  bagged  trees  are  grown,  considering  all  p   variables  at  each  split.  The  final  prediction  model  is  given  by  taking  the  majority   vote  of  the  100  bagged  trees.  The  prediction  power  will  again  be  calculated  by   predicting  the  test  set  and  validate  the  results.    

Fourth,   random   forest   is   used   the   same   way   as   bagging,   growing   100   bagged   trees,   but   instead   of   considering   p   variables   at   each   split,   just   m   =   𝑝   variables  are  considered  as  described  in  Section  2.2.5.  The  final  prediction  model   consists  of  taking  the  majority  vote  of  the  100  bagged  trees.  Then  the  prediction   power  is  calculated.    

Fifth,  we  predict  whether  someone  is  insured  or  not  by  using  the  logistic   regression  model  as  described  in  Section  2.2.3.  We  use  the  same  training  data  as   used   by   the   CT   to   do   so.   We   construct   two   different   models.   The   first   model   includes  all  variables.  Then  the  regression  model  predicts  the  testing  data.  Then   the   prediction   power   is   calculated   the   same   way   as   for   the   CT.   For   the   second   model,   we   look   at   the   significance   of   the   variables.   Variables   that   are   not   significant   at   the   5%   level   are   thrown   out   to   construct   the   second   model.   Afterwards,  we  calculate  the  prediction  power  again.    

Last,  the  results  will  be  compared  on  the  basis  of  their  prediction  power,   variables  that  are  included  in  each  model  and  which  of  them  are  most  significant.   To  decide  whether  one  of  the  prediction  models  predicts  best,  we  test  𝐻!: 𝑝! = 𝑝!,  by  calculating  the  t-­‐value  as  follows,  

(𝑝!− 𝑝!) − 0 𝑝 1 − 𝑝 (𝑛1 !+ 1 𝑛!)  ~  𝑁 0,1    with    𝑝 =(p!+ p!) 2    

where  𝑝!  and  𝑝!  are   the   prediction   powers   of   the   different   models   or   the   minimal   prediction   power   and  𝑛!and  𝑛!  are   the   number   of   observations   in   the  

test  set.  The  critical  value  is  2.00.  

2.3.2  Data  generating  process  and  data  manipulation  

The  data  is  manipulated  in  two  ways.  First  by  changing  the  proportion  of  insured   and  not  insured  in  the  data  and  second  by  making  sure  the  data  is  linear.  

83%   of   the   people   are   insured   in   the   original   dataset.   Consequently,   all   classification   models   should   at   least   predict   around   83%   of   the   data   well.   This  

(12)

situation   causes   the   logistic   regression   model   to   have   a   high   intercept   and   the   CTs  to  only  give  ‘Insured’  as  an  outcome.  In  other  words,  the  models  are  tended   to   give   ‘Insured’   as   an   outcome,   giving   them   a   high   prediction   power   in   the   ‘Insured’   segment,   but   low   prediction   power   in   the   ‘Not   insured’   segment.   Therefore,   we   construct   a   dataset   of   50%   ‘Insured’   and   50%   ‘Not   insured’   observations  to  prevent  this  problem.  Again,  we  perform  the  research  method  on   this  dataset  as  described  in  Section  2.3.1.  

Also,   we   have   to   take   into   account   that   the   data   may   be   nonlinear.   We   expect   the   CT   to   perform   better   for   nonlinear   relations   than   the   logistic   regression  model.  Therefore,  we  generate  linear  data  by  using  the  second  logistic   regression  model,  constructed  as  described  in  Section  2.2.3.  First,  we  simply  fill   in   the   values   of   the   observations   in   the   model   and   save   the   fitted   values   of   whether   someone   is   insured   or   not.   Second,   we   add   up   a   random   error   term   from  the  standard  normal  distribution  to  prevent  the  logistic  regression  model   to  construct  a  perfect  fit.  When  the  value  is  positive,  the  person  is  insured.  Third,   the  former  values  of  whether  someone  is  insured  or  not  will  be  replaced  by  the   new   constructed   values.   Again,   we   perform   the   basic   method   described   in   Section  2.3.1  on  this  newly  generated  dataset.  

3.  Results  

In  this  chapter,  the  results  of  the  dataset  where  83%  of  the  observed  people  are   insured  are  described  in  Section  3.1.  The  results  of  the  dataset  where  50%  of  the   observed  people  are  insured  are  described  in  Section  3.2.  The  results  of  the   dataset  where  50%  of  the  observed  people  are  insured  and  the  data  has  a  linear   relation  with  whether  someone  is  insured  or  not  are  described  in  Section  3.3.   Finally,  in  Section  3.4,  a  comparison  of  the  logistic  regression  and  tree  models  is   made.  

The  results  are  shown  in  Table  1,  Table  5  and  Table  9.  In  the  upper  part  of   these  tables,  the  prediction  power  of  the  different  models  is  given.  First  for  the   F75  training  set,  as  described  in  Section  2.3,  second  for  the  L75  training  set,  third   for   the   M75   training   set   and   finally   the   average   of   the   three   preceding   sets   is   given   in   the   fifth   column.   The   minimum   prediction   power   is   established   by   calculating  the  prediction  power  of  the  most  basic  model  that  has  only  ‘Insured’  

(13)

as  an  outcome.  It  is  calculated  by  dividing  the  amount  of  ‘Insured’  in  the  test  set   by  the  total  observations  in  the  test  set.  It  is  simply  the  percentage  of  ‘Insured’  in   the  test  set.  The  last  column  gives  the  t-­‐value,  calculated  as  described  in  Section   2.3.1,  taking  the  minimal  prediction  power  as  reference.  

3.1  Results  of  the  83%  dataset  

In   this   chapter,   Section   3.1.1   discusses   the   results   of   the   logistic   regression   models   and   Section   3.1.2   of   the   classification   tree,   cost   complexity   pruning,   bagging  and  random  forest.    

3.1.1  Logistic  regression  models  

The  results  of  the  two  logistic  regression  models  are  shown  in  Table  1.  The  first   is  the  model  that  includes  all  available  variables  and  the  second  only  includes  the   significant  variables  from  the  first  model  (𝛼=0.05).  The  prediction  power  of  the   first   model   is   on   average   0.8419   and   from   the   second   0.8413,   making   the   first   the  one  with  the  highest  prediction  power.    

 

Looking   at   the   t-­‐values,   none   of   the   logistic   regression   models   has   significant   better   prediction   power   than   the   minimal   prediction   power.   Therefore,   both   logistic   regression   models   do   not   add   significant   prediction   power   to   the   most   simple   model   that   has   only   ‘Insured’   as   an   outcome.   Maybe   this   is   caused   by   the   fact   that   83%   of   the   observed   people   is   insured,   which   causes  weak  performance  of  classification  models  as  described  in  Section  2.3.2.  

Table  1    

Prediction  power   F75   L75   M75   Average  

  t-­‐value   Tree  1:  Classification  tree   0.8285   0.8336   0.8294   0.8305   0   Tree  2:  Cost  complexity  pruning   0.8285   0.8336   0.8294   0.8305   0   Tree  3:  Random  forest   0.8408   0.8314   0.8388   0.8370   0.8258   Tree  4:  Bagging   0.8478   0.8509   0.8482   0.8490   2.3795   Logistic  regression  model  1   0.8399   0.8455   0.8402   0.8419   1.4514   Logistic  regression  model  2   0.8413   0.8431   0.8395   0.8413   1.3740   Minimal  prediction  power   0.8285   0.8336   0.8294   0.8305   0  

(14)

It   is   strange   that   on   average,   the   second   model   predicts   worse   than   the   first  model.  In  the  second  model,  all  insignificant  variables  from  the  first  model   are  removed,  reducing  the  overfitting  problem.  Variables  that  are  not  significant   are   not   very   important   for   explaining   the   dependent   variable   or   suffer   from   endogeneity.   One   would   suggest   that   without   the   insignificant   variables,   the   prediction  power  would  increase.  In  these  results,  this  is  not  the  case.  This  could   also  be  caused  by  the  fact  that  83%  of  the  observed  people  is  insured.  

In   Table   2,   the   five   most   significant   variables   of   the   second   model   constructed   on   the   L75   training   set   are   shown.   The   full   table   is   depicted   in   Appendix  2A.  All  variables,  except  for  health,  are  still  significant  compared  to  the   former   model.   The   five   most   significant   variables   for   predicting   whether   someone   is   insured   or   not,   using   the   logistic   regression,   are   female,   live_with_spouse,  employ,  disab_pop,  and  income.  

Table  2  

Coefficients   Estimate   Std.  Error   z  value   Pr(>|z|)   female   -­‐4.35E+02   4.70E+01   -­‐9.27E+00   2.00E-­‐16   live_with_spouse   -­‐4.74E+02   4.77E+01   -­‐9.93E+00   2.00E-­‐16  

employ   7.05E+02   6.92E+01   1.02E+01   2.00E-­‐16  

disab_pop   -­‐1.71E+03   1.12E+02   -­‐1.52E+01   2.00E-­‐16   income   2.91E-­‐02   1.57E-­‐03   1.85E+01   2.00E-­‐16  

3.1.2  Tree  models  

The   results   of   the   four   tree   models   are   also   given   in   Table   1.   The   prediction   power  of  the  first  model  is  on  average  0.8305,  from  the  second  0.8305,  from  the   third  0.8370  and  from  the  fourth  0.8490,  making  the  fourth  model  (bagging)  best   and  the  first  two  models  worst.  Besides,  bagging  is  the  only  tree  model  that  adds   significant  prediction  power  compared  to  the  minimal  prediction  power  (t-­‐value   =  2.38).    

We   find   that   cost   complexity   pruning   doesn’t   improve   the   prediction   power  of  the  classification  tree  and  that  both  models  haven’t  got  any  distinctive   power  (t-­‐value  =  0).  Their  prediction  powers  are  exactly  the  same  as  the  minimal   prediction  power.  This  is  easy  to  explain  when  we  look  at  the  actual  classification   trees.   Both   the   classification   tree   and   the   pruned   tree   only   give   ‘Insured’   as   an  

(15)

outcome.   Pruning   a   large   tree   with   only   the   outcome   ‘Insured’   will   result   in   a   smaller  tree  with  exactly  the  same  outcome.  Therefore,  no  increase  in  prediction   power   is   obtained   by   pruning.   The   trees   are   thus   as   good   as   the   minimal   prediction  power.  

Before  describing  the  results  of  bagging  and  random  forest  in  Table  1,  we   start  by  describing  the  results  shown  Table  3,  where  the  MSEs  of  the  four  tree   models   are   given.   The   results   of   the   classification   tree   and   cost   complexity   pruning  are  again  the  same.  But  when  we  look  at  random  forest  and  bagging,  we   find  a  reduction  of  the  MSE.  We  would  expect  that  random  forest  reduce  the  MSE   more   than   bagging,   but   bagging   causes   on   average   a   bigger   reduction   than   random   forest.   Therefore,   we   expect   bagging   to   provide   the   best   prediction   power.  This  is  confirmed  by  the  prediction  powers  shown  Table  1.  

 

Bagging   and   random   forest   both   improve   on   average   the   prediction   power   of   the   classification   tree   and,   as   expected   according   to   Table  3,   bagging   provides  the  best  results  with  an  average  prediction  power  of  0.8490.  

The   following   variables   were   included   in   the   classification   tree   and   the   cost   complexity   pruning   tree   model:   tot_exp,   income,   age   and   otp_exp.   For   the   variables  used  for  bagging  and  boosting,  we  look  at  Table  4  where  the  five  most   important   variables   are   shown.   The   full   table   is   depicted   in   Appendix   3A.   The   MDG   is   given   for   random   forest   and   bagging.   The   results   are   sorted   from   important  to  not  important  according  to  the  MDG.  The  first  nine  variables  with   the  highest  MDG  are  the  same  for  random  forest  and  bagging.  We  find  that  the   two  most  important  variables  are  tot_exp  and  income,  which  are  also  included  in   the  classification  trees  and  the  pruned  trees.  

  Table  3  

MSE   F75   L75   M75   Average  

Classification  tree   0.1640   0.1658   0.1643   0.1647  

Cost  complexity  pruning   0.1640   0.1658   0.1643   0.1647  

Random  Forest   0.1491   0.1470   0.1445   0.1469  

(16)

Table  4  

Importance   Random  Forest   MDG   Bagging   MDG  

1   tot_exp   547,3   tot_exp   435,3  

2   income   437,2   income   406,7  

3   PERWT   349,6   PERWT   335,7  

4   age   259,7   age   245,9  

5   VARSTR   259,3   VARSTR   243,5  

3.2  Results  of  the  50%  dataset  

In   this   chapter,   Section   3.2.1   discusses   the   results   of   the   logistic   regression   models   and   Section   3.2.2   of   the   classification   tree,   cost   complexity   pruning,   bagging  and  random  forest.    

3.2.1  Logistic  regression  models  

The  results  of  the  two  logistic  regression  models  are  shown  in  Table  5.  The  first   is  the  model  that  includes  all  available  variables  and  the  second  only  includes  the   significant  variables  from  the  first  model  (𝛼=0.05).  The  prediction  power  of  the   first  model  is  on  average  0.7620,  and  from  the  second  0.7636,  making  the  second   model  the  best.  

Table  5  

Prediction  power   F75   L75   M75   Average   t-­‐value  

Tree  1:  Classification  tree   0.7466   0.7134   0.7425   0.7342   12.7193   Tree  2:  Cost  complexity  pruning   0.7480   0.7134   0.7425   0.7346   12.7461  

Tree  3:  Bagging   0.7703   0.7791   0.7839   0.7778   15.3467  

Tree  4:  Random  Forest   0.7724   0.7798   0.7873   0.7798   15.4738   Logistic  regression  model  1   0.7696   0.7493   0.7669   0.7620   14.3831   Logistic  regression  model  2   0.7656   0.7534   0.7717   0.7636   14.4789   Minimal  prediction  power   0.5129   0.5020   0.5061   0.5070   0    

Both   models   have   significant   better   prediction   power   than   the   minimal   prediction   power.   Therefore,   both   logistic   regression   models   add   significant   prediction  power  to  the  minimal  prediction  power.    

(17)

The   second   model   predicts   slightly   better   than   the   first   model,   but   the   difference   is   not   significant   (t-­‐value   =   0.102).   In   the   second   model,   all   insignificant  variables  from  the  first  model  are  removed,  reducing  the  overfitting   problem.  Variables  that  are  not  significant  are  not  very  important  for  explaining   the   dependent   variable   or   suffer   from   endogeneity.   One   would   suggest   that   without  the  insignificant  variables,  the  prediction  power  would  increase.  These   results  confirm  these  expectations,  but  not  significant  as  mentioned  before.  

In   Table   6,   the   five   most   important   variables   of   the   second   model   constructed  on  the  M75  set  are  shown.  The  full  table  is  depicted  in  Appendix  2B.   All   variables   compared   to   the   former   model   are   still   significant,   except   for   VARSTR.   The   five   most   significant   variables   for   predicting   whether   someone   is   insured   or   not,   using   the   logistic   regression,   are   disab_pop,   income,   live_with_spouse,  female,  and  blackorwhite.    

Table  6   Estimate   Std.  Error   z  value   Pr(>|z|)   disab_pop   -­‐1,42E+00   1,54E-­‐01   -­‐9,251   2,00E-­‐16  

income   2,13E-­‐05   1,84E-­‐06   11,611   2,00E-­‐16  

live_with_spouse   -­‐8,98E-­‐01   1,67E-­‐01   -­‐5,366   8,04E-­‐08   female   -­‐3,52E-­‐01   6,60E-­‐02   -­‐5,326   1,00E-­‐07   blackorwhite   3,57E-­‐01   7,39E-­‐02   4,835   1,33E-­‐06    

3.2.2  Tree  models  

The   results   of   the   four   tree   models   are   also   given   in   Table   5.   The   prediction   power  of  the  first  model  is  on  average  0.7342,  from  the  second  0.7346,  from  the   third   0.7778   and   from   the   fourth   0.7798,   making   the   fourth   model   (random   forest)  best  and  the  first  model  (Classification  tree)  worst.      

We   find   that   cost   complexity   pruning   improves   the   prediction   power   of   the  classification  tree,  but  not  significant  (t-­‐value  =  0.025).  Both  models  have  got   distinctive  power,  compared  to  the  minimal  prediction  power  (t-­‐value  >  2).  

Before  describing  at  the  results  of  bagging  and  random  forest  in  Table  5,   we  start  by  describing  the  results  in  Table  7,  where  the  MSE  of  the  tree  models   are  given.  The  results  of  the  classification  tree  and  cost  complexity  pruning  are   about   the   same.   But   when   we   look   at   random   forest   and   bagging,   we   find   an  

(18)

average   reduction   of   the   MSE   from   0.2253   for   bagging   to   0.2172   for   random   forest.  As  confirmed  in  Table  7,  we  would  expect  that  random  forest  reduce  the   MSE  more  than  bagging.  Therefore,  we  expect  random  forest  to  provide  the  best   prediction  power.  

 

As   described   in   Table   5,   bagging   and   random   forest   both   improve   on   average   the   prediction   power   of   the   classification   tree   and,   as   expected   according   to   Table  7,   random   forest   provides   the   best   results   with   an   average   prediction  power  of  0.7798.  

The  following  variables  are  included  in  the  classification  tree  and  the  cost   complexity   pruning   tree   model:   tot_exp,   income   and   age.   For   the   five   most   important  variables  used  for  bagging  and  boosting,  we  look  at  Table  8.  The  full   table   is   depicted   in   Appendix   3B.   The   MDG   is   given   for   random   forest   and   bagging.   The   results   are   sorted   from   important   to   not   important,   according   to   the  MDG.  The  four  most  important  variables  are  the  same  for  random  forest  and   bagging.  The  two  most  important  variables  are  tot_exp  and  income.  

Table  8  

Importance   Random  forest   MDG   Bagging   MDG  

1   tot_exp   243,2   tot_exp   421,0   2   income   220,6   income   297,3   3   otp_exp   171,6   otp_exp   200,1   4   age   159,1   age   180,1   5   tot_otp_vis   158,7   PERWT   175,7   Table  7   MSE   F75   L75   M75   Average   Classification  tree   0.2604   0.2438   0.2477   0.2506  

Cost  complexity  pruning   0.2604   0.2479   0.2477   0.2520  

Random  Forest   0.2148   0.2183   0.2184   0.2172  

(19)

3.3  Results  of  the  linear  50%  dataset  

In   this   chapter,   Section   3.3.1   discusses   the   results   of   the   logistic   regression   models   and   Section   3.3.2   of   the   classification   tree,   cost   complexity   pruning,   bagging  and  random  forest.    

3.3.1  Logistic  regression  models  

The  results  of  the  two  logistic  regression  models  are  shown  in  Table  9.  The  first   is  the  model  that  includes  all  available  variables  and  the  second  only  includes  the   significant  variables  from  the  first  model  (𝛼=0.05).  The  prediction  power  of  the   first  model  is  on  average  0.9162,  and  from  the  second  0.9214,  making  the  second   model  the  best.  

Table  9  

Prediction  power   F75   L75   M75   Average   t-­‐value  

Tree  1  Classification  tree   0.8604   0.8584   0.8665   0.8618   14.4712   Tree  2  Cost  complexity  pruning   0.8557   0.8584   0.8665   0.8602   14.3581  

Tree  3  Bagging   0.8984   0.8869   0.8923   0.8925   16.7354  

Tree  4  Random  Forest   0.9038   0.8923   0.8970   0.8977   17.1298   Logistic  regression  model  1   0.9153   0.9085   0.9248   0.9162   18.5721   Logistic  regression  model  2   0.9228   0.9133   0.9282   0.9214   18.9862   Minimal  prediction  power   0.6504   0.6104   0.6287   0.6299   0    

Both   models   have   significant   better   prediction   power   than   the   minimal   prediction   power.   Therefore,   both   logistic   regression   models   add   significant   prediction  power  to  the  most  basic  model  that  only  has  ‘Insured’  as  an  outcome.    

The   second   model   predicts   slightly   better   than   the   first   model,   but   the   difference   is   not   significant   (t-­‐value   =   0.517).   In   the   second   model,   all   insignificant  variables  from  the  first  model  are  removed,  reducing  the  overfitting   problem.  Variables  that  are  not  significant  are  not  very  important  for  explaining   the   dependent   variable   or   suffer   from   endogeneity.   One   would   suggest   that   without   the   insignificant   variables,   the   prediction   power   would   increase.   The   results  above  confirm  these  expectations,  but  not  significant.  

In   Table   10,   the   five   most   significant   variables   of   the   second   model   constructed  on  the  M75  set  are  shown.  The  full  table  is  depicted  in  Appendix  2C.  

(20)

All   variables,   except   for   marr   and   VARSTR,   are   still   significant.   The   five   most   significant  variables  for  predicting  whether  someone  is  insured  or  not,  using  the   logistic  regression,  are  female,  age_grp,  disab_pop,  income,  and  blackorwhite.  

 

Table  10   Estimate   Std.  Error   z  value   Pr(>|z|)   female   -­‐1,00E+00   1,21E-­‐01   -­‐8,268   2,00E-­‐16  

age_grp   7,00E-­‐01   8,03E-­‐02   8,725   2,00E-­‐16  

disab_pop   -­‐4,38E+00   2,83E-­‐01   -­‐15,441   2,00E-­‐16  

income   6,79E-­‐05   3,53E-­‐06   19,237   2,00E-­‐16  

blackorwhite   1,44E+00   1,32E-­‐01   10,931   2,00E-­‐16    

3.2.2  Tree  models  

The   results   of   the   four   tree   models   are   also   given   in   Table   9.   The   prediction   power  of  the  first  model  is  on  average  0.8618,  from  the  second  0.8602,  from  the   third   0.8925   and   from   the   fourth   0.8977,   making   the   fourth   model   (random   forest)  best  and  the  second  model  (cost  complexity  pruning)  worst.    

We   find   that   cost   complexity   pruning   does   not   improve   the   prediction   power  of  the  classification  tree.  Maybe  this  is  causes  by  the  fact  that  the  data  is   now  linear,  and  CTs  work  best  with  nonlinear  data.  But  still,  both  models  have   got  distinctive  power  compared  to  the  minimal  prediction  power  (t-­‐value  >  2).  

Before  describing  at  the  results  of  bagging  and  random  forest  in  Table  9,   we  start  by  describing  the  results  in  Table  11,  where  the  MSE  of  the  tree  models   are   given.   The   MSEs   of   the   classification   tree   and   cost   complexity   pruning   are   much   higher   than   bagging   and   random   forest.   When   we   look   at   random   forest   and  bagging,  we  find  a  reduction  of  the  MSE  from  0.1052  for  bagging  to  0.1021   for  random  forest.  As  confirmed  by  the  results  shown  Table  11,  we  would  expect   that   random   forest   reduce   the   MSE   more   than   bagging.   Therefore,   we   expect   random  forest  to  provide  the  best  prediction  power.  

(21)

As   described   in   Table   9,   bagging   and   random   forest   both   improve   on   average   the   prediction   power   of   the   classification   tree   and,   as   expected   according  to  Table  11,  random  forest  provides  the  best  results  with  an  average   prediction  power  of  0.8977.  

The  following  variables  are  included  in  the  classification  tree  and  the  cost   complexity  pruning  tree  model:  tot_exp,  income  and  age.  For  the  variables  used   for   bagging   and   boosting,   we   look   at   Table   12.   The   MDG   is   given   for   random   forest   and   bagging.   The   results   are   sorted   from   important   to   not   important   according  to  the  MDG.  The  two  most  important  variables,  tot_exp  and  income,  are   the  same  for  random  forest  and  bagging.  

Table  12  

Importance   Bagging   MDG   Random  forest   MDG  

 1     tot_exp   803,8   tot_exp   328,4    2     income   345,5   income   242,1    3     age   228,9   rx_exp   213,6    4     PERWT   110,6   age   203,9    5     EDUCYR   96,8   otp_exp   132,9     3.4  Comparison  

First,  the  result  of  the  83%  dataset  are  compared,  second  of  the  50%  dataset  and   third   of   the   linear   50%   dataset,   and   fourth   a   comparison   of   the   model   interpretation   is   made.   Last,   all   results   from   the   three   datasets   are   compared   with  each  other.  

3.4.1  Comparison  of  the  models  constructed  with  the  83%  dataset  

The  logistic  regression  models  1  and  2  have  higher  prediction  power  than  the  CT   with  and  without  cost  complexity  pruning.  But  when  we  reduce  the  variance  of  

Table  11  

MSE   F75   L75   M75   Average  

Classification  tree   0.1290   0.1346   0.1323   0.1320  

Cost  complexity  pruning   0.1344   0.1644   0.1323   0.1437  

Random  Forest   0.0998   0.1009   0.1021   0.1009  

(22)

the   prediction   trees   by   bagging   and   random   forest,   we   find   better   results   compared  to  the  logistic  regression  models.  When  we  look  at  the  significance  of   the   models,   only   bagging   adds   significant   prediction   power   to   the   minimal   prediction   power.   The   other   three   tree   models   and   two   logistic   regression   models  do  not  improve  the  minimal  prediction  power  significantly.  So  the  only   model   having   any   distinguish   power   is   bagging.   But   this   model   is   not   significantly  better  than  the  best  logistic  regression  model  (t=0.535).  

  When  we  look  at  the  variables  included  in  the  models,  there  appear  some   differences.  All  models  include  income  and  PERWT  or  categorize  them  as  part  of   the  most  important  variables,  but  the  other  highly  significant  variables  from  the   logistic  regression  model,  namely  disab_pop,  female,  employ,  live_with_spouse  and   blackorwhite,   which   are   all   dummy   variables,   don’t   appear   in   the   tree   models.   The   other   way,   the   highly   important   tree   variables   age,   tot_exp,   otp_exp   and   VARSTR,   which   are   all   continuous   variables,   don’t   appear   in   the   logistic   regression   models.   Therefore,   logistic   regression   prefers   dummy   variables   and   CT  prefers  continuous  variables.    

3.4.2  Comparison  of  models  constructed  with  the  50%  dataset  

All   models   add   significant   prediction   power   to   the   minimal   prediction   power   with  t-­‐values  of  at  least  12.  Therefore,  the  models  perform  much  better  than  the   models   constructed   with   the   83%   data   compared   to   their   minimal   prediction   powers.   The   logistic   regression   models   perform   better   than   the   classification   tree  with  and  without  cost  complexity  pruning,  but  gets  beaten  by  bagging  and   random  forest.  The  best  tree  model  is  obtained  by  using  random  forest  and  the   best  logistic  regression  models  by  including  only  the  significant  variables,  but  the   difference   between   the   models   is   not   significant   (t=1.05).   Therefore   they   perform  just  as  well.  

When  we  look  at  the  variables  included  in  the  models,  again  there  appear   some  differences.  All  models  include  income  or  categorize  it  as  part  of  the  most   important   variables,   but   the   other   highly   significant   variables   from   the   logistic   regression   model,   namely   disab_pop,   female,   employ,   age_grp   and   blackorwhite,   which   are   again   all   dummy   variables,   do   not   appear   in   or   belong   to   the   most   important  variables  from  the  tree  models.  The  other  way,  the  highly  important  

(23)

tree   variables   age   and   VARSTR,   which   are   both   continuous   variables,   don’t   appear  in  the  logistic  regression  models.  It  is  logical  that  the  tree  models  include   continuous  variables.  They  are  able  to  construct  dummy  variables  (branches)  by   splitting   the   observations   the   best   way   possible.   The   best   to   explain   this   difference   is   by   looking   at   age,   a   continuous   variable,   and   age_grp,   a   dummy   variable.   The   tree   models   choose   age   so   that   they   can   make   their   own   split   instead   of   being   bounded   to   the   predetermined   splits   from   age_grp.   Consequently,  the  tree  models  create  an  at  least  as  good  split  as  the  ones  from   age_grp.   The   logistic   regression   models   choose   age_grp.   Apparently   there   exist   some   nonlinearity   in   the   data,   causing   the   logistic   regression   model   to   choose   age_grp  over  age.  

3.4.3  Comparison  of  models  constructed  with  the  linear  50%  dataset  

All   models   add   significant   prediction   power   to   the   minimal   prediction   power   with  t  values  of  at  least  14.  Therefore,  these  models  perform  much  better  than   the  models  constructed  with  the  83%  data.  Bagging  and  random  forest  perform   better  than  the  classification  tree  with  and  without  cost  complexity  pruning,  but   this   time,   the   logistic   regression   models   perform   better   than   the   tree   models.   The   best   tree   model   is   obtained   by   using   random   forest   and   the   best   logistic   regression   model   by   including   only   the   significant   variables   and   the   difference   between  them  is  significant  (t=2.245).  Therefore,  the  logistic  regression  model  is   the   overall   best   model.   This   result   was   expected,   because   the   data   were   generated  by  using  the  logistic  regression  model.  

When  we  look  at  the  variables  included  in  the  models,  there  appear  some   differences.   All   models   include   income   or   categorize   it   as   part   of   the   most   important   variables,   but   the   other   highly   significant   variables   from   the   logistic   regression   model,   namely   disab_pop,   female   and   live_with_spouse,   which   are   again  all  dummy  variables,  don’t  appear  in  the  tree  models.  The  other  way,  the   highly   important   tree   variables   age   and   VARSTR,   which   are   both   continuous   variables,  don’t  appear  in  or  are  not  significant  in  the  logistic  regression  models.   Again,  we  can  explain  these  differences  the  same  way  as  we  did  in  Section  3.4.2   with  age  and  age_grp.  

Referenties

GERELATEERDE DOCUMENTEN

Ariadne performs equally well as Doc2Vec does in a specific information retrieval task, and is able to rank the target results higher. We argue that one has to take into account

In 2000, the Swiss government established National Centres of Competence in Research (NCCRs), which are networks in specific fields or around specific topics. 8), namely

The ridge, lasso, L 2 fused lasso, L 1 fused lasso, and smoothed logistic regression are fitted on the bladder cancer copy number data with the optimal λ’s as found by

To overcome this problem we resort to an alternating descent version of Newton’s method [8] where in each iteration the logistic regression objective function is minimized for

The observations of malignant tumors (1) have relatively high values for variables (Sol, Age, Meno, Asc, Bilat, L_CA125, ColSc4, PSV, TAMX, Pap, Irreg, MulSol, Sept), but relatively

Support vector machines based on ranking constraints When not us- ing regression models, survival problems are often translated into classification problems answering the

Following the answers of the Russian president Vladimir Putin during his press- conference (December 18, 2014) and the answers of American presidents Barack Obama (August 01, 2014)

“As a number of studies have shown, we find that there is no significant relationship at the cross-national level between the achievement test scores and the amount of instructional