Predicting whether someone is insured or not : a comparison between classification trees and logistic regression

(1)

Predicting Whether Someone is Insured or Not: A Comparison Between

Classification Trees and Logistic Regression

Abstract

Logistic regression and classification trees are compared on their prediction power, variable selection and interpretation on whether someone is insured or not. This is done on three different datasets, where the proportion of insured observations and the linearity in the data are varied. The logistic regression performs even good as classification trees in case of nonlinear data and better in case of linear data. Logistic regression tends to choose dummy variables and classification trees continuous variables in their prediction models, but both choose income as an important variable. The classification tree is easier to interpret nonlinearity in the data.

Author: Steyn Heskes Student number: 6350399

Supervisor: Dr. N. P. A. van Giersbergen Subject: Econometric and Big Data Bachelor of Science in Econometrics University of Amsterdam

(2)

INDEX

INDEX ... 2

1. Introduction ... 4

2. Data and methods ... 5

2.1. Data ... 5

2.2. Techniques ... 6

2.2.1 The classification tree ... 6

2.2.2 Cost complexity pruning ... 7

2.2.3 The logistic regression model ... 8

2.2.4 Bagging ... 8

2.2.5 Random forest ... 9

2.3 Method ... 10

2.3.1 Basic method ... 10

2.3.2 Data generating process and data manipulation ... 11

3. Results ... 12

3.1 Results of the 83% dataset ... 13

3.1.1 Logistic regression models ... 13

3.1.2 Tree models ... 14

3.2 Results of the 50% dataset ... 16

3.3 Results of the linear 50% dataset ... 19

3.4 Comparison ... 21

3.4.1 Comparison of the models constructed with the 83% dataset ... 21

3.4.2 Comparison of models constructed with the 50% dataset ... 22

3.4.3 Comparison of models constructed with the linear 50% dataset ... 23

3.4.4 Interpretation ... 24

3.4.5 Overall comparison ... 24

4. Conclusion ... 25

5. Literature cited ... 26

(3)

Appendix 2A – Coefficients 83% dataset logistic regression model 2 ... 30

Appendix 2B – Coefficients 50% dataset logistic regression model 2 ... 31

Appendix 2C – Coefficients linear 50% dataset logistic regression model 2 ... 32

Appendix 3A – Importance 83% dataset bagging and random forest ... 33

Appendix 3B – Importance 50% dataset bagging and random forest ... 34

Appendix 3C – Importance linear 50% dataset bagging and random forest ... 35

(4)

1. Introduction

Nowadays computers are more and more involved with economic transactions and automatically save information about these transactions. The past three decades datasets have consequently grown bigger and bigger, resulting in extraordinary big datasets. For example, companies such as Google capture around 20 billion URL’s a day and over 100 billion search queries each month. This way, they captured around 30 trillion URL’s in total. Econometricians are however used to deal with datasets that fit in a single spread sheet. Big datasets require a different approach for three main reasons: First, the datasets are so big that they require stronger manipulative techniques. Second, sometimes there are more variables available than preferred, causing the overfitting problem. Third, big datasets will have more flexible relations than just linear relations (Varian, 2013).

To deal with these datasets, new techniques have been developed, called machine learning techniques. These techniques have three main differences with Ordinary Least Squares (OLS) (Varian, 2013). First, machine learning techniques focus on summarizing certain nonlinearity in the data in contrast to OLS, which focuses solely on linear relations. Second, OLS focuses on explaining the data as good as possible within the available data, whereas machine learning techniques focus on forecasting outside the dataset. This can easily be explained by recent developments. Twenty years ago, datasets weren’t that big and therefore it wasn’t possible to check a model with additional data. Today, there is so much data, that it is possible to select a sample from the data to create a model and then predict the remaining out-‐of-‐sample data to validate the model. One can therefore calculate how much of the out-‐of-‐sample data is well predicted. This technique is called the k-‐fold cross validation. Third, machine learning techniques also provide methods to deal with the overfitting problem. By selecting the most powerful variables and removing the superfluous variables, the model gets less complex and prediction power improves.

In this paper, the machine learning technique Classification Trees (CT) and the commonly used logistic regression model are compared for three main reasons. First, the models are both popular with statisticians, machine learning researchers, data-‐analysts and econometricians. Second, they are both

(5)

classification methods, used for example to classify emails in the categories ‘spam’ and ‘no spam’. Third, they are both quite easy to use. Besides, CT’s are becoming more popular for several reasons. First, they are easy to interpretate. Second, by pruning CT’s, their prediction power increases is expected to a higher level than logistic regression models, subject to that the dataset is big enough (Candeviren, 2006; Perlich, 2003). Third, CT is a good method for explaining non-‐linearity’s in the data.

The aim of this paper is to examine which method, CT or Logit regression, makes better predictions by comparing the obtained results, using k-‐fold cross validation. Both methods try to explain whether someone’s insured or not on the basis of some demographic and socioeconomic characteristics and some information about health care expenditures.

The rest of the paper is structured as follows. First, the data, data manipulation and techniques are described in Section 2. Then the research method is described in Section 3. After that, the results will be discussed in Section 4 whereupon a conclusion is drawn in Section 5.

2. Data and methods

First the data and manipulation process used to obtain the results are described in Section 2.1. Second the techniques to create and improve the prediction models are described in Section 2.2. Third the research method is described in Section 2.3.

2.1. Data

The Medical Expenditure Panel Survey (MEPS) is a nationally representative survey of the U.S. civilian non-‐institutionalized population, which are the people of 16 years and older living in the 50 States and the District of Colombia who are not inmates of homes of the aged and penal-‐ and mental facilities, and are not on active duty in the Armed Forces, set up in 1996 by the U.S. Department of Health and Human Services (Shen, 2013). MEPS consist of the results of survey’s of households, employers and medical providers to collect information about health care expenditures and health insurance coverage, demographic and socioeconomic characteristics.

(6)

This study uses the MEPS data ‘Basevar.xlsx’, conducted in 2005, which consists of 33964 observations. For this study, the data is manipulated in several ways. First, for classification trees to work, we need that the data is fully available. Missing values cause problems for classification and logistic regression to work. Therefore, only the rows without missing values are included. Rows with missing values are removed. In the second step, all data about the amount compensated by insurance companies is removed, because when the amount is greater than zero, this person is most certain insured. Last, two variables are manipulated into two groups. Hlthins is manipulated into ‘insured’ and ‘not insured’ and race is manipulated into ‘White’ and ‘Other’. The variables that survived the manipulation process above are described in Appendix 1.

2.2. Techniques

This section describes five techniques, which are used to estimate and improve the different models, namely the classification tree in Section 2.2.1, cost

complexity pruning in Section 2.2.2, the logistic regression model in Section 2.2.3, bagging in Section 2.2.4, and random forest in Section 2.2.5.

2.2.1 The classification tree

The CT exists of a network of nodes, where each node denotes a variable, of which a fictitious example is shown in Figure 1. At each node, two or more branches denote certain values or value ranges after which a new node or terminal node appears. The start node or family node is the first node of the tree where the branches partition the whole dataset in some classes. In Figure 1, the start node partitions the dataset into older than 45 to the left branch and younger or even old as 45 to the right branch. Each partitioning is based on finding the most homogeneous subsets (branches) within the variable. This way, an observation is going top-‐down through the tree and will end in a terminal node. The terminal node exists of values of the dependent variable, such as in Figure 1 in ‘Yes’ or ‘No’.

(7)

Classification means dividing data in several categories based on properties of the observations. The aim of the CT method is to construct a tree that predicts the out-‐of-‐sample data as good as possible (Candeviren, 2006). This method belongs to the Classification And Regression Trees (CART) methods. Trees prove to be good tools to summarize important non-‐linearity in the data, and work well with big amounts of data (Varian, 2013).

2.2.2 Cost complexity pruning

CT’s can grow large and become very complex. This is desirable for explaining the training data, but likely to overfit the data, resulting in poor prediction power. Cost complexity pruning provides an outcome by introducing a penalty term to the number of nodes. The tree with the lowest Mean Squared Error (MSE), dependent on the number of nodes, is expected to be the tree with the best prediction power (Hastie et al., 2013). Besides, the pruned tree is less complex and therefore easier to interpret.

(8)

2.2.3 The logistic regression model

The logistic regression model divides observations based on their properties in several categories or classes of the dependent variable, just like CTs. There are no restrictions on the independent variables, they may be of categorical or numerical form. In this paper, the dependent variable exists of just two categories. The logistic regression model is therefore defined as follows (Heij et al., 2004): 𝑔 𝑥 = log 𝑃 𝑦 = 1 𝑥 𝑃 𝑦 = 0 𝑥 = 𝛽! + 𝛽!𝑥! ! !!!

The coefficients 𝛽_! are estimated by the log-‐likelihood function. The estimated model can now be interpreted in terms of the signs and significance of the estimated coefficients.

2.2.4 Bagging

This technique is used in order to improve the performance of statistical learning methods such as decision trees. The decision trees in Section 2.2.1 and Section 2.2.2 suffer from high variance (Hastie et al., 1013). Therefore, the results of two different samples can be quite different. A decision tree with low variance is expected to provide similar results when applied repeatedly to distinct datasets. Bagging is a common used method for reducing variance.

Generally, bagging uses B separate training sets to build B prediction models and averages the resulting predictions. In other words, first, 𝑓!_(𝑥),…,

𝑓!_{(𝑥)
are
calculated
using
different
training
sets
1
to
B.
Second,
the
low-‐variance}

statistical learning model is defined as follows:

𝑓_!"# 𝑥 = 1

𝐵 𝑓!(𝑥)

!

!!!

Unfortunately, mostly there are no multiple training sets available. Therefore, a technique called bootstrap is used. First, bootstrap takes repeated samples from the training data set. This way, B different bootstrapped training data sets are created. Second, the model is trained on the 𝑏!!_bootstrapped

training set in order to get 𝑓∗!_{(𝑥).
Third,
the
predictions
are
averaged
to
obtain}

(9)

𝑓_!"# 𝑥 = 1

𝐵 𝑓∗!(𝑥)

!

!!!

For decision trees, bagging works almost the same. First B trees are constructed on the basis of B bootstrapped training sets. The resulting predictions are averaged to find the resulting model. The individual B trees suffer from high variance, but have low bias, whereas the resulting averaged model has lower variance at the cost of some bias. In case of classification trees, instead of taking the average, the most commonly occurring class among the B predictions (majority vote) is used to obtain the final prediction model.

So the advantage of bagging is that it improves prediction accuracy, but it also has a disadvantage. When we bag a large number of trees, it is no longer possible to interpret the results in a single tree, causing that it is no longer clear which variables are most important to the final prediction model. The Gini index provides an outcome.

The Gini index is defined as follows,

𝐺 = 𝑝!"

!

!!!

(1 − 𝑝!")

where m denotes the 𝑚!!_{region
(regions
are
split
up
by
nodes),
k
the
𝑘}!!_class

(in this case 1 or 2, or ‘Insured’ or ‘Not insured’) and 𝑝!" is the proportion of

observations in the training set in the 𝑚!!_{region
that
belong
to
the
𝑘}!!_{class.
The}

Gini index measures the total variance across the K classes. By adding up the total amount that the Gini index is decreased by splits over a given predictor and averaged over all B trees, the ones with the largest mean decrease are the most important variables. This measure is called the Mean Decrease in Gini index (MDG).

2.2.5 Random forest

Bagging considers all p variables as split candidates at each split. This way, most of the trees will use the strongest predictor as start node, resulting in B quite similar trees. Therefore, the predictions from the bagged trees will also be quite similar and thus highly correlated. But averaging over a large amount of bagged trees, which are highly correlated, doesn’t lead to a strong variance reduction.

(10)

Random forest overcomes this problem by decorrelating the bagged trees. In contrast to bagging, random forest only considers a subset of m < p variables as split candidates at each split. Hereby, the m considered variables are randomly chosen at each split. Therefore, the strongest predictor will not even be considered in (p – m)/p of the splits. This way, other predictors will have a better chance to get in the model, decreasing the variance of the average of the resulting trees and thereby more reliable. In this paper, we will use the recommended m = 𝑝 (Hastie et al., 2013). As with bagging, the variables with the largest MDG, are the most important variables.

2.3 Method

In this chapter, first the research method is described in Section 2.3.1, then the data generating process and data manipulation results are described in Section 2.3.2.

2.3.1 Research method

To decide which of the two methods, logistic regression or classification tree, predicts best whether someone is insured or not, we first divide the data into two parts, existing of 75% and 25% of the data. These sets are called the training set and the test set, respectively. This is done in three different ways. First by using the first 75% as training set, second the last 75% and third the middle 75%, named F75, L75 and M75, respectively. The training data is used to construct a prediction model with each of these two methods. The prediction power will then be examined (validated) by predicting the test set. A calculation of how much of the testing data is well predicted (success) divided by the amount of data in the testing data (total) gives the prediction power. The models will subsequently be compared by their prediction power and variable importance by looking at the MDG and the p-‐values of the variables.

Second, we construct a CT by using the training set as described in Section 2.2.1. Then the CT is used to predict the test data. To determine how well the CT predicts, the prediction power is calculated. Then the CT is pruned as described in Section 2.2.2 by including a penalty for each node. The bigger the penalty, the smaller the tree will be. We expect the prediction power to increase after pruning (Gareth James, 2013) and will select the tree with the lowest MSE.

(11)

Third, to improve the results of the CT, new CTs are constructed by bagging as in Section 2.2.4. First, 100 bagged trees are grown, considering all p variables at each split. The final prediction model is given by taking the majority vote of the 100 bagged trees. The prediction power will again be calculated by predicting the test set and validate the results.

Fourth, random forest is used the same way as bagging, growing 100 bagged trees, but instead of considering p variables at each split, just m = 𝑝 variables are considered as described in Section 2.2.5. The final prediction model consists of taking the majority vote of the 100 bagged trees. Then the prediction power is calculated.

Fifth, we predict whether someone is insured or not by using the logistic regression model as described in Section 2.2.3. We use the same training data as used by the CT to do so. We construct two different models. The first model includes all variables. Then the regression model predicts the testing data. Then the prediction power is calculated the same way as for the CT. For the second model, we look at the significance of the variables. Variables that are not significant at the 5% level are thrown out to construct the second model. Afterwards, we calculate the prediction power again.

Last, the results will be compared on the basis of their prediction power, variables that are included in each model and which of them are most significant. To decide whether one of the prediction models predicts best, we test 𝐻_!: 𝑝_! = 𝑝_!, by calculating the t-‐value as follows,

(𝑝!− 𝑝!) − 0 𝑝 1 − 𝑝 (_𝑛1 !+ 1 𝑛!) ~ 𝑁 0,1 with 𝑝 =(p!+ p!) 2

where 𝑝_! and 𝑝_! are the prediction powers of the different models or the minimal prediction power and 𝑛!and 𝑛! are the number of observations in the

test set. The critical value is 2.00.

2.3.2 Data generating process and data manipulation

The data is manipulated in two ways. First by changing the proportion of insured and not insured in the data and second by making sure the data is linear.

83% of the people are insured in the original dataset. Consequently, all classification models should at least predict around 83% of the data well. This

(12)

situation causes the logistic regression model to have a high intercept and the CTs to only give ‘Insured’ as an outcome. In other words, the models are tended to give ‘Insured’ as an outcome, giving them a high prediction power in the ‘Insured’ segment, but low prediction power in the ‘Not insured’ segment. Therefore, we construct a dataset of 50% ‘Insured’ and 50% ‘Not insured’ observations to prevent this problem. Again, we perform the research method on this dataset as described in Section 2.3.1.

Also, we have to take into account that the data may be nonlinear. We expect the CT to perform better for nonlinear relations than the logistic regression model. Therefore, we generate linear data by using the second logistic regression model, constructed as described in Section 2.2.3. First, we simply fill in the values of the observations in the model and save the fitted values of whether someone is insured or not. Second, we add up a random error term from the standard normal distribution to prevent the logistic regression model to construct a perfect fit. When the value is positive, the person is insured. Third, the former values of whether someone is insured or not will be replaced by the new constructed values. Again, we perform the basic method described in Section 2.3.1 on this newly generated dataset.

3. Results

In this chapter, the results of the dataset where 83% of the observed people are insured are described in Section 3.1. The results of the dataset where 50% of the observed people are insured are described in Section 3.2. The results of the dataset where 50% of the observed people are insured and the data has a linear relation with whether someone is insured or not are described in Section 3.3. Finally, in Section 3.4, a comparison of the logistic regression and tree models is made.

The results are shown in Table 1, Table 5 and Table 9. In the upper part of these tables, the prediction power of the different models is given. First for the F75 training set, as described in Section 2.3, second for the L75 training set, third for the M75 training set and finally the average of the three preceding sets is given in the fifth column. The minimum prediction power is established by calculating the prediction power of the most basic model that has only ‘Insured’

(13)

as an outcome. It is calculated by dividing the amount of ‘Insured’ in the test set by the total observations in the test set. It is simply the percentage of ‘Insured’ in the test set. The last column gives the t-‐value, calculated as described in Section 2.3.1, taking the minimal prediction power as reference.

3.1 Results of the 83% dataset

In this chapter, Section 3.1.1 discusses the results of the logistic regression models and Section 3.1.2 of the classification tree, cost complexity pruning, bagging and random forest.

3.1.1 Logistic regression models

The results of the two logistic regression models are shown in Table 1. The first is the model that includes all available variables and the second only includes the significant variables from the first model (𝛼=0.05). The prediction power of the first model is on average 0.8419 and from the second 0.8413, making the first the one with the highest prediction power.

Looking at the t-‐values, none of the logistic regression models has significant better prediction power than the minimal prediction power. Therefore, both logistic regression models do not add significant prediction power to the most simple model that has only ‘Insured’ as an outcome. Maybe this is caused by the fact that 83% of the observed people is insured, which causes weak performance of classification models as described in Section 2.3.2.

Table 1

Prediction power F75 L75 M75 Average

t-‐value Tree 1: Classification tree 0.8285 0.8336 0.8294 0.8305 0 Tree 2: Cost complexity pruning 0.8285 0.8336 0.8294 0.8305 0 Tree 3: Random forest 0.8408 0.8314 0.8388 0.8370 0.8258 Tree 4: Bagging 0.8478 0.8509 0.8482 0.8490 2.3795 Logistic regression model 1 0.8399 0.8455 0.8402 0.8419 1.4514 Logistic regression model 2 0.8413 0.8431 0.8395 0.8413 1.3740 Minimal prediction power 0.8285 0.8336 0.8294 0.8305 0

(14)

It is strange that on average, the second model predicts worse than the first model. In the second model, all insignificant variables from the first model are removed, reducing the overfitting problem. Variables that are not significant are not very important for explaining the dependent variable or suffer from endogeneity. One would suggest that without the insignificant variables, the prediction power would increase. In these results, this is not the case. This could also be caused by the fact that 83% of the observed people is insured.

In Table 2, the five most significant variables of the second model constructed on the L75 training set are shown. The full table is depicted in Appendix 2A. All variables, except for health, are still significant compared to the former model. The five most significant variables for predicting whether someone is insured or not, using the logistic regression, are female, live_with_spouse, employ, disab_pop, and income.

Table 2

Coefficients Estimate Std. Error z value Pr(>|z|) female -‐4.35E+02 4.70E+01 -‐9.27E+00 2.00E-‐16 live_with_spouse -‐4.74E+02 4.77E+01 -‐9.93E+00 2.00E-‐16

employ 7.05E+02 6.92E+01 1.02E+01 2.00E-‐16

disab_pop -‐1.71E+03 1.12E+02 -‐1.52E+01 2.00E-‐16 income 2.91E-‐02 1.57E-‐03 1.85E+01 2.00E-‐16

3.1.2 Tree models

The results of the four tree models are also given in Table 1. The prediction power of the first model is on average 0.8305, from the second 0.8305, from the third 0.8370 and from the fourth 0.8490, making the fourth model (bagging) best and the first two models worst. Besides, bagging is the only tree model that adds significant prediction power compared to the minimal prediction power (t-‐value = 2.38).

We find that cost complexity pruning doesn’t improve the prediction power of the classification tree and that both models haven’t got any distinctive power (t-‐value = 0). Their prediction powers are exactly the same as the minimal prediction power. This is easy to explain when we look at the actual classification trees. Both the classification tree and the pruned tree only give ‘Insured’ as an

(15)

outcome. Pruning a large tree with only the outcome ‘Insured’ will result in a smaller tree with exactly the same outcome. Therefore, no increase in prediction power is obtained by pruning. The trees are thus as good as the minimal prediction power.

Before describing the results of bagging and random forest in Table 1, we start by describing the results shown Table 3, where the MSEs of the four tree models are given. The results of the classification tree and cost complexity pruning are again the same. But when we look at random forest and bagging, we find a reduction of the MSE. We would expect that random forest reduce the MSE more than bagging, but bagging causes on average a bigger reduction than random forest. Therefore, we expect bagging to provide the best prediction power. This is confirmed by the prediction powers shown Table 1.

Bagging and random forest both improve on average the prediction power of the classification tree and, as expected according to Table 3, bagging provides the best results with an average prediction power of 0.8490.

The following variables were included in the classification tree and the cost complexity pruning tree model: tot_exp, income, age and otp_exp. For the variables used for bagging and boosting, we look at Table 4 where the five most important variables are shown. The full table is depicted in Appendix 3A. The MDG is given for random forest and bagging. The results are sorted from important to not important according to the MDG. The first nine variables with the highest MDG are the same for random forest and bagging. We find that the two most important variables are tot_exp and income, which are also included in the classification trees and the pruned trees.

Table 3

MSE F75 L75 M75 Average

Classification tree 0.1640 0.1658 0.1643 0.1647

Cost complexity pruning 0.1640 0.1658 0.1643 0.1647

Random Forest 0.1491 0.1470 0.1445 0.1469

(16)

Table 4

Importance Random Forest MDG Bagging MDG

1 tot_exp 547,3 tot_exp 435,3

2 income 437,2 income 406,7

3 PERWT 349,6 PERWT 335,7

4 age 259,7 age 245,9

5 VARSTR 259,3 VARSTR 243,5

3.2 Results of the 50% dataset

The results of the two logistic regression models are shown in Table 5. The first is the model that includes all available variables and the second only includes the significant variables from the first model (𝛼=0.05). The prediction power of the first model is on average 0.7620, and from the second 0.7636, making the second model the best.

Table 5

Prediction power F75 L75 M75 Average t-‐value

Tree 1: Classification tree 0.7466 0.7134 0.7425 0.7342 12.7193 Tree 2: Cost complexity pruning 0.7480 0.7134 0.7425 0.7346 12.7461

Tree 3: Bagging 0.7703 0.7791 0.7839 0.7778 15.3467

Tree 4: Random Forest 0.7724 0.7798 0.7873 0.7798 15.4738 Logistic regression model 1 0.7696 0.7493 0.7669 0.7620 14.3831 Logistic regression model 2 0.7656 0.7534 0.7717 0.7636 14.4789 Minimal prediction power 0.5129 0.5020 0.5061 0.5070 0

Both models have significant better prediction power than the minimal prediction power. Therefore, both logistic regression models add significant prediction power to the minimal prediction power.

(17)

The second model predicts slightly better than the first model, but the difference is not significant (t-‐value = 0.102). In the second model, all insignificant variables from the first model are removed, reducing the overfitting problem. Variables that are not significant are not very important for explaining the dependent variable or suffer from endogeneity. One would suggest that without the insignificant variables, the prediction power would increase. These results confirm these expectations, but not significant as mentioned before.

In Table 6, the five most important variables of the second model constructed on the M75 set are shown. The full table is depicted in Appendix 2B. All variables compared to the former model are still significant, except for VARSTR. The five most significant variables for predicting whether someone is insured or not, using the logistic regression, are disab_pop, income, live_with_spouse, female, and blackorwhite.

Table 6 Estimate Std. Error z value Pr(>|z|) disab_pop -‐1,42E+00 1,54E-‐01 -‐9,251 2,00E-‐16

income 2,13E-‐05 1,84E-‐06 11,611 2,00E-‐16

live_with_spouse -‐8,98E-‐01 1,67E-‐01 -‐5,366 8,04E-‐08 female -‐3,52E-‐01 6,60E-‐02 -‐5,326 1,00E-‐07 blackorwhite 3,57E-‐01 7,39E-‐02 4,835 1,33E-‐06

The results of the four tree models are also given in Table 5. The prediction power of the first model is on average 0.7342, from the second 0.7346, from the third 0.7778 and from the fourth 0.7798, making the fourth model (random forest) best and the first model (Classification tree) worst.

We find that cost complexity pruning improves the prediction power of the classification tree, but not significant (t-‐value = 0.025). Both models have got distinctive power, compared to the minimal prediction power (t-‐value > 2).

Before describing at the results of bagging and random forest in Table 5, we start by describing the results in Table 7, where the MSE of the tree models are given. The results of the classification tree and cost complexity pruning are about the same. But when we look at random forest and bagging, we find an

(18)

average reduction of the MSE from 0.2253 for bagging to 0.2172 for random forest. As confirmed in Table 7, we would expect that random forest reduce the MSE more than bagging. Therefore, we expect random forest to provide the best prediction power.

As described in Table 5, bagging and random forest both improve on average the prediction power of the classification tree and, as expected according to Table 7, random forest provides the best results with an average prediction power of 0.7798.

The following variables are included in the classification tree and the cost complexity pruning tree model: tot_exp, income and age. For the five most important variables used for bagging and boosting, we look at Table 8. The full table is depicted in Appendix 3B. The MDG is given for random forest and bagging. The results are sorted from important to not important, according to the MDG. The four most important variables are the same for random forest and bagging. The two most important variables are tot_exp and income.

Table 8

Importance Random forest MDG Bagging MDG

1 tot_exp 243,2 tot_exp 421,0 2 income 220,6 income 297,3 3 otp_exp 171,6 otp_exp 200,1 4 age 159,1 age 180,1 5 tot_otp_vis 158,7 PERWT 175,7 Table 7 MSE F75 L75 M75 Average Classification tree 0.2604 0.2438 0.2477 0.2506

Random Forest 0.2148 0.2183 0.2184 0.2172

(19)

3.3 Results of the linear 50% dataset

The results of the two logistic regression models are shown in Table 9. The first is the model that includes all available variables and the second only includes the significant variables from the first model (𝛼=0.05). The prediction power of the first model is on average 0.9162, and from the second 0.9214, making the second model the best.

Table 9

Prediction power F75 L75 M75 Average t-‐value

Tree 1 Classification tree 0.8604 0.8584 0.8665 0.8618 14.4712 Tree 2 Cost complexity pruning 0.8557 0.8584 0.8665 0.8602 14.3581

Tree 3 Bagging 0.8984 0.8869 0.8923 0.8925 16.7354

Tree 4 Random Forest 0.9038 0.8923 0.8970 0.8977 17.1298 Logistic regression model 1 0.9153 0.9085 0.9248 0.9162 18.5721 Logistic regression model 2 0.9228 0.9133 0.9282 0.9214 18.9862 Minimal prediction power 0.6504 0.6104 0.6287 0.6299 0

Both models have significant better prediction power than the minimal prediction power. Therefore, both logistic regression models add significant prediction power to the most basic model that only has ‘Insured’ as an outcome.

The second model predicts slightly better than the first model, but the difference is not significant (t-‐value = 0.517). In the second model, all insignificant variables from the first model are removed, reducing the overfitting problem. Variables that are not significant are not very important for explaining the dependent variable or suffer from endogeneity. One would suggest that without the insignificant variables, the prediction power would increase. The results above confirm these expectations, but not significant.

In Table 10, the five most significant variables of the second model constructed on the M75 set are shown. The full table is depicted in Appendix 2C.

(20)

All variables, except for marr and VARSTR, are still significant. The five most significant variables for predicting whether someone is insured or not, using the logistic regression, are female, age_grp, disab_pop, income, and blackorwhite.

Table 10 Estimate Std. Error z value Pr(>|z|) female -‐1,00E+00 1,21E-‐01 -‐8,268 2,00E-‐16

age_grp 7,00E-‐01 8,03E-‐02 8,725 2,00E-‐16

disab_pop -‐4,38E+00 2,83E-‐01 -‐15,441 2,00E-‐16

income 6,79E-‐05 3,53E-‐06 19,237 2,00E-‐16

blackorwhite 1,44E+00 1,32E-‐01 10,931 2,00E-‐16

The results of the four tree models are also given in Table 9. The prediction power of the first model is on average 0.8618, from the second 0.8602, from the third 0.8925 and from the fourth 0.8977, making the fourth model (random forest) best and the second model (cost complexity pruning) worst.

We find that cost complexity pruning does not improve the prediction power of the classification tree. Maybe this is causes by the fact that the data is now linear, and CTs work best with nonlinear data. But still, both models have got distinctive power compared to the minimal prediction power (t-‐value > 2).

Before describing at the results of bagging and random forest in Table 9, we start by describing the results in Table 11, where the MSE of the tree models are given. The MSEs of the classification tree and cost complexity pruning are much higher than bagging and random forest. When we look at random forest and bagging, we find a reduction of the MSE from 0.1052 for bagging to 0.1021 for random forest. As confirmed by the results shown Table 11, we would expect that random forest reduce the MSE more than bagging. Therefore, we expect random forest to provide the best prediction power.

(21)

As described in Table 9, bagging and random forest both improve on average the prediction power of the classification tree and, as expected according to Table 11, random forest provides the best results with an average prediction power of 0.8977.

The following variables are included in the classification tree and the cost complexity pruning tree model: tot_exp, income and age. For the variables used for bagging and boosting, we look at Table 12. The MDG is given for random forest and bagging. The results are sorted from important to not important according to the MDG. The two most important variables, tot_exp and income, are the same for random forest and bagging.

Table 12

Importance Bagging MDG Random forest MDG

1 tot_exp 803,8 tot_exp 328,4 2 income 345,5 income 242,1 3 age 228,9 rx_exp 213,6 4 PERWT 110,6 age 203,9 5 EDUCYR 96,8 otp_exp 132,9 3.4 Comparison

First, the result of the 83% dataset are compared, second of the 50% dataset and third of the linear 50% dataset, and fourth a comparison of the model interpretation is made. Last, all results from the three datasets are compared with each other.

3.4.1 Comparison of the models constructed with the 83% dataset

The logistic regression models 1 and 2 have higher prediction power than the CT with and without cost complexity pruning. But when we reduce the variance of

Table 11

MSE F75 L75 M75 Average

Classification tree 0.1290 0.1346 0.1323 0.1320

Random Forest 0.0998 0.1009 0.1021 0.1009

(22)

the prediction trees by bagging and random forest, we find better results compared to the logistic regression models. When we look at the significance of the models, only bagging adds significant prediction power to the minimal prediction power. The other three tree models and two logistic regression models do not improve the minimal prediction power significantly. So the only model having any distinguish power is bagging. But this model is not significantly better than the best logistic regression model (t=0.535).

When we look at the variables included in the models, there appear some differences. All models include income and PERWT or categorize them as part of the most important variables, but the other highly significant variables from the logistic regression model, namely disab_pop, female, employ, live_with_spouse and blackorwhite, which are all dummy variables, don’t appear in the tree models. The other way, the highly important tree variables age, tot_exp, otp_exp and VARSTR, which are all continuous variables, don’t appear in the logistic regression models. Therefore, logistic regression prefers dummy variables and CT prefers continuous variables.

3.4.2 Comparison of models constructed with the 50% dataset

All models add significant prediction power to the minimal prediction power with t-‐values of at least 12. Therefore, the models perform much better than the models constructed with the 83% data compared to their minimal prediction powers. The logistic regression models perform better than the classification tree with and without cost complexity pruning, but gets beaten by bagging and random forest. The best tree model is obtained by using random forest and the best logistic regression models by including only the significant variables, but the difference between the models is not significant (t=1.05). Therefore they perform just as well.

When we look at the variables included in the models, again there appear some differences. All models include income or categorize it as part of the most important variables, but the other highly significant variables from the logistic regression model, namely disab_pop, female, employ, age_grp and blackorwhite, which are again all dummy variables, do not appear in or belong to the most important variables from the tree models. The other way, the highly important

(23)

tree variables age and VARSTR, which are both continuous variables, don’t appear in the logistic regression models. It is logical that the tree models include continuous variables. They are able to construct dummy variables (branches) by splitting the observations the best way possible. The best to explain this difference is by looking at age, a continuous variable, and age_grp, a dummy variable. The tree models choose age so that they can make their own split instead of being bounded to the predetermined splits from age_grp. Consequently, the tree models create an at least as good split as the ones from age_grp. The logistic regression models choose age_grp. Apparently there exist some nonlinearity in the data, causing the logistic regression model to choose age_grp over age.

3.4.3 Comparison of models constructed with the linear 50% dataset

All models add significant prediction power to the minimal prediction power with t values of at least 14. Therefore, these models perform much better than the models constructed with the 83% data. Bagging and random forest perform better than the classification tree with and without cost complexity pruning, but this time, the logistic regression models perform better than the tree models. The best tree model is obtained by using random forest and the best logistic regression model by including only the significant variables and the difference between them is significant (t=2.245). Therefore, the logistic regression model is the overall best model. This result was expected, because the data were generated by using the logistic regression model.

When we look at the variables included in the models, there appear some differences. All models include income or categorize it as part of the most important variables, but the other highly significant variables from the logistic regression model, namely disab_pop, female and live_with_spouse, which are again all dummy variables, don’t appear in the tree models. The other way, the highly important tree variables age and VARSTR, which are both continuous variables, don’t appear in or are not significant in the logistic regression models. Again, we can explain these differences the same way as we did in Section 3.4.2 with age and age_grp.