A comparison of logistic regression and artificial neural networks : predicting health care utilization

(1)

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided up into a number of sections and contains references. An outline can be something like (this is an example for an empirical thesis, for a theoretical thesis have a look at a relevant paper from the literature):

(a) Front page (requirements see below)

(b) Statement of originality (compulsary, separate page) (c) Introduction (d) Theoretical background (e) Model (f) Data (g) Empirical Analysis (h) Conclusions

(i) References (compulsary)

If preferred you can change the number and order of the sections (but the order you use should be logical) and the heading of the sections. You have a free choice how to list your references but be consistent. References in the text should contain the names of the authors and the year of publication. E.g. Heckman and McFadden (2013). In the case of three or more authors: list all names and year of publication in case of the rst reference and use the rst name and et al and year of publication for the other references. Provide page numbers.

2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All that actually matters is that your supervisor agrees with your thesis.

3. The front page should contain:

(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Faculty as in the heading of this document. This combination is provided on Blackboard (in MSc Econometrics Theses & Presentations).

(b) The title of the thesis

(c) Your name and student number (d) Date of submission nal version

(e) MSc in Econometrics

(f) Your track of the MSc in Econometrics 1

A Comparison of Logistic Regression and

Artificial Neural Networks

Predicting Health Care Utilization

Mats de Nijs

10757473

Master’s programme: Econometrics

Specialisation: Big Data Analytics

Date of final version: August 15, 2018

Supervisor: Dr. J. C. M. van Ophem

Second reader: Dr. N. P. A. van Giersbergen

Abstract

Logistic regression is the most commonly used method for developing predictive models for binary outcomes in medicine. Artificial Neural Networks is a more recent technique that has emerged as an alternative to logistic regression and other statistical techniques. This paper presents an overview of the features of logistic regression and neural networks. Both methods are applied in predicting health care utilization and compared.

(2)

Statement of Originality

This document is written by Mats de Nijs who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision and the completion of the work, not for the contents.

(3)

Introduction

Worldwide insurers are struggling with keeping health insurance affordable. Schieber and Poul-lier (1989) observed a tendency of increasing health expenditures in OECD countries, a trend that has strengthened ever since due to advances in medical technology, aging of the population and the increasing number of people with insurance (Mehrotra et al., 2003; CMS, 2017).

Primarily in the U.S., insurers cooperate with companies to tackle this problem. They mine data about the prescribed drugs workers use, how they shop and even whether they vote, to predict their individual health needs and to recommend treatments (Silverman, 2016). Trying to contain rising health-care costs, some companies, including American retailer Wal-Mart Stores Inc., are paying employee wellness firms like Castlight Healthcare Inc. to collect and crunch employee data to identify, for example, which workers are at risk for diabetes, and target them with personalized messages nudging them towards a doctor or services such as weight-loss programs (Silverman, 2016). Companies state the goal is to get employees to improve their own health prematurely as a way to cut corporate health-care bills (Silverman, 2016).

To develop clinical prediction rules, insurers and employee wellness firms use a large number of techniques, including the clinical judgment of physicians and a range of statistical meth-ods (Breiman et al., 1984; Watson et al., 1985). These statistical methmeth-ods include linear and logistic regression, discriminant analysis and recursive partitioning. For predicting binary out-comes, logistic regression is commonly used (Hosmer and Lemeshow, 1989). A more recent technique that has emerged as a potential alternative to logistic regression and other statistical techniques is artificial neural networks (Guerriere and Detsky, 1991). Neural networks have the advantage of not being constrained by prescribed mathematical relationships between de-pendent and indede-pendent variables and hence have the ability to model any kind of nonlinear relationship.

A few studies suggest that neural networks offer significantly better predictive performance than other statistical methods for certain problems (Baxt, 1991; Buchman et al., 1994; Shi et al., 2012). Other studies find no statistically significant difference in the classification accuracy of logistic regression vs neural networks (Ottenbacher et al., 2001). The goal of this paper is to compare logistic regression and neural networks, especially the differences of their predictive

(5)

CHAPTER 1. INTRODUCTION 2 performances. To examine the differences between these statistical methods, they will be applied through predicting health care utilization in the U.S.. The main research question is: ”How do the predictive performances of logistic regression models and artificial neural networks compare when predicting health care utilization in the U.S.?”

There are many empirical challenges in studying health care utilization. As opposed to studies on health care expenditure (Schellhorn, 2001; Shen, 2013), the dependent variable is dichotomous. The dependent variable is either a 1 or a 0 which makes binary classification techniques such as logistic regression and neural networks applicable. Making the dependent variable binary also complicates things. Since high health expenditures are grouped with very low health expenditures, the classification methods could struggle to find the important variables in predicting health care utilization. Setting the threshold at $500 for example, instead of $0, could improve the predictive performance of the models. The outcome would also be valuable for insurers that work with deductibles. Different configurations and thresholds will be examined throughout this research.

Another empirical challenge lies in the time element. The data that is used is from the Medical Expenditure Panel Survey (MEPS, 2017), a national survey on the U.S. civilian non-institutionalized population. Between 1996 and 2015, each year a new panel of sample house-holds was selected to construct the dataset. Therefore, the data from the different years could be combined and used as one big dataset. The data is unfitting for panel data analysis. Instead, predictive performances will be evaluated by ’training’ algorithms using the data from a certain time span and evaluated by the data of the next year. Also, long-term trends in predictive performances will be studied.

Yet another challenge is to fine-tune the logistic regression model and the artificial neural network. Both algorithms require proper predictor variables to operate well, thus methods of feature selection will be performed. Fine-tuning neural networks also involves defining the optimal structure. ’Hyperparameters’ specify the structure of a network and determine how the network is trained. Optimizing the neural network will therefore be done by testing different combinations of these hyperparameters.

Some minor challenges include sampling and performance measurement. In sampling, the optimal training set is constructed by including a balanced number of positive and negative outputs. Since the dataset is fairly balanced, sampling is less relevant than for other applica-tions of binary classificaapplica-tions. The goodness of fit of a certain algorithm is measured by its performance metrics. There are many performance metrics which might result in different out-comes including accuracy and Area Under Curve (AUC). Computation time is another relevant measure.

The remainder of this paper is organized as follows. In chapter 2 the relevant literature regarding the subject is discussed. The dataset that is used is described in chapter 3. Chapter 4 presents the econometric models. In the chapter 5, the results are presented and interpreted which leads to the conclusions of the paper in chapter 6.

(6)

Chapter 2

Literature Review

There is a great deal of theoretical and empirical literature concerning logistic regression, neural networks and their applications. To provide a background for the following chapters, the most relevant literature is provided. The related literature is divided into four different topics: logistic regression, neural networks, predicting health care utilization and other binary classification techniques.

2.1 Logistic Regression

Logistic regression, also called logit regression, is a widely used technique that is used to relate a dichotomous dependent variable to one or more explanatory variables. In this research, the dependent variable yi takes one of the following two outcomes:

yi =

 



1 if individual i utilized health care this year. 0 if not.

(2.1)

In logistic regression, the probability of a positive outcome is related to a series of predictor variables by an equation of the form:

P [yi = 1] = F (x0iβ) = σ(x 0 iβ) = 1 1 + e−x0iβ (2.2) where xi is a vector of predictor variables corresponding to individual i and β is a vector

of coefficients associated with each predictor variable (Cameron and Trivedi, 2005). Logistic regression models use the maximization of a likelihood function as the estimation method. After the coefficients β are estimated, the model can be tested by feeding new data and calculating F (x0_iβ). The model classifies an observation as positive if F (x0_iβ) > 0.5 and negative otherwise. Probit regression is a very similar technique although it uses the standard normal cdf for F (x0_iβ), whereas logit regression uses the cdf of the logistic distribution. Both methods yield similar, though not identical, inferences. In health sciences, the most commonly used distribu-tion funcdistribu-tion is the logistic distribudistribu-tion funcdistribu-tion, mainly because the logit model has a relatively simple form for the first-order conditions and asymptotic distribution. Therefore, we will stick to logistic regression.

(7)

CHAPTER 2. LITERATURE REVIEW 4 Logistic regression is the most used technique of choice for modeling in medicine when the outcome of interest is binary. Logistic regression models have been used for predicting diseases (Zhou et al., 2004) and predicting recovery (Sperandei, 2014) among many other ap-plications (Bender et al., 2007; Shen, 2013; Sperandei, 2014).

Sperandei (2014) presents the results of several studies using logistic regression for modeling in medicine. Although he does not report exact predictive performances, he shows that with less specific variables (e.g. age and region), logistic regression is well capable of predicting the effectiveness of treatments and predicting diseases.

Zhou et al. (2004) developed a method using logistic regression to classify and predict the type of cancer in human cells. By using the variable selection methods Gibbs sampling and Markov chain Monte Carlo methods, they discovered important genes that related to different types of cancer. After the important genes were identified, the logistic regression model was used for cancer classification and prediction. They found only one error for the classifier in a dataset of 7129 cells. Their study demonstrated that with the right data and methods to identify the strong variables, logistic regression is capable at classifying and predicting diseases. Shen (2013) used probit regression which is similar to logistic regression to model health care utilization. He also used a semiparametric estimation of the probit model. Shen (2013) studied a set of three equations to examine the effects of different factors on health care de-cisions: health insurance, utilization and expenditures. The second equation models what he names ’the decision to seek health care’. Shen (2013) uses the results of the first and second equation to ultimately model the level of expenditures. Shen (2013) did not report the predictive performance of the second equation because that was not his aim. He did however present the estimated coefficients and their significance of the utilization equation. Significant explanatory variables included the number of comorbidities, mental illnesses, gender, years of education, marital status and insurance coverage.

2.2 Neural Networks

Logistic regression could be used for multiple goals: determining causalities, measuring the weights of importance of predictors as well as predicting the outcome of the dependent variable. Neural networks, on the other hand, only do the latter. Neural networks, also called artificial neural networks, are machine learning methods that can be used for regression and classifica-tion type of problems (Paliwal and Kumar, 2009). They have been developed for all sorts of applications and have proven to be competitive in solving real-world problems (Prieto et al., 2016). Neural networks have also been developed for clinical applications, including diagnosing diseases (Baxt, 1991; Teramoto et al., 2017), predicting re-hospitalization (Ottenbacher et al., 2001), predicting mortality (Shi et al., 2012) and predicting length of stay in the intensive care unit (Tu and Guerriere, 1993). Although many different types of neural network are currently used, the focus of this paper is restricted to ’feed forward neural networks’. Other types such as ’recurrent neural networks’, in which values cycle, are only used for very specific applications.

(8)

CHAPTER 2. LITERATURE REVIEW 5 Artificial neural networks were first developed around 1970, by researchers who were trying to recreate the learning process of the human brain (Guerriere and Detsky, 1991; Hinton, 1992). In 1986, the scientific community became more interested in this technique because of the discovery of back propagation (Rumelhart et al., 1986). Neural networks have the ability to ’learn’ mathematical relationships between predictor variables and the dependent variable. This is achieved by training the network with a training set consisting of the predictor variables as well as the corresponding dependent variable. The networks are programmed to change their internal weights based on the mathematical relationships identified between the inputs and outputs in the dataset. Once a network has been trained, it can be evaluated in a separate test, using a validation set. When the ideal network is constructed, it can be used for predictions, using a test set (Tu, 1996).

Figure 2.1 is the diagram of the neural network used by Tu (1996). The network is designed to predict the probability of a patient dying from a certain disease based on two predictor variables: sex and age. Neural networks are often represented using such diagrams. Each circle represents a neuron, often called a node, while each line represents a connection weight. The neurons of the network are arranged in three layers: input, hidden and output. The predictor variables enter the network through the input. The estimated probability leaves the network as the output. A network can have multiple outputs but in this study, only networks with a single output are considered. The neurons in the hidden layer contain intermediate values that are calculated by the network. Each of the hidden and output neurons also contains a function termed the ’activation function’. Each activation function is itself a non-linear function of a

(9)

CHAPTER 2. LITERATURE REVIEW 6 linear combination of the values of the previous layer. The coefficients in the linear combination are adaptive parameters. The sigmoid function used in logistic regression is an example of an activation function. Hidden neurons allow the network to model complex relationships. Neural networks can consist of multiple hidden layers and hidden layers can consist of multiple neurons. The input and hidden layers of neural networks usually include a bias unit. These neurons store the value of 1 and are not influenced by any of the previous layers. They act as a intercept term, allowing the internal relations to shift up or down.

The logistic regression model that was used by Tu (1996) to predict the same outcome using the same variables as before is presented as figure 2.2. As demonstrated by this figure, the logistic regression model uses significantly less parameters (three coefficients of β) than a neural network (nine connection weights w), even for a relatively small network. For the logistic regression model, calculating the probability of an individual dying is relatively simple. In contrast, the input variables of a neural network undergo three logistic transformations before the output is determined.

Figure 2.2: Diagram of logistic regression model used by Tu (1996)

The neural network used by Tu (1996) is a rather small network. Many neural networks are far more complex, especially in medicine. Teramoto et al. (2017) for example, made use of eight hidden layers to classify lung cancer types. As more layers, neurons and predictor variables are being added, the number of connection weights and complexity of the network increases. A major risk of complex neural networks that comes with increasing complexity is ’overfitting’. When the number of predictor variables, hidden layers or neurons increases, the prediction error on the training set is driven to a very small value. But when new data is presented to the network, the error is large. The network has memorized the training examples, but it has not learned to generalize to new situations. This phenomenon is demonstrated in figure 2.3. Based

(10)

CHAPTER 2. LITERATURE REVIEW 7 on the application, researchers use many different kinds of techniques to prevent the model from overfitting. Teramoto et al. (2017) used augmentation of data by image manipulation whereas commonly used techniques include cross-validation, adding regularization terms and ensembling. The ideal point on the diagram is when the error of the test set is minimal, which is indicated by the first dashed line. In cross-validation, a validation set is constructed on which distinct models with different degrees of complexity are applied. The model with the lowest error resembles the ideal point of figure 2.3 and is selected to be applied on the test set.

Figure 2.3: Diagram showing a typical relationship between the amount of error in the training and test set and the complexity of the neural network Tu (1996)

The neural network used by Tu (1996) is relatively simple in more ways. It only uses one hidden layer of which the number of neurons is the same as the number of input variables and the activation functions of choice are all logistic. Including more hidden layers and neurons per layer could make for a better performance. In some cases, other activation functions such as rectified linear unit (ReLU) or hyperbolic tangent give better results. The number of hidden layers, the number of neurons and the sort of activation functions are examples of hyperparameters. These are variables set before actually optimizing the model’s parameters. Hyperparameter selection can be seen as both an optimization problem and a generalization problem because overfitting is possible. Often, the selection process is influenced by computational costs; each hyperparameter setting should be used to train a new model, and model training typically takes a long time. That is why many studies do not focus much on hyperparameter selection. For classification problems with large datasets, the computational costs of neural networks is sometimes even used as a performance measurement.

(11)

CHAPTER 2. LITERATURE REVIEW 8 and classification type of problems. They have proven to be competitive in solving all kinds of real-world problems including clinical applications. Neural networks involve series of neurons, layers and activation functions. More detailed descriptions of these structures and mathematical clarifications is provided in chapter 4.

2.3 Predicting health care utilization

Logistic regression and neural networks are commonly used to predict specific binary outcomes in medicine such as the occurrence of a disease or the efficiency of a treatment. Predicting health care utilization is a very complicated case as it is very comprehensive and unpredictable. There are different reasons for an individual to seek health care and not all reasons are pre-dictable. Some risks are almost fully unpredictable such as road accidents. Other cases are well predictable. People with high BMI and cholesterol often suffer from heart diseases and individuals diagnosed with asthma structurally use health care. The majority of the health ex-penses however is just slightly predictable. In some circumstances, an individual is more likely to have medical issues and to seek access to health care than others. These circumstances are reflected by the predictor variables in the model. The goal of this paper is to predict health care utilization as well as possible using the predictor variables. A number of problems specific to predicting health care utilization are discussed in this section, as well as ways to solve these problems.

The main dependent variable, health care utilization, is defined as 1 if the individual utilized health care and as 0 if not. This classification is derived from the total health expenditure of the individual being bigger than 0 or not. Logistic regression and neural networks might struggle to identify the important variables because individuals with high health expenditure are grouped with individuals with low health expenditure. Therefore, a series of experiments will be per-formed where health expenditures are separated into classes based on other thresholds. Instead of predicting health care utilization, the models are then predicting health care utilization with costs higher than a certain amount. By setting this threshold at $300 for example, the algorithm will ignore minor medical expenses and will only predict major expenses. First hospital visits will be ignored by the model and only subsequent visits are considered. According to Fishman et al. (2012), individuals are less likely to use health care if they have unmet deductibles. At the same time, healthy individuals are more likely to choose an insurance contract with high deductibles. Thus, having unmet deductibles is both a discouragement to use health care and a health indicator. For many individuals, they either do not use health care at all, or spend all their deductibles and are more likely to spend even more. It would therefore be interesting to check whether a threshold around the amount of deductibles makes much of a difference. Fish-man et al. (2012) reported that in a U.S. dataset of 17691 observations, 13.4 percent of the individuals have deductibles between $100 and $500 and 0.8 percent of the individuals have deductibles above $500. Among individuals with deductibles between $100 and $500, Fishman et al. (2012) found a significantly lower likelihood of making an initial doctor visit. Setting the

(12)

CHAPTER 2. LITERATURE REVIEW 9 threshold at $100 or $500 could make for interesting results which are valuable to ensurers.

Another empirical challenge lies in the time element. Many studies on health care expendi-ture or health care utilization use some form of panel data analysis (Lammers and Mclaughlin, 2017; Mebrati et al., 2013). However, the data used is not suitable for panel data analysis since a new panel of sample households was selected each year. Instead, the data could be combined to make one big dataset. Also, the data from different years could be examined individually to examine the trend in predictive performance. Ultimately, algorithms will be ’trained’ by using the data from a certain time span and evaluated using the data from the next year, to examine whether the data could in fact be used to predict health care utilization in the future.

In both logistic regression and neural networks, it is important that all the predictor vari-ables legitimately influence the dependent variable. Adding unrelated predictor varivari-ables would reduce the error of the training set but also causes overfitting. Fortunately, because of increasing health expenditures, there is a vast literature about features that relate with health expenditure. According to Shen (2013), health expenditures relate to three groups of explanatory variables: health characteristics, geographic and demographic variables and socioeconomic status. The first group of explanatory variables was studied by Schellhorn (2001). He proved that individ-uals with better health conditions have lower expected health expenditure and a lower chance to use health care. One of the major causes of health expenditures in the U.S. is obesity. Sturm (2002) confirmed that obesity increases health risks significantly. Another example of a high importance health characteristic is pregnancy since it is almost guaranteed to lead to health expenditure (Windmeijer and Silva, 1997). The second group of explanatory variables include geographic and demographic variables. Some examples of demographic variables that Shen (2013) found to be significant in explaining health care utilization include marital status and sex. Females and married individuals had a higher chance to use health care. Schellhorn (2001) found that individuals living in metropolitan areas visit 25 percent less their primary physician than individuals from rural areas. The third group of explanatory variables reflects socioeco-nomic status. These variables include income and education. Tang and Ch’Ng (2011) found that both income and education have a positive impact on health care utilization. Holly et al. (1998) found that the number of hours worked has a significant negative effect on the number of hospital visits. Lastly, there is a lot of literature regarding the relation between health insurance and health expenditures (Liu et al., 2012; Bajari et al., 2014), which is why insurance is one of the most important predictor variables.

The selected group of explanatory variables will likely still include a number of unrelated predictor variables. To get rid of the variables that are only confusing the model, feature selection will be performed. In logistic regression, it is usual to select variables for inclusion through backward or forward stepwise regression technique. Neural networks have no built-in technique. Therefore, a feature selection technique based on Random Forests will be performed. Random Forests are often used for feature selection (Shamsoddini et al., 2017). The reason is because the tree-based strategies used by Random Forests naturally ranks variables by how

(13)

CHAPTER 2. LITERATURE REVIEW 10 well they improve the accuracy of the model. Feature selection also has the benefits of reducing computational costs.

A dataset is imbalanced if the classes (0 and 1) are not approximately equally balanced. Imbalance on the order of 50 to 1 is prevalent in cancer detection (Woods et al., 1993) and imbalance of up to 100.000 to 1 has been reported in other applications (Provost and Fawcett, 2001). The dataset used only has a slight imbalance of around 4 to 1. About 80 percent utilizes health care whereas about 20 percent does not. The performance of machine learning algorithms is typically evaluated using predictive accuracy. However, this is not appropriate when the data is imbalanced and/or the costs of different errors vary significantly. As an example, consider the classication of pixels in mammogram images as possibly cancerous (Woods et al., 1993). A typical mammography dataset might contain 98 percent normal pixels and 2 percent abnormal pixels. A simple strategy of always guessing the majority class will result in a predictive accuracy of 98 percent. However, the nature of the application requires a high rate of correct detection of the minority class and a relatively low rate of correct detection of the majority class. In machine learning, three ways have been developed to address the issue of imbalanced data. One is to assign different costs to training samples (Domingos, 1999). The second is to assign different values to evaluation samples. The third is to re-sample the original dataset by either oversampling the minority class and/or undersampling the majority class (Kubat and Matwin, 1997).

Class imbalance is a serious issue in medical segmentation and one that requires special care with regard to the costs of training samples used and the costs of evaluation samples used. In machine learning, a loss function is a function that assigns costs to the error. The error is the difference between the actual and the output value:

E = (yi− ˆyi) (2.3)

The loss function is minimized to optimize the model. The ideal loss function differs from application to application. The cross entropy loss, also named log-loss, is ubiquitous in modern neural networks.

A performance metric is used to judge the performance of a model. It has nothing to do with the training process. Performance metrics often use the predicted value instead of the output value, which is rounded to 0 or 1 in case of classification. In that case, there are two types of errors: false positives (FP) and false negatives (FN) and two types of correct classifications: true positives (TP) and true negatives (TN), as displayed in figure 2.4. The costs of these errors differ per metric. Predictive accuracy is the performance measure generally associated with machine learning algorithms and is defined as:

Accuracy = (T P + T N )/(T P + F P + T N + F N ) (2.4) In case of binary classification, Accuracy is calculated by the percentage of correctly classified samples after rounding the estimated probabilities off to 0 or 1. The decision boundary is set

(14)

CHAPTER 2. LITERATURE REVIEW 11 to 0.5. In the context of balanced datasets and equal error costs, it is reasonable to use error rate as a performance metric. Error rate is:

Error = 1 − Accuracy (2.5) In the presence of imbalanced datsets with unequal costs, it is more appropriate to use the ROC curve (Kubat and Matwin, 1997; Jiao and Du, 2016). ROC curves can be thought of as representing the family of best decision boundaries for relative costs of TP and FP. On an ROC curve the X-axis represents:

F P R = %F P = F P/(T N + F P ) (2.6) And the Y-axis represents:

T P R = %T P = T P/(T P + F N ) (2.7) The ideal point on the ROC curve would be (0,100), that is all positive samples are classified correctly and no negative samples are misclassified as positive. Figure 2.5 shows an illustration. The line y=x represents the scenario of randomly guessing the class. Area Under the ROC Curve (AUC) is a useful metric for classifier performance as it is independent of the decision criterion selected and prior probabilities. The AUC comparison can establish a dominance relationship between classifiers. If the ROC curves are intersecting, the total AUC is an average comparison between models (Chawla et al., 2002). However, for some specific cost and class distributions, the classifier having maximum AUC may in fact be suboptimal.

(15)

CHAPTER 2. LITERATURE REVIEW 12

Figure 2.5: Illustration of a ROC curve

2.4 Other binary classification techniques

The focus in this research is on logistic regression and neural networks. Other statistical methods were skipped, mainly due to the time limit of the research. Some of the most used alternatives are discussed shortly to provide a sense of how they work and what they are capable of.

Figure 2.6: Illustration of a support vector machine

”Support Vector Machine” (SVM) is a machine learning algorithm which can be used for both classification or regression challenges. However, it is mostly used in classification problems. In this algorithm, each data point is plotted as a point in an k-dimensional space, where k is the number of variables, with the value of each variable being the value of a particular coordinate. Classification is then performed by finding the hyper plane that differentiates the two classes the

(16)

CHAPTER 2. LITERATURE REVIEW 13 best. The performance of SVM is highly dependent on the amount of noise in the data. If there is a clear margin of separation between the classes, it is very effective. However, if there is a lot of noise in the data, i.e. the classes overlap, the classifier loses its effectiveness. The expectation is that health care utilization is very random, which is why SVM appears inappropriate. Also, SVM is especially effective in cases with high dimensional spaces and limited data, both of which are not the case.

Figure 2.7: Illustration of a decision tree

A ”Decision Tree” is another example of a machine learning algorithm used for prediction purposes. A decision tree consists of nodes and edges. Each node splits the data based on one of the input variables. The data is then moved through an edge to a next node. This process repeats until each sample went through a certain number of nodes, often equal to the number of input variables. A leaf is located at each end of the tree. If a sample ends up in a leaf, it is assigned a corresponding probability of being from a certain class, based on what the tree learned from the training set. There are several algorithms that define the structures of decision trees and the order of the nodes based on the importance of the variables. ”Random Forest” is the term assigned to machine learning algorithms that use multiple decision trees. Decision trees and random forests are simple to interpret and visualize. They perform well in large datasets and have the advantage of having in-built feature selection; irrelevant features are used less. The main drawback of Random Forests is the model size, the training and prediction process are more time-consuming than other algorithms.

The ”k-nearest neighbour” algorithm (k-NN) is a non-parametric method for classification and regression. The prediction of an individual depends on its nearest neighbors. The k nearest neighbors are defined as the k data points in the training sample that are closest in feature space to the individual. The predicted class of the individual is the class that is most common across the k nearest neighbors. The k-nearest neighbour algorithm can do well in practice with enough representative data. It is important that a meaningful distance function is chosen because if not, the variables with the largest range will single-handedly determine the relative distances.

(17)

CHAPTER 2. LITERATURE REVIEW 14

(18)

Chapter 3

Data

3.1 Data files

The Medical Expenditure Panel Survey is an ongoing nationally representative survey of the U.S. civilian noninstitutionalized population started in 1996 by the U.S. Department of Health and Human Services (MEPS, 2017). Surveys on households, employers and medical providers are conducted to collect information regarding estimates of respondents’ health status, health care expenditures, demographic and socioeconomic characteristics, employment, access to care, and satisfaction with care.

Starting in 1996, each year a new panel of sample households was selected to construct a dataset. These ’Full Year Consolidated Data Files’ are publicly accessible and can be found on the MEPS website (MEPS, 2017). At the time of completing this research, the datasets between 1996 and 2015 were available. The datasets differ from another in a number of ways. Apart from the fact that different datasets include data from other respondents, the number of respondents vary and the variables are different as well. This makes it more complicated to join the datasets to create a bigger training sample. The datasets of 2010 and earlier lack a few essential variables including the majority of health status variables. Therefore, this study focuses only on the years 2011 to 2015. For cross-sectional predictions, individual years as well as the complete dataset will be used. For analyses on predicting health care utilization in the upcoming years, the training and validation set contains the data from 2011 to 2014 and the test set contains the data from 2015.

The data is further reduced to individuals between 18 and 84 years old. For respondents outside this range, little data is available. Individuals under the age of 18 did not fill in many of the questionnaires of which the dataset was constructed. The same thing goes for individuals of age 85 or older. For this group of people, even their exact age is not reported. This filter reduces the size of the dataset significantly. The complete dataset gets reduced from 181529 to 127939 observations. Because this filter only leaves out individuals not at a working age, the expectation is that employment-related variables will have a relatively weak predictive performance. Figure 3.1.a shows the distribution of ages in the dataset of 2015 after the filter.

(19)

CHAPTER 3. DATA 16

3.2 Dependent variables

The key variable to predict is utilization of health care, which is defined as having positive health care expenditures. The source of this variable in the dataset is TOTEXP**, where the stars indicate the year. This variable is defined as the total amount of expenditure on all health services. Health services include office based visits, hospital outpatient visits, emergency room visits, impatient hospital stays, prescription of medicine, dental visits, home health care and other medical expenses. Another variable of interest is TOTSLF** which stands for the expenditure on all health services paid out of pocket. If an expense is paid out of pocket it means that the patient or the patient’s family paid for it and that it was not paid by medicare, medicaid, private insurance or other unclassified sources. A third variable of interest is TOTEXP** -TOTSLF**. This constructed variable stands for the total amount of expenditure on all health services not paid out of pocket. In figures 3.1.b to 3.1.d, three histograms show the distributions of the three kinds of health expenditures. As expected, the histograms show a clear peak when the expenditure is between 0 and 100. This peak is considerably higher when only expenditure on health services paid out of pocket is taken into account. The same goes for expenditure on health services not paid out of pocket.

As explained before, predicting the size of health expenditure is of serious relevance as well. To an insurer, it is especially relevant whether health care expenditures exceed a patient’s deductibles. In 2012, Fishman et al. (2012) reported 13.4 percent with unmet deductibles between $100 and $500 and 0.8 percent with unmet deductibles above $500. Based on those numbers, two more dependent variables are constructed. These variables are defined as 1 if the total expenditure exceeds $100 and $500 and as 0 otherwise. Table 3.1 presents the descriptive statistics of all of the output variables. Although figures 3.1.b, 3.1.c and 3.1.d show a large peak between 0 and 100, it appears that the dependent variables seem to be bigger than 0 for most individuals.

Dependent

variable Expenditure>0

Out of pocket expenditure>0

Not out of pocket

expenditure>0 Expenditure>100 Expenditure>500

0 29622 (23%) 38933 (31%) 38146 (30%) 36166 (28%) 57487 (45%) 1 97771 (77%) 88460 (69%) 89247 (70%) 91227 (72%) 69896 (55%) Observations: 127393

(20)

CHAPTER 3. DATA 17 (a) Histogram of age (b) Histogram of total amoun t of exp enditure on all health services (c) Histogram of total amoun t of exp enditure on all health services paid ou t of p o ck et (d) Histogram of total amoun t of exp endit ure on all health services not paid out of p o ck et Figure 3.1: Histograms of age and exp enditure

(21)

CHAPTER 3. DATA 18

3.3 Independent variables

The explanatory variables include health-related characteristics, socioeconomic status and geo-graphic and demogeo-graphic variables. The dataset consists of a very large number of variables to represent these categories. An overview of the variables used is given in table 3.2.

Health-related characteristics are represented by a wide number of diagnoses, medical check-ups and other characteristics. Many of these variables are dummy variables meaning that they are represented by either a 1 or a 0. A general selection criterion on these variables is that they were measured in the beginning or before the year of interest. It is useless to predict an individu-als health care utilization based on his/her medical issues this year. The health status variables in table 3.2 of which the source ends with ’31’ were conducted in the first interview round of the year of interest. The first interview round was always held in the first month. These variables include MNHLTH31, RTHLTH31, PREGNT31, ADSMOK31, WLKLIM31, ACTLIM31, UN-ABLE31 and BMINDX31 which stand for mental health, physical health, pregnant, smoker, limitation in physical functioning, limitation in work/housework/school, completely unable to do activities and BMI, respectively. The variables mental health and physical health were mea-sured by what individuals filled in when they were asked for their perceived mental and physical status. The possible answers were: ’excellent’, ’very good’, ’good’, ’fair’ and ’poor’. For the 10 possible answers, dummy variables are constructed, leaving those who did not fill in a question as the omitted class for that question. The variable BMI is not measured for all respondents. The missing values of BMI are set to 0 and an additional dummy variable is created which is set to 1 if the value of BMI was missing to account for the missing values. Some diseases got diagnosed in the same year the database was constructed, however these diagnoses should not be used in predicting that year’s health care utilization. The dataset provides the age of the respondent at the time of the diagnosis for some diseases. Diseases of which the age of diagnosis is reported are checked with the respondent’s current age and set to 1 if they differ at least a year. The included diagnosis variables are: HIBPDX, CHDDX, ANGIDX, MIDX, OHRTDX, STRKDX, EMPHDX, CHOLDX, DIABDX, ARTHDX, ASTHDX and ADHDADDX. These variables stand for diagnoses in high blood pressure, coronary heart disease, agnina, heart at-tack, other heart disease, stroke, emphysema, high cholesterol, diabetes, arthritis, asthma and adhd, respectively. The dataset does not provide the age of the respondent at the time of the diagnosis for all diseases. Therefore, the variables CHBRON31 and CANCERDX which stand for chronic bronchitis and cancer diagnosis, are excluded. Health-related characteristics are also represented by some medical checkups including DSFB**53, DSEY**53, DSEY**53 and DSFL**53 where the stars indicate the year before the year of interest. These variables stand for: had feet checked, dilated eye exam, had cholesterol checked and flu vaccination, respectively.

Geographic and demographic variables are represented by region, sex, age, not white, mar-ried and discharged. For REGION**, four dummy variables are constructed to represent the west, midwest, south and northeast region of the U.S.. Race and marital status are coded

(22)

CHAPTER 3. DATA 19 as RACEV1X and MARRY**X, which are multiple category variables. These variables are reduced to a single dummy variable of interest which imply whether the individual has a skin colour other than white and whether he/she is currently married.

Just like with the health-related characteristics, the socioeconomic status variables like in-surance should be reported before or at the beginning of the year. An individual is for example more likely to take insurance after he uses health care (Liu et al., 2012). The variable OC-CCAT31 is represented by dummy variables for each of the 10 occupation categories, leaving missing values as the omitted class. The occupation categories are ’management, business and financial operations’, ’professional and related occupations’, ’service occupations’, ’sales and related occupations’, ’office and administrative support’, ’farming, fishing and forestry’, ’construction, extraction and maintenance’, ’production, transportation and material moving’, ’military specific occupations’ and ’unclassified occupations’.

In total, there are 5 dependent and 55 independent variables, of which 51 are independent dummy variables. The only not-dummy variables are BMI, age, income and family income. A comparison of baseline characteristics of patients by expenditure is provided by table 3.3.

(23)

CHAPTER 3. DATA 20

Variable Source variable in dataset Description Dependent Variable:

utilization TOTEXP** Total health expenditures

utilization > $100 TOTEXP** Total health expenditures > $100 utilization > $500 TOTEXP** Total health expenditures > $500 pocket utilization TOTSLF** Out of pocket health expenditures not pocket utilization TOTEXP** - TOTSLF** Not out of pocket health expenditures

Explanatory Variables: 1. Health Status Variables

Mental healtha _MNHLTH31 _{Perceived mental health status} Physical healtha _RTHLTH31 _{Perceived health status}

Pregnangt PREGNT31 Pregnancy status

Smoker ADSMOK31 Smoking status

High blood pressure HIBPDX High blood pressure diagnosis Coronary heart disease CHDDX Coronary HRT disease diagnosis

Angina ANGIDX Angina diagnosis

Heart attack MIDX Heart attack (MI) diagnosis

Other heart disease OHRTDX Other heart diseases diagnosis

Stroke STRKDX Stroke diagnosis

Emphysema EMPHDX Emphysema diagnosis

High cholesterol CHOLDX High cholesterol diagnosis

Diabetes DIABDX Diabetes diagnosis

Arthritis ARTHDX Arthrisis diagnosis

Asthma ASTHDX Asthma diagnosis

Adhd ADHDADDX ADHD diagnosis

Limitation 1 WLKLIM31 Limitation in physical functioning

Limitation 2 ACTLIM31 Any limitation in work, housework and/or school Limitation 3 UNABLE31 Completely unable to do activities

Had feet checked DSFB**53 Had feet checked

Dilated eye exam DSEY**53 Dilated eye exam

Had cholesterol checked DSCH**53 Cholesterol checked

Flu vaccination DSFL**53 Flu vaccination

BMI BMINDX31 Body Mass Index

2. Geographic and Demographic Variables:

Regiona _REGION** _{Census region}

Sex SEX Sex

Sge AGE**X Age

Not white RACEV1Xb _{Race of respondent}

Married MARRY**Xb _{Marital status}

Discharged HONRDC31 Honorably discharged

3. Socioeconomic Status Variables:

Income TTLP**X Individual total income

Family income FAMINC** Household total income

Employed EMPSTb _{Employment state}

Occupationa _OCCCAT31 _{Occupation group}

Insured INSCOV**b _{Insurance coverage}

a_{Represented by multiple dummy variables}

b_{Multiple category variable reduced to a single dummy of interest} ** Year dependent characters

*** Year dependent characters: the year before

(24)

CHAPTER 3. DATA 21 Coun t T yp e of exp enditu re Threshold T otal > 0 P o ck et > 0 Not p o ck et > 0 T otal > $100 T otal > $500 Pregnan t 2142 2060 (96%) 1648 (77%) 2020 (94%) 2018 (94%) 1900 (89%) Smok er 18488 14043 (76%) 12544 (68%) 12584 (68%) 13014 (70%) 9928 (54%) BMI < 18.5 6096 3742 (61%) 3315 (54%) 3338 (55%) 3433 (56%) 2536 (42%) 18.5 -24.9 39439 29639 (75%) 26464 (67% ) 26792 (68%) 27514 (70%) 20186 (51%) 25 -29.9 42111 32062 (76%) 29058 (69%) 29170 (69%) 29834 (71%) 22697 (54%) 30 < 39747 32328 (81%) 29623 (75% ) 29947 (75%) 30446 (77%) 24477 (61%) Male 59585 41439 (70%) 37311 (63% ) 37089 (63%) 38193 (64%) 27918 (47%) F emale 67808 56332 (83%) 51149 (75%) 52158 (77%) 53034 (78% ) 41978 (62%) Age (y ears) 18-24 17964 11400 (63%) 8906 (50%) 10033 (56%) 10114 (56%) 6192 (34%) 25-39 37028 25103 (68%) 21774 (59%) 22074 (60%) 22718 (61%) 15299 (41%) 40-64 54140 43987 (81%) 41088 (76%) 40136 (74%) 41399 (76%) 32739 (60%) 65-84 18261 17281 (95%) 16692 (91%) 17004 (93%) 16996 (93%) 15666 (86%) Not white 33166 24908 (75%) 21737 (66%) 23050 (69%) 23059 (70%) 17269 (52%) Married 61187 49543 (81%) 45989 (75% ) 45746 (75%) 46638 (76%) 36387 (59%) Disc harged 7410 6608 (89%) 6062 (82%) 6362 (86%) 6393 (86%) 5501 (74%) Income 0 -8999 34714 24996 (72%) 21009 (61%) 22592 (65%) 22946 (66%) 17040 (49%) 9000 -21499 31889 23578 (74%) 21244 (67%) 21132 (66%) 21812 (68%) 16769 (53%) 21500 -42499 31301 23822 (76%) 22208 (71%) 21634 (69%) 22249 (71%) 16793 (54%) 42500 < 29489 25375 (86%) 23999 (81%) 23889 (81%) 24220 (82%) 19294 (65%) Emplo y ed 74755 56605 (76%) 51945 (69%) 50889 (68%) 52450 (70%) 38147 (51%) Insured 101392 85695 (85%) 77260 (76%) 82072 (81%) 81240 (80%) 64300 (63%) F ull dataset 127939 97771 (77%) 88460 (69%) 89247 (70%) 91227 (72%) 69896 (55%)

(25)

Chapter 4

Models and method

The main research question of this report is: ”How do the predictive performances of logistic regression models and artificial neural networks compare when predicting health care utilization in the U.S.?” In this chapter, the approach that was followed to answer the research question is explained. Applying logistic regression models and artificial neural networks involved a number of steps, most of which are similar or even identical for both methods. These steps are as follows:

Logistic regression:

1. Split the data into a training, evaluation and test set. 2. Train logistic regression models on the training set.

(i) Select the predictor variables. (ii) Determine the parameters.

3. Choose the optimal model by evaluating the models on the validation set. 4. Apply the optimal model on the test set.

Neural networks:

1. Split the data into training, evaluation and test set. 2. Train neural networks on the training set.

(i) Select the predictor variables. (ii) Select the number of hidden layers.

(iii) Select the number of neurons in the hidden layers. (iv) Select the activation functions of the neurons.

(v) Determine the parameters.

3. Choose the optimal network by evaluating the models on the validation set. 22

(26)

CHAPTER 4. MODELS AND METHOD 23 4. Apply the optimal network on the test set.

In step 1 of both methods, the data was split up. This step was performed only once, making sure that both classifiers used identical training, evaluation and test sets. Also, three synthetic training samples were constructed in this step by oversampling the minority class. In step 2, the logistic regression models and neural networks were trained. The distinct configurations of logistic regression models used different predictor variables. The distinct configurations of neural networks also used different predictor variables along with different functional forms. The optimal logistic regression model was chosen by evaluating the logistic regression models on the validation set in step 3. Likewise, the optimal neural network was selected. Step 4 is when the optimal logistic regression model and the optimal neural network were applied on the test set. In the following sections, the steps above are further explained. Step 4 is presented as part of the results of this paper (chapter 5).

4.1 Step 1: Training, evaluation and test sets

Before the data was split up into a training, evaluation and test set, a number different data samples were constructed. The main data sample consists of all the data between 2011 and 2015. The corresponding results could be considered the most the most relevant since the most data was used. Another data sample consists of all data between 2011 and 2014 and performed predictions on 2015. The corresponding results are also very valuable since they reflect the predictability of health care utilization for upcoming years the best. The other five data samples consist of the individual years. 2015 is the most recent and relevant year. The trend in the predictive performance of these years was studied. Table 4.1 provides an overview of these data samples.

Training set Validation set Test set

2011 - 2015 (63697 | 50%) 2011 - 2015 (31848 | 25%) 2011 - 2015 (31848 | 25%) 2011 - 2014 (68225 | 67%) 2011 - 2014 (34113 | 33%) 2015 (25055 | 100%) 2015 (12527 | 50%) 2015 (6264 | 25%) 2015 (6264 | 25%) 2014 (12218 | 50%) 2014 (6109 | 25%) 2014 (6109 | 25%) 2013 (12937 | 50%) 2013 (6469 | 25%) 2013 (6468 | 25%) 2012 (13643 | 50%) 2012 (6822 | 25%) 2012 (6821 | 25%) 2011 (12371 | 50%) 2011 (6186 | 25%) 2011 (6185 | 25%) Bold numbers indicate separation size

Table 4.1: Data samples and split sizes

The data samples were thereafter split up into a training set, an evaluation set and a test set. Some researchers only distinguish a training set and a test set and create a validation split within the training set. Instead, this setup is used because it is clearer. The training set was used to fit the parameters of the classifier. The validation set was used to find the optimal

(27)

CHAPTER 4. MODELS AND METHOD 24 models by testing distinct configurations of logistic regression models and neural networks on them. The test set was used for evaluating the performance of the final model. The majority of the data is used for training because that generally requires the most data. In case all of these three sets were from the same time span, the separation 50% — 25% — 25% was used. In case only the training and evaluation set were of the same time span, the separation 67% — 33% was used. Table 4.1 provides the exact sizes of the sets.

In addition to the original training set, synthetic training sets were evaluated. Synthetic training sets were made by over-sampling the minority class in the training sets to account for class imbalance. Majority class under-sampling was not performed because there is no over-plus of data. The SMOTE (Synthetic Minority Over-sampling Technique) algorithm, designed by Chawla et al. (2002) was used to create training sets in which the minority class is augmented by 100%, 200% and (n−m_m − 1) ∗ 100%. Here, n is the total number of observations in the test set and m is the number of observations of the minority class in the test set. Augmenting the minority class by (n−m_m − 1) ∗ 100% results into perfect class balance. Data was not augmented by 100% or 200% if the minority class would become the majority class as the result of that The pseudo-code for SMOTE is as follows:

Algorithm SMOTE(S, m, P, k)

Input: Training set: S, number of minority class samples: m, amount of SMOTE: P% and number of nearest neighbours: k.

Output: ₁₀₀P ∗ m synthetic minority samples 1. C ← ∅

2. if P < 100

3. C ← randomly select ₁₀₀P ∗ m minority samples 4. else

5. C ← select all minority samples f loor(₁₀₀P ) times and randomly select 6. ₁₀₀P − f loor( P

100) ∗ m minority samples

7. end if

8. for all samples r in C :

9. compute the k minority class nearest neighbours

10. randomly select one of the k nearest neighbours and call it nn 11. Create a synthetic minority sample s with attributes equal to 0. 12. for all of the attributes att of r :

13. Compute dif : the difference between the value of att of r and the value of att of nn 14. Compute gap: a random value between 0 and 1

15. Assign value to the attribute att of s equal to att of r + gap*dif 16. end for

17. return s 18. end for

(28)

CHAPTER 4. MODELS AND METHOD 25 Algorithm 4.1: SMOTE

In short, the SMOTE algorithm creates ₁₀₀P ∗ m minority samples and assigns values to their attributes randomly chosen between those of a randomly selected minority sample and those of one of its nearest minority class neighbours. The number of nearest neighbours was set to 5. The distance between two samples is equal to the sum of the squared differences between the attributues: dist(xi, xj) = k X a=1 (xia− xja)2 (4.1)

The four non-dummy variables; BMI, age, income and family income, were re-scaled for the nearest neighbour search because of the relatively big distances. The values of BMI were divided by the 2 times the average differnce between two observations of BMI, so that the average difference of BMI was 0.5. The same was done for age, income and family income.

In total, 7 different data samples were considered. For every data sample, one training set and a number of synthetic training sets were used to train models. For every data sample, there were also 5 different dependent variables. A total of 105 optimal models were trained for both classifiers. These models were evaluated by 5 ∗ 7 = 35 validation sets and tested by 5 ∗ 7 = 35 test sets.

4.2 Step 2a: Logistic Regression

In this research, the dependent variable yitakes one of two outcomes. As stated before, there are

multiple dependent variables that will be predicted individually, but for now the key dependent variable is considered:

yi =

 



1 if individual i utilized health care this year. 0 if not.

(4.2)

In logistic regression, the probability of a positive outcome is related to a series of predictor variables by an equation of the form:

P [yi = 1] = F (x0iβ) = σ(x0iβ) =

1 1 + e−x0iβ

(4.3) where xi is a vector of predictor variables corresponding to individual i and β is a vector of

coefficients associated with each predictor variable. The first element of xi is 1 so the first

element of β represents an intercept (Cameron and Trivedi, 2005).

To estimate the values of β, logistic regression models use the maximization of a likelihood function. If the probability that an individual uses health care is equal to p, then the likelihood that all n individuals i have output yi is equal to:

L(p) =

n

Y

i=1

(29)

CHAPTER 4. MODELS AND METHOD 26 The log-likelihood is then given by:

ln(L(p)) = n X i=1 yiln(p) + n X i=1 (1 − yi)ln(1 − p) = X i,yi=1 ln(p) + X i,yi=0 ln(1 − p) (4.5)

In logistic regression the probability that an individual uses health care is not equal to p but related to a series of predictor variables by an equation of the form σ(x0_iβ). The log-likelihood function therefore equals:

ln(L(β)) = X

i,yi=1

ln(σ(x0_iβ)) + X

i,yi=0

ln(1 − σ(x0_iβ)) (4.6)

The vector of coefficients associated with each predictor variable β were estimated by maximiz-ing this likelihood function over β.

Stochastic gradient descent was used to maximize the likelihood. Stochastic gradient descent was also used to train the neural networks. Stochastic gradient descent uses the cross-entropy error function which equals the negative of the log-likelihood:

E(β) = −ln(L(β)) = −

n

X

i=1

{y_iln(σ(x0_iβ)) + (1 − yi)ln(1 − σ(x0iβ))} (4.7)

The stochastic gradient descent algorithm estimates the values of β through updating their values by the gradients of the error function:

β(τ +1)= β(τ )− η∇E(β(τ )) (4.8)

Where, η is a positive step size, also called learning rate. The gradients of the error function could be obtained by using the derivative of the sigmoid function:

σ(a) = 1 1 + e−a (4.9) ∂σ ∂a = −1 ∗ −1 ∗ e −a_{∗ (1 + e}−a )−2 = 1 1 + e−a∗ e−a

1 + e−a = σ(a) ∗ (1 − σ(a)) (4.10) Using the result above, the gradients of the error function are derived as follows:

∇E(β) = ∂E(β) ∂β = − n X i=1 yi σ(x0_iβ) ∂σ(x0_iβ) ∂β − 1 − yi 1 − σ(x0_iβ) ∂σ(x0_iβ) ∂β ∂σ(x0_iβ) ∂β = σ(x 0 iβ) 1 − σ(x0iβ)xi ∂E(β) ∂β = − n X i=1 yi σ(x0_iβ) − 1 − yi 1 − pi σ(x0_iβ) 1 − σ(x0_iβ)xi = − n X i=1 yi 1 − σ(x0iβ) − 1 − yiσ(x0iβ) xi = − n X i=1 yi− yiσ(x0iβ) − σ(x0iβ) + yiσ(x0iβ) xi ∇E(β) = n X i=1 σ(x0_iβ) − yi xi (4.11)

(30)

CHAPTER 4. MODELS AND METHOD 27 Note that a contribution to the gradients is given by the error: σ(x0_iβ) − yi. As the algorithm

sweeps through the dataset, it performs the updating equation 4.8 for each observation. Several passes were made over the dataset until the algorithm converged. The data was shuffled for each pass to prevent cycles. Stochastic gradient descent could be presented as follows in pseudocode:

Algorithm SGD (β,η)

Input: Initial vector of parameters: β and learning rate η Output: Estimated parameters β

1. repeat the following steps until an approximate global minimum is obtained: 2. · Randomly shuffle the dataset

3. · for i = 1, 2, ..., n do: 4. β(τ +1) = β(τ )− η∇E(β(τ )₎

5. end for

Algorithm 4.2: SGD

The stochastic gradient descent algorithm raises the question on how to set the learning rate η. Setting this parameter too high could cause the algorithm to diverge, setting it too low makes it slow to converge. Typical implementations use an adaptive learning rate to ensure that the algorithm converges. A simple extension of the algorithm is to make the learning rate a decreasing function ηtof t so that the first iterations cause large changes in the parameters while

the later ones do fine-tuning. In this research, the ’Adam’ optimization algorithm, designed by Kingma and Ba (2015), was used. Adam is an adaptation of stochastic gradient descent in which the learning rate η is adapted for each of the parameters. This optimization algorithm also uses running averages of both the gradients and the second moments of the gradients. Given the parameters β(τ ) and loss function E(β), Adam’s parameter update is given by:

m(τ +1)_β = α1m(τ )_β + (1 − α1)∇E(β(τ )) v(τ +1)_β = α2v_β(τ )+ (1 − α2)(∇E(β(τ )))2 ˆ mβ = mτ +1_β 1 − ατ +1₁ ˆ vβ = vτ +1_β 1 − ατ +1₂ β(τ +1)= β(τ )− η mˆβ pˆvβ+ (4.12)

Where is a small scalar used to prevent division by 0, and α1 and α2are the forgetting factors

for gradients and second moments of gradients, respectively. The values of η, , α1 and α2 are

set to 0.001, 10−8, 0.9 and 0.999, respectively, which are values suggested by Kingma and Ba (2015).

Within logistic regression, there is little space for model optimization. The probability of a positive outcome is set to relate to the predictor variables through the cdf of the logistic

(31)

CHAPTER 4. MODELS AND METHOD 28 distribution. The parameters β are set to be optimized through likelihood maximization. As long as convergence is reached, the choice of optimization algorithm is not important. The only ways to optimize the model and to improve its performance is through sample selection and variable selection. Sample selection was already discussed in the previous section and does not play a role in the actual training of the model. That leaves variable selection. Variable selection in logistic regression when used as a prediction technique includes adding non-linear terms of variables and removing weak variables.

Adding non-linear terms of variables could be useful to account for a possible non-linearity between the dependent and the independent variable. Generally, any non-linear term could increase the predictive performance of the model. However, instead of adding plenty of non-linear terms to the model, the non-non-linear terms were restricted to terms that are broadly used to model health care expenditure or utilization (Sturm, 2002; Shen, 2013). To income and family income, 1 was added and the logarithm was taken to replace these variables. Also, a squared term of age was added.

Variables were selected for inclusion in these models through a variation of the backward stepwise regression technique (Derksen and Keselman, 1992). This technique involved starting with all candidate variables. In each step, the model is fitted and its performance is evaluated by the validation set. Then, a variable is deleted from the set of explanatory variables based on its significance. Unlike the regular backward stepwise regression technique, in which the deletion variables stops when all variables are statically significant, this process is repeated until there is only one variable left. The set of variables with the highest performance in the validation set is later elected as the optimal set. The significance of a variable is based on its t-value. The variable with the lowest absolute t-value is deleted each step. The constant term is always included. The absolute t-value is calculated by:

|t_βˆ| = | ˆβ| s_βˆ (4.13) where s_βˆ= q 1 n−2P(yi− ˆyi)2 pP(xi− ¯x)2 (4.14) One thing to keep in mind when applying this method is that when the dataset has two or more correlated variables, any of these correlated variables can be used as the predictor, with little preference of one over the others. But when all of them are used, the significance of those variables is reported as much lower. This is not an issue in this case. Variable selection is performed to reduce overfitting and it makes sense to remove the variables that are mostly duplicated by other variables. As long as variables are removed one by one, there is no danger of removing important variables with low significance due to correlation.

(32)

CHAPTER 4. MODELS AND METHOD 29

4.3 Step 2b: Neural Networks

In chapter 2, artificial neural networks were introduced as useful machine learning methods that can be used for regression and classification type of problems. Neural networks involve a series of neurons, layers and activation functions. In this section, a more detailed description of the structures and mathematical relations will be given and the types of neural networks that will be used to predict health care are presented.

Figure 4.1: Diagram of an example neural network

A basic neural network works as follows. At first, the input data is vectorized and fed into the network. In figure 4.1, the input data is represented by x1 to xk, with x0 as a constant to

represent the bias term. Then, a series of matrix operations is performed on this input data layer by layer. In the simple case, for each layer, the input is multiplied by the weights, a bias is added and an activation function is applied (equation 4.15).

pb= k X a=1 wba∗ xa+ x0 zb = u(pb) (4.15)

Where k is the length of the data, wba is the weight that relates the intermediate value pb to xa

and u(.) is the activation function that transforms pb into the value stored in neuron zb. The

(33)

CHAPTER 4. MODELS AND METHOD 30 until the last layer is reached:

qc= l X b=1 wcb∗ zb+ z0 yc= v(qc) (4.16)

The final output values yc are called y in this case of one output variable. The prediction is

then evaluated in the error function, often called the ’loss function’. Unlike logistic regression, neural networks could use different kinds of loss functions. Although, in binary classification, it is very common to assume that that the prediction y is equal to the conditional probability of a positive outcome of the predicted variable t. From this assumption follows the cross-entropy error function. This is the same function as the one used in logistic regression. This assumption is followed in this research. It reads:

y = y(x, w) = prob(t = 1|x) (4.17) Where x is a vector of input data and w is the vector of all weights. The conditional distribution for the targets t then follows a Bernoulli distribution:

prob(t|x, w) = y(x, w)t(1 − y(x, w))1−t (4.18)

Assuming independent training samples, taking the negative logarithm of the likelihood function leads to the cross-entropy error function:

E(w) = − n X i=1 tiln(yi) + (1 − ti)ln(1 − yi) (4.19)

After the prediction is evaluated in the error function, the error function is used to compute the partial derivatives with respect to the weights in each layer going backwards recursively:

∂E ∂wcb = ∂E ∂qc ∂qc ∂wcb ≡ δ_czb ∂E ∂wba = ∂E ∂pb ∂pb ∂wba ≡ δ_bxa δb = X c ∂E ∂qc ∂qc ∂pb = u0(pb) X c wcbδc (4.20)

The weights are updated with these values. The same stochastic gradient descent algorithm as in the logistic regression model, namely the Adam optimization algorithm was be used for updating the values. Adam’s parameter update is given by equation 4.12 where the instances of β are to be changed to w for neural networks. These steps, from feeding input data until updating the weights are repeated until the loss function is minimized. This process is called ’error back-propagation’ (Rumelhart et al., 1986). This paper is restricted to this training algorithm because it has proven to be the most convenient training algorithm (Rumelhart et al., 1986). The most prominent alternative to back-propagation is ’direct feedback alignment’. This algorithm uses different weights for propagating the error backwards than the weights used for

(34)

CHAPTER 4. MODELS AND METHOD 31 the forward matrix operations. Nokland (2016) showed that the test performance is almost as good as those obtained with back-propagation.

After a network was trained, it was evaluated by the validation set. To construct the confusion matrix or to calculate the accuracy of the model, estimations of yi are rounded off to

1 and 0 respectively. The decision boundary is set to 0.5.

To optimize the neural network, several neural networks with different functional forms, also called structures, were trained on each training set. The network with the best performance on the validation set was then elected as the best and evaluated once again by the test set. As opposed to logistic regression, there are many ways to customize the structure of a neural network. Therefore, there are a lot of ways to optimize the model and its performance. For each training set, the neural network was optimized over the number of hidden layers, the number of neurons in the hidden layers, the activation functions of the neurons and the set of predictor variables (in neural networks often called features).

The structure of a neural network consists of three types of layers: input, hidden and output. Creating the neural network architecture therefore means coming up with values for the number of layers of each type and the number of neurons in each of these layers and the type of activation functions used. The first layer is uncomplicated. Every neural network has one of them which contains the features and a constant. As long as at least one hidden layer is implemented in the network, it is irrelevant to include non-linear terms of features such as squares in the input layer. Like the input layer, every neural network has exactly one output layer. In this research, the output layer consists of one neuron; the dependent variable. That leaves the hidden layers. Both the number of hidden layers and the number of neurons in each of these hidden layers must be carefully considered. If the data is linearly separable, which is not the case, the network does not need hidden layers at all. Adding hidden layers allows the network to model any type of mathematical relation. The universal approximation theorem states that neural network, with a single hidden layer, containing a finite number of neurons, can approximate any continuous mapping from one finite space to another (Cybenko, 1989; Hornik, 1991). That is why one hidden layer is sufficient for the large majority of the problems. While the universal approximation theorem proves that a single layer neural network can learn anything, it does not specify how easy it will be for that neural network to actually learn something. Neural networks with two hidden layers have the ability to approximate any smooth mapping to any accuracy and can represent an arbitrary decision boundary to arbitrary accuracy (Hornik, 1991). Nowadays research in deep neural network architectures show that many hidden layers can be successful for difficult task such as cancer recognition (Teramoto et al., 2017).

As for the number of neurons, there are no clear rules on how many to include. Using too few neurons in the hidden layers will result in ’underfitting’. Underfitting occurs when there are too few neurons in the hidden layers to adequately detect the signals in a complicated data set. Using too many neurons in neural networks can result in several problems. Too many neurons

A comparison of logistic regression and artificial neural networks : predicting health care utilization

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

A Comparison of Logistic Regression and

Artificial Neural Networks

Predicting Health Care Utilization

Mats de Nijs

Abstract

Contents

Chapter 1

Introduction

Chapter 2

Literature Review

2.1

Logistic Regression

2.2

Neural Networks

2.3

Predicting health care utilization

2.4

Other binary classification techniques

Chapter 3

Data

3.1

Data files

3.2

Dependent variables

3.3

Independent variables

Chapter 4

Models and method

4.1

Step 1: Training, evaluation and test sets

4.2

Step 2a: Logistic Regression

4.3

Step 2b: Neural Networks