Predicting the county-level adult obesity rate in the United States using linear regression and machine learning models

(1)

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided up into a number of sections and contains references. An outline can be something like (this is an example for an empirical thesis, for a theoretical thesis have a look at a relevant paper from the literature):

(a) Front page (requirements see below)

(b) Statement of originality (compulsary, separate page) (c) Introduction (d) Theoretical background (e) Model (f) Data (g) Empirical Analysis (h) Conclusions

(i) References (compulsary)

If preferred you can change the number and order of the sections (but the order you use should be logical) and the heading of the sections. You have a free choice how to list your references but be consistent. References in the text should contain the names of the authors and the year of publication. E.g. Heckman and McFadden (2013). In the case of three or more authors: list all names and year of publication in case of the rst reference and use the rst name and et al and year of publication for the other references. Provide page numbers.

2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All that actually matters is that your supervisor agrees with your thesis.

3. The front page should contain:

(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Faculty as in the heading of this document. This combination is provided on Blackboard (in MSc Econometrics Theses & Presentations).

(b) The title of the thesis

(c) Your name and student number (d) Date of submission nal version

(e) MSc in Econometrics

(f) Your track of the MSc in Econometrics

Master’s Thesis Big Data and Business Analytics

2016-2017

Predicting the County-level Adult Obesity Rate in the United States

using Linear Regression and Machine Learning Models

Author:

Alexandros Kakakis Student Number: 10625127

Supervisor: dhr. prof. dr. Marcel Worring Second Reader: dhr. dr. Noud van Giersbergen

MSc in Econometrics

University of Amsterdam

(2)

Statement of Originality

This document is written by Student Alexandros Kakakis who declares to take full

responsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it.The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

1. Introduction

Obesity is an alarming problem in the United States. A person is considered as obese when his or her Body Mass Index (BMI) is at least thirty. Currently, around one third of the American adults have a BMI that exceeds this level (Flegal et al., 2016). This is a significant increase compared to several decades ago, and it is not expected that we will see a decline in the upcoming years. This increasing trend has serious consequences. Obese adults have an increased risk of cancer, diabetes, strokes and other deceases (McGinnis & Nestle, 1989). Because of these enlarged health risks, there is a link between growing health costs and obesity (Finkelstein et al., 2009). The health costs will continue to rise when the obesity prevalence does not reduce. Therefore, the U.S. government is constantly looking for policy interventions that could eventually lead to a reduction in obesity prevalence.

Multiple studies contribute to this by evaluating existing interventions or by providing suggestions for more effective ones. Improving the food environment is a returning recommendation in those researches. As one would expect, the growing availability of unhealthy food establishments, like fast-food restaurants, has contributed to the increasing obesity trend (Spence et al., 2009). However, this also works the other way around. Improving the accessibility of grocery stores could stimulate a healthier food consumption, and

consequently reduce obesity rates (Bodor et al., 2010). For those, and other interventions to be most effective, they should be targeted on areas where the obesity rate is above average. A relevant question is therefore if we can locate those areas automatically.

We contribute to existing studies that address this question in two ways. First, we try to predict the obesity rate on county-level, instead of the individual or state-level predictions from other studies. We believe that this is an improvement, since interventions targeted on counties are expected to be more effective than the ones that are focused on individuals (Salois, 2012). Also, compared to state-level, county-level predictions are more detailed.

Second, we try to improve prediction accuracy by using machine learning models. This differs from the current studies that are mainly using linear models for prediction. Machine learning methods are expected to be more accurate, since they are able to capture nonlinear relations in the data (Zhang et al., 2009). This improvement, and the one mentioned above bring us to the following research question: How do machine learning methods compare to a linear regression model when predicting county-level obesity rates in the United States?

(6)

In total, we use five different models to answer this research question. The first model is Ordinary Least Squares (OLS) and its performance serves as a baseline for the other models. Those machine learning models are: regression trees, random forest for regression, Support Vector Regression (SVR) and a single layer neural network. The first model is easy

interpretable, the other three models have proven to be able to provide accurate predictions in multiple research fields.

We use data from the Food Environment Atlas published by the Economic Research Service (ERS) from the U.S. Department of Agriculture (USDA). This Food Atlas contains many variables involving the food environment in the United States. We believe that this data is suitable for our research, since most of the data is on county-level and it contains all the variables related to our research question. We will estimate the models with all of those relevant variables, but also with an optimal subset of variables determined by Recursive Feature Elimination (RFE). The data is partitioned into a train and test set, such that we can compute an unbiased measure for the prediction error. This measure is the Root Mean Squared Error (RMSE) and we will use this to compare the performances of the different models.

In the upcoming chapter, the theoretical background of this thesis is provided. There, we will discuss some of the earlier research to the factors that influence obesity prevalence. Further, we highlight the models used in earlier studies that aimed to predict obesity rates. In Chapter 3, we give a detailed description of our methodology. We discuss each model that we use, and explain how we tune the corresponding parameters. The subsequent chapter describes the source of our data and the type of variables that it contains. On top of that, we mention which of those variables we exclude and how we deal with missing data. The chapter ends with some descriptive statistics and an overview of the included variables. The results of this study are discussed in Chapter 5. For each of the models we will discuss their accuracy on the test set and display some model specific results. This thesis ends with a conclusion where we will

summarize our findings. We will also indicate the complications of this research and suggest some further improvements.

(7)

2. Theoretical Background

The purpose of this chapter is to provide some theoretical background for this thesis. We start with an overview of the factors that are associated with obesity prevalence, according to existing literature. In the second section, we detail the methods used in earlier studies that aim to predict the obesity rate in the United States. This includes researches that obtain data from the Food Environmental Atlas, which is the same data as we will use.

2.1 Factors Associated with Obesity Prevalence

Many researchers have tried to explain the obesity prevalence. In their studies they usually find a relation between a set of variables and this prevalence. We can group those variables into one of the following categories: accessibility of food, availability of food stores and restaurants, the local food environment, physical activity and socioeconomic characteristics. In this section, we will elaborate on each of those categories and provide results from related studies.

A well known cause of obesity is an unhealthy diet. This type of diet typically consists of food that contains many calories, but few nutritional value. People who consume this kind of food are more likely to be obese (Ford & Dzewaltowski, 2008). Therefore, in literature a great amount of attention is paid to the relation between food environment and obesity. The food environment refers to the public places in a community where food can be obtained (Chi et al., 2013). This environment could affect the diet of individuals in more than one way. First we look at the impact of the type of food stores and restaurants that are available in an area.

It is undeniable that fast-food restaurants and convenience stores are sources of unhealthy food. The results of the state-level study of Morland & Evenson (2009) indeed confirm that the availability of those food establishments increase the risk of obesity. However, the food environment can also reduce this risk. Spence et al. (2009) measure the relation between obesity and specialized food stores and supermarkets, which are both considered retailing healthier food than the stores mentioned earlier. They find a significant reduction in the obesity risk for people living in areas with relatively more specialized food stores and supermarkets.

Finally, Welker et al. (2016) indicate that improving the food quality on schools might help to put a hold on the growing obesity rate in the United States. They provide evidence that school food policies can significantly influence the food intake of children and thus the obesity

(8)

risk. According to Li & Hooker (2010), students that are eligible for free or reduced lunch are more likely to have a higher BMI compared to other students.

Besides availability, accessibility of healthy food might also play an important role in obesity prevalence. Michimi & Wimberly (2010) find that within metropolitan areas in the U.S., the distance to supermarkets is negatively correlated with the fruit and vegetable consumption and the obesity rate. It is therefore expected that households that live far away from

supermarkets, have higher risks of being obese. Similar, Bodor et al. (2010) find that the addition of a supermarket within a two kilometer radius of an individual, decreases the obesity risk for that individual. Further, the price of healthy food compared to unhealthy food could impact the food intake of consumers. Lakdawalla et al. (2005) state that increasing certain food prices through taxes, could potentially decrease obesity rates. Those three studies confirm that making healthy food more accessible for people, leads to better diets and consequently lower BMI levels.

So far we have only considered the effects of food on obesity. However, nutrition is not the only factor that is associated with its prevalence. As one would expect, a fair amount of studies find a relation between insufficient physical activity and obesity. A study of Gordon-Larsen et al. (2006) for example, finds that the absence of recreational facilities like public tennis courts, beaches or sports clubs in an area increases the risk of obesity for the people living in that area. Further, Ghimire et al. (2017) find a negative correlation between the amount of green space in a county and BMI. They suggest that creating more outdoor recreation resources, like national parks, might contribute to a reduction of the obesity rate in the United States. Lastly, Tucker & Gilliland (2007) study the effect of weather on physical activity. Based on a literature review, they conclude that poor weather leads to a decrease in physical activity levels.

Lastly, we consider the variables related to socioeconomic characteristics. Obesity seems to be associated with the socioeconomic status (SES) of individuals. For instance, especially within lower SES groups, poverty is positively correlated with obesity (Wen et al., 2010). Powell et al. (2007) find similar results. They detect that there is less availability of healthy food in lower SES neighborhoods. Finally, Ford & Dzewaltowski (2008) point out that within ethnic minorities and other disadvantaged populations, the obesity rate is higher compared to opposing populations. In their study, they provide an explanation for this finding. Their literature review leads to the conclusion that the mentioned populations have less access to

(9)

healthy food and therefore face a higher obesity risk than populations in higher SES groups. In this section, we have highlighted the factors that are potentially correlated with obesity. Those factors could be used to predict obesity rates. In the following section we will detail some studies that have tried this. In particular, we mention the models they use and discuss the explanatory variables included in those models.

2.2 Food Environment and Obesity

There are numerous studies that use the factors that we pointed out above to explain the obesity prevalence in the United States. In this section, we highlight a selection of those studies. We are mainly interested in the ones that show similarities with our research. For each of them, we detail the corresponding model, the data, the explanatory variables and the outcome.

We start with the county-level study of Salois (2012), which uses the same data as our research. Salois (2012) aims to find the effect of the local food economy on obesity and diabetes. To capture this economy, he uses variables like the percentage of farms with direct sales, direct farm sales per-capita and the density of farmers’ markets. Those are markets where farmers can sell their products directly to customers. Besides these indicators for the local food economy, he also includes some of the features mentioned in our previous section. Salois (2012) measures the effect of these variables with a robust regression, since several tests rejected the hypothesis of normally distributed error terms. In case of non-normality of those terms, robust regression estimates are more efficient than the standard OLS estimates. The results following from this model indicate a negative correlation between the local food economy and both obesity and diabetes. Based on this, Salois (2012) states that investing in the local food economy can reverse the inclining trend of the obesity rate.

Chi et al. (2013) use the same USDA data to explain the obesity prevalence on county-level. However, they believe that there is a spatial heterogeneity between the obesity rate and its explanatory variables. Therefore, they use Geographically Weighted Regression (GWR) as their estimation technique. This allows the regression parameters to vary over space, which

addresses their spatial heterogeneity assumption. The explanatory variables for the GWR model are based on food accessibility, socioeconomic status, physical environment and food environment. For the latter, they use two different ratios: the convenience-to-grocery store ratio and the fast-food-to-full-service restaurant ratio. For both of those ratios a higher value

(10)

corresponds to a less healthy food environment. Chi et al. (2013) find that the

convience-to-grocery store ratio, poverty rate and urban environment are positively correlated with obesity. Their results also indicate that in some states GWR outperforms OLS and that the models could complement each other.

Similar to the two studies mentioned above, Jilcott et al. (2011) use USDA data to

investigate the association between the food environment and obesity on county-level. They include densities of grocery stores, farmers’ markets and supercenters as indicators for this food environment. Jilcott et al. (2011) use a multilevel linear model to estimate the effects of these variables. They use this model because they believe that there is a dependence of the individual counties within a state. Therefore, they use the state that a county belongs to, as the second-level variable. According to the estimates of the multilevel model, all three food establishments stated above are negatively correlated with obesity rates.

Unlike the other studies, Michimi & Wimberly (2015) do not use USDA data. Instead, they obtain their information about obesity and food environment from the Behavioral Risk Factor Surveillance System (BRFSS). However, there is a relation between this source and the USDA data, since some of the variables from the Food Atlas are obtained from the BRFSS. Using this data, Michimi & Wimberly (2015) estimate a logistic regression model to classify individuals into obese or non-obese categories. As input for the model, they construct two factors. One represents healthy food options, while the other consists of unhealthy alternatives. As expected, the results indicate that the latter increases the odds of being obese.

Instead of using a linear model like the studies above, Yan & Griffin (2015) use a parametric nonlinear regression to estimate the obesity rate on county-level. Their data is obtained from several sources among which the USDA. From this data, they use food environment, food accessibility and demographics as their explanatory variables. The results following from the nonlinear model are in line with earlier research and expectations. Addition of a grocery store, for example, decreases the predicted obesity rate. The opposite can be said for a convenience store or a supercenter.

Finally, we discuss the study of Zhang et al. (2009). This study aims to predict childhood obesity using attributes from the Wirral child database. The prediction accuracy of a logistic regression model is compared with different machine learning techniques. Three of them are also used in this research, namely decision trees, neural network and Support Vector Machines

(11)

(SVM). However, Zhang et al. (2009) use these models for classification, while we solve a regression problem. They find that from the models mentioned above, SVM performs best on this classification task. On the other side, the logistic regression and decision trees provide relatively poor predictions. Based on this, they suggest that certain machine learning methods could be used more often as an alternative to standard regression models.

In this section we have discussed some of the studies that investigate the relation between obesity and the food environment. Four of these use USDA data, since it includes many features and the data is on county-level. Although we use the same data, our methodology differs. In the next chapter we will discuss our method and provide some background of the models we estimate.

(12)

3. Models and Methodology

The purpose of this chapter is to explain the method of our research and explore the models that are part of it. In the first section, we will describe in general how we are going to predict the county-level obesity rates. It is followed by a number of sections in which the prediction models are detailed one by one. In the final section we address the feature selection model we employ.

3.1 Method

The aim of our research is to predict the county-level obesity rate in the United States. We assume that this obesity rate depends on a set of variables. For each county i, there is a k×1 vector of observations for those variables, denoted by xi. These vectors are stored in the N×k

matrix X, where N is the total number of counties in the data. Further, we denote the observed and predicted obesity rate for observation i by yiand ˆyi(xi)respectively. Occasionally, we will

simplify the notation of the latter to ˆyi.

We will use five different models to forecast obesity. Those models are Ordinary Least Squares, Support Vector Regression, regression trees, random forest for regression and a single layer neural network. The performance of the OLS model will act as a baseline for the other models. In the remainder of this chapter we will discuss the models in more detail. In this section, we will explain our general procedure of predicting the county-level obesity rates.

We wish to obtain an indication of the accuracy of our models, so that we are able to compare them. Therefore, we split the data into a train and test set. All the training and validation of the models is performed on the train set. The test set is only used to measure the model performance. We decide to use 70 percent of the data for training, and the remaining 30 percent for testing. In this way, we have enough data to train, but also sufficient observations to be predicted. The data partitioning is identical for each of the five models.

We measure the model performance on the test set by the Root Mean Squared Error (RMSE). The general definition of this measure for n predictions is given by

RMSE= s 1 n n

∑

i=1 e2_i, (3.1)

(13)

where ei is the difference between the real and predicted value for observation i. In case we

want to evaluate the model performance on the test set, this prediction error is equal to yi− ˆyi.

It is undesirable that these errors are large, and we wish to detect those models that cause this. The RMSE is suitable for this, since the errors are squared. This causes high inaccuracies to be enlarged, and thus they are more easily visible through the RMSE.

In order to obtain predictions with high accuracy, we optimize the models individually. This involves selecting the optimal model parameters. The process of checking for which parameter combination the model performs best, is called parameter tuning. This can not be done on the test set, since it would lead to underestimation of the RMSE. Therefore, we only use the train set to tune the parameters. For training sets with many observations, it would be possible to hold out a part of the training set for validation. However, our data is not sufficiently large (around 3000 observations in total).

Cross-validation is a good alternative for smaller data sets. For K-fold cross-validation, the data is divided into K equal parts. In step k of the cross-validation, the kth part is left out the data. The model is trained using the K−1 other parts, and a prediction error is calculated on the kth part. There are K steps in total, such that each part is left out once. The size of K depends on the objective. Higher K decreases bias, but might instead increase variance. In general, five or ten fold cross-validation gives good results (Friedman et al., 2001). We use 10-fold

cross-validation in this research. Those ten folds are generated randomly for each model. We use the cross-validation for parameter tuning as follows. First, we specify a grid that contains all the combinations between the model parameters that we wish to evaluate.

Subsequently, for each of those combinations we perform a 10-fold cross-validation and obtain a RMSE. The parameter combination corresponding to the lowest value for this measure, is then considered as optimal. Because this RMSE depends on the folds, it might be that the optimal combination changes when these folds slightly differ. This is especially the case when the standard deviations of the RMSE values over the folds are relatively large compared to the differences in RMSE values between parameter combinations. When this situation occurs during the parameter tuning, we will mention this and discuss its consequences.

The final part of our methodology is the scaling of the input data. The neural network and SVR models rely on functions that do not deal well with relatively high input values. Therefore, it is common to normalize the input variables of those models. We apply the same scaling to

(14)

the other models, such that are able to compare them with SVR and neural network regression. The normalization of variable Xjis given by

Zj =

Xj−E(Xj)

q

var(Xj)

(3.2)

It is purely based on the train set, which implies that we derive the mean and standard

deviation solely from the observations in this set. Predictions on the test set are scaled back by using the same mean and standard deviation, such that we obtain realistic output values.

In short, we will estimate five different models. Those models will be optimized

individually by using 10-fold cross-validation to determine their parameters. Below we will elaborate on which parameters must be tuned for each model. We start with the baseline model of our research: OLS.

3.2 Ordinary Least Squares

In this section we discuss the first of five models that we use in this research: the linear regression model. This model assumes that the output is a linear combination of the input variables. As a function, this is given by

f(X) =Xβ, (3.3)

where β are the unknown weights that we wish to estimate. We include an intercept in our model, and therefore X contains an additional column of ones. Consequently, there are k+1 parameters to be estimated.

Ordinary Least Squares is an estimation method that solves the linear regression problem by minimizing the Sum of Squared Errors (SSE). This function is almost similar to the RMSE defined in the previous section. In case we substitute f(X)in this function, we obtain

N

∑

i=1

(15)

We seek to find a vector ˆβ that minimizes this function. Heij et al. (2004) show that this problem has the following solution.

ˆβ= XTX-1

XTy. (3.5)

By substituting these optimal weights into Equation 3.3, we easily see that new observations can be predicted by

ˆy=X ˆβ (3.6)

Because the OLS model assumes a simple linear model, it is expected that we can approve prediction accuracy by allowing nonlinear relations in the data. Therefore, we use four machine learning models that make this possible. We start with the regression tree model.

3.3 Regression Trees

The first machine learning model we discuss in this chapter is decision trees. Decision trees are easy to interpret and intuitive. There are two types of decision trees: classification trees and regression trees. Classification trees are suitable when the dependent variable is categorical. However, in this research the dependent variable is the obesity rate per county, which is a continuous variable. Therefore, only regression trees are considered in this research. In the first part of this section we explain the algorithm of this tree, next we will discuss the parameter tuning.

Figure 1: Node structure of a decision tree

In Figure 1, an example of a decision tree is shown. It consists of different types of nodes. The node at the top is called the root node. The bottom nodes are the terminal nodes or leafs

(16)

(Grömping, 2009). Finally, the other nodes are referred to as internal nodes. A path in a decision tree starts at the root node, from where it passes through internal nodes and ends at one of the terminal nodes. At each node the data is partitioned according to an algorithm. New observations are predicted by following a path, determined by this algorithm. Below we will detail the Classification And Regression Trees (CART) algorithm, which we will use in our research.

The CART algorithm was first introduced by Breiman et al. (1984). It partitions the data by using an optimal splitting criterion. This criterion consists of a splitting variable Xjand a split

point s. The splitting variable is one of the k explanatory variables (e.g. the number of grocery stores per county). Based on the value of this variable and a corresponding split point, the data is divided into two groups. At the root node, this separation leads to the groups

R1(j, s) = {X|Xj ≥s} and

R2(j, s) = {X|Xj <s}.

(3.7)

Those groups are determined by a combination(j, s). The CART algorithm finds the optimal combination by solving min j,s min c1

∑

i∈R1(j,s) (yi−c1)2+min c2

∑

i∈R2(j,s) (yi−c2)2 , (3.8)

where cmis the predicted response for group m. The two summations in Equation 3.8 are the

node impurities of group 1 and 2 respectively. It can be shown that the response that minimizes this node impurity, is equal to the average value of yiof all the observations i that belong to the

corresponding group (Friedman et al., 2001). Since this is a easy solution, the optimal

combination can be found relatively fast. Once it is found, the algorithm repeats itself for each of the created groups.

We will end up with a large tree if we let this repetition continue for a long time. While such trees will have very accurate predictions on observations from the training data, it is highly unlikely that it is accurate on out-of-training observations. This problem is known as

overfitting (Hayes et al., 2015). In order to avoid this, we could prune the tree. This works as follows. First, we grow an unpruned tree untill we reach a predetermined minimum number of observations per node. We follow Friedman et al. (2001) and fix this number at 5. Subsequently,

(17)

we prune this tree by removing some of the nodes from below. Basically, we have now obtained a subtree of the unpruned tree.

In this research we use cost-complexity pruning to find the optimal subtree. This involves tuning a complexity parameter that causes a trade-off between tree size and risk of overfitting. By choosing the parameter too large, the pruned tree will be too small to provide accurate predictions. On the other hand, decreasing the complexity parameter leads to a tree that might overfit on the train set. We use parameter tuning to find its optimal value, and thus the optimal pruned tree.

In this section we have discussed the regression tree model. Although this model is simple to interpret and not hard to implement, a disadvantage is its high variance (Friedman et al., 2001). An extension of regression trees that addresses this problem, is random forest for regression. This method will be discussed in the next section.

3.4 Random Forest for Regression

Random forest is an extension to decision trees that was introduced to improve the model performance. In this section, this model will be explained. We will focus our attention on random forest models consisting of regression trees.

As mentioned before, we wish to reduce the variance of the regression tree model. Random forest does this by bootstrap aggregating (bagging) numerous unbiased trees. This implies drawing a number of B bootstrap samples, growing an unpruned regression tree for each of them and finally aggregating these trees. Friedman et al. (2001) show that the aggregated variance of B identically distributed trees is defined by

ρσ2+1−ρ

B σ

2_, _(3.9)

where ρ is the correlation between the trees and σ2the variance of a single tree. By increasing B, the right term of Equation 3.9 disappears. Thus, reducing the variance boils down to reducing the correlation between the trees as long as we keep their variance on the same level.

The random forest algorithm achieves this by making an alteration in the way the trees are grown. Instead of seeking the optimal splitting variable from all of the available variables, at each node only a subset of variables is available. We define the number of variables in this

(18)

subset by m<k. The subset is then created by randomly drawing m columns from X. This way, it is possible that the optimal splitting variable is not available, forcing the algorithm to select another variable. In theory this will cause the B trees to differ, resulting in less mutual

correlation. The amount of correlation can be controlled by varying m. We will tune this parameter to optimize the random forest model.

Besides finding the optimal number of variables available at each node, we must also decide how many trees we grow in our model. Although more trees could increase prediction

accuracy, it also increases the training time. It is expected that the prediction error converges to a limit, as B increases (Breiman, 2001). We would like to find this limit, so that the model accuracy is as high as possible. We will start with 500 trees, and from there we will increase this number until we observe no further serious reduction in the cross-validation error.

Having determined the optimal values for B and m, the random forest model can be used for prediction. Predicting a new observation for random forest works as follows. For

observation i, we denote the corresponding prediction by tree j as ˆyi,j. These predictions can be

averaged over all the trees to obtain the prediction for observation i. This can be expressed as

ˆyi = 1 B B

∑

j=1 ˆyi,j. (3.10)

So far we have discussed three different models. Our fourth model is Support Vector Regression. In the next section we will detail how this model can be used to predict a continuous variable.

3.5 Support Vector Regression

Support Vector Machines is a class of machine learning techniques that is especially known for the usage of kernels and the sparseness of its solution. There are SVM algorithms for both classification and regression. The latter is often referred to as Support Vector Regression. In this section, we will detail this method. As usual, we start with the general idea of the model, followed by a discussion about the model parameters.

The SVR solution can be found by minimizing a certain error function. In the following we will discuss the steps that lead to this optimization problem. We begin with a linear regularized regression model. Just like in the OLS model, its solution is found by minimizing a quadratic

(19)

error term. However, in this case we add a regularization term to it to avoid overfitting. We can express this regularized error function by

1 2 N

∑

i=1 (f(xi) −yi)2+ λ 2 kβk 2 , (3.11)

where f()is the same as in Equation 3.3 and λ is the regularization parameter.

One of the advantages of the SVM models, is that they have sparse solutions. To obtain this sparsity, we replace the quadratic error term in Equation 3.11 by a so called ε-intensive error function (Bishop, 2007). This function allows the errors to be zero when they do not exceed a certain bound, determined by ε>0. We make this explicit by defining the following function:

Eε(x) =      0 if|x| <ε |x| −ε otherwise (3.12)

By substituting this error function into (3.11) we obtain

C N

∑

i=1 Eε(f(xi) −yi) + 1 2kβk 2_, _(3.13)

where C is the cost parameter that is commonly used in SVM as regularization parameter. It controls the penalty for predictions that lie outside the ε-bound. For very small values of C, the effect of prediction errors on the minimization is small and errors are allowed more easily. On the other hand, if C is large, the predictions outside the ε-bound are penalized heavily and the model tends to overfit the data. In general, the cost parameter causes a trade-off between accuracy and overfitting.

The next step towards the SVR model involves slack variables. For each observation, we introduce two slack variables: ξiand ξ∗i. The definition of these slack variables are as follows.

For each observation we have a model outcome f(xi). This outcome has a lower and upper

bound defined by f(xi) −εand f(xi) +εrespectively, where ε is the same as in Equation 3.12.

The true outcome can be either above, below or inside the ε-bound. The latter corresponds to the case where the outcome of the ε-intensitive error function equals zero. When the true observation exceeds the upper bound, we set ξi >0. Similar, for true observations below the

(20)

By using the definition of these slack variables, we can allow the errors to lie outside the

ε-bound (Bishop, 2007). The following conditions illustrate this.

∀i; y_i ≤ f(xi) +ε+ξi

∀i; y_i ≥ f(xi) −ε−ξ∗_i

∀i; ξ_i ≥0 ∀i; ξ_i∗ ≥0

(3.14)

We can now rewrite Equation 3.13 in terms of the slack variables as

C N

∑

i=1 (ξi+ξ_i∗) + 1 2kβk 2 . (3.15)

This expression can be minimized under the restrictions (3.14) by using Lagrange optimization. It can be shown (Bishop, 2007, pp. 341-342) that this Lagrange optimization leads to the following formula for predicting new observations:

ˆy(xi) = N

∑

n=1

(an−a∗n)K(xi, xn) +b, (3.16)

where K(xi, xn)is a kernel function, b denotes a bias term and an, a∗nare the Lagrange

multipliers. It can be derived that for observations inside the ε-bound, both multipliers are equal to zero (Bishop, 2007). Therefore, we only have to consider those observations that lie on or outside the ε-bound. These observations are called the support vectors (Bishop, 2007). Since Equation 3.16 only involves those support vectors, we have obtained a sparse solution for the SVR model.

We have introduced a kernel function in Equation 3.16. The type of kernel function must be specified beforehand by the user. Below we state four of the most commonly used choices (Hsu et al., 2003).

• Linear: K(x, x0) =xTx0

• Polynomial: K(x, x0) = (γ·xTx0+c)d, γ>0

• Radial: K(x, x0) =exp −γkx−x0k2, γ>0

(21)

All of these kernels, except of the linear kernel, have additional kernel parameters to be specified. These are, combined with C, the hyperparameters of the SVR model. We tune them together with the kernel type to find the optimal kernel and its corresponding parameters.

In this section, we have shown how the regularized error function can be rewritten, such that we obtain the sparse solution of the SVR model. We also discussed the hyperparameters of this model and the kernel functions that are related to this. In the next section, we will detail our final machine learning model: the neural network.

3.6 Single Layer Neural Network for Regression

A neural network is a prediction model that uses one or more hidden layers to transform input data into an outcome. In this research we consider the neural network with a single hidden layer, since multiple hidden layers seem too complex for our relatively small data size (both in number of observations and variables). In this section we will describe how this neural network can be used to obtain predictions of a continuous output variable. We also address the hyperparameters of the model.

Figure 2: Structure of a single layer neural network

In Figure 2 we see the general structure of a neural network for regression with a single hidden layer. The variables Z1till ZMin the middle of this figure are called the hidden units of

(22)

the neural network. They are called hidden, since they are not observed but instead learned from the data (Friedman et al., 2001). They can be defined as a linear combination of the input data. Consequently, the output ˆy is expressed as a linear combination of the hidden units. The corresponding set of equations can be written as follows.

Zm =σ(Xαm), m=1, ...., M

ˆy=Zβ,

(3.17)

where the first column of both X and Z contains ones. Further, σ()is the sigmoid function:

σ(x) = ₁₊_exp1₍₋_x₎. This function is referred to as the activation function, and it leads to a

nonlinear transformation of the data (Friedman et al., 2001). In Equation 3.17, there are M(k+2) +1 parameters in total that need to be trained. We define these parameters by

α= {αm0, αm1, αm2, ... , αmk; m=1, .., M}

β= {β0, β1, β2, ... , βM}

(3.18)

Below we will explain how their training can be done.

Similar to other models, we seek to find seek to find the parameters that minimize the SSE of the predictions. In this specific case, we denote this error function by R(α, β). A method that

can be used to minimize the SSE is back-propagation. This is an iterative process where the weights are updated at each iteration. It consists of two stages. In the first stage, the input is fed forward to the neural network such that a value for R(α, β)is obtained. For the first iteration

this requires a selection of starting values for the parameters. Usually, they are chosen close to zero (Friedman et al., 2001). In the second stage, the weights are updated by gradient descent. At iteration r+1, these updates are given by

β(r+1)= β(r)−γr N

∑

i=1 ∂Ri ∂β(r) α(mr+1)=α(mr)−γr N

∑

i=1 ∂Ri ∂α(mr) , (3.19)

where γris the learning rate. It can be shown that it is convenient to calculate the derivative

∂R(α,β)

(23)

updating continues until some convergence occurs or until a maximum number of iterations is reached.

One of the tuning parameters in the neural network is the number of hidden units. Usually, this number lies between 5 and 100 (Friedman et al., 2001). It is expected that the needed number of hidden units increases with the number input variables and observations. The intuition behind this, is that more hidden units allow for more complex interactions in the data. However, too many of them could lead to overfiting.

Similar to the regularized linear regression problem we mentioned in the previous section, we could avoid overfitting by adding a regularization term to R(α, β). The regularized

minimization problem is now given by

R(α, β) +λ _M

∑

i=1 k

∑

j=0 α2_ij+ M

∑

i=0 β2_i , (3.20)

where λ≥0 is called the weight decay parameter. As becomes clear from the equation above, it penalizes the magnitude of the parameters α and β. Higher values for λ will therefore shrink those parameters towards zero, which reduces the risk of overfitting. The weight decay and the number of hidden units are the tuning parameters of the single layer neural network. We first determine how many hidden units are sufficient for our data. Subsequently, we tune the weight decay parameter while keeping M fixed. In this way, we expect to keep the risk of overfitting low.

So far we have discussed all the models that we use in our research. In the last section of this chapter we introduce a method that reduces the number of input variables for those models. We use Recursive Feature Elimination to determine which variables are most important for prediction, and which are redundant.

3.7 Recursive Feature Elimination

In total, we have 42 explanatory variables available to train the models described above. In each of them, we included all those variables. However, this might be harmful, since it increases the risk of overfitting (Guyon & Elisseeff, 2003). Feature selection is often used to address this problem. Besides reducing the risk of overfitting, it also provides insights in the underlying pattern of the data and it simplifies the model (Guyon & Elisseeff, 2003). Further,

(24)

we might decrease training times, since we decrease the number of input variables. For those reasons, we will perform a feature selection method to see if our predictions improve. The method we use, is Recursive Feature Elimination (RFE), which we will describe below.

RFE is a widely used feature selection method due to its good performance and relatively short computation time (Granitto et al., 2006). It aims to find a subset of variables that contains just as much explanatory power as the total set of variables. RFE achieves this, by removing variables in order of importance. The least informative features are eliminated first. The optimal subset of variables is the one that gives the lowest cross-validation error.

In order to perform RFE, we must determine how we rank the variables in order of

importance. There are several methods that can be used for feature ranking. In this research we use the random forest model to find the variable importance. Earlier research has shown that this type of RFE performs well (Granitto et al., 2006). We have already discussed the random forest algorithm in Section 3.4, but some addition is needed to explain how it can be used for variable ranking. We will provide this explanation below.

For the variable ranking, we use the out-of-bag (OOB) samples from the random forest. The OOB sample for tree b in the random forest contains the observations that are excluded from bootstrap sample b. For this tree, the corresponding OOB sample can be used as a test set to obtain a prediction error. Using these errors, it is possible to measure the importance of each feature. For variable j, this can be determined as follows. For each OOB sample, shuffle the values for variable j and compute the prediction error on the OOB sample again. Consequently, store the difference between the errors of the first and second prediction and average this difference over all the trees in the forest. By doing this for all the k input variables, it is possible to obtain the importance by comparing the relative prediction changes. It is expected that important variables lead to relatively more change, since they have more impact than redundant variables (Granitto et al., 2006).

In this chapter, we have provided an overview of our methodology and the models we use. In short we train five different models on a train set. We use 10-fold cross-validation to tune their parameters and find the optimal model. Consequently, we test the optimal models on a test set and compute the RMSE of the predictions. Finally, we repeat this process with an optimal subset of variables, determined by the RFE algorithm. Before we discuss the results, a description of the data is given in the following chapter.

(25)

4. Data

In this chapter we will describe the data that we use in this research. In the first section, a general overview of this data is given. In the following section, we will expain why we exclude some of the variables and observations. In the final section we provide a table including all the variables that are part of the final dataset and discuss their descriptive statistics.

4.1 Food Environment Atlas

This research uses data that is made available by the Economic Research Service from the U.S. Department of Agriculture. In this section, we will describe this dataset and its purpose.

The ERS from the USDA has published a Food Environment Atlas, which will be referred to as Atlas from now on. It is made available to facilitate research to the relation between food environment and health. The USDA needs those studies to improve the overall effectiveness of policy interventions that aim to decrease the obesity prevalence. The Atlas provides over 200 variables, mainly related to the food environment in the United States. Most of those variables are measured on county level, and all the 3141 U.S. counties are included. The data in the Atlas is obtained from a variety of sources and years. When more recent data becomes available, the Atlas is updated. Over the years, the number of included variables has also increased. For this research, the most recent version of the Atlas is used. This version has 211 variables and was last updated in 2015.

To clarify the structure of the data, the variables can be organized into three categories: Food Choices, Health and Well-Being and Community Characteristics. The first category contains statistics on the access and the availability of food stores. Several types of stores are included, such as grocery stores, convenience stores and farmers’ markets. The category also includes variables that provide an indication of the accessibility of those stores.

The Health and Well-Being category includes the prevalence of adult obesity within a county. Adult obesity is measured as the percentage of persons aged 20 years and older who are obese. The estimates are provided by the Centers for Disease Control and Prevention (CDC). They have obtained the data from the Behavioral Risk Factor Surveillance System (BRFSS). A potential disadvantage of this source, is that they include self-reported measures. This might be harmful, sine people tend to underestimate their weight, and hence underestimate their BMI

(26)

(Salois, 2012). Other variables within the Health and Well-Being category are the adult diabetes rate and variables that capture the levels of physical activity.

The final category includes the factors that describe the composition and characteristics of a community. Those are among others statistics on the demographics of a county, the diversity in ethnicity, the income level and the number of recreational facilities.

Unfortunately, the data from the Atlas can not be used directly for research. Some data is missing, while other variables are not suited for the purpose of this research. In the following section, we will describe which variables are excluded from the data. Furthermore, we will explain how we deal with missing values.

4.2 Data Preparation

In this section, the preparation of the data will be described. Some available variables will be left out of the final dataset. In the first part of this section we will provide the argumentation behind this. Further, some counties and/or variables have lots of missing data. How we handle this problem, will be discussed in the second part.

4.2.1 Excluded Variables

There is a total number of 211 variables available in the Atlas. Here, we will explain which of those variables are not suitable for our research. They will be excluded from our final dataset.

Many indicators in the Atlas are measured in multiple ways, resulting in multiple variables. An example is the indicator of fast-food restaurants. The availability of fast-food restaurants is measured as a total amount per county, but also as the number of fast-food restaurants per thousand county residents. The first measure can be seen as a simple count, while the second is a density measure. When there are multiple variables available for the same indicator, the density measure is preferred for this research. We prefer this measure, because it does not depend on the size of the county as much as a count does. In total, there are 48 variables excluded because there was a more informative measure available for them.

Although the Atlas contains mostly county-level data, there are some indicators that are measured on state-level only. Since a single state consists of numerous counties, it is expected that within a state, the county-level measures would vary if they were available. On top of that,

(27)

we are eventually interested in county-level predictions. Based on that, we will only use county-level data, and therefore exclude the 58 variables measured on state-level.

Besides multiple measures, there are also variables that contain statistics over multiple years. We follow Salois (2012) and von Hippel & Benson (2014) and include only the variable that is measured at the year closest to 2010, which is the most recent year that the obesity rate is measured. A total of 18 variables are excluded, because there is a closer year to 2010 available for them.

There are 25 variables that measure a percentage change of an indicator over two or more years. For example, the percentage change of grocery stores between 2007 and 2012. Because our dependent variable is not a percentage change, but a variable measured in a single year, we do not include those variables. It is not expected that important information is lost, since for each of the change variables, a single year measure is provided which we do include in the final dataset.

Finally, there are a few variables that are not suitable for this research. First, the diabetes indicators are removed from the data. This is for two reasons. In the first place it might be that obesity causes diabetes (Sakurai et al., 1999). Second, it is not likely that diabetes measures are available when obesity rates are unknown. This is because they are both health related

variables, and furthermore within the Atlas they are collected from the same source. The same holds for the child obesity rate, which is also excluded.

4.2.2 Missing Observations

There are 57 variables left, that are suitable for our research purpose. However, some of these variables have missing values. This causes problems when training the models. Therefore, they must be either replaced, or removed. When missing values occur, it is common in USDA studies to remove the counties with missing values. For instance, Chi et al. (2013) eliminate 33 counties that have missing values for the variables they use. However, in this research we have some variables for which a large proportion of the data is missing. In those cases it might be better to exclude those variables, since removing counties would cause a lot of observations to be lost. Below we will discuss the variables that have this problem and the possible

(28)

A part of the Atlas includes variables that are indicators for the food assistance. Food assistance mainly refers to programs that are intended to support low-income households, so that they can provide healthy food for their family. The Supplemental Nutrition Assistance Program (SNAP) and the Special Supplemental Nutrition Program for Women, Infants, and Children (WIC) are two examples of such programs. Unfortunately, four variables related to those programs have a fair proportion of missing values (at least ten percent of the observations is missing). Because there are other variables that measure food assistance (e.g. the percentage of students eligible for free/reduced lunch), we delete those four variables.

The second part of the Atlas that contains variables with a large share of missing values is the section that covers the local food environment. This environment is partly captured by farms in a county that sell directly to consumers. There are three main variables that involve farms with direct sales: the amount of farms with direct sales, the percentage of farms that have direct sales and direct farm sales per capita. The first two have 177 and 234 missing values respectively, while the latter has 403 missing values. In earlier USDA studies, other variables were successfully used to measure the local food economy. Considering that these other variables are also available for this research, we do not include the three variables mentioned above.

Further, there are seven other variables related to farmers’ markets. Those variables specify what type of products the farmers’ markets sell (e.g. the percentage of farmers’ markets that sell animal products). More than a thousand observations are missing for each of those seven variables. We expect that this is because these are all specific categories, and it might not be clear to which of those categories the farmers’ markets belong. Since there are more general variables that capture the relation between farmers’ markets and obesity, we exclude the seven variables from our final data.

After excluding the variables with large proportions of missing data, there are still some counties for which some of the data is missing. However, those counties embody only a relative small segment of the total observations. Accordingly, we eliminate those counties. The final dataset includes 3052 observation on 43 variables, including the adult obesity rate per county. In the following section we will discuss those variables and consider important descriptive statistics.

(29)

4.3 Final Dataset

In this section we will display all our variables that are left after we performed the alterations mentioned in the previous section. First, we will provide a table consisting of all the variables in our final dataset, together with some descriptive statistics. Subsequently, some variable definitions are discussed in more detail. Finally, we provide some relevant statistics of the data.

4.3.1 Variable Definitions

Table 1 describes all variables used in this research, together with their definition and the year that they are measured. On top of that, the corresponding mean, standard deviation, minimum and maximum are given. Some variables need some additional explanation, which we will provide below.

To start with the accessibility category, it is notable that households have low access if they are living more than a mile away from the nearest supermarket, supercenter or large grocery store if in an urban area. For households in a rural area, this is 10 miles. Further, we need to specify the definition of children and seniors. Children are aged 18 years or younger, while seniors are aged 65 years or older.

The second category contains indicators on different types of stores. They are categorized by the definitions of the North American Industry Classification System (NAICS). Important are the differences between grocery stores and convenience stores. The first retails a general line of food, including fresh fruits, prepared meats and canned foods. The latter is primarily engaged in milk, soda and snacks. In general, the products in convenience stores are

considered unhealthier than those in grocery stores.

The difference between fast-food and full-service restaurants can be explained as follows. Fast-food restaurants are establishments where customers wait in line to order and pay before eating. On the other hand, in full-service restaurants the customers are served while seated and pay afterwards. We have also included the ratio between low-fat milk and soda prices. These prices are regional prices relative to the national price. A region consists of 26 markets and the regional price is the average price of those markets. This is the only variable in our data that is not measured on county-level. However, it is still on a higher level of granularity than

(30)

In the local food category we see some variables that need some further explanation. To begin with, food hubs are organizations that connect farmers and customers by offering distribution. Further, CSA farms are farms that market through a Community Supported Agriculture (CSA) arrangement. These arrangements let customers invest and in return receive seasonal products during the farming season1. Lastly, the farm to school indicator is a

dichotomous variable that takes value 1 if there are one or more farm-to-school programs within a county, and zero otherwise. These programs educate students about the local food economy.

The Natural Amenity Index in the health category is an index created by the ERS of the USDA. It measures the amount of natural amenities within a county, relative to other counties. The index ranges from 1 to 7, and it increases in the amount of natural amenities. For example, counties with more lakes, parks or higher temperatures are more likely to score better on the index.

In the final category, there are three dummy variables which we will discuss below.

Persistent-poverty counties are counties where the poverty rate exceeded 20 percent in all three the years 1980, 1990, and 2000. In the Atlas, the counties that satisfy this definition are indicated with value 1. Another binary variable is the indicator of metro counties. According to the USDA, those are counties that have at least one urbanized area or are economically tied to central counties. The USDA indicates the counties that meet this definition with value 1. The same holds for population-loss counties. Following the definition from the Atlas, these are counties for which the number of residents declined both between 1980 and 1990 and between 1990 and 2000.

Above, we have provided a brief description of a few variables. For a more detailed description we refer to the full data documentation of the USDA, which is available online2_.

1_{https://www.localharvest.org/csa/}

(31)

Table 1: Descriptive Statistics

Category Variable Year Mean Std Min Max

Accessibility Population, low access to store (%) 2010 23.426 20.094 0.000 100

Low income & low access to store (%) 2010 8.326 8.110 0.000 72.274

Children, low access to store (%) 2010 5.469 4.808 0.000 34.016

Seniors, low access to store (%) 2010 3.929 4.241 0.000 29.209

Households, no car & low access to store (%) 2010 3.040 2.165 0.000 29.508

Stores Grocery stores/1,000 pop 2012 0.250 0.200 0.000 2.058

Supercenters & club stores/1,000 pop 2012 0.018 0.022 0.000 0.210

Convenience stores/1,000 pop 2012 0.600 0.301 0.000 4.132

Specialized food stores/1,000 pop 2012 0.050 0.068 0.000 0.865

SNAP-authorized stores/1,000 pop 2012 0.868 0.350 0.000 3.200

WIC-authorized stores/1,000 pop 2012 0.225 0.193 0.000 2.107

Restaurants Fast-food restaurants/1,000 pop 2012 0.572 0.282 0.000 5.797

Full-service restaurants/1,000 pop 2012 0.782 0.579 0.000 13.043

Assistance Students eligible for free lunch (%) 2010 42.959 16.248 0.000 99.402

Students eligible for reduced-price lunch (%) 2010 8.550 3.360 0.000 57.417

Prices Price of low-fat milk/price of sodas 2010 0.909 0.126 0.637 1.243

Local Farmers’ markets/1,000 pop 2009 0.036 0.070 0.000 1.020

Vegetable farms 2007 22.297 43.050 0 1100

Farms with vegetables harvested for fresh market 2007 20.111 40.213 0 1100

Orchard farms 2007 36.482 178.548 0 4.685

Berry farms 2007 8.174 19.639 0 395

Greenhouse vegetable and fresh herb farms 2007 1.289 1.318 0 41

Food hubs 2012 0.045 0.208 0 1

CSA farms 2007 4.052 5.413 0 79

Farm to school program 2009 0.060 0.238 0 1

Health Adult obesity rate (%) 2010 30.579 4.238 13.100 47.900

Recreation & fitness facilities/1,000 pop 2012 0.069 0.073 0.000 0.770

ERS natural amenity index 1999 3.491 1.049 1 7

Socioeconomics White (%) 2010 78.866 19.401 2.804 99.163

Black (%) 2010 8.603 14.213 0.000 85.439

Hispanic (%) 2010 8.366 13.338 0.000 95.745

Asian (%) 2010 1.042 2.022 0.000 32.997

American Indian or Alaska Native (%) 2010 1.578 6.394 0.000 94.097

Hawaiian or Pacific Islander (%) 2010 0.049 0.111 0.000 2.019

Population 65 years or older (%) 2010 15.971 4.133 3.728 43.385

Population under age 18 (%) 2010 23.447 3.264 9.112 40.127

Median household income (thousand dollars) 2010 43.027 10.587 21.611 119.075

Poverty rate (%) 2010 16.760 6.207 3.200 50.100

Persistent-poverty counties 2010 0.113 0.316 0 1

Child poverty rate (%) 2010 24.199 9.013 3.200 61.100

Persistent-child-poverty counties 2010 0.227 0.419 0 1

Metro/nonmetro counties 2010 0.370 0.483 0 1

(32)

4.3.2 Descriptive Statistics

Now that we have clarified the definitions, we can highlight some relevant statistics of the variables in Table 1. To start with, the average obesity rate in a county is close to 31 percent. This is in line with our expectations. Its standard deviation is relatively small, suggesting that the obesity rate does not differ much from this average level for most counties.

This is not the case for the accessibility variables. The maximums of those variables are far higher than their average and their standard deviation is also large. Therefore, it seems that there are counties where the accessibility of food stores is extremely low. We observe the same characteristics for the features in the store category. The densities of the stores do not seem evenly distributed. Another important finding is that on average there are more convenience than grocery stores, which might indicate that the availability of unhealthy food is greater than its healthy alternatives. However, this is not the case for the restaurants. There we see in general more full-service than fast-food restaurants in a county.

Finally, we discuss the variables in the socioeconomic category. Naturally, on average most of the county residents are from white ethnicity. It is followed by the black and Hispanic ethnicities who have similar statistics. Further, we see that the poverty rates are quite centered around the mean, as their standard deviation is relatively low. However, there is at least one county for which the poverty rate exceeds 50 percent.

In this chapter we have described the data and its variables. We detailed which of them we include in our model and how we treat missing values. We combine this data with the methods provided in Chapter 3 to predict the county-level obesity rates. The results of those predictions are discussed in the upcoming chapter.

(33)

5. Results

In this chapter we will discuss the results of our research. It starts with the outcome of the feature selection. Consequently, in the following five sections we give the results of the prediction models. We will consider their parameter tuning and interpret the results of the optimal models. Finally, we compare the prediction performances of the models before and after feature selection.

5.1 Recursive Feature Elimination

The results of the RFE are plotted in Figure 3a. It shows that the optimal subset contains the 21 most important variables. However, the RMSE values for subsets that include more variables, do not differ much from the minimum that is reached for 21 variables. For those subsets, the RMSE values fluctuate around 2.5, while their variance over the folds is 0.1 on average. This suggests that the optimal subset found might change when using different folds. Nonetheless, we still consider the subset of 21 variables as optimal, since its tuning error is just as good as the other ones. Altough, we do not expected that the accuracy of the model predictions will benefit a lot from the RFE, the models could become clearer and the training times are expected to decrease. 2.50 2.75 3.00 3.25 3.50 3.75 0 10 20 30 40 Number of variables RMSE

(a) Cross-validation of the different subsets

BERRY_FARM ORCHARD_FARM FRESHVEG_FARM FITNESS_FACILITIES REDUCED_LUNCH NATIVE OLDER65 WHITE YOUNGER18 SNAP_STORE CHILD_POVERTY MILK_SODA_PRICE POVERTY LOW_ACCESS_NO_CAR ASIAN HISPANIC INCOME BLACK FREE_LUNCH FULL_SERVICE_RESTAURANT NATURAL_AMENITY_INDEX 0 25 50 75 100 Importance (b) Variable importance

Figure 3: Results of the Recursive Feature Elimination

As mentioned in Section 3.7, we have used the random forest model to rank the variables based on their importance. The results of this ranking are shown in Figure 3b, where we only

(34)

included the 21 highest ranked variables. The importance measure on the horizontal axis is relative to the value of the most important feature. In our case, this is the ERS Natural Amenity Index. It is followed by the density of full-service restaurants, percentage students eligible for free lunch and the percentage of people of a black ethnicity in a county. Those four variables stand out from the other ones, since their relative importance is larger. The other variables include indicators for socioeconomic status, food availability, food accessibility and physical activity levels.

In the upcoming sections we will see how the feature selection impacts the model predictions. For each model, we show the results both before and after the RFE.

5.2 Ordinary Least Squares

As we discussed before, OLS is the baseline of our research. Here we will briefly discuss its performance on the test set. Before feature selection, this performance resulted in a RMSE of 2.709. The OLS model got less accurate after RFE, based on the corresponding RMSE of 2.760.

Although the OLS solution is not complex, the predictions are quite accurate. However, we still expect that most other models could outperform its results. A model for which this is not necessarily expected, is regression trees. This model is mainly included for its interpretability and less for its predictive power. In the next section, we will see if those expectations hold.

5.3 Regression Trees

In this section we will provide the results of the regression tree model. We start with discussing the outcome of the complexity tuning. Next the tree corresponding to the optimal

cost-complexity parameter will be displayed. We end the section with an interpretation of this optimal tree.

We determine the optimal tree size by the complexity parameter. In order to get a first indication of the optimal value for this parameter, we specify a small range of values between 0 and 0.1. The results of this initial grid search are given in Figure 4a. The blue line in this figure indicates the tuning results with all variables included, the red line corresponds to the case where we used only the optimal subset of variables.

From the initial search we derive that the optimal complexity parameter lies somewhere between 0 and 0.01. Therefore, we perform a more detailed grid search between those two

(35)

3.2 3.4 3.6 0.000 0.025 0.050 0.075 0.100 Complexity Parameter RMSE After RFE Before RFE

(a) First Indication

3.00 3.05 3.10 3.15 0.0000 0.0025 0.0050 0.0075 0.0100 Complexity Parameter RMSE After RFE Before RFE

(b) More detailed search

Figure 4: Complexity parameter tuning

values. The results of this tuning are shown in Figure 4b. They indicate that before the feature selection, the optimal complexity parameter is 0.0025. This is slightly higher than after feature selection, where the minimal cross-validation error is reached at 0.0015. Because of this difference, the optimal regression tree after feature selection is larger than the tree that uses all the variables. For the reason of this size difference, we decide to only discuss the latter tree in more detail.

The optimal regression tree before RFE is plotted in Figure 5. We see that the tree has grown quite large, resulting in a total of 39 nodes (terminal nodes excluded) and 17 unique splitting variables. However, this is still relatively modest compared to the unpruned tree, which has 177 nodes. Within each node in the figure we see a number and a percentage. The first represents the predicted response of the obesity rate within the observation in the node, the second denotes the share of those observations compared to the whole dataset. The colour of the nodes also has a meaning. Namely, they indicate the level of the obesity rate. The colours range from dark green to dark blue, corresponding to relatively low and relatively high obesity rates respectively. Finally, the names beneath the nodes denote the splitting variables.

As mentioned before, a great advantage of regression trees is their interpretability. This is because we can observe directly how the splitting variables affect the predicted response for each node. This works as follows. As discussed earlier, at each node the data is partitioned according to a statement. Each statement consists of a splitting variable and split point. When an observation satisfies the statement, it moves left in the tree, otherwise it moves right. The consequent node on the left always represents a group with lower obesity rates than the right node. From this, we can derive the impact of the splitting variables on the predicted obesity

Predicting the county-level adult obesity rate in the United States using linear regression and machine learning models

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

Master’s Thesis Big Data and Business Analytics

2016-2017

Predicting the County-level Adult Obesity Rate in the United States

using Linear Regression and Machine Learning Models

MSc in Econometrics

University of Amsterdam

Contents

1.

Introduction

2.

Theoretical Background

3.

Models and Methodology

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

4.

Data

5.

Results