Data analysis : past versus future : comparing the predictive performance of XGBoost with OLS regression analysis

(1)

Faculty of Economics and Business, Amsterdam School of Economics Bachelor Thesis, BSc Econometrics and Operations Research

Data analysis: past versus future

Comparing the predictive performance of XGBoost with OLS

regression analysis

Caitlin Bruys (11034041) June 26, 2018

Supervisor 1

Dr Marco van der Leij

Supervisor 2

(2)

Statement of Originality

This document is written by Caitlin Bruys who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

1 Introduction

Peter Sondergaard, senior vice president of Gartner Research, stated in 2011 that in-formation is the oil of the 21st century, and analytics is the combustion engine (”Gartner Says Worldwide Enterprise IT Spending to Reach $2.7 Trillion in 2012”, 2011). Not only in academics, but also in business big data has become increasingly important (Chen, Chiang, & Storey, 2012). More and more companies collect data, however, not all companies know how to handle and use this data. When companies use only traditional statistical and econometric techniques, like regression analysis, they cannot use these datasets to their full extent. Furthermore, since handling big datasets have unique issues, different tools are required (Varian, 2014). Therefore, it may be time to change from econometric methods to machine-learning methods that can handle this large amount of data.

One machine-learning method stands out, XGBoost. More than half of the challenge-winning solutions published at the machine-learning competition site Kaggle during 2015 used XGBoost. Not only at Kaggle, but also at the KDDCup 2015, XGBoost proved to be the most successful method to train the models. It was used in all top-10 solutions in 2015 (Nielsen, 2016). The second most-used method at Kaggle, neural nets, was used in just over a third of the winning solutions. Therefore, XGBoost outperforms neural nets in most cases. Hence XGBoost provides the best solutions when applied to different problems, like web-text prediction, motion detection, product categorisation, and store sales prediction (Chen & Guestrin, 2016).

XGBoost, or eXtreme Gradient Boosting in full, was created by Tianqi Chen in 2014. He implemented gradient boosting machines and since it is open source, it has benefited from contributions of many other developers. It is a machine-learning method that is optimized in terms of speed and memory resources. This algorithm is designed for per-formance and can be applied to find optimal regressions, classifications and rankings, and thus analyze big data. It is based on gradient tree boosting, which is a machine learning technique widely used in many applications (Chen & Guestrin, 2016). This technique pro-duces a prediction model consisting of a combination of weak prediction models which are typically decision trees. The gradient tree boosting method builds this prediction model in stages, just like other boosting methods do. It can be used as a stand-alone predictor, but it can also be incorporated in bigger classification algorithms for example (Chen & Guestrin, 2016). The eXtreme in the name XGBoost refers to the fact that it is much

(6)

faster than ordinary gradient tree boosting. It is used for supervised learning problems, with training data containing multiple features to predict a target variable.

In econometrics, different regression techniques are used to estimate models and make predictions, ranging from a simple Ordinary Least Squares (OLS) regression to more com-plicated regression models. These methods have provided satisfactory estimations for years. An example of the application of an OLS regression analysis is described in a paper by Pope (2016). He tried to estimate the effect morning classes have on the productivity of students in grades 6 to 11. In this thesis, it is decided to use the data from the paper by Pope as this dataset contains a huge number of observations and it also contains a lot of missing values, which can cause problems in the OLS models. An estimation such as the one in Pope’s paper could also be done by a machine-learning method, for instance XGBoost.

The aim of this paper is to establish to what extent the XGBoost technique performs better than regular econometric regression techniques and how is this affected by differ-ences in model specification. In order to do this, first the literature on the theory behind the algorithm has been studied thoroughly to support the used methodology in this paper. Another aspect of the literature study is the introduction of a selection of econometric regression methods and an investigation into their mechanisms. Furthermore, differences in model specification between XGBoost and econometric regressions have been analysed. This is followed by different econometric regressions on the dataset used in Pope’s paper (2016), and by the application of the XGBoost algorithm on that same dataset. Their per-formances are first compared when the data contain missing values and the differences are explained. Next, XGBoost is limited to the variables of the OLS regression and their pre-dictive performances are evaluated and compared. Then, the models are both interpreted and these interpretations are analyzed and explained.

The results in this thesis show that the XGBoost model has an overall better predictive performance than the OLS model, that is, the RMSE of the XGBoost model is always lower than that of the OLS model. The XGBoost model performs relatively worse when the data contain missing values than the OLS model. This is shown by the significant but modest decrease in difference between the average RMSEs of the two models, which falls from 1.82% at 10% to 1.54% at 70% of missing values. The second set of findings show that the predictive performance of the XGBoost model, restricted to the OLS regression variables,

(7)

is 0.6% lower than that of the OLS model. Finally, the results of interpreting the models show that OLS and XGBoost find approximately the same nearly linear relationships in the data. Moreover, the six most important variables the two models find are the same, and found to roughly be of equal importance in the two models. The difference in between the predictions of the two models can therefore be explained by the lesser important variables, which have a different effect on the predictions of both models. A rather unexpected result is that the RMSEs of both models decrease when the data contain increasingly more missing values. This is likely due to outliers in the data becoming missing and thus stabilising the model predictions.

This thesis follows the order described above. It is divided into five chapters. The in-troduction is followed by Chapter 2, in which the literature study is covered. This provides a fundamental knowledge on XGBoost, econometric regressions and their differences and similarities. In Chapter 3, based on the acquired knowledge, an econometric model and an XGBoost model are chosen and applied to the dataset. The results of the application of the models are described in the Chapter 4. These results are compared, analysed and linked to the theory found earlier. In Chapter 5, a simulation study is performed to further investi-gate the differences between the predictive performances of the two models on datasets with different underlying relationships. In the final chapter, the most important findings are summarized, the research question is elaborated on, and possible future research is outlined.

(8)

2 Literature study

This literature review provides a basic understanding of the eXtreme Gradient Boosting algorithm. Firstly, trees are discussed, and in particular decision trees. Secondly, the term boosting is explained. Thirdly, these two concepts are combined and this combination is discussed, as they form the basis of gradient boosting, which in itself underpins the extreme gradient-boosting method. Additionally, the regression method OLS is briefly explained and compared to XGBoost to show the fundamental distinguishing factors between these methods. Finally, a summary of the most important findings is given, and the hypotheses that follow from these.

2.1 Classification and regression trees

The collection of data has both benefits and disadvantages. The large amount of data available to data analysts makes the distinction between useful and useless data difficult. Simple regression techniques like OLS require the assumption of a linear model, and they are not sufficient for large amounts of data. Classification and regression trees, however, can handle non-linear models. These trees are machine-learning methods to construct (prediction) models from data.

In his paper, Loh (2011) explains these methods. In a classification tree every single observation in the dataset belongs to a class. It tries to predict the class of the dependent variable, using new independent variables. To find the class an observation belongs to, this tree recursively partitions the dataset into two subsets, one independent variable at a time. The goal is to create two smaller groups in which the dependent variable becomes more similar after each split. A simple prediction model within each split helps to obtain this. The models can, therefore, be represented as binary decision trees, existing of a root node, decision nodes and leaf nodes. The root node is divided into two decision nodes, which are again split into decision sub-nodes. The tree ends in a leaf node. All observations go through this tree and end up in only one leaf node. Figure 1 shows how a split is made and what a tree looks like after this split.

The classification tree method used in XGBoost for binary dependent variables is a variant of the CART (classification- and regression-tree) algorithm. To measure the impurity of

(9)

Figure 1: Left shows the original data, the middle shows the split in this data and right the tree after this split.

the node, the Gini index is used. The formula of this impurity function is as follows:

Ig(p) = J X i=1 pi(1 − pi) = 1 − J X i=1 p2_i, (1)

where pi is the relative frequency of class i in the observation sample (Du & Zhan, 2002).

Loh explains that a split is found by searching over all independent variables and all intervals S that minimize the impurity of its two child nodes. If the independent variable is ordered, S is an interval of the form (-∞, c]. If not, S is a subset of the values taken by this independent variable. This process is repeated and applied to all child nodes until the impurity reduces to relatively less than a given threshold (Loh, 2011). When the model is finished, it can predict the class of an observation based on the given independent variables. So, these classification trees are binary, and are applied when the dependent variable is categorical.

Another type of tree Loh (2011) discusses is the regression tree, which is similar to a classification tree, but differs for instance in that the dependent variable of a regression tree takes ordered and continuous values. This, consequently, changes the way the dataset is split. Since the dependent variable does not belong to a class, a regression model is fitted to each node using all independent variables. Regression trees use the same approach for tree construction as the classification trees described above, with the node predicting the sample mean of the dependent variable. However, instead of the Gini index, the CART

(10)

regression tree uses the sample variance as node impurity: 1 n − 1 n X i=1 (yi − ¯y)2, (2)

where n is the number of observations in the child node, yi is the value of the dependent

variable of observation i in the child node and ¯y is the mean of the dependent variable of the child node. This results in piece wise constant and easy to interpret models, although their prediction is regularly less accurate than smoother models (Loh, 2011).

2.2 Boosting

Not all trees constructed by the learning algorithms discussed above have a high accu-racy. Some learning algorithms produce classifiers which perform only slightly better than random guessing. These are called weak classifiers or weak learners and do not appear to be very useful. To get a better classifier, Freund and Schapire (1996) introduced a new procedure called boosting (Friedman, Hastie, & Tibshirani, 2000). Dietterich explains that this is a way of combining many weak classifiers to create a strong one that performs much better than the independent weak classifiers. XGBoost uses a variant of this boosting algo-rithm. It retains a collection of weights over the original training set. After each classifier has been fitted by the base learning algorithm, the weights are adjusted. If an observation is misclassified its weight increases, and if properly classified, its weight decreases (Diet-terich, 2000). Therefore, observations difficult to classify are given relatively more weight (Sutton, 2005). After a weak classifier is trained, the weak classifier is added to the strong classifier, which is the collection of previously trained classifiers. Adding the new weak learner to the strong learner improves the performance of the strong learner (Dietterich, 2000).

2.3 Gradient boosting

Gradient boosting, as explained by Friedman, is a boosting method that starts with a strong classifier, normally a decision tree, sequentially generating a weak learner. This learner is then added to the strong classifier. As described in the previous section, gradient boosting focuses on difficult-to-classify cases by training the weak classifier on the

(11)

so-called pseudo-residuals of the strong classifier by least squares. When the sum of squared residuals is used as loss function, these residuals are the pseudo-residuals (Friedman, 2002). Because difficult-to-classify cases have higher residuals, gradient boosting focuses on these cases and thus is a form of boosting. At each iteration, a weak classifier is trained on these residuals of the whole dataset. The contribution of the weak learner to the strong learner is then determined using the gradient descent optimization process. In short, this process determines the contribution by how much the weak learner minimizes the objective function and thus the total error of the strong learner. A more detailed explanation of this objective function can be found in the article ’Stochastic Gradient Boosting’ by Friedman (2002).

2.4 XGBoost

Gradient boosting and boosted trees have been around for a while, but XGBoost has been introduced only in 2014. XGBoost follows the principle of gradient boosting, but differs from it in its modelling details. To improve gradient boosting, Chen and Guestrin (2016) made only a few small adjustments to already existing methods and this resulted in the better performing algorithm XGBoost. First of all, XGBoost uses a more generalized model formalization to avoid overfitting and it has a specifically defined complexity. Sec-ondly, it uses a sparsity-aware splitting method, which decreases the run time. Thirdly, it uses a theoretically justified weighted quantile sketch to find the best splits. This, however, is only very useful for large systems. Moreover, XGBoost proposes an effective way to use computer memory. These last two improvements are not discussed in this paper. See Chen and Guestrin (2016) for a more detailed explanation. Just as other boosting methods, XGBoost recursively adds trees to the strong classifier, with every new tree trying to ex-plain the residuals of the current strong classifier. So each new tree is constructed with the residuals of the current model as dependent variable.

The most important improvement Chen and Guestrin made is the regularization term in the objective function, to avoid overfitting. This occurs when the estimated model has been fitted too specifically to the training data, thus giving a high accuracy on the training set. However, when predicting on the test set or any other dataset, the accuracy is much lower. Less overfitting results in a better predictive and less complex model (Chen & Guestrin, 2016). Regular gradient boosting uses an objective function consisting of two

(12)

parts: training loss (L) and a regularization term: (Ω).

Obj(θ) = L(θ) + Ω(θ), (3)

here, θ represents the parameters of the model. The training loss determines how well the model predicts, and the regularization term controls for the complexity of the model (Chen & Guestrin, 2016). An example of a widely used loss function is the sum of squared errors. XGBoost uses the following objective function:

L(φ) =X i l(ˆyi, yi) + X k Ω(fk), (4) and Ω(f ) = γS + 1 2λ S X s=1 w_s2+ α S X i=s |ws| (5)

In equation (4), fk corresponds to an independent tree structure q with leaf weights

(w1, w2, ..., wS) = w. S is the number of leaves in the tree and l(ˆyi, yi) represents the

loss function. Because of these additional terms, the size of the tree is controlled by γ and the final learnt weights are smoothed by λ and α. The last two regularization terms are respectively called the L2 and L1 regularization terms. If L2 regularization is added to a regression, it is called a Ridge regression. If L1 regularization is used in a regression, this is called a LASSO regression. Adding both L1 and L2 to a regression results in another regulated regression method called elastic net. These types of regressions are often used in machine learning. Adding λ and α leads to less overfitting and creates a complex-enough model that is yet not too complex (Chen & Guestrin, 2016).

Equation (4) is difficult to optimize, as it uses functions as parameters. Chen and Guestrin (2016) solve this by training the model by recursively adding new tree structures. Each new tree, t, tries to solve:

˜ L(t) ₌ n X i=1 l(ˆy_i(t−1)+ ft(xi), yi) + Ω(ft), (6)

(13)

polyno-mial. The simplified objective function is then given by: ˜ L(t) ₌ n X i=1 [gift(xi) + 1 2hif 2 t(xi)] + Ω(ft), (7)

where gi and hi are respectively the first and second order derivative of the loss function

with respect to y (evaluated in ˆy(t−1)). This function is rewritten further by expanding equation (5). Next, the optimal leaf weights are calculated and substituted in the expansion of equation (7). Because the regularization term contains an absolute value, a distinction must be made to handle the two possibilities for alpha. This results in the optimal objective function (Towards Data Science, 2017):

˜ L(t)_{(q) = −}1 2 T X j=1 (Jα(P_i∈I_jgi))2 P i∈Ijhi+ λ + γT, (8)

where Fα is defined as:

Jα(G) =            G + α G < −α, G − α G > α, 0 else. (9)

In equation (8), Ij is defined as the observation set of leaf j. This equation can subsequently

be used as scoring function, which measures how good the tree structure is. This scoring function corresponds to the impurity score function of decision trees.

To find all possible tree structures, Chen and Guestrin (2016) employ an algorithm that starts from a single leaf and iteratively adds branches to the tree. Each iteration, it proposes a split and calculates the loss reduction for that split. The split that maximizes the loss reduction is chosen. This is repeated until the loss reduction is no longer positive, and thus smaller than the regularization term γ. IL and IR are the observation sets of the

left and right nodes after the split. The loss reduction after the split is:

∆ ˜L(t)_{(q) =} 1 2[ (Jα( P i∈ILgi)) 2 P i∈ILhi+ λ +(Jα( P i∈IRgi)) 2 P i∈IRhi+ λ −(Jα( P i∈Igi)) 2 P i∈Ihi+ λ ] − γ. (10)

In short, it adds up the loss of the two groups when observations are split into two groups (left and right) and subtracts the already existing loss from the existing tree before

(14)

the split. This is then corrected with the regularization term gamma.

The second improvement Chen and Guestrin made is the sparsity-aware splitting method. Simply said, this method makes smart use of missing values in the dataset. It does this by assigning all missing values in either the left or the right group when making a split, depending which side results in the best improvement of the objective function. It can thus actually use values that are missing, which is a whole new way of handling missing data. See Chen and Guestrin (2016) for details.

2.5 XGBoost Tree SHAP

Interpreting the prediction a model makes is highly important. However, for complex models like XGBoost, even experts struggle with this (Lundberg & Lee, 2017). In addition, the classic measure of variable importance in XGBoost does not give consistent estimates of the importance. To solve this, Lundberg and Lee proposed a new method called SHAP to be able to interpret the model’s predictions. Each variable in the XGBoost model is assigned a certain importance by SHAP, which is called the feature importance. This feature importance can be obtained using the Tree SHAP package in Python created by Lundberg and Lee (2017).

SHAP values find their basis in Shapley values (Shapley, 1953). The uniqueness of Shapley values follow from proofs from game theory on the fair allocation of profits. In their paper, Lundberg and Lee (2017) describe how these Shapley values are found in the Tree SHAP package. They define Shapley values as feature importances when the explanatory variables are multicollinear. To find the effect of inclusion of a specific variable k on the prediction of the model, the model is evaluated multiple times. Moreover, Lundberg and Lee (2017) state that, in order to evaluate the effect missing variables have on the original prediction model F , XGBoost in this case, a mapping mx must be defined. mx maps

between the original input variables of the model and z0, which is a binary pattern of missing variables (Lundberg, Erion & Lee, 2018a).

With this mapping mx, the prediction function F (mx(z0)) can be evaluated and thus

find the effect of either observing or not observing a variable. For example, if variable k is observed in this function, z0_k = 1, and if it is not observed z0_k = 0. In addition, Lundberg and Lee (2017) define

(15)

where P represents the set of observed variables, and E[F (x)|xP] is the estimated expected

prediction of the XGBoost baseline model conditioned on the subset P of the input vari-ables. This expectation is estimated using the estimated decision tree output from the XGBoost model. More details can be found in Appendix D.

Then, to estimate the impact of variable k on the prediction of the model, first, the observation passes through the XGBoost model, with all possible subsets of variables in-cluding variable k, P ∪ {k}. Second it passes through with all possible subsets without variable k, P . Next, the difference between the expected values of the observation is calculated as follows:

FP ∪{k}(xP ∪{k}) − FP(xP), (12)

where xP is the vector of values for the variables in P . The weighted average of these

differences represent the SHAP value for a single observation of variable k:

φk(x) =

X

P ⊆V \{k}

|P |!(N − |P | − 1)!

N ! [Fx(P ∪ {k}) − Fx(P )], (13)

where N is the number of input variables and V the set of all variables. These Tree SHAP values represent the importance of that feature for a specific observation. Summing up these values leads to the prediction output of the model for that specific variable, that is, F (x) =P

k

φk(x) (Lundberg et al., 2018a). The formation of the prediction of the model is

visualized in Figure 2.

Figure 2: This figure shows a single ordering of how the ouput of a function F arises by adding up the effects φkof each

variable being incorporated in a conditional expectation. For XGBoost, the order in which the variables are added into the conditional expectation does not influence the value of φkor the prediction output of the model, because the SHAP values

(16)

2.6 Ordinary Least Squares

To be able to compare XGBoost to OLS, it is important to understand how an OLS regression analysis model is constructed. OLS is a method to linearly model the relation-ship between a dependent variable and the independent variables. To determine which variables should be used in the OLS regression, first the economic theory behind possible relationships between the dependent and explanatory variables must be studied. If a relationship can be based on a theoretical ground, the variable can be used as a regressor. After all regressors have been found, the OLS model can be constructed. This is an example of how this model looks like:

yi = α + β1xi1+ β2xi2+ ... + βkxik+ i, (14)

where yi is the dependent variable and xij are the independent variables. i is the residual

which is the difference between the value of yi estimated by the model and the true value.

Equation (14) finds the exact values for the coefficients α and β that minimize the loss function. The loss function of OLS is squared residuals. XGBoost can use this loss function as well and this is done in the application in this thesis. After the OLS model has been estimated, the coefficients are analyzed, and their significance is evaluated and tested.

The coefficients give a clear impression how the OLS predictions arise and as with XGBoost, understanding why the OLS model makes the predictions it makes is very im-portant. Beside the coefficients, for OLS also the SHAP values can be calculated. This is less complicated for OLS than for XGBoost. This uses the following formula:

φk(x) = βk(xk− E[xk]), (15)

where E[xk] is estimated by the sample mean, ¯xk. So it is the OLS regression coefficient

multiplied by the deviation from the mean of the variable corresponding to that coefficient. The prediction output of the OLS model results from adding all the SHAP values, just as with the SHAP values of the XGBoost model, F (x) =P

k

φk(x) (Lundberg & Lee, 2017).

Both XGBoost and OLS are used to analyze data and as stated earlier in this thesis, the amount of data becoming available is increasing, not only in size but also in detail. This fact makes analyzing data using econometric regression techniques much more difficult

(17)

since these usually perform better on smaller datasets. Machine-learning methods like XGBoost have been developed to handle large data. This does not mean, however, that the machine-learning techniques provide better estimations than the conventional econometric regression techniques. Hal Varian (2014), chief economist at Google, explains that there are several lessons the machine-learning community can draw from econometrics. One example of these lessons is that observational data cannot determine causality, no matter the size. Another important lesson to be drawn concerns the process of determining whether a causal relationship or causal inference actually exists (Varian, 2014). To support the fact that machine-learning can draw lessons from econometrics, Varian gives an example of an econometric regression outperforming a machine-learning method. He claims that this difference is due to the fact that trees do not work as well if the underlying relationship is actually linear (Varian, 2014, p 10). When it is not linear, XGBoost is expected to outperform OLS. An example of this is given by Huang, Chen, Gu and Yang (2018).

Varian (2014) divides the econometric data analysis into four groups: prediction, sum-marization, estimation and hypothesis testing. Machine-learning mostly focuses on predic-tion and developing high-performance algorithms to make useful predicpredic-tions. In contrast, econometricians mainly look for insights or relationships in the data. Linear regression anal-yses like OLS are generally used for summarization (Varian, 2014). For OLS to be valid, a number of assumptions must be met, for instance, the error term must be normally distributed and their conditional expectation equal to 0. XGBoost and other machine-learning algorithms do not need such assumptions. Moreover, a machine-machine-learning method like XGBoost decides which variables are most important during the creation of the model and chooses the ones which give the best split. Thus it does not consider possible theories underlying the relationship between the variables and the dependent variable. This differ-ence in approach can lead to the use of different variables in both models (Huang, Chen, Gu, & Yang, 2018).

Varian (2014) describes some of the differences between OLS and Machine Learning. More specific differences between XGBoost and OLS are for example that XGBoost adds a regularization term to its loss function which results in an objective function. OLS does not regularize its loss funtion. In addition, the evaluation and tests of the coefficients performed in OLS are skipped over by XGBoost. So XGBoost uses the variables to explain the dependent variable without validating if the estimated effect is significantly different from

(18)

0. The last example of a difference is that the validity of the estimated coefficients of OLS depend on the number of observations. Only if the number of observations is sufficiently large, the coefficients are valid. In some datasets, missing values can cause problems with the validity of the coefficients, as observations with missing values are deleted. OLS can therefore not use the missing values whereas XGBoost can, using its sparsity-aware algorithm.

2.7 Hypotheses

The research in this thesis is twofold: an analysis of the theory of XGBoost and a comparison of this method with OLS. From the latter it becomes clear that XGBoost performs better on large datasets with an underlying non-linear relationship. One of the reasons for this is that XGBoost can use more than and different variables from OLS, without basing this on economic theory. In addition, the regularization term included in XGBoost ensures that the model does not overfit but it does allow a more complex (non-linear) model than OLS (linear model). Moreover, XGBoost can use missing values in a dataset whereas missing values in OLS can be a problem. These factors contribute to a better predictive performance of XGBoost. The factors described above are defined for more general datasets. However, the dataset used in this thesis (from Pope, 2016) has underlying relationships that are close to linear. This linearity possibly decreases the contribution of these factors on the predictive performance of XGBoost. Therefore, the following hypotheses arise that are discussed in the next chapter:

• Hypothesis 1: XGBoost gives more accurate predictions than OLS, even on a dataset with relationships close to linear. This because it can capture some non-linearity in nearly linear relationships.

• Hypothesis 2: XGBoost gives more accurate predictions than OLS when the data contain increasingly more missing values.

• Hypothesis 3: XGBoost gives better predictions than OLS when limited to the same variables as OLS.

• Hypothesis 4: XGBoost finds linear relations between the dependent and indepen-dent variables with its non-linear model and these relationships are approximately the same as those found by OLS.

(19)

3 Data and Methodology

In this section, the research methodology is specified more precisely. This methodology can be used to examine the hypotheses specified earlier, and combined with a literature review supplies a better comprehension of the hypotheses to answer the research question. The source of the OLS model and the used data is introduced and described first. Then the baseline XGBoost and OLS models to determine the difference in predictive performance are specified. These models are at the root of the comparison between the two prediction methods. First of all, the difference in performance is evaluated when the data contain missing values. Subsequently, the model is used to compare the difference in explanatory variables used in XGBoost to those used in OLS. Second, the XGBoost model is found using only those variables available to the OLS regression to again compare the performances. Finally, the relationships between the dependent and independent variables are analyzed as well as the interpretation of the XGBoost and OLS models.

3.1 Source of the OLS model and data

In the previous chapter it is explained how OLS is constructed and how the linear regression works. These models can be applied to various problems. An example of the application of a simple OLS regression analysis is described in the paper ’How the time of day affects productivity: evidence from school schedules’ by Pope (2016). In Chapter 4, the data from this paper is used to explore the comparison between OLS and XGBoost.

In his paper, Pope estimates the time-of-day effect on the student’s GPA and test scores. He finds that having math class in the morning instead of the afternoon increases the math GPA with about 0.072 and the English GPA with about 0.032. From this, Pope (2016) concludes that students learn more in the morning than in the afternoon. For his research, he used the following equation to estimate the effect of morning classes on the math GPA result of the student:

Si,t = β0+ β1M orningi,t+ β2Ci,t−1+ β4Di,t+ β4Gi,t+ β5Ti,t+ β6Ft+ i,t, (16)

where Si,tis the math GPA of student i in year t. M orningi,tis a dummy variable of student

i in year t and is equal to 1 if the student has class in the first two periods and 0 if in the last two periods. Ci,t−1 is a vector and consists of the student’s math and English CST scores

(20)

and math GPA of the year before. The CST is a state-wide multiple-choice test all students in grades 2 to 11 have to take each spring and that shows how well a student performs in relation to the state-content standard. It contains, among other things, an English and a math component. Beside the CST, the students’ performances are graded with an A, B, C, D, or F for each subject class in both semesters. Just as in Pope’s paper (2016), the CST scores are normalized and the GPA is left as is, ranging from 0 to 4. The vector Di,t

contains the student’s demographic data, like gender, English Language Learner status, age and their parents’ educational level. Students with an English Language Learner status are not able to fluently communicate in English or learn effectively in English. Ti,t contains

information about the teacher and the relevant class. Gi,t is a vector of binary variables,

one for each grade. Like Gi,t, Ft consist of binary variables, taking year fixed effects into

account. Finally, i,t is the error term (Pope, 2016). This OLS model was estimated on a

panel dataset of about 1.8 million students in Los Angeles in grade 6 to 11. This dataset is described in the following section. Furthermore, the OLS model of this thesis is very much based on the OLS model of Pope described above.

3.2 Data

Before further specifying the models, it is important to describe the data. The used data is very rich, so a simple analysis can uncover important relationships between the data (Pope, 2016). The dataset used in this paper is the same as the one in Pope’s paper of 2016, and it is retrieved from the Harvard Dataverse1_{. It is student-level panel data}

on students from sixth to eleventh grade from the Los Angeles Unified School District (LAUSD). It contains 1.8 million observations from 2003 to 2009, with almost 75% of the students being Hispanic. The collected data concerns for example their grades, gender, parents’ educational level, ELL, teacher, course name and course period, and they are used as explanatory variables. Furthermore, their California Standards Test (CST) English Language Achievement scores, CST math scores and individual course GPA are available as a measurement of academic ability. The summary statistics presented in Table 7 show that the math GPA is about 0.1 GPA point higher for students who attended classes in the first two periods of a school day than for those in the last two periods. The math

1_{Pope, Nolan, 2015, ”Replication data for: How the Time of Day Affects Productivity: Evidence from}

(21)

and English CST scores are also higher for students attending classes during the first two periods, respectively 0.073 and 0.061 standard deviations.

Since not all information on each student is complete, the dataset contains a lot of missing values. OLS is not able to use these missing values, so some restrictions must be imposed on the set. The schools included in the restricted data-set sample have a six-period day schedule that starts at 8:00 am and ends at 3:10 pm. Furthermore, the sample is restricted to students who are enrolled in one math or English class at the most and have the same teacher for that class in both semesters (Pope, 2016). A further sample restriction is that it contains only students that attend a math class in either periods 1 and 2 or periods 5 and 6. Pope (2016) states that omitting periods 3 and 4 emphasizes the difference between morning and afternoon classes, however he mentions that it is arbitrary to do this. The last restriction imposed on the dataset is the omission of missing values. First, the explanatory variables containing more than 25 missing values are removed from the dataset. Subsequently, the observations with missing values are deleted. The remaining dataset contains 404,124 observations without any missing values.

3.3 Methodology

The data described above is analyzed using two different methods: OLS and XGBoost. Both methods first find a model and then make predictions with this model. When a model is found with the same data it has to predict, it can give unrealistically good predictions. These are so good because the model has already ’seen’ the data it has to predict. This is called data leakage and it can be prevented by splitting the data into two random independent subsamples, the train set and the test set. These sets contain 75% and 25% of the dataset, respectively. The train set is used to create a model whereas the test set is used to independently evaluate performance. Both methods use the exact same train and test set so that difference in performance does not depend on the data.

3.3.1 Baseline models

To get an estimate of the difference in predictive performance between the two methods, first the baseline model for both methods is found, starting with the OLS regression analysis model. It has already been described in Chapter 3.1 in this paper. For OLS this is the model of equation (16). The explanatory variables have already been explained in the literature

(22)

review and a definition of these variables can be found in Table 9. This regression finds the values for the variable coefficients and the constant that minimize the RMSE of the regression.

The XGBoost baseline model, on the other hand, is constructed in a rather different way. XGBoost presumes a number of hyperparameters and, with those parameters, it builds the best model. In this paper, 8 of the hyperparameters of XGBoost are discussed. The first one, max depth, is the maximum number of leaves allowed in a tree, from the root node to the end nodes of the tree. A higher value of max depth allows for more complex relationships in a tree, but this can cause a model to overfit. The second hyperparameter is min child weight. This is the minimum weight, which corresponds to the minimum number of samples that is needed to create a new node. The lower the value, the fewer samples are needed in a child node. This allows for more complex trees and can thus cause the model to overfit. The third and fourth hyperparameters, subsample and colsample bytree, are respectively the fraction of observations and explanatory variables selected at each splitting step. When these parameters are set below 1, not all observations and variables are used when constructing a step in a tree, and this makes the model less likely to overfit to a single sample or explanatory variable (”XGBoost Parameters”, n.d.).

Gamma is the fifth hyperparameter, and this has been introduced in equation (5) and (8) as a regularization term. It specifies the minimal loss reduction needed to make a split. When the loss reduction of a proposed split is larger than gamma, the split is made. If the loss reduction of the split is smaller than gamma, the split is not added to the tree. This ultimately makes the model more conservative. The sixth and seventh hyperparameters, lambda and alpha, are regularization hyperparameters as well. Equations (5), (8) and (10) show how lambda is used in the objective function and loss-reduction function. Both are regularization terms on the weights of the leaves. As explained earlier, lambda corresponds to the L2 regularization in the Ridge regression and alpha to the L1 regularization term in the Lasso regression. Together, lambda and alpha correspond to elastic net regularization. They penalize high-leaf weights, which makes the model more conservative. The eighth and last hyperparameter is the learning rate. After a new tree is constructed, the weights of this tree are multiplied with the learning rate to reduce the influence of each individual tree. This allows for future trees to improve the model and prevents overfitting (”XGBoost Parameters”, n.d.).

(23)

Most of these hyperparameters have a default value, so they do not need to be specified explicitly. Yet, deviating from the default value can lead to a better-performing model. It is, however, difficult to find the exact values of the parameters that produce the best model. The process of finding the best hyperparameters is called hyperparameter tuning. First, the four parameters max depth, min child weight, subsample and colsample bytree are tuned. To do this, a range of values for each parameter is selected. Next, using the cross-validation method described earlier, the train set is split into two subsamples and for every possible combination of the parameter values the 5 best XGBoost models are constructed on one subsample and scored on the other subsample. The model that produces the best average score is selected as best model for that specific combination of parameter values.

After the best model for all combinations has been estimated, the model that has the best score is selected as the overall best model and the corresponding combination of parameter values is selected as best parameter values. The mechanism of finding the best parameters by searching over all possible combinations is called a grid search. After the best values of the first 4 hyperparameters are found, the regularization parameters gamma, lambda and alpha are tuned. Again, by specifying a range of values for these parameters, and using the grid search and cross-validation mechanisms, the best value for gamma, lambda and alpha are found. Finally, with the values of the other hyperparameters, the learning rate is tuned using the same approach as for lambda and alpha. Once the best values of the eight hyperparameters have been found, these are used to find the best model on the whole train set. After the best model has been found, it is used to make predictions of the dependent variable using the test set.

The best models of both methods are then compared with regard to their ability to predict. To do this, the RMSE of the models are compared. Naturally, the lower the RMSE, the closer the out of sample predictions are to the real values, and thus the better the predictions are. Then, to test if the predictions of the models differ significantly, a test described by Diebold and Mariano (1995) is used. This test was created to compare prediction accuracy of two competing models and uses the difference between the prediction error for student i of both models. This difference is called the loss differential, and as the predictive performance of the two methods is measured by an error squared-loss function

(24)

(RMSE), the Diebold-Mariano test is applied to a squared-loss differential, namely

di,t = 2OLSi,t− 2

XGBi,t. (17)

So, the squared prediction error of OLS minus the squared prediction error of XGBoost of student i in year t. The null hypothesis of this test, H0 : E[di,t] = 0, is tested against the

alternative, H1 : E[di,t] 6= 0. In the original Diebold-Mariano test a consistent estimate of

variance of the asymptotic mean sample loss differential is used. Instead, for the present test the variance of the mean sample loss differential is used. Thus, under the null hypothesis, the Diebold- Mariano test is

¯ d q V ar(d) n ∼ N (0, 1), (18) where ¯d = _n1P

i,tdi,t is the mean loss differential and V ar(d) = _n1

P

i,t(di,t − ¯d)2 the

variance of the loss differential. If the null hypothesis is rejected, the two competing models significantly differ in predictive performance. The present test can therefore determine whether one of the models has a better predictive performance or not and for this reason is used for the evaluation of the following models.

3.3.2 Models with missing values

In the literature review, the main differences between XGBoost and OLS have been discussed. One of these differences was the ability of XGBoost to handle missing values. In contrast to XGBoost, OLS omits the observations with missing values from the regression. The first point of comparison is how the predictive performance of both models is affected by missing values in the dataset. To evaluate this, the training data is made to gradually contain more missing values. This is done by randomly selecting a row and column in the training data and redefining the original value as missing. This is done in such a way that every selected row contains at most 1 missing value. Moreover, the train and test set are split to contain exactly the same observations as the train and test set of the baseline models, to prevent differences in the datasets from influencing the difference in predictive performance of the two models. First 10% of the data is made to be missing. Then, the baseline OLS model is estimated on this data and the baseline XGBoost uses this data to find the best model. With these models, predictions are made on the test set, and the

(25)

RMSE of both models is calculated. The process of creating missing values, finding the best model and predicting on the test set is repeated with 20%, 30%, 40%, 50%, 60%, 70%, 80% and 90% of observations with missing data. Then, this is whole process repeated 5 times on a dataset with different missing values with the same percentages. Because the performance of both models can vary when different values are missing, the average of the RMSEs at each percentage is calculated. In addition, the Diebold-Mariano test is applied to assess if the predictions of the models differ significantly.

3.3.3 Models without additional variables

In contrast with XGBoost, OLS needs a careful theoretical investigation of possible ex-planatory variables and not every arbitrary variable can be added to the regression. When one explanatory variable can be accurately linearly predicted from the other explanatory variables, the explanatory variables are said to be multicollinear. Multicollinearity does not affect the predictive performance of the OLS model, but the estimations of the individual regression coefficients are possibly not valid. With XGBoost, no such restrictions on the explanatory variables are needed. Hence, XGBoost can deal with many variables in the dataset, relevant or not, and might find a good model with irrelevant variables. To test if the performance of XGBoost depends on the, according to OLS, irrelevant variables, the XGBoost model is found with only the explanatory variables used in the OLS regression. For this, all additional explanatory variables of XGBoost are removed from the train and test set of the baseline XGBoost model. The variables of the OLS regression are listed and described in Table 9 and the additional variables are listed and described in Table 10. The model is then trained on the train set and used to make predictions on the test set. With the predictions and real values the RMSE is calculated and compared to the RMSE of the baseline OLS model. Moreover, the Diebold-Mariano test is applied to see whether the predictions of the two models differ significantly.

3.3.4 Interpretation of the models

As a final point of comparison, the interpretability of both models is evaluated. Un-derstanding why a model makes a certain prediction can be as crucial as the prediction’s accuracy in many applications (Lundberg & Lee, 2017). First, the predictions of the OLS baseline model’s dependent variable are plotted against those of the explanatory variables,

(26)

which can help visualize the relationships found by OLS. Using the same explanatory vari-ables as those of the OLS baseline model, this is also done for the baseline XGBoost model. The relationships found by OLS and XGBoost are then compared using these plots to see if the two methods find different relations between the dependent variable and independent variable.

Second, the XGBoost model does not predict coefficients for each explanatory variable used, but attributes a certain importance to the explanatory variables, which is called feature importance. This feature importance is obtained using the Tree SHAP package in Python created by Lundberg and Lee (2017), described in Chapter 2.5. The Tree SHAP package gives consistent and accurate estimations of the feature importance. Applying this to the XGBoost and OLS baseline models provides an estimation of the relationship between the explanatory variable and the dependent variable. Furthermore, the mean SHAP values of the variables are visualized for both the XGBoost and the OLS baseline models.

By using the models described above, the difference in predictive performance between OLS and XGBoost are evaluated. First, the baseline models are compared, followed by the baseline models in the presence of missing values, then the OLS and XGBoost models with only a selection of explanatory variables, and finally the interpretation of both baseline models. In the next chapter, the results of the models described above are shown and analyzed.

(27)

4 Results

This chapter consists of the analysis of the obtained results, following from the method-ology and models described above. First of all, the optimal hyperparameters of the XG-Boost baseline model are found by doing a grid search on the parameters. Second, the baseline OLS and XGBoost models are found and estimated. Their predictive perfor-mance is then compared the RMSE, and the significance of the difference in predictive performance is analyzed with the Diebold-Mariano test. This is also done for the models with missing values and the models with a limited number of variables. Next the relation-ships found by both models between the dependent variable and explanatory variables are plotted. And, lastly, the Tree SHAP package in Python was used to calculate and visualize the feature importances of the baseline XGBoost and OLS models.

4.1 Optimal hyperparameter values

In the previous chapter it was explained how to obtain the best values for the hyperpa-rameters with a grid search. Table 1 contains the hyperpahyperpa-rameters and their value ranges of the grid search. For each combination of parameters, a cross-validation is done, where the model is trained on 75% of the train set and tested on the other 25%, of which the output is the RMSE of this test. This is repeated 4 times, and the mean RMSE of these 5 tests is calculated.

Table 1: Hyperparameters and the gridsearch range

Hyper- ange

parameter

max depth [0, 1, 2, 3, 4] min child weight [1, 2, 3, 4, 5]

subsample [0.6, 0.7, 0.8, 0.9, 1] colsample bytree [0.6, 0.7, 0.8, 0.9, 1]

In the Python XGBoost implementation, the value 0 for max depth indicates that there is no limit on the maximum depth of a tree.

Based on the mean RMSE’s, the hyperparameter combination which yields the lowest mean RMSE is selected. This grid search can take a lot of time, as the number of models to be estimated increases polynomially with the amount of values in the range. These

(28)

ranges of the selected values, for example, result in 54 _{= 125 models to be cross-validated}

on about 300,000 observations, totaling to 500 models. To reduce the amount of time to find the optimal hyperparameter values, the grid search and cross-validation is done on only a sample of 10.000 training-data observations. Furthermore, the best values of the hyperparameters are estimated only once, to be able to isolate the effect of missing values and a limited number of explanatory variables. This ensures the difference in performance between those models and the baseline XGBoost model does not depend on different hy-perparameter values.

The best hyperparameter values resulting from the grid search are 2 for max depth, 2 for min child weight, 1.0 for subsample and 0.6 for colsample bytree. This means that the trees in the best XGBoost model are not very deep and contain at most 3 leaves, and consequently, at most 1 split. A value of 2 for min child weight means that the minimum weight of a leaf needed to make a split is 2. The values of 1.0 and 0.6 for subsample and colsample bytree respectively, indicate that it is optimal to enable each new tree to use all of the observations of the train data, but a subsample of 60% of the explanatory variables. These values are now used to find the best values for gamma, lambda and alpha, which are 1.3, 0.2 and 11, respectively. The value of 1.3 for gamma implies that a split is only made if it reduces the loss function with more than 1.3. The values for lambda and alpha signify a low L2 regularization, but a high L1 regularization on the leaf weights. Finally, the learning rate is tuned with a grid search. This results in an optimal learning rate of 0.23. The next section discusses the baseline OLS and XGBoost models, of which the latter utilizes the obtained hyperparameter values.

4.2 Baseline models

Now that the best hyperparameter values have been found, the two baseline models can be estimated, starting with the OLS model. The resulting estimations of the regression analysis are presented in Table 2. This table shows that OLS found all but 2 relationships between the dependent and explanatory variables to be statistically highly significant. Moreover, the coefficient of the independent variable Morning is approximately as large as in Pope’s paper. In this thesis, the coefficient of Morning was estimated to be 0.0603, whereas in Pope’s paper the estimation of this coefficient was slightly higher, namely

(29)

Table 2: OLS regression results

Variable Coefficient S.D.

Constant 0.9372*** 0.063

Morning Class 0.0603*** 0.004

Prior Math CST score 0.3160*** 0.003 Prior English CST score 0.1152*** 0.003

Prior Math GPA 0.4307*** 0.002

Female 0.1123*** 0.004 Less than HS -0.0107** 0.005 HS Grad -0.0187*** 0.006 Some College -0.0094 0.007 College Grad 0.0507*** 0.007 Grad School 0.0525*** 0.010 ELL 0.0372*** 0.005 Grade FE Yes Year FE Yes Number of observations 301593 R2 0.395

Significant at *** 1%, ** 5%. The omitted binary variable for parental education level is No Response.

0.0722. Next, this model is employed to make predictions on the test set. The RMSE of the predictions is shown in Table 3. After the baseline OLS model has been estimated, the baseline XGBoost model is estimated and used to make predictions on the test set. The RMSE of the predictions is also represented in Table 3. In the first column of this table the RMSEs of the in-sample OLS regression and XGBoost model can be found. Column 2 shows the out-of-sample RMSEs of the two models. Column 3 compares the two out-of-sample RMSEs. From this table it appears clear that the baseline XGBoost model yields a slightly lower RMSE for both the in-sample and out-of-sample predictions. This is a small but statistically significant decrease in out-of-sample RMSE as shown by the Diebold-Mariano test.

2_{The difference in coefficient estimations is due to the fact that the OLS model in this thesis is estimated}

on only the train set with 301593 observations. In Pope (2016), the OLS model was estimated on the full sample of 402124 observations

(30)

Table 3: Prediction accuracy of the baseline models

RMSE RMSE RMSE Diebold-Mariano

(train set) (test set) differential t-value

OLS 0.965998 0.964808 -

-XGBoost 0.945047 0.946643 1.88***% -37.75

Significant at *** 1%. The number of train and test observations is 301593 and 100531, respectively.

4.3 Models with missing values

The average RMSEs of 5 OLS and XGBoost models with data containing missing values are shown in the left figure of Figure 3. The percentage of rows with missing values is on the horizontal axis while RMSE is measured on the vertical axis. Not only the difference in average predictive performance on the test set between the two models becomes clear, but also the difference in average predictive performance within each model. As with the baseline models, XGBoost performs better than OLS for each percentage of rows with missing values in the data. Interestingly, the curve of the OLS model overall shows that the higher percentage of missing values in the data, the lower average RMSE of the OLS model, except at 50%. Here, the average RMSE increased slightly compared to the average RMSE at 40%. From 10-40%, the average RMSE of the OLS model gradually decreases. Then, at 40%, the average RMSE shows a small increase, followed by a continuous decrease from 50% onwards.

It was expected that the more values are missing, the higher the RMSEs of the models. This is because it is intuitive that less data leads to less accurate predictions. However, this does not seem to be the case. The decreasing RMSEs of the models might be explained by the fact that, when values become missing, not only ’regular’ values but also outliers become missing. This results in less noise in the dataset and thus more accurate preditions. This also possibly explains why OLS is relatively less affected when the dataset contain more missing values. In the OLS regression, the observations with at least 1 missing value are omitted from the analysis. XGBoost, however, keeps this observation with the missing values. When coincidentally a value of an outlier becomes missing, this has different consequences for OLS than for XGBoost. In the OLS regression, these outliers are then excluded from the analysis, but with XGBoost, these ouliers are still included. Excluding

(31)

these outliers can stabilize the model predictions and including them can therefore lead to less accurate predictions. This effect is represented in Table 4 in which is shown that the RMSE differential decreases while the percentage of rows with missing values increases.

Figure 3: The left figure shows the average RMSEs of the models when the data contain missing values, the right figure shows the RMSEs of the two models in the 5 simulations.

Moreover, this effect is shown by both figures in Figure 3, where right figure shows that not only the shapes of the average RMSEs differ, but also the RMSEs of the two models in each of the 5 individual simulations. The average RMSE of the OLS model, just as the OLS RMSEs of the 5 simulations, is more regular. The average RMSE of the XGBoost model, on the other hand, shows a more irregular shape. From 10%-20%, the average RMSE of the XGBoost model increases, followed by a relatively large decrease from 20% to 30%. This decrease is then counteracted by an increase from 30% to 40%. From 40%-50%, the average RMSE of the XGBoost model seems stable, with a minor decrease at 50%. After 70% , it decreases continuously. The largest difference between the shapes of the two lines is that the line of OLS is smoother than that of XGBoost.

The coarse shape of the XGBoost average RMSE might be attributed to the missing values in the train set, which ultimately lead to a different model or a similar model with different leaf weights. This can possibly be caused by the sparsity aware splitting algo-rithm. As explained earlier in Chapter 2, the XGBoost sparsity-aware algorithm decides if observations with a missing value for the specific variable on which a split is made, should go to the left or right child node. It makes this decision based on the decrease of the loss function. The direction of the child node decreasing the loss function most, is the direction in which all observations with a missing value for that variable follow. Then, the weight

(32)

of the two child nodes is calculated as the average of the dependent variables belonging to that node. When the variable value of an observation suddenly changes to ’missing’, it is possible that the observation that originally belonged to the left child node now belongs to the right one, or vice versa. So, when the weights of the leaves are then calculated, it may result in different leaf weights and even a completely different model with different trees.

For every percentage of rows with missing values, Table 4 shows the average RMSEs of the two models and the corresponding RMSE differential. The Diebold-Mariano test demonstrated that these RMSE differentials are statistically significant. If the percentage of rows containing missing values is 10% or 30%, the decrease in RMSE of the XGBoost model relative to the OLS model is 1.82%. These are the largest decreases found in Table 4, however, they are lower than the 1.88% decrease in average RMSEs of the two baseline models.

For percentages other than 10 and 30, the decrease in average RMSE of the XGBoost model relative to the OLS model is also lower than the 1.88% of the baseline models. This means that the difference in predictive performance between the two models is smaller. At 70%, the difference in RMSE between the two models is smallest, and the decrease in average RMSE of XGBoost relative to that of OLS is 1.54%. Although this difference is not very big, it is significant as demonstrated by the Diebold-Mariano test. Thus, an XGBoost model with missing values in the data does not perform as well as an XGBoost model without missing values. Yet, it does perform significantly better than the OLS model with missing values in the data, for all percentages of missing values. However, instead of having a relative better predictive performance than OLS when the data contain missing values, the XGBoost predictive performance is relatively more negatively affected by missing values. Hence, the results from these tests reject the second hypothesis of this thesis.

Table 4: Average prediction accuracy per percentage of rows with missing values

10% 20% 30% 40% 50% 60% 70% 80% 90% OLS 0.9649 0.9647 0.9644 0.9644 0.9645 0.9640 0.9630 0.9628 0.9625 XGBoost 0.9473 0.9475 0.9468 0.9480 0.9477 0.9481 0.9482 0.9463 0.9455 RMSE Differential 1.82%*** 1.78%*** 1.82%*** 1.70%*** 1.74%*** 1.65%*** 1.54%*** 1.71%*** 1.77%*** Diebold-Mariano t-value -77.17 -71.35 -67.07 -59.61 -54.86 -47.85 -38.93 -34.42 -24.67 Significant at *** 1%.

(33)

4.4 Models without additional variables

In the literature review has been explained that XGBoost is able to use more than and different variables from OLS. To examine how much the predictive performance of XGBoost benefits from the additional variables, the XGBoost model has been estimated using only the variables from the OLS regression. The RMSE of the baseline OLS model and XGBoost model without additional variables is shown in Table 5. The additional variables and their descriptions are provided in Table 10. The difference in RMSE between these two models is 0.647% and this does not depend on differences in the data used. The XGBoost model has been trained on and has to predict exactly the same observations as the baseline OLS model. This ensures that the difference in predictive performance only depends on the difference in variables XGBoost was able to use. The RMSE of XGBoost model is 0.0117 higher than that of the baseline XGBoost model, as is shown in Table 5. This implies that XGBoost does actually use some of the additional variables in it’s Table 5: Prediction accuracy of the OLS model and XGBoost model without additional variables

RMSE RMSE RMSE (test set) RMSE Diebold-Mariano

(train set) (test set) baseline model differential t-value

OLS 0.965998 0.964808 0.964808 -

-XGBoost 0.958228 0.958338 0.946643 0.6***% -19.09

Significant at *** 1%. The number of train and test observations is 301593 and 100531, respectively.

model and thus to make predictions. It is important to note that XGBoost still manages to predict more accurately than OLS, even with the same data. This is probably due to the fact that XGBoost finds non-linear models, unlike the linear models OLS estimates. If the underlying relationship between the dependent and independent variables are not strictly linear, XGBoost is able to capture this non-linearity. OLS does not have this ability, and for OLS to capture non-linear relationships, the variables have to be manually transformed before they are used. Even then the OLS model still estimates a linear model, for transformed variables, like the logarithm of an original variable. Compared to the difference in RMSE between the two baseline models, the difference between the RMSEs of the two models in Table 5 is considerably smaller, as shown by the RMSE differential.

(34)

The decrease in RMSE of the XGBoost baseline model relative to the RMSE of the OLS baseline model is 1.88%, whereas the decrease of the XGBoost model without additional variables relative to the baseline OLS model is 0.6%. To test if the difference in predictions of OLS and XGBoost are still statistically significant, again, the Diebold-Mariano test is applied. This test shows that the difference is indeed highly significant, although a 0.6% decrease does not seem much. So, the predictive performance of XGBoost is better than that of OLS when XGBoost is limited to the same variables as OLS. Hence, the results are in line with the third hypothesis of this thesis

4.5 Interpretation of the models

To assess the relationships between the dependent and independent variables of the baseline models, the relationship between the predictions of the models and some of the independent variables is plotted in Figure 4. This is done for the independent variables Math GPA of the previous year and the Math and English CST scores of the previous year. The mean predictions of the values 0 and 1 of the binary variables Morning, Female and ELL status are shown in Table 6. In Figure 4, for each independent variable the trend line of the OLS predictions is presented in the plots of both the baseline OLS model and the baseline XGBoost. This further emphasizes the difference in predictions of the two baseline models. The results of the estimation of the baseline models already showed that the difference in predictions of the two models is very small, albeit significant. As expected, the figures of the two models for the same independent variable look similar. The first notable difference between the two models is that the predictions of the XGBoost model are more inside the limits of the GPA scale which is 0-4. This is indicated with the green circles in the figures.

Moreover, for the independent variable Math GPA average prior, the OLS predictions lie further apart at values of Math GPA average prior from 3 to 4 than those of the XGBoost model. The opposite is true for the variables Math and English CST score. Here the OLS predictions lie closer together, thus making the overall shape of the predictions look more linear. The XGBoost model is not restricted to a linear shape, which possibly causes the slightly different shapes between the predictions of the models of these variables.

The similarity of the figures of the two models for the same independent variable indi-cates that the relationships or trends between the explanatory variable and the dependent

(35)

Figure 4: The relationship between the predictions of the baseline models and independent variables.

variable found by the two models is approximately the same. The underlying relationship between the dependent and independent variables are therefore practically linear. For the math and English CST scores and the math GPA of the previous year the relationship between this variable and the dependent variable found by the models is positive. For OLS, this was also shown in Table 2 by the fact that the coefficients of these variables in the OLS regression were positive. In addition, both the OLS model and the XGBoost model found a positive relationship between the Morning variable and the predicted Math GPA. This can also be derived from the first row of Table 6. For both models, the mean

(36)

Table 6: Prediction mean of binary variables OLS XGBoost Variable 0 1 0 1 Morning 1.9058 1.9911 1.9067 1.9888 Female 1.8223 2.0675 1.8230 2.0654 ELL status 2.0800 1.4930 2.0890 1.4934

Columns 2 and 4 represent the the mean of the observations of which binary variable value is equal to 0, Columns 3 and 5 represent the mean of the observations of which binary variable value is equal to 1.

predicted value of a student’s math GPA is higher for the students with classes in the morning. OLS found that the mean predicted math GPA was 1.91 for students with an afternoon math class and 1.99 for students in a morning math class. XGBoost found that these means were approximately the same. This is in accordance with the paper of Pope, although the estimated coefficient in this thesis is slightly lower. For the last two inde-pendent variables, Female and English Language Learner status, the OLS and XGBoost models also found roughly the same relationships and mean predicted math GPAs for both values of the binary variables.

So, the majority of the relationships between the dependent variable and the indepen-dent variables in Figure 4 can be reasonably represented with a linear model. The difference between the predictions of the two models were known to be small, and this resulted in the plots of the predictions against a independent variable looking similar.

4.6 SHAP values and feature importance

Figure 5 represents the mean absolute SHAP values of the independent variables of the XGBoost model. The first variable, the math GPA of the prior year (gpaavgprior), has the highest mean SHAP value, which corresponds to the highest feature importance. The dependent variables are show in a decreasing order of feature importance. Thus, the second and third most important features in the XGBoost model are the math and English CST scores of the year before (zcstscoremathprior and zcstscoreelaprior). Interestingly, the four most important variables in the XGBoost model are exactly the most significant variables with the highest OLS coefficients, respectively 0.4307, 0.3160, 0.1152, 0.1123.

Beside these variables, the XGBoost model attributes a relatively high importance to some of the fixed effects variables of the OLS regression. For instance, the binary variables for grades 7 to 11. According to this figure, the binary variables representing the year

Data analysis : past versus future : comparing the predictive performance of XGBoost with OLS regression analysis