A machine learning approach to hyperparameter estimation : the case where XGBoost estimates its own optimal regularization parameters

(1)

A Machine Learning Approach to Hyperparameter

Estimation

The case where XGBoost estimates its own optimal regularization parameters

Abstract

This study successfully targets the computational capacity issues of hyperparameter opti-mization of machine learning algorithms. By optimizing 350 randomly generated datasets, we found a relation between the dataset and its optimal regularization parameters. By esti-mating these optimal parameters the algorithm is roughly 20 times faster than the original optimization algorithm whilst keeping the reduction of accuracy at maximum 20%. The unique methodology can give rise to a new research stream in understanding hyperparam-eter optimization of machine learning algorithms.

Name: Rick Halm

Student number: 10651098 Supervisor: M.J. van der Leij Date: June 26, 2018

University of Amsterdam

(2)

Statement of Originality

This document is written by Rick Halm who declares to take full responsibility for the contents of this document.

I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in

creating it.

The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

1. Introduction

As one of the most foremost database designers, Pat Helland, once said "We can no longer pretend to live in a clean world" (Helland, 2011, p. 42). This statement refers to the rapidly increasing amounts of data and its messiness that the digital age provides (Helland, 2011). Examples of this data explosion are the 24 petabytes of data that Google processes per day and the 400 million tweets on twitter per day. However, "more data also comes with more noise" (Walker, 2014, p.182), referring to the likelihood of errors that increase as datasets get larger (Viktor & Kenneth, 2013). Where econometric methods perform well on models which have theoretically proven functional forms, these methods underperform on the unconventional, large datasets where functional forms are unknown (Berk, 2016). Fortunately, this is exactly where machine learning techniques flourish. By extracting unknown patterns from the data, machine learning algorithms have the ability to predict outcomes from data that were previously perceived as worthless (Mullainathan & Spiess, 2017).

Of these machine learning techniques, one appears to be outperforming other algo-rithms; Extreme Gradient Boosting (XGBoost). Of the 29 Kaggle data science com-petitions, over half of the winning teams used XGBoost (Chen & Guestrin, 2016) with impressive results in e.g. differentiating the Higgs boson signals from background noise (Chen & He, 2015) or classifying patients with epilepsy based on an MRI scan of cerebral activity (Torlay, Perrone-Bertolotti, Thomas, & Baciu, 2017).

The inventors, Chen and Guestrin (2016), contribute their success to the updated Gra-dient Tree Boosting (GTB) learning algorithm that can manipulate the complexity of the model to prevent overfitting. Overfitting is a result of unnecessarily complicating the model and has devastating effects on the generalization of the model (Berk, 2016).

Chen and Guestrin balance the complexity and generalization of the model by in-troducing two regularization parameters that prevents the algorithm from using too many trees and high weight values. However, the value of these two parameters are set manually.

(6)

CHAPTER 1. INTRODUCTION

In order to find the optimal parameter values, the model is trained on all possible param-eter combinations. This is highly inefficient and computational intensive. The research question of this study is to what extend the optimal parameter setting can be empirically found and hence reduce the computational intensity of XGBoost.

To answer this question, this study simulates 350 random datasets. Each dataset is op-timized using XGBoost. Then, a new dataset is created containing the optimal parameters and charactersitics of the 350 optimized models. Then XGBoost is used to find a relation between characteristics of the dataset and the optimal value of the parameters. Lastly, the performance of the prediction model is tested against several other models.

The results of this study provide much promise. The proposed prediction model build on a small number of observations can create a model that is over 20 times faster than the optimization model. However, the speed advantage comes at a cost as its performance in the simulation analysis is at most 20% worse than the optimal model. This result shows that indeed, the dataset provides information about the optimal parameters.

This study starts by dissecting the mathematical foundations of the machine learning algorithms that are fundamental to gradient tree boosting and its regularization compan-ion; XGBoost. After the mathematical review, the third chapter discusses how XGBoost is applied on generated datasets. This is followed by the fourth chapter which elaborates on the results of this study. The fifth chapter, the discussion, critically reviews the perfor-mance of the machine learning techniques and provides future directions for the machine learning research stream. Finally, the thesis ends with a summary and conclusion of the paper.

(7)

2. The Algorithm

Breiman (2001), one of the founders of classification and regression trees (CART), wrote an article about the ’two cultures of statistical modeling’. In this article he stated that ’data modeling’ is the culture that takes the assumption that there exists an underlying stochastic data model that explains the dependent variable. Econometrics is one of the research areas which contributes to this culture. On the other hand, the ’algorithmic modeling’ culture contrastingly states that this underlying stochastic data model is complex, "mysterious", and unknown. Their goal is to find a function that can predict the dependent variable by exploring the effect of different explanatory variables and hence, not adhering to one theoretically motivated model.

The following chapter discusses one specific technique from the ’algorithmic model-ing’ culture that builds a model with the assumption that the underlying data model is too complex; Extreme Gradient Boosting (XGBoost). This chapter starts with a discussion on decision trees which are the basis of XGBoost. Then, the boosting technique is discussed which combines multiple trees into one strong model. This is followed by the dissection of the XGBoost algorithm and the chapter finishes with a discussion on regularization that prevents the model from overfitting.

2.1 Decision Trees

XGBoost is an algorithm that in its core uses decision trees to make promising predic-tions. The intuition behind decision trees is that partitions of the dataset with similar characteristic also have similar outcome values. For instance, by partitioning the work-force into different educational levels and gender, the salaries in the subgroups are closer to each other compared to the full dataset.

Figure 2.1 shows such a decision tree. The root node, consisting of the complete dataset, is split up by conditioning on a particular variable. The left node is not split up further and hence is called a leaf. However, the right node is split up further into two

(8)

CHAPTER 2. THE ALGORITHM

Figure 2.1: Decision tree terminology

leaves and is called a decision node. It is important to state that there are two types of de-cision trees that are often referred to in literature; classification trees that have categorical dependent variables and regression trees that have numerical dependent variables.

There are two questions that arise when making a decision tree; Where should the node be split and when should the tree stop splitting (Sutton, 2005)? This is explained in the remainder of this section.

2.1.1 Splitting the node

In order to decide how to split the dataset into two subgroups, the objective is to split the data, in such a way, that the variance of the outcome variable in the subgroups is smaller than before the split. This is called an increase in node purity. Then, in order to deter-mine where to split, the algorithm calculates the node purity of all possible splits in each variable. The split which leads to the greatest increase in node purity is proposed. There are different measures for node purity. For classification trees, two often used impurity functions are the Gini index of diversity and the entropy function. These functions use the proportion of the subgroup that are correctly classified (Sutton, 2005). On the other hand, regression trees use sum of squares or absolute differences between the actual value and prediction as a measure for impurity. Altogether, a node is split with the variable and accompanied value that increases the purity of the tree the most.

(9)

2.1.2 Stopping the split

The best decision tree is a tree that has a perfect balance between model complexity and generalization. A large tree with many nodes and leaves can handle the complex relations in the sample and would for example look like the left graph in Figure 2.2. However, this endangers the generalization of the model and results in inaccurate predictions of future observations; this is called overfitting. On the other hand, a small tree underutilizes the information in the training sample and also result in misclassification of future observa-tion; this is called underfitting. The goal is to create a right sized tree that maximizes the information of the training set while maintaining the accuracy of future predictions (Sutton, 2005). This "perfectly balanced" model would, for example, look like the right graph in Figure 2.2

Figure 2.2: Finding the perfect balance between overfitting an generalizability

In order to find the perfect balance between complexity and generalization the follow-ing objective function is minimized.

Lα(T ) = l(T ) + α|T | (2.1)

In this equation, l(T ) is called the loss function and encompasses the misclassification rate or prediction error of the tree on the training set. By increasing the number of leaves

(10)

T, the loss l(T ) will decrease. On the other hand, by increasing the number of leaves

T, the term α|T | will increase. The functionLα(T ) is minimized and finds an optimum

where the error is as small as possible whilst restraining the final number of leaves in the tree.

The parameter α is a user-chosen value and called the regularization parameter where higher values of α penalizes addition of leaves and will lead to simpler models. Hence, α corrects for the model complexity and prevents overfitting. Therefore, the perfect balance between complexity and generalization is found by determining the optimal value of the regularization term α (Sutton, 2005). In general, additional regularization parameters can be added that penalize for other characteristics such as weight of leaves.

The optimal value of α is typically found with a process called cross-validation. It starts by dividing the dataset into a training set and a test set. A tree is build on the training set with varying values of α. Then, the tree is tested on the test set to judge its generalizability. The α with the best performing tree is chosen (Sutton, 2005).

2.2 Boosting

Up to this point, the decision trees are often stated to be easily interpreted. However, sometimes, small changes in the training set can lead to significantly different trees and hence can result in completely different predictions. Therefore, a single decision tree can be inaccurate (Sutton, 2005).

Boosting is a method that decreases this inaccuracy without increasing the bias of the model. It combines small trees or weak models into a strong model. Each new tree gives relatively more weight to misclassified observations by the current estimations (Sutton, 2005; Berk, 2016). Boosting has promising results and significantly improves the ac-curacy of the model (Sutton, 2005). The drawback of boosting is that the combination of multiple trees, called a tree ensemble, results in a highly complex model with low interpretability. There are multiple ways that trees can be boosted. This paper focuses on gradient tree boosting which is the boosting technique that is used in the XGBoost

(11)

algorithm.

2.3 Extreme Gradient Boosting

The following section discusses the main contribution of Chen and Guestrin’s XGBoost; the addition of two regularization terms in the gradient tree boosting algorithm. The section starts with a visual interpretation of the original gradient tree boosting algorithm without regularization terms. This is followed by a formal elaboration of the algorithm including the regularization terms that encompass XGBoost.

2.3.1 Gradient Tree Boosting

Gradient tree boosting is a simple technique where the gradient of the loss function of the current tree ensemble is used as dependent variable in a new tree. To make it more intuitive, take for example the sum of squared residuals. In this case, the gradient and hessian of Equation 2.6 are:

l( ˆy(t−1)_i , yi) = (yi− ˆyi(t−1))2 =⇒ gi= ∂ l ∂ yi = 2(yi− ˆy(t−1)_i ) =⇒ hi= ∂2l ∂ y2_i = 2 (2.2)

The derivative with respect to yi shows that the gradient is equal to the residuals from

the previous tree ensemble predictions. As the gradient (residuals) are used as dependent variable in the next tree, the algorithm gives extra weight to the observations which are inaccurately predicted, meaning the observations with high residuals.

The following figures show how gradient tree boosting works. The first graph in Figure 2.3 shows the data points. Then, a split is found that minimizes the loss and is

given a prediction ˆy_ithat equals the average of the subsample. The final image visualizes

(12)

𝑦"#

𝑦"_$

loon

!"_# !"_$

Figure 2.3: from clean data to a first tree

Figure 2.4: residuals (left) and the two following proposed splits

Loon

Figure 2.5: final model as graph and tree

Clearly, the model in Figure 2.3 shows that the current predictions deviate from the actual values, which form the residuals. Figure 2.4 shows that in the next iteration, these residuals are actually used to find a next split in either the left or right node.

(13)

The final step is the determination of the best split. The split with the highest re-duction in loss is chosen, which results in the final model and tree seen in Figure 2.5. For exemplary purposes, assume that further split of leaves does not improve the model significantly.

Although, Figure 2.5 shows the optimal tree, it is still not a perfect prediction. Figure 2.6 shows how the gradient boosting technique tries to explain the residuals of the current model by building a new tree with the residuals as dependent variable. In this way, the model continuously attempts to reduce the residuals in order to increase prediction accu-racy whilst reducing interpretability of the tree ensemble. The following section explains the algorithm in more detail and introduces the regularization parameters that prevent the algorithm from endlessly splitting nodes and creating trees (Berk, 2016). With the regularization parameters, the previous algorithm becomes extreme gradient boosting or commonly referred to as XGBoost.

Figure 2.6: Boosted tree ensemble

2.3.2 XGBoost: The Algorithm

The previous section visually showed how the gradient tree boosting algorithm works, the following section mathematically describes the algorithm. Chen and Guestrin (2016) define a boosted ensemble of K trees that predict y as follows:

ˆ y_i= η K

∑

k=1 f_k(xi) (2.3)

(14)

In this equation f_kis a single tree containing the weights or predictions of the leaves.

Ad-ditionally, a learning rate (η) can be included which shrinks the effect that each individual tree has on the total ensemble. In their paper, Chen and Guestrin (2016) set this value to 1 and hence excludes the term from the discussion. Same as in Section 2.1.2, a tree is build on the basis of a loss function such as absolute- or squared-loss that calculates the resid-uals of the current tree ensemble. These residresid-uals are then used as dependent variable in the new tree.

XGBoost improves the gradient tree boosting by introducing two regularization terms that prevent the model from overfitting. It adds the parameter γ to reduce the number of leaves in the current tree (T ) and the parameter λ to reduce the individual contribution

of each leave by constraining the weight (wj). Then the following objective function is

minimized: L(t) (φ ) = n

∑

i=1 l( ˆy_i, yi) | {z } loss function + γT +1 2λ T

∑

j=1 w2_j | {z } regularization (2.4)

With a second-order Taylor approximation, Chen and Guestrin (2016) are able to

deter-mine how the objective function depends on the gradient (gi) and hessian (hi) of the loss

function. By optimizing this approximation, the optimal weights of the leaves are deter-mined, resulting in the following scoring function that measures the quality of the tree. This function has the same role as the impurity functions described in Section 2.1.1.

˜ L(t)_{(q) = −}1 2 T

∑

j=1 (∑i∈Ijgi) 2 ∑i∈Ijhi+ λ + γT (2.5)

In order to find the right split variable and value, the algorithm evaluates all possible split values within all variables. The following equation is used to determine whether the proposed split should actually be made by comparing the quality of the tree before and

(15)

after the split.

˜ L(t)_{(q) =}1 2 h (∑_i∈I le f tgi) 2 ∑i∈Ile f thi+ λ + (∑i∈Irightgi) 2 ∑i∈Irighthi+ λ − (∑i∈Igi) 2 ∑i∈Ihi+ λ i − γ (2.6)

The proposed split, either Ile f t or Iright, with the highest loss reduction is used. The

algorithm will stop splitting when the loss reduction is negative; meaning that the regu-larization term γ is larger than the reduction in loss scaled by λ .

2.4 Regularization Estimation

The discussion on the XGBoost algorithm shows that the two regularization parameters λ and γ , also called hyperparameters, are crucial determinants of the balance between model complexity and generalization as explained in Section 2.1.2. Unfortunately, it is unknown beforehand which values of the hyperparameters are optimal and hence a parameter-search must be conducted.

As stated before, it is the goal to identify values of γ and λ such that the model accurately predicts new data. A parameter search that uses brute force is grid-search. All different combinations of the proposed grid are tested and the model with the best out-of-sample accuracy is chosen. As the parameter-grid grows, the computational intensity of the algorithm grows exponentially and results in highly exhaustive algorithms (Hsu,

Chang, & Lin, 2003). For example, with only five values per parameter, 52= 25 different

models are build.

A solution that reduces computational intensity is to estimate the hyperparameters prior to the analysis. However, theoretically derived estimations require strong assump-tions on the underlying distribution of the error (Lukas, 1992). For instance, Bayesian optimization assumes that the model f (x) is drawn from a Gaussian process where the dependent variable is normally distributed (Snoek, Larochelle, &, Adams, 2012). This goes against the ’algorithmic modeling’ culture that states that the underlying data model is too complex to make such assumptions (Breiman, 2005).

(16)

Another solution is gradient-based optimization of the parameter. It is a recursive algorithm which starts with a random parameter value. Then, a model is build and the parameter value is updated using the gradient of the model with respect to the parameter value. This continues until the optimal parameter value converges (Kingma & Ba, 2014). An issue with this optimization method is that it assumes that the loss function is convex. Also, what often is overlooked, is that this still requires the creation of multiple models and hence is computationally intensive.

Altogether, the common intuition of the optimization methods is that the optimal hy-perparameter values depend on the dataset. Characteristics of the dataset such as average, range or variance of the dependent variable or covariance intuitively impact the optimal value of the hyperparameters (Snoek, Larochelle, &, Adams, 2012; Kingma & Ba, 2014). To what extend those dataset characteristics determine the optimal parameter value re-mains unknown as there is no exploratory empirical research in this area. This is the starting point of this thesis which explores the relationship between the characteristics of the dataset and the optimal hyperparameters. Therefore, the hypothesis of this study is that estimations of the hyperparameters on the basis of the characteristics of dataset sig-nificantly increases the speed of the algorithm while the accuracy of the algorithm does not change significantly.

(17)

3. Methodology

The previous chapter dissected the XGBoost algorithm in detail and came to the con-clusion that the algorithm could possibly be improved by estimating the optimal regular-ization parameters. Currently, optimregular-ization strategies are either slightly faster than grid-search or make strong assumptions on underlying data distributions which goes against the exploratory nature of machine learning algorithms. Also, there are no simulation studies that aim to estimate the parameters empirically. This literature gap is filled in this study.

The main question of this paper is whether a dataset is somehow related to its optimal regularization parameters. An answer is found by following two stages as visualized in Figure 3.1. The first stage, called optimization, starts with the generation of random datasets with independent and dependent variables. This is discussed in Section 3.1. Then, Sections 3.2 and 3.3 discuss how each dataset is optimized by XGBoost in order to explain the dependent variable with the independent variable of these generated datasets. The balance between complexity and generalization of the model is found by optimizing its regularization parameters (γ and λ ).

Each time a model is created in the first stage, the optimal parameters and particular characteristics of each generated dataset form a new observation in the training set. The specific characteristics of the dataset are explained in Section 3.4. In the second stage called estimation, XGBoost is used to create a model that explains the optimal parameters (dependent variable) from the dataset characteristics (independent variables). Section 3.5 discusses multiple models that are used to optimize this relationship and how its predic-tive performance is tested. To clarify, XGBoost is used twice. Once to create the best model to explain the generated dataset. The second time XGBoost is used to optimize the relationship between the data characteristics and the optimal parameters γ and λ .

(18)

CHAPTER 3. METHODOLOGY ! "_## ⋯ "_#% ⋮ ⋱ ⋮ "_(# ⋯ "_(%) ! +# ∗ -#∗ ⋮ +_./0∗ -./0∗ ) ! 1_# ⋮ 1₍) XGBoost optimizes: +∗_{, -}∗ ! "̅_#, 4_#, 5_#, 6_# ⋮ "̅./0, 4./0, 5./0, 6./0 ) Section 3.1 Section 3.2-3.3 Section 3.4 XGBoost optimizes relationship to predict

+∗_{, -}∗ Stage 1 Optimization Section 3.5 Stage 2 Estimation

Figure 3.1: visual explanation of methodology

3.1 Data generation

As the study requires many datasets with accompanying optimal regularization

parame-ters, these datasets are generated. The data is generated in three steps. First features1

are generated by drawing samples from varying distributions. Secondly, in order to make the dataset more realistic, pseudo-correlations between the features are added. Lastly, the dependent variable is created using the features.

3.1.1 Generating the independent variables

Several characteristics of the dataset are generated to ensure diversity of datasets. Firstly, a different number of samples and features are generated for each dataset. Also the number of important features, meaning features that influence the dependent variable, is varied as well as the number of highly correlated features resulting in multicollinearity. Lastly,

(19)

CHAPTER 3. METHODOLOGY

because XGBoost can handle missing observations, a percentage of the total observations is removed. More details can be seen in Table 3.1.

To ensure diversity between features, each feature is drawn from eleven possible dis-tributions with randomly drawn parameter settings. Also, sampling disdis-tributions that have similar values independent of the parameter settings are multiplied by a random value be-tween -1000 and 1000 so that even those distributions differ in location and variance.

number of samples DU(500,3000)‡

number of features U(3,20)†

number of highly correlated* features DU(2,features/2)

percentage missing U(0,1)

* correlation between 0.5 and 0.8

†Drawn from uniform distribution, ‡Drawn from discrete uniform distribution

Table 3.1: Varying parameter settings to generate different datasets

Distribution Parameter settings

Beta(!,") !, " U(0,10)†

Binomial(n,p) n DU(10,10000)‡_{, p U(0,1)}

Chisquared(df) df DU(1,100)

Exponential(") " U(0.2,10)

F(df1,df2) df1, df2 DU(1,100)

Gamma(k,#) k U(0.1,10), # U(0.1,2)

Logistic($,s) $ DU(-1000,1000), s DU(1,10)*$

Lognormal($,%) $ DU(-3,3), % DU(1,5)*$

Normal($,%) $ DU(-1000,1000), % DU(1,10)*$

Poisson($) $ DU(1,1000)

Uniform(a,b) a DU(-1000,0), b DU(a,1000)

Table 3.2: Different sampling distributions with varying parameter settings

Initially, the features are generated independently from each other and hence, do not contain correlation. However, in reality, a dataset often has correlated features. Hence, to make the dataset more realistic, pseudo-correlations (ρ) are added to the features as can be seen in Equation 3.1. First, a synthetic pseudo-correlation matrix is created with values between 0 and 0.1. Then, a random number of high correlations (see Table 3.1) between

(20)

0.5 and 0.8 are entered into the correlation matrix.

Then, the correlation is scaled such that the features have the assigned

pseudo-correlation. By scaling it with the ratio of the standard deviation (_SSxi

x j) the feature has the

same variance as the targeted feature. Hence, pseudo-correlation is added as follows:

x1new= x1+ ρ2,1

S_x1

S_xkx2+ · · · + ρk,1

S_x1

S_xkxk. Finally, the feature matrix (X ) is multiplied with

the scaled correlation matrix. This results in a feature matrix with correlated features.

X_new= X         1 ρ1,2 . . . ρ1,k ρ2,1 . .. ... .. . . .. ρk−1,k ρk,1 . . . ρk,k−1 1                 1/Sx1 1/Sx2 .. . 1/Sxk         h 1/Sx1 1/Sx2 . . . 1/Sxk i =         x11 x12 . . . x1k x₂₁ . .. x_2k .. . . .. ... x_n1 x_n2 . . . x_nk                  1 ρ1,2 S_x2 S_x1 . . . ρ1,k S_xk S_x1 ρ2,1 S_x1 S_x2 . .. ... .. . . .. ρk−1,k S_xk S_xk−1 ρk,1 S_x1 S_xk . . . ρk,k−1 S_xk S_xk−1 1          (3.1)

3.1.2 Generating the dependent variable

The dependent variable is created by generating a random relation between the feature ma-trix and the dependent variable. First of all, the complexity of the model is defined in the following way. The algorithm generates a random number that defines how many features are informative. Also, random values of the coefficients are drawn for each informative feature. Additionally, a random number is drawn that defines how many informative fea-tures contain interaction terms. Figure 3.3 shows which values these parameters can take. Finally, for each informative feature that contains interaction terms, a random number of

interactions are drawn. In this way, if the jth informative feature contains three

interac-tions it means that y = ... + bjxkxlxm+ .... These randomly drawn parameters together

(21)

number of informative features DU(1, number of features)‡

coefficients U(-10,10)†

number of interactions DU(1, number of informative features)

Table 3.3: parameter creation for dependent variable

Secondly, to prevent that one variable dominates the dependent variable, before mul-tiplying the informative feature with the randomly generated coefficient, it is scaled such

that all values of the variable are between 0 and 1 by xscaled_i = xi−min(xi)

max(xi)−min(xi). After

scaling, the variable is multiplied by the accompanying coefficient and a random error

(εi∼N (0,1)) for each feature is added.

Lastly, to increase the diversity of generated dependent variables, it is multiplied by a random value between 1 and 100. This ensures that different ranges of y are generated and heterogeneity of the different datasets. An example of a generated dependent variable

is: y = b1xscaled_i + ε1+ b2(xjxk)scaled+ ε2.

3.2 The model: general parameters

Over the past two years, the XGBoost package has updated its algorithm and now includes many more parameters compared to the simple algorithm described in Chapter 2. To ensure the continuity of the paper and simplicity of the research, the parameters of the algorithm are set in such a way that the algorithm is as similar as possible to the original. These settings can be seen in Table 3.4 and are is explained next.

(22)

Parameter Value Explanation

Maximum depth of tree 25 Determines how far a tree can grow

Number of boosted trees 250 Determines how many trees are maximally boosted Learning rate 0,2 Shrinkage factor that corrects the addition of a new tree Minimum weight of a leaf 0 Minimum number of observations in a leaf

Maximum delta steps 0 Places an absolute cap on the weight

Subsample 100% How much percent of the whole training set is used Objective linear Uses squared-loss as loss function

Booster gbtree Booster is used (gbtree is the booster explained in the paper) Number of cross validations 4 Number of times a parameter setting is validated on different

train-test sets

Table 3.4: parameter settings for XGBoost

The parameter maximum depth of a tree is a cap that is placed on how far an individual tree can grow. However, this parameter can overwrite the regularization effect of λ and γ which also determine how far a tree can grow. Hence, the default of this parameter is increased from 3 to 25, such that this parameter does not affect the optimization of the regularization parameters. The same argument is used for the minimum weight of a leaf and maximum delta steps and the chosen values ensure that these parameters do not affect the optimal regularization parameters.

The number of boosted trees and the learning rate should generally also be optimized in the grid-search. These parameters set the amount of trees that are boosted and with which coefficient each tree is multiplied as explained in Section 2.3.2. However, due to capacity constraints, these parameters are set to 250 and 0.2. These parameters do not overwrite lambda and gamma but do result in a non-optimal model. Additionally, 250 trees is not always the right sized ensemble, hence the grid-search function allows an early stopping criteria. When the model has not improved in the last five trees, the model stops before the maximum is reached. This also increases speed of the algorithm.

Lastly, as the generated dependent variable is a continuous variable, the squared-loss function is used for this paper and the booster setting is gbtree which is the boosting algorithm that was explained in this paper.

(23)

3.3 Optimizing regularization parameters

In order to find the optimal values, the algorithm initializes the model with five suggested

values for each parameter and hence has 25 combinations of parameter settings2. The

algorithm determines the optimal parameters using 4 cross validations on the 25 different combinations of the parameters resulting in 100 tree ensembles. Then each tree ensemble is evaluated using the root mean squared errors (RMSE) and the optimal parameters of the model with the smallest RMSE are chosen. Then a new model is ran with five new parameter values that are in the neighborhood of the optimal parameters. The following section explains this optimization in more detail.

Before we can search in the neighborhood of the optimal values, the maximum should be found. The parameter values of γ and λ are bounded by a minimum of 0 whilst there is no maximum value for the parameter specified. Hence, simply searching in the neighborhood of the initial parameter values is not enough and first the maximum should be found.

When an iteration ends with an optimal γ or λ at a maximum value, the boundary value is multiplied by 5. Then again, five values are chosen as shown in Figure 3.2. This continues until an iteration does not end at the maximum values of the parameter grid. This means that the current parameter grid contains the neighborhood of the optimal value and the narrowing process can be started.

Figure 3.2: Parameter optimization when a maximum is found, empty dot visualizes the optimal value found in the first round

(24)

Once the neighborhood of the optimal parameters is found, the algorithm narrows down the search by taking the two neighboring parameter values and create five new values between the neighboring values of the optimal parameter as shown in Figure 3.3.

Figure 3.3: Narrowing process of parameter optimization

So, how the algorithm works is as follows. First, the algorithm determines the maximum value of each of the parameters by the process exemplified in Figure 3.2. Once these are found, the optimal parameter value is narrowed down by the process shown in Figure 3.3. This narrowing process continues until either the distances in both parameters grids is smaller than 0.1 or the narrowing process has ran for five times.

3.4 Dataset characteristics

The previous section explained how each dataset is optimized in order to find the optimal parameters of the dataset. In order to answer our question whether there is a relationship between the dataset and these found optimal values multiple characteristics of the dataset are used. The characteristics are shown in Table 3.5. As XGBoost is a machine learn-ing algorithm, the algorithm only selects the features that actually explain the dependent variable. Hence, as many characteristics as possible are used that could explain the op-timal parameter setting. The exploratory nature of machine learning algorithms is truly different from the econometric models where the addition of a new independent variable should come with theoretical arguments. Therefore, this paper elaborates upon important features that are found after the analysis rather than before.

(25)

prior knowledge of the data generating process. For instance, the parameter ’number of important features’ that was used to create the dependent variable is known due to the prior knowledge of the data generating process. Therefore, generally, the number of important variables should not be used and instead covariances, correlations and rank of the feature matrix should be used. However, the goal of this paper is to find a relation between the dataset and the optimal parameters. Therefore, parameters such as the number of important features are used as dataset characteristics.

Dataset characteristics Name in algorithm

Number of observations in training set training_size

Number of features in dataset feature_size

Number of important features nr_imp_feat

Percentage missing of the total observations pct_missing

Range, variance, skewness and kurtosis of the dependent variable range_y,var_y, … Mean, absolute mean, variance, skewness and kurtosis of the

covariance between independent and dependent variable(s) average_cov_y, var_cov_y, …

Mean, variance, skewness and kurtosis of the correlation

between independent and dependent variable(s) average_corr_y, var_corr_y, …

25th,50th,75th,90th percentile of covariance between

independent and dependent variable(s) 25th_percentile_cov,…

25th,50th,75th,90th percentile of correlations between

independent and dependent variable(s) 25th_percentile_corr,…

Rank of the feature matrix rank_X

Ratio rank of feature matrix and total number of features _{rank_div_features}

Table 3.5: Dataset characteristics that could possibly determine the optimal parameters

3.5 Model testing

The previous sections explained the first stage of this methodology where the optimal parameters are found by optimizing the generated dataset. The first stage resulted in a training dataset containing optimal parameters and characteristics of each dataset. The following section explains the second stage of the methodology where the training set is used to predict the optimal parameters and how its performance is evaluated.

(26)

λ . The model is build using exactly the same XGBoost algorithm as explained in the previous sections but with a learning rate of 0.1 and maximum number of boosted trees of 500 in order to increase accuracy of the prediction model.

The performance of the predictions are evaluated by comparing the root mean squared error. First, random datasets are created and the parameters are predicted. Then, a model is build with the predicted parameters. In order to benchmark the performance of the predicted parameters, the performance is compared with multiple models. Firstly, the performance is compared with a model without any regularization where gamma is set to zero and lambda is set to one. This is the default setting of XGBoost. Thirdly, the dataset generation process discussed in section 3.3 is not completely random and could have optimal parameters with a central tendency. Therefore, the performance of the predicted parameters is also compared by the third model that includes the average value of the optimal parameters of the training set. In order to measure the extend that the prediction models capture patterns between the dataset and its optimal parameters, the prediction model is compared to a random model that draws its parameters from a discrete uniform distribution between 0 and 1000. Finally, the performance of the predictions are compared to the RMSE of the optimal model.

In addition to these general models, six models are created to predict the optimal γ and λ differently. Possibly, the relation between the dataset and its optimal parameters

is non-linear. Therefore, the first two additional models predict the logarithm3 of γ and

λ . The last four models are conditional models. As discussed in the second chapter, γ and λ interact and therefore a model is created in which the prediction of one parameter is used to predict the other parameter. To clarify, the predicting model of γ is trained on the training set including the optimal λ . When testing the performance of the prediction

model, γ is given the prediction of λ . This is done for γ| ˆλ and λ | ˆγ and for the logarithmic

models. In total eight prediction models are built and together with the default setting, the mean and the random model, this ends up with eleven models.

(27)

4. Results

The following section discusses the results of the analysis. First, the diversity of the generated datasets is discussed. Secondly, the XGBoost models that predict γ and λ are discussed including an elaboration on the important features and interactions that influence the values of γ and λ . This is followed by a discussion on the performance of the predictor models compared to default, mean, random and optimal parameters. Lastly, the additional models are combined into a strong performing model.

4.1 Diversity of the generated datasets

In total 350 generated datasets were analyzed with XGBoost each resulting in tree ensem-bles with optimal regularization parameters γ and λ .

Figure 4.1: Optimal regularization parameters

The diversity of the generated datasets can be seen from the different optimal regulariza-tion parameters in figure 4.1. The optimal parameter λ mostly ranges between 0 and 1200 whilst γ ranges between 0 and 17500 with most values lower than respectively 2500 and

(28)

CHAPTER 4. RESULTS

600 for γ and λ . Interestingly, the graph suggests an absence of correlation between γ and λ , however, as expected, there is an interaction. When one of the parameters has a high value, most of the time, the other value remains relatively low. Looking at Equation 2.6, this makes sense. When for instance λ is high, it reduces the effect of the residuals and hence, low values of gamma already prevent the split from happening.

4.2 Determinants of optimal parameter values

Tree boosting is a successful technique to improve the predictive accuracy of a single decision tree. However, the increased prediction accuracy is traded off by a reduced interpretability of the proposed model. This reduced interpretability is also apparent in this study as can be seen in Appendix C where a part of one of the 500 trees is shown. The following section uses importance and interaction plots to explain the complex tree ensembles created to predict the optimal regularization parameters γ and λ .

4.2.1 Feature Importance

As stated in the methodology, many features were included in the model that could pos-sibly estimate the optimal γ and λ . The exploratory nature of machine learning uses only the features that increase the purity of the model, or in this case, reduce the RMSE. How-ever, it is currently unclear how the importance of each feature is defined. Therefore, the importance of each feature, called a feature score (F-score), is calculated on three dimen-sions; weight, gain, and coverage. Weight is the total number of times the feature is used to split a node. Gain is the average reduction of RMSE when the feature is used to split a node. Coverage is the average number of samples affected by a split using the feature (Distributed Machine Learning Community, 2016). This paper defines an important fea-ture as a feafea-ture that scores high on all three dimensions, that is, has high average gain, is used frequently and covers many observations. The feature importance graphs of the general prediction models of λ and γ are shown in Figure 4.2. The feature importance graphs for the logarithmic models and the conditional models are shown in Appendix A.

(29)

CHAPTER 4. RESULTS

The following section discusses the most important features that determine the regular-ization parameters. Also, there are a few suggestions of why and how these features are related to the optimal parameter. These suggestions can be used by future scholars that dive into the exact relationship.

0 10000 20000 30000 40000 50000 60000 70000 F-Score/gain abs_average_cov_yfeature_size 25th_percentile_covnr_interactions 75th_percentile_cov 50th_percentile_cov 50th_percentile_corr 90th_percentile_corr 25th_percentile_corraverage_corr_y rank_div_featurespct_missing training_sizeskew_cov_y kurt_y variance_corr_yskew_corr_y rank_X kurt_cov_yvar_y nr_imp_feat average_cov_y 90th_percentile_cov 75th_percentile_corrskew_y n_prod_interactionskurt_corr_y abs_average_corr_yrange_y Fe ature s

Gamma feature importance using gain

0 1000 2000 3000 4000 5000 F-Score/gain rank_X rank_div_features 50th_percentile_corrkurt_corr_y range_y 25th_percentile_covskew_y 25th_percentile_corrskew_cov_y 90th_percentile_corr90th_percentile_cov 75th_percentile_covn_prod_interactions 50th_percentile_covkurt_y feature_size abs_average_cov_y 75th_percentile_corrabs_average_corr_y average_cov_yskew_corr_y nr_imp_feat average_corr_ypct_missing nr_interactionstraining_size var_y kurt_cov_y variance_corr_y Fe ature s

Lambda feature importance using gain

0 200 400 600 800 1000 1200 F-Score/weight rank_X nr_interactions rank_div_featuresskew_cov_y average_corr_y n_prod_interactions abs_average_corr_yvariance_corr_y 90th_percentile_corrnr_imp_feat feature_size abs_average_cov_yvar_y average_cov_y 75th_percentile_corr75th_percentile_cov kurt_corr_yskew_y 50th_percentile_corrskew_corr_y 90th_percentile_covkurt_cov_y kurt_y 50th_percentile_covpct_missing training_sizerange_y 25th_percentile_cov 25th_percentile_corr Fe ature s

Gamma feature importance using weight

0 100 200 300 400 500 600 700 800 F-Score/weight rank_div_featuresrank_X nr_interactions n_prod_interactionsnr_imp_feat var_y abs_average_corr_yfeature_size abs_average_cov_yskew_cov_y average_cov_y 90th_percentile_corrkurt_cov_y range_y 75th_percentile_corr 50th_percentile_corraverage_corr_y skew_corr_y 75th_percentile_covskew_y 90th_percentile_cov 50th_percentile_covkurt_y kurt_corr_y 25th_percentile_cov 25th_percentile_corrpct_missing training_size variance_corr_y Fe ature s

Lambda feature importance using weight

0 20 40 60 80 100 120 140 F-Score/cover feature_size 25th_percentile_corr 50th_percentile_corrrank_div_features 25th_percentile_cov 90th_percentile_corrskew_cov_y nr_imp_feat average_corr_y 75th_percentile_cov 50th_percentile_covabs_average_cov_y 75th_percentile_corrkurt_y average_cov_y 90th_percentile_covtraining_size variance_corr_y abs_average_corr_ypct_missing kurt_cov_yvar_y skew_y skew_corr_y n_prod_interactionsnr_interactions kurt_corr_yrange_y rank_X Fe ature s

Gamma feature importance using cover

0 50 100 150 200 F-Score/cover 25th_percentile_cov 25th_percentile_corr 50th_percentile_corr 75th_percentile_corrskew_y abs_average_cov_yaverage_cov_y rank_div_features n_prod_interactionsrank_X 50th_percentile_covkurt_corr_y skew_corr_y 90th_percentile_corraverage_corr_y nr_imp_feat skew_cov_y 90th_percentile_covkurt_cov_y 75th_percentile_covpct_missing range_y abs_average_corr_yvar_y feature_size training_sizekurt_y variance_corr_ynr_interactions Fe ature s

Lambda feature importance using cover

(30)

CHAPTER 4. RESULTS

General model

From the graphs with the feature importance for the prediction model of γ, there are a few interesting features. First of all, the range and skewness of the dependent variable have high gain, weight and coverage. As the prediction of a leaf is the average of all observa-tions in that leaf, dependent variables with a wide range and large skewness require many splits of leaves to increase predictive accuracy. As the number of leaves are punished by γ , range and skewness of the dependent variable determine the optimal value of γ . The same argument goes for the feature containing the number of product interactions which has high gain and coverage but is not used often. This feature is a measure for the com-plexity of the model. More complex models require more leaves which are punished by γ . Interestingly, the percentiles of the correlations and covariances between the features and the dependent variable are used most frequently. However, they do not cover many observations and do not lead to a high gain. This is an indication that these features are used at the end of the trees nearby the leaves.

In contrast to γ, λ ’s most important features are different. Firstly, the percentage missing is an important feature. Equation 2.6 shows that λ scales the sum of squared residuals. When a dataset contains many missing values, the sum of squared residuals is relatively low and hence requires lower regularization compared to similar sized datasets without missing data. Secondly, the optimal value of λ is determined by the number of observations in the dataset. Logically, the sum of squared residuals grows with the number of observations. As λ can scale these residuals, a dataset with more observations is related with higher values of λ . Lastly, the variance of the correlation between the features and the dependent variable is the most important determinant of λ . A high variance suggests that there are some features that have a very high correlation with the dependent variable. By using this feature to split, the residuals of the prediction can quickly be reduced and hence do not require a lot of regularization by λ .

(31)

CHAPTER 4. RESULTS

Logarithmic model

Compared to the prediction of γ and λ , the logarithmic models have two different im-portant features as can be seen in Appendix B. By the logarithmic transformation, the attention to large optimal parameters is shifted to more subtle differences. For γ this means that the range of the dependent variable is used significantly less and the variance of the dependent variable becomes more important. This shift suggests that large values of γ are affected by the range of the dependent variable. In this case, large variances require many leaves to improve the predictive accuracy. As the number of leaves are punished by γ , the variance of the dependent variable determines the optimal value of γ .

On the other hand, the prediction of λ is now shifted from the variance of the corre-lation towards the percentiles of the correcorre-lation and covariances. This suggests that high values of λ that are squeezed by the logarithmic transformation reduce the effect of the variance of the correlation on the optimal parameter. In this case, the kurtosis of the de-pendent variable, the fatness of its tails, determines λ . Fat tails result in higher residuals and as λ scales the residuals, fat tails require regularization.

Conditional models

In the conditional models, one of the parameters is given the prediction of the other pa-rameter. In the predicting model of γ given ˆλ , the most important determinant after the variance of the dependent variable becomes ˆλ . However, the coverage and frequency of λ is very low. This suggests the same as Figure 4.1 as for small values, λ and γ are unre-lated. But very high values of λ are related to small values of γ. This result is similar for

the model where prediction model of λ is given the value of ˆγ .

Contrasting the previous result, in conditional models for the logarithmic models,

both γ given ˆλ and λ given ˆγ models are strongly determined by the given parameter. The

given parameter has high gain, is often used and covers many observations. Due to the logarithmic transformation, the extremely high values give way to more subtle effects of the other parameter.

(32)

CHAPTER 4. RESULTS

4.2.2 Interactions between features

The previous section discussed important features that have high gain, are frequently used, and capture many observations. However, this result does not describe neighboring or interacting features, that is, features that are often split after each other in a decision tree. The following graph shows which percentage of the split in a feature is followed by a split in the directed feature. For example, the top graph in Figure 4.3 shows that a split of the range of the dependent variable is followed by a split of the variance of the dependent variable 21% of the time. To improve the visibility of the graph, both include only percentages higher than 10%. The size of the nodes indicate how many times they are used to split in the full tree ensemble. The following section discusses the findings on the basis of these two network graphs.

For the prediction model of γ the previous discussion showed that the range and skew-ness of the dependent variable are highly important features to predict γ whilst the per-centiles are used often but generally important. The top graph of Figure 4.3 shows three interesting remarks about the interactions of the splits. Firstly, most of the feature splits

are followed by a split in the 25th percentile of the correlation. A split of this feature

results 29% of the time in a split in a 25th percentile of the covariance. Interestingly, all

other percentiles of correlation and covariances point towards the 25th percentiles of the

correlation and covariance. Despite the limited importance of both features, it explains why both features are highly used. A split in a more important feature, range of the depen-dent variable, leads to 21% of the time in a split of a variance in the dependepen-dent variable. This explains the strong relation between range and variance of the dependent variable also seen in the logarithmic models in the discussion of feature importance. The last im-portant feature, skewness of the dependent variable, leads 10% of the time in a split of the kurtosis of the covariance of the dependent variable.

The previous discussion showed that λ ’s most important determinants are percentage missing observations, size of the training set and the variance of the correlation of the dependent variable. Firstly, the bottom graph in Figure 4.3 shows again that many splits

(33)

CHAPTER 4. RESULTS

are followed by splits in the 25th percentile of the correlation which on its turn leads to

a split in the training size 11% of the time. Another interesting finding is that there are many feature splits that results in a split in the training size. However, the split in the training size is followed by a dispersed set of features (lower than 10%) and hence not shown in this graph. Thirdly, the variance of the correlation of the dependent variable most often leads to a split in the percentage missing observations which on its turn leads to splits in the kurtosis of the correlation of the dependent variable.

Altogether, both network graphs have some interesting remarks on interactions be-tween splits of features. Still, these "important" interactions only occur 10-20% of the time. This leads to the most important finding of this section; even though we seem to understand important features and interactions, the relationship between the dataset and its optimal parameters is highly complex and the understanding of the complete tree en-semble is limited.

(34)

CHAPTER 4. RESULTS 0.29 0.16 0.12 0.13 0.16 0.14 0.10 0.11 0.14 0.15 0.15 0.11 0.15 0.15 0.1₂ 0.19 0.14 0.14 0.14 0.20 0.12 0.11 0.1₁ 0.12 0.10 0.12 0.11 0.16 0.10 0.10 0.13 0.14 0.1₅ 0.12 0.27 0.21 0.17 0.16 0.14 0.10 0.49 0.10 0.120.14 0.1 3 25th_percentile_corr 25th_percentile_cov 50th_percentile_corr 75th_percentile_corr 50th_percentile_cov 75th_percentile_cov 90th_percentile_corr 90th_percentile_cov abs_average_cov_y average_corr_y average_cov_y n_prod_interactions feature_size nr_interactions nr_imp_feat kurt_corr_y kurt_y kurt_cov_y variance_corr_y

pct_missing range_y var_y rank_X skew_y rank_div_features skew_corr_y training_size 0.10 0.11 0.13 0.11 0.10 0.13 0.10 0.10 0.15 0.13 0.11 0.11 0.14 0.18 0.1₁ 0.18 0.14 0.13 0.28 0.28 0.14 0.14 0.14 0.11 0.17 0.100.11 0.11 25th_percentile_corr 25th_percentile_cov training_size 50th_percentile_corr 50th_percentile_cov 90th_percentile_corr pct_missing abs_average_corr_y average_cov_y nr_imp_feat kurt_cov_y nr_interactions 75th_percentile_cov skew_cov_y kurt_corr_y rank_X

rank_div_features range_y variance_corr_y skew_y

var_y kurt_y

Figure 4.3: Relations between splits for the prediction models of γ (top) and λ (bottom)

4.2.3 General thoughts

Machine learning techniques are designed to explore patterns in datasets that are theo-retically unknown. The models that are built are famous for its predictive accuracy but

(35)

CHAPTER 4. RESULTS

they are too complex to be interpretable. This is also the case of the relationship between the dataset and its optimal regularization parameters. A simple correlation or interaction between the optimal parameters and the dataset does not exist. The previous section used two methods that aim to explain the complex relationship. The feature importance gener-ally favored characteristics of the dependent variable, size and number of missing values of the dataset. Using network analysis to find interesting interactions of the feature splits discovered a few interesting relations however relatively non-significant. The following section shows the upside of tree ensembles. Even though the relationship between the dataset and its optimal parameters is highly complex, its predictive performance is quite promising.

4.3 Performance of the model

The following section discusses the performance of the proposed models. Firstly, the performance of the general model is compared to several benchmarks. Then, several models are combined to create a strong performing model. Lastly, the additional mod-els are compared to increase the understanding of the model performance under certain circumstances.

4.3.1 General Models

After the creation of the models that predict the optimal value of λ and γ, its performance is tested on 116 newly generated datasets. From each dataset, the required characteristics are extracted and used to predict the optimal λ and γ of the dataset. Then, the tree ensem-ble is build with the predicted parameter values. The model is benchmarked against the default, mean, random and optimal model using the root mean squared error (RMSE) of the model.

Figure 4.4 shows the performance of the predicted model compared to the default model without regularization and the mean model of the trained dataset. The performance of the mean and default models varies a lot but are still quite centered around 20% higher

(36)

CHAPTER 4. RESULTS

errors compared to the optimal model. The result of the mean model implies that the data generating process is diverse but has a slight tendency to produce similar datasets. Interestingly, in the case of the default model, generally the predicted parameters perform better, but when it performs worse, it drastically performs worse. This implies that if the model predicts the wrong regularization terms, it can perform drastically worse than a model without regularization.

The last benchmark is the comparison of the predicted model against the random model. Interestingly, it shows that the model with random parameters is performing not very bad and sometimes even better than the prediction model. Still, the model with the predicted parameters performs better 75% of the time and sometimes results in 5 times lower RMSE. Also, the model with the predicted parameters generally performs better than the mean and default model as respectively 84 and 82 of the 116 datasets have a lower RMSE. Altogether, these results suggest that the dataset provides information about the optimal regularization parameters. However, keep in mind that the parameter predictions are far from flawless.

(37)

CHAPTER 4. RESULTS

-0.1 0.1 0.3 0.5 0.7 0.9 1+

Percentage RMSE prediction compared to optimal 0 2 4 6 8 10 12 14 Fre que nc y 0.0 0.2 0.4 0.6 0.8 1+

Percentage RMSE default_settings compared to optimal 0 2 4 6 8 10 12 Fre que nc y -0.0 0.2 0.4 0.6 0.8 1+

Percentage RMSE mean_impute compared to optimal 0 2 4 6 8 10 12 Fre que nc y −0.8 −0.6 −0.4 −0.2 0.0 0.2

Percentage RMSE prediction compared to random guessing 0 5 10 15 20 25 30 Fre que nc y

Figure 4.4: RMSE of the predicted parameters (upper left), default (upper right) and mean (lower left) relative to the optimal RMSE. Bottom right shows predictive model compared to randomly guessed parameters

Next to these benchmarks, the upper left graph in Figure 4.4 shows the performance of the predicted parameters compared to the performance with optimized parameters. It shows that 76 of the 116 models with predicted parameters perform less than 20% higher RMSE than the optimal parameters. Also, 80% of the predicted models have less than 40% higher RMSE. Interestingly, there are two cases in which the model with the predicted parameters perform better than the optimal values. This shows the non-convexity of the loss function discussed in Equation 2.4 and that the optimal values that are found could actually be a local minimum. Still, there are quite some cases in which the predicted model has twice as much RMSE than the optimal model.

(38)

CHAPTER 4. RESULTS

4.3.2 Additional Models

In order to capture these terribly performing cases, five additional models are built to predict γ and λ . As Figure 4.1 shows, the optimal parameters are not linear, hence a model is build that predicts the logarithms of γ and λ . Additionally, the figure also shows how the regularization terms interact. Small values of one parameter lead to high values of the other. Therefore, four models are built that predict one of the parameters conditionally to the predicted parameter. An overview of all models and explanation is given in Figure 4.5. The results of each individual model is shown in Appendix B.

model explanation

!; " prediction with knowledge of dataset

log(!+1); log("+1) prediction of the logaritm of parameters

!| " ; " prediction of lambda given the prediction of gamma "| ! ; ! prediction of gamma given the prediction of lambda log(!+1)| log("+1) ; log( "+1) prediction of log transformed lambda given log of gamma log("+1)|log( !+1) ; log( !+1) prediction of log transformed gamma given log of lambda

!= 1; "=0 Default model, without regularization

Figure 4.5: Overview of all models used to predict the optimal parameters

The final result of this paper is the performance of a combined model. It combines all prediction models of Figure 4.5 and picks in each case the model with the lowest RMSE. In this way, it removes the drastically large error when the regularization parameters are wrongly predicted. The computational speed of this combined model is still impressing as it only requires 28 iterations while the optimization method requires at least 500. The left histogram in figure 4.6 shows that the combination of seven simple models significantly reduces the predictive error with on average 12% higher RMSE. Also, 50% of the models have less than 7,5% higher RMSE and 90% of the models have less than 20% higher RMSE.

On the other hand, in order to asses the performance of the combined seven prediction models it is benchmarked against a random model which combines seven random models and uses the model with the lowest RMSE. The right histogram in figure 4.6 shows the performance of the combined model against the benchmark. Again, the combined

(39)

ran-CHAPTER 4. RESULTS

dom models are not performing very bad as the prediction model has mostly between 0 and 0.20 percent lower RMSE than the random model. However, oddly, sometimes the random model performs better than the prediction model indicating that the prediction model is far from flawless. On average the prediction model performs 10% better than the random guessing and 12% worse than the optimal model. The prediction model halves

the RMSE of the seven random models1 and hence, the prediction model captures some

dataset characteristics that are related to the optimal parameters.

-0.1 0.1 0.3 0.5 0.7 0.9 1+

Percentage RMSE combined model compared to optimal 0 5 10 15 20 Fre que nc y −0.8 −0.6 −0.4 −0.2 0.0 0.2

Percentage RMSE combined model compared to random g esses 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Fre q e nc y

Figure 4.6: RMSE of the combined model compared to the optimal model and random guessing

Therefore, the result of this paper shows that there is a connection between the dataset and its optimal parameters. However, using the parameter predictions leads to a trade-off between the speed and accuracy of the final tree ensemble. As explained in the methodol-ogy section, the optimization method mostly runs at least five times and each time it builds 100 tree ensembles. Hence, the combined model results in an algorithm that is roughly 20 times faster than the optimization method. However, the speed improvement comes at a cost. On average, the model based on parameter predictions leads to 12% higher RMSE with 20% maximum in 90% of the cases.

Taking a closer look at the performance of the combined model, it shows that all models sometimes perform best. However, it appears that the model of γ given λ performs

(40)

CHAPTER 4. RESULTS

best almost 30% of the time. Also, together, the conditional models perform best 75% of the time. This again shows that γ and λ interact. Next, a pattern is sought that would explain which model performs best in certain circumstances.

prediction default _settings

prediction given gamma prediction given lambda log prediction log prediction given gamma log prediction given lambda 10 11 19 35 7 17 18

Table 4.1: Number of times each model performs best

4.3.3 Comparison of the model performance

What Table 4.1 does not explain is which optimal values of γ and λ are best predicted by which model. Hence, Figure 4.7 shows the optimal parameters as dots where the color indicates which models estimates the optimal parameters best. The top graph shows all optimal values and the bottom graph zooms in on the smaller optimal values.

There are four suggested prediction patterns. First of all, it becomes clear that the general model given λ and the logarithmic model given λ capture a linear relationship where lower values of λ require larger values of γ. Also, the logarithmic conditional model captures the smaller optimal values than the simple conditional model. Secondly, the general model given γ and logarithmic model given γ captures a relation where the optimal γ and λ are unrelated and only restricts λ to a certain range. Thirdly, the default model predicts best when the optimal model has λ around zero and γ lower than 100. This suggests that the combined model does not predict the optimal parameters well in this area. Lastly, the logarithmic prediction model has the best results for values where γ is close to zero and the λ is lower than ten.

These two graphs show that each model captures different relations between λ and γ. In most cases there is a linear relationship where high values of λ come with low values of γ . On the other hand, sometimes there appears no relation where λ is optimal no matter the optimal value of γ. This is also the case when either λ or γ is close to zero. The

(41)

CHAPTER 4. RESULTS

following chapter discusses the implications of this outcome, limitations of the study and possible future directions for research in this area.

(42)

5. Discussion

This thesis tackles today’s greatest constraint of machine learning; computational capac-ity. Most of the time the hyperparameters of a machine learning model are optimized by building a model for each combination of the hyperparameters. The exponential nature of the hyperparameter grid results in computationally intensive machine learning algorithms. This paper tackles the computational intensity with a unique methodology. By using the power of machine learning, the study estimates optimal parameters of the machine learn-ing algorithm by providlearn-ing characteristics of the dataset that is analyzed.

The main contribution of this study is the evidence that there is a relationship between the dataset and its optimal parameters. With a sample of 350 datasets, using the machine learning algorithm XGBoost, multiple models are created that predict two optimal reg-ularization parameters. The results are promising as in 90% of the cases the estimated model performs maximally 20% worse compared to the optimal model. On the one hand this implies a trade-off between accuracy and speed. On the other hand, this implies a trade-off between accuracy and interpretability this also shows that there is indeed a com-plex and "mysterious" black-box function that explains how the dataset is related to its optimal parameters.

By estimating the relationship between dataset and its optimal parameter the XGBoost algorithm creates a highly complex model to predict the optimal parameter. By dissect-ing the model we discovered that characteristics of the dependent variable are important determinants for the optimal parameter γ which punishes the addition of leaves. Also, the characteristics of the dataset such as training size and the percentage of missing obser-vations are determinants for the optimal parameter λ which punishes the weight of the leaves. Also we discovered that the different models capture different patterns between the dataset and optimal parameters. Still, the complexity of the model results in a re-lationship that becomes hardly interpretable and only simple suggestions about feature importance and interrelations can be made.

A machine learning approach to hyperparameter estimation : the case where XGBoost estimates its own optimal regularization parameters