The use of regression trees for prediction in misspecified linear models

(1)

The use of regression trees for

prediction in misspecified linear

models

Eric Dignum (10246312)

June 2014, University of Amsterdam

Abstract

This study compares regression-tree-based methods with linear regression based on their prediction performance and finds that in some misspeci-fied linear models boosting produces better out-of-sample forecasts than regression. However the other methods: single regression trees, bagging and random forests are less competitive and fail to outperform regression in all instances.

(2)

1 Introduction 2 2 Theoretical Framework 4 2.1 Growing Method . . . 4 2.2 Ensemble Methods . . . 5 2.3 More Theory . . . 8 3 Experimental Setup 10 3.1 Simulation Methods . . . 10 3.2 Model Tuning . . . 14 3.3 Performance Measures . . . 16 3.4 General Setup . . . 17 4 Results 18 5 Conclusion 23 6 Further Research 25 References 26

(3)

1 Introduction

Nowadays computers become increasingly important in our society, creating the opportunity to store huge amounts of data, leading to bigger and bigger data sets. If the data sets are so large and complex that they cannot be handled by normal programs (such as Microsoft Excel) it is often called “Big Data”.

This also has effect on the data sets used by econometricians which tradi-tionally are small compared to Big Data, according to Varian (2013, p. 1): “conventional statistical and econometric techniques such as regression often work well but there are issues unique to big data sets that may require different tools”. These data sets may allow for more flexible and complex relationships, increasing the probability of misspecification. Self-learning algorithms from the area of machine learning such as decision trees, neural nets and support vector machines may be more effective here than (linear) regression.

A particular interesting approach of analysing data is that of decision trees. This tree-method partitions the feature space based on a set of if-then statements into smaller (more homogeneous) regions, with each region having its own prediction for observations falling in its space. If a tree is used to classify observations it is known as a classification tree, while if the response is continuous it is called a regression tree.

There is a reasonable amount of research comparing various classification algorithms including logistic regression and classification trees such as: Perlich, Provost & Simonoff (2003) and Caruana & Niculescu-Mizil (2006). But when it comes to regression trees there is almost no research that compares these algo-rithms with for example standard regression methods such as linear regression. Especially with upcoming ensemble methods like bagging, boosting, and ran-dom forests, decision trees are gaining popularity in different areas. Ensemble methods are techniques to improve the prediction accuracy by using multiple trees. These techniques show significant improvements in the prediction accu-racy compared to a single tree, especially boosting is a promising algorithm.

This being said, it may be valuable to search for circumstances where regres-sion trees might be a better tool to model relationships than normal regresregres-sion. For example: in data sets with highly complex relationships regression trees

(4)

might be more suitable.

In this study regression trees are compared to standard regression tools on pre-diction performance with use of simulated data. This allows for flexibility and a comprehensive comparison of the two approaches. Furthermore, the ensem-ble methods described above will be used to improve the accuracy of a single regression tree and will be included in the comparison.

After this brief introduction there is a theoretical overview of the different methods and techniques that will be used. Then the experimental setup is described, including how the data sets are simulated. Next the results of the comparisons are presented and discussed, followed by the conclusion and some suggestions for further study.

(5)

2 Theoretical Framework

In this section the theory behind regression trees, the “growing” method, en-semble methods and an overview of the existing literature in this specific area is given. Although there are good books on machine learning that contain very good explanations of the techniques used here, this is a thesis from the econo-metric point of view, thus readers are probably not familiar with these methods. So a (basic) description of these algorithms is given in this section.

Even though there are different approaches to grow trees, their basic outline is the same. Basic regression trees subsequently partition the data into smaller, more homogeneous regions. The point where a split is made is called a node and where this is done depends on the used method for constructing the tree, but at a split every predictor (in regression: explanatory variable) is considered and the best one, according to the used criterion, is chosen. This splitting of the data (creating nodes) continues until some stopping criterion is reached, which again could be different depending on the used growing method. Often a too large tree is grown for optimal out-of-sample predictions because of over-fitting (more on this in the next section). For this reason trees are “pruned”; insignificant splits are discarded in order to create a smaller tree which mostly results in better forecasts.

When the tree is constructed the goal is to make predictions, which are made in the nodes at the end of the tree (terminal nodes). These predictions can be the simple average of the training data falling in the same region as the observation, but there are other, more complex prediction equations.

2.1 Growing Method

The oldest, most implemented method and the one used in this study for grow-ing regression trees is CART (Classification And Regression Trees) and was first introduced by Breiman et al. (1984). For regression this approach begins with the entire data set and searches every value of every predictor to find the pre-dictor that splits the data into two groups, such that the sum of squared errors (SSE) is minimised. After this, the same process continues for each of the two

(6)

groups created by the previous split. This procedure is called binary recursive partitioning. SSE = X i∈S1 (yi− ¯y1) + X i∈S2 (yi− ¯y2)

The splitting stops, when the number of data points left in a region falls below a preset threshold, a predetermined number of nodes is reached or the decrease in the SSE is under a preset limit. For the CART growing methodology this preset limit is called the complexity parameter and often is a small value. These criteria can be used at the same time or on their own, depending on the preference of the user. Furthermore, the prediction equation is just the simple average of the training data falling in that particular region.

Figure 1: An illustration of how a regression tree could divide the feature space. Here the log(salary) of a major league baseball player is explained with use of the number of hits in the previous year and the number of years the player is present in the major league. Hastie et al. (2013, p. 305)

2.2 Ensemble Methods

A major problem with single trees is that they have high variance. For exam-ple, when there is a small change in the data this could lead to a different split somewhere high up in the tree. This “error” has effect on all the splits below it, which may lead to an entirely different series of splits. This implies that a

(7)

Figure 2: The corresponding regression tree with the averages in the terminal nodes (go left in a node if the statement is true). For example if a player plays 5 years in the major league and has 150 hits this model predicts a salary of e6.74_{. Hastie et al. (2013, p. 304)}

single tree suffers from high variance. The ensemble methods discussed here try to tackle this problem by producing multiple trees and take a simple or weighted average to reduce variance.

Bagging is such a procedure: it produces multiple trees with use of bootstrap-ping. Bootstrapping involves drawing multiple samples (with replacement!) of size n from a data set with total size n, so one observation could be in a sample more than once. For each bootstrapped sample a tree is grown until each sam-ple has a corresponding tree. In the case of regression trees, the prediction for an observation is the simple average of the values each separate tree predicts. Because every bootstrapped sample has a different composition, each tree is grown on a “different” data set ensuring that the variance decreases compared to just one single tree.

A problem with bagging is that it evaluates every predictor at every con-sidered split. The potential harm in this is that although “different” data sets are used (bootstrapped samples) to construct the trees, the series of splits in the trees can look very similar. This is also known as tree correlation and is caused by the underlying relationship of the data. While bagging reduces some of the variance of the predicted values, tree correlation prevents an even bigger reduction, but there are techniques that can reduce it even more.

(8)

The random forests algorithm tries to de-correlate the trees by taking a random sample of m predictors out of the full set of p predictors at every considered split. This forces the algorithm to make different splits compared to other trees, resulting in less tree correlation. The rest of the procedure is the same as in bagging, with bagging being a special case of a random forest with m equal to p.

Another approach, which is quite different, is the boosting algorithm. There are multiple implementations of this technique, but their basic idea for regression is the same and is explained first. Boosting constructs, just as in bagging and random forests, an ensemble of trees, but trees are grown sequentially; each tree uses information of the previously grown tree. This information consists of the residuals; the difference between the actual values and the prediction of the previous trees. The new tree is grown to fit these residuals and the predictions of this tree are added to the predictions of previously grown trees. This way the algorithm “learns” gradually.

In regression the (Stochastic) Gradient Boosting-algorithm developed by Friedman (2002) is the most popular and is also used in this study. The proce-dure in the scheme below is taken from Kuhn & Johnson (2013) and is referred to as Simple Gradient Boosting.

1. Select tree depth (number of successive nodes), D, and number of itera-tions (trees) K.

2. Compute the average response, ¯y, and use this as the initial predicted value for each sample.

3. For k = 1, 2,..., K:

(a) Compute the residual, the difference between the observed value and the current predicted value, for each sample.

(b) Fit a regression tree of depth, D, using the residuals as the response. (c) Predict each sample using the regression tree fit in the previous step. (d) Update the predicted value of each sample by adding the previous iteration its predicted value to the predicted value generated in the previous step.

(9)

Next to the tuning parameters D and K another tuning parameter; λ, also known as the “learning rate”, needs to be set. Lambda is a shrinkage parameter that is set to avoid over-fitting the data. Over-fitting the data occurs when a very large tree fits the training data very well, but leads to poor out-of-sample predictions, which is exactly the objective. In step (d) only a fraction (lambda) of the current predicted value is added to the predicted value of the previous iterations, which forces the algorithm to learn slowly. Often small values of the parameter (≤0.05) seem to work best. However, this increases computation time because more iterations are needed to create a good model.

Stochastic Gradient Boosting differs slightly from Simple Gradient Boosting and uses a random sampling scheme to reduce prediction variance. For each it-eration a random sample is drawn from the training data and after this the tree is constructed. The fraction of the data used is known as the bagging fraction and if used at all, Kuhn & Johnson (2013, p. 206) suggest a fraction of 0.5.

There is quite some research which compares these different ensemble methods albeit with classification trees instead of regression trees. For example Maclin & Opitz (1999) and Bauer & Kohavi (1999) find that boosted classification trees are among the best performing classification algorithms known today, but for some data sets boosting leads to lower performance (which is according to them due to the relatively high level of “noise” (irrelevant data)). Bagging however produces steady increases in performance for every data set, making it a more reliable method.

2.3 More Theory

As brought up before, there is not much research on the performance of re-gression trees compared to standard rere-gression techniques. However, there is research comparing classification trees with logistic regression and other classi-fication algorithms. Although classiclassi-fication trees are not the same as regression trees, they do have the same structure and growing processes so it might be meaningful to look at comparisons of classification algorithms.

(10)

algo-rithms by means of eleven binary classification problems. They find that boost-ing is the best learnboost-ing algorithm, closely followed by random forests and bag-ging. Logistic regression belongs to the poor performing algorithms together with single decision trees and boosted stumps (a maximum tree depth of one in every iteration).

Perlich, Provost & Simonoff (2003) compare the performance of (single) trees and (ridge) logistic regression based on several binary classification prob-lems from different domains. They find that trees tend to perform better for large data sets and logistic regression for small ones, even for data sets from the same domain. Furthermore, trees tend to perform well when signal separability is high (data set is relatively good to divide in classes) and logistic regression performs well when low separability is present.

A fundamental difference between classification trees and regression trees is that the regression variant has a continuous feature space whereas the classification variant only needs to classify observations. This could influence the performance of boosting, bagging and decision trees in general, thus the good performance in the classification setting gives no guarantees for the regression variant.

An obvious advantage of learning algorithms in general is that they try to specify the underlying relationship themselves whereas in standard regression the modeller needs to spot the relevant patterns in the data. This human interaction could lead to a misspecified model resulting in biased or inefficient estimates.

On the other hand linear regression would surely outperform a single re-gression tree if the true relationship is indeed linear and the model is correctly specified, but this is certainly not always the case. As Varian (2013, p. 1-2) says: “larger data sets may allow for more flexible and complex relationships”, which may create a need for different regression techniques. With promising ensemble methods such as bagging, random forests and especially boosting, de-cision trees are an upcoming learning algorithm and may outperform standard regression methods in certain areas.

(11)

3 Experimental Setup

In this study the open-source program R is used. The advantage of R is that anyone can create packages and upload them to be used by others. This can also be seen as a problem, because there is no guarantee that the code provided in the package is well implemented. To account for this, only packages are used which either appear in studies or in books on machine learning.

In Hastie et al. (2013) the tree package is used to construct classification and regression trees with the CART growing method. Bagging and random forests are created with the randomForest package and boosting with the gbmpackage. The book from Kuhn & Johnson (2013) uses the same packages for bagging, random forests and boosting as in Hastie et al. (2013), but for the CART implementation they use rpart. To train all these models and select the best parameters using k-fold cross validation, the train function in the caret package is used.

Most of these packages are available for quite some time now, making it rea-sonable to assume that potential problems are either fixed or not important. Also the fact that they are used in books and studies is reassuring. Given this knowledge rpart, randomForest, gbm and caret are used to carry out this empirical study on regression trees and standard regression techniques, with the standard methods already implemented in R. The code used for the simu-lation of the models and comparison of the various algorithms can be requested for. This way the code can be checked for errors and the results validated.

3.1 Simulation Methods

The advantage of simulation is that the data generating process is known. In practice however, this process is not known. So when a model is created that is based on real-life data this could end up in a misspecified model. For example by not modelling a non-linearity, by adding a unimportant term or by leaving a relevant variable out. This could lead to biased estimations or coefficients with high variance.

(12)

manually set. Because of this, the conclusions made in this thesis are formally applicable to this specific model. However, the values of the coefficients have no significant influence on the algorithms so these conclusions can be made more general.

The values of the predictors are drawn from a multivariate normal distribu-tion with mean zero (a vector) and a variance-covariance matrix with all ones on the diagonal and 0.2 elsewhere (except when stated otherwise). This is to minimise high correlations between predictors, which can lead to multicollinear-ity resulting in high standard errors of the estimated coefficients or if there is already a bias present it can worsen it.

For each simulated model a normal distributed noise term (epsilon) with mean zero and a variance of one is incorporated, which is not very realistic but good enough to compare the performance of the different techniques.

In the simulations where a rightly specified model is used, the data is tested by means of statistical tests to affirm that the assumptions of linear regression hold. The tests used are the Breusch-Godfrey test for heteroskedasticity, the Breusch-Pagan test for serial correlation, the Ramsey-RESET test for functional form (a general misspecification test) and the Jarque-Bera test for normality of the errors. Only if all these tests have a p-value greater than 0.05, which implies that the hypothesis still stands or an error of the first kind is made, the data set is used. A more technical overview of the used statistical tests is incorporated in Heij et al. (2004).

The first data generating process is just data with a linear relationship, with no dummy-variables, interaction effects and higher order terms. This is done to look how regression trees and ensemble methods compare to standard regres-sion in an area where linear regresregres-sion is the Best Linear Unbiased Estimator (BLUE), but the other techniques are non-linear methods. To avoid confusion, each data generating process can provide data for multiple simulations, which in their turn provide the presented results.

This process is used to investigate the effect of the training set size (keeping the validation set at a fixed size) on all techniques and the influence of a growing

(13)

set of predictors (explanatory variables). For the training set sizes of 1000, 2000, 3000, 4000 and 5000 observations are used to train the models keeping the validation set at a fixed size of 1000 and the number of predictors constant at 6. This simulation is used to find a suitable sample size for the remaining simulations, keeping computation time and performance in mind.

As a second simulation the number of predictors are varied, because de-pendent variables in real-life problems often have many predictors influencing them. The self-learning algorithms should be able to model relationships with a considerable number of variables. Sets of 6, 8, 10, 12 and 14 predictors are used to test the effect of this increasing number. For each set the outcome vari-able y is determined by exactly that number of regressors, otherwise the model is misspecified. Although the only process displayed below is the one with 14 predictors the other processes are generated with the first q variables.

Data generating process (1):

y = 2.5+5x1+0.5x2−x3−2x4+0.25x5+0.75x6+0.3x7+2.5x8−0.7x9+0.9x10−...

...0.8x11+ 0.63x12+ x13− x14+ ε

Estimated model:

y = c + b1x1+ b2x2+ ... + bqxq+ e

for q = 6, 8, ..., 14.

The second process includes an interaction effect (a combined effect of two variables; the impact of one variable depends on the level of the other) and a quadratic term (a quadratic relationship between the concerned predictor and y) to see how the algorithms handle this extensions of the normal model. Linear regression should show no significant differences, because the model is still linear.

y = 2.5 + 5x1+ 0.5x2− x3− 2x4+ 0.25x5+ 0.75x6+ 0.2x1x2− 0.5x32+ ε

Estimated model:

(14)

The next data generating process is the same as the first plus additional simu-lated terms which have nothing to do with the data generating process (“noise”). If these terms are incorporated in the regression model the variance of the estimations increases, resulting in an inefficient estimator and a bigger mean squared error (MSE).

The third simulation first includes one irrelevant term in the regression model, after this two irrelevant regressors and finally three variables. The reason that no more than three redundant variables are used is that more is unrealistic since this would lead to an obvious misspecified model (read: statistical tests or theory will deem regressors insignificant if this is not already the case with less irrelevant regressors).

y = 2.5 + 5x1+ 0.5x2− x3− 2x4+ 0.25x5+ 0.75x6+ ε

Estimated model:

y = c + b1x1+ b2x2+ ... + bqxq+ e

for q = 7, 8, 9.

The fourth simulation leaves one relevant regressor out. Only one is needed to estimate the effect of this on the methods, because more left out will only worsen the bias. The value of the coefficient is varied, because this effect “runs” through the other variables; a greater absolute value will have greater influence. The value ranges from 0.1 to 0.6 with steps of 0.1.

The next simulation will raise the correlation between all predictors to 0.4, 0.6 and 0.8 to create a multicollinearity problem. If there is no bias present this can lead to high standard errors of the coefficients, but this does not influ-ence the prediction accuracy of linear regression as long as the model is rightly specified. However, a high correlation between predictors may influence the tree-based methods. Both of these simulations are performed with data simu-lated from the first process.

(15)

The last data generating process includes a (subtle) non-linearity, because char-acteristics in real-life data are certainly not always linear. Next to the trees and ensemble methods, this model is estimated by linear regression which is a misspecified model. In the process x2 is incremented by 5 to ensure there is

almost a zero probability of negative values, because this may lead to strange results.

Data generating process (4): y = 2.5 + 5x1+ (x2+ 5) α − x3− 2x4+ 0.25x5+ 0.75x6+ ε Estimated model: y = c + b1x1+ b2x2+ ... + b6x6+ e for α = 1.1, 1.2, ..., 1.8.

3.2 Model Tuning

The different learning algorithms all have parameters which need to be “tuned” to give it the optimal prediction accuracy or structure. For example, the num-ber of subsequent nodes (tree depth) in a single tree is a tuning parameter. Too many nodes might over-fit the data but too few might miss the important char-acteristics of the data. It is therefore important to find the best parameters. A way to detect which value of the parameter is best without simulating additional data, often very hard in the real world, is k-fold cross validation. This procedure splits the data set into k roughly equal sized subsets and consists of k iterations. For each iteration one subset is used as validation set and the rest is used to train the model. After this, the model and the corresponding parameters with the lowest MSE are chosen. The procedure for k-fold cross validation in this study is taken from Hastie et al. (2013, p. 181-182) and is as follows:

1. Split the data into k (roughly) equal sized subsets. 2. For j = 1, 2, ..., k :

(a) Set the tuning parameter.

(16)

(c) Compute the MSE for the validation set. 3. Choose the one with the lowest MSE.

Picking an appropriate value for k leads to a bias-variance trade-off. A high value of k will result in a lot of correlated trees with a low bias, because the validation set is small the training sets will look very similar, leading to identical (correlated) trees. A big training set will lead to low bias, but the small size of the validation set will lead to high variance in the different MSEs. Also if k is big, a lot of iterations must be performed and especially if the algorithm is slow, could lead to a lot of computation time. A small value of k results in few iterations, but has a relatively small training set which could lead to a biased tree. Hastie et al. (2013, p. 181) and Kuhn & Johnson (2013, p. 70) suggest that with k=5 or k=10 the MSE neither suffers from excessively high bias nor high variance.

For the growing method ten-fold cross validation determines whether trees need to be pruned or not, without losing the important aspects of the underlying rela-tionship. This is done by selecting an appropriate value for the cost-complexity parameter. Next to tree size, the approaches that produce multiple trees have another tuning parameter namely ensemble size (number of trees). The size of the ensemble needs to be big enough, such that more trees do not lead to a significant reduction in the MSE. In other words: the MSE needs to be rela-tively constant after a while. This parameter is estimated by trial and error, the moment it stays relatively constant is the appropriate size.

To create random forests the number of predictors evaluated at a split (m) needs to be set. In Kuhn & Johnson (2013) a value of m = p/3 is found to work very well and tuning this parameter does not lead to a significant reduction in the errors according to them, so this value is also used here plus another four values evenly spaced between one and the total number of predictors.

The boosting algorithm has two additional parameters: the learning rate, λ, and the bagging fraction. Some people believe that the power of boosting is partly due to its slow learning approach, the learning rate determines how fast boosting learns. Often small values of λ are used, typically smaller than 0.05 but some use even 0.001. Note that small values of the learning rate lead to

(17)

more iterations as explained in the previous section. For the bagging fraction 1 (no randomness) is used due to limitations of the train function. For λ 0.05 and 0.01 are used which might be a bit shortsighted, but more values increases computation time significantly.

An important note in this section is that only a subset of values for tuning parameters is searched for the optimal ones. This is due to computation con-straints; multiple cross validations involve a great amount of computation, so with a k-fold cross validation k different models (with different tuning parame-ters) are tested and the best one among these is the “optimal” one. Although the actual optimal values are not likely to be found, this procedure gives a considerable increase in performance.

3.3 Performance Measures

Traditionally econometricians look at measures as the MSE and R2 _to

evalu-ate model performance. However, with the growing size of the data sets these in-sample measures of fit, which often overestimate the true performance, be-come less interesting when out-of-sample data (not used to train the model) is available. The advantage of simulation is that the real relationship of the variables is known and the true MSE can be estimated by simulating extra data to evaluate the out-of-sample performance of all models (including the standard techniques).

M SE =X

i∈V

(yi− ypredicted)2

V = validation set

To obtain a good estimate of the true MSE, the created model is validated on extra data (consisting of a thousand “new” observations) and the simple average of the various MSEs gives a reasonable indication of the performances of the different algorithms.

(18)

3.4 General Setup

Each of the simulated data sets will be estimated by a (misspecified) standard re-gression model. This technique is then compared with a tuned single rere-gression tree, the bagged and random forest variants (with their tuned parameters) and the boosting algorithm with its optimal values. These methods are constructed on a data set containing a number observations that has to be determined in the first simulation. The comparisons are based on out-of-sample performance by means of the mean squared error on a validation set size of 1000. This process will be repeated a hundred times and a simple average of all MSEs will be taken to obtain a good estimate of the true MSE.

(19)

4 Results

In this section the most important results are presented and discussed. Because the data is simulated and the real values are already known, there is no need for a thorough and extensive analysis of the data. Also all the used techniques (except normal regression) are self-learning techniques, so they do not require much input from the modeller (only the tuning of the parameters). For this reason there are only tables showing all results of the simulations and the corresponding models.

For the tables one should know that CART stands for a single tree grown with the CART methodology, RF resembles the random forests algorithm and DGP stands for the data generating process(es) used for the particular simula-tions in the table.

Varying sample size

Regression CART Bagging RF Boosting 1000 obs 1.010* 3.609 1.932 1.928 1.251 2000 obs 0.988* 3.151 1.705 1.698 1.143 3000 obs 1.010* 2.942 1.589 1.554 1.105 4000 obs 1.011* 2.810 1.551 1.520 1.103 5000 obs 1.008* 2.676 1.471 1.452 1.080

Table 1: Number of predictors: 6, Validation size: 1000, Iterations: 100, DGP: y = 2.5 + 5x1+ 0.5x2− x3− 2x4+ 0.25x5+ 0.75x6+ ε.

The MSE of regression comes, with a training set consisting of 1000 observations, already very close to the true value of 1 (the variance of epsilon) and thus a bigger training set will not lead to significant performance increases. The other techniques do benefit from an increasing size and boosting comes quite close to the performance of regression albeit with a training set of 5000. Furthermore, the performance of a single tree is poor, but bagging and random forests do a good job of reducing the variance associated with one tree. Random forests does slightly better than bagging in all cases, but it is not enough to say that there is evidence of tree correlation (which random forests should partially eliminate

(20)

if present).

For the rest of the simulations a training set of 3000 observations will be used and a validation set consisting of 1000. A bigger sample will lead to more computation time (several hours), but less results in a significant drop in per-formance. For the ensemble size of bagging, boosting and random forests 2000 is chosen. More will sometimes lead to significant reductions in MSE, but also to more computation time. For this reason 2000 is reasonable and gives a good reflection of the performance.

Varying number of predictors

Regression CART Bagging RF Boosting 6 predictors 1.006* 4.976 2.624 2.575 1.189 8 predictors 1.006* 4.976 2.335 2.311 1.183 10 predictors 0.983* 6.205 2.954 2.925 1.243 12 predictors 1.008* 7.691 3.645 3.572 1.315 14 predictors 1.033* 9.951 4.753 4.670 1.398

Table 2: Training size: 3000, Validation size: 1000, Iterations: 100, DGP: different for every row; see simulation methods section.

With an increasing number of predictors, linear regression is again an excel-lent performer (as expected) and shows no efficiency loss as the number of explanatory variables increases. However the tree-based methods show a signif-icant increase in MSE, this increase is also present if only the function rpart is used (without the train function) and other functions implementing the CART-methodology from different packages. This reduces the probability that this is due to an error in the functions or in the simulation itself. One thing single trees, bagging and random forests have in common is that they all have a max-imum tree depth of thirty nodes (due to 32-bit programs). It could be that this just is not enough to model the underlying relationship of the data resulting in an incomplete model. But boosting shows also significant increases (albeit less dramatically) in the MSE and grows subsequent trees with a maximum depth of nine nodes (manually set). Raising this depth could be a solution, enabling boosting to better specify the relationship in the data. Another improvement

(21)

of boosting could be the use of the bagging fraction which forces the algorithm to consider different variables at each split. The currently used “fraction” of 1 could lead to similar trees in every iteration resulting in lower performance. When implemented, these changes in the parameters for the boosting procedure seem to be no improvement, leaving the “issue” still unresolved.

Another thing to notice; with 6 predictors and a data set containing 3000 observations a single tree has a MSE of 4.976, while in the previous simulation it was 2.942. This strengthens the statement that trees have high variance and shows the need for ensemble methods to reduce it.

Included interaction and quadratic term

Regression CART Bagging RF Boosting complex model 1.006* 2.828 1.571 1.557 1.168 Table 3: Training size: 3000, Validation size: 1000, Iterations: 100, DGP: y = 2.5 + 5x1+ 0.5x2− x3− 2x4+ 0.25x5+ 0.75x6+ 0.2x1x2− 0.5x32+ ε.

With the extra non-linear terms the tree-methods seem to perform just a bit worse, but it could also be due to the few iterations of the simulation. The ranking stays the same with linear regression up front followed by boosting, random forests, bagging and single trees in that order.

Redundant variable(s)

Regression CART Bagging RF Boosting 1 redundant variable 1.020* 2.990 1.604 1.581 1.133 2 redundant variables 0.996* 2.939 1.630 1.592 1.104 3 redundant variables 1.008* 3.045 1.690 1.664 1.125 Table 4: Training size: 3000, Validation size: 1000, Iterations: 100, DGP: y = 2.5 + 5x1+ 0.5x2− x3− 2x4+ 0.25x5+ 0.75x6+ ε.

All methods seem to be relatively unaffected by the growing number of noise variables (Table 4). The tree-based algorithms, except boosting, have a slight increase in MSE when the amount of noise increases, but overall they are able

(22)

to separate the important aspects from the rest. For boosting however, this is contrary to what Bauer & Kohavi (1999) and Maclin & Opitz (1999) say about boosting’s dependence on noise. For regression the coefficients may have higher standard errors, but the MSEs do not change much if redundant variables are added. This is probably due the low correlation between the predictors and noise variables which influences the variance of the estimated coefficients.

Omitted variable

Regression CART Bagging RF Boosting β6= 0.1 0.995* 2.317 1.389 1.379 1.066 β6= 0.2 1.033* 2.459 1.461 1.455 1.101 β6= 0.3 1.068* 2.525 1.461 1.447 1.079 β6= 0.4 1.170 2.673 1.600 1.509 1.111* β6= 0.5 1.205 2.713 1.524 1.503 1.095* β6= 0.6 1.319 2.750 1.524 1.501 1.106*

Table 5: Training size: 3000, Validation size: 1000, Iterations: 100, DGP: y = 2.5 + 5x1+ 0.5x2− x3− 2x4+ 0.25x5+ β6x6+ ε.

When a bias is introduced (Table 5) the performance of regression seemingly lowers. Boosting even surpasses linear regression if the coefficient of the omitted variable is 0.4 or higher. Single trees, bagging and random forests cannot take advantage of this and stay behind.

Correlation between predictors

Regression CART Bagging RF Boosting Cor(xi, xj) = 0.4 1.033* 2.641 1.606 1.593 1.158

Cor(xi, xj) = 0.6 1.000* 2.254 1.424 1.414 1.126

Cor(xi, xj) = 0.8 1.015* 1.866 1.291 1.286 1.145

Table 6: Training size: 3000, Validation size: 1000, Iterations: 100, DGP: y = 2.5 + 5x1+ 0.5x2− x3− 2x4+ 0.25x5+ 0.75x6+ ε.

As said before, the prediction accuracy of linear regression is not influenced by multicollinearity and boosting also shows little or no increases in MSE. However,

(23)

the remaining methods show a significant decrease in MSE when correlation between all predictors rises from 0.4 to 0.8. An explanation for this could be that whatever split is made, it is more beneficiary to model the outcome y, because the predictors have high correlation between one another.

Included non-linearity

Regression CART Bagging RF Boosting α = 1.1 1.022* 3.618 1.836 1.777 1.147 α = 1.2 0.998* 3.815 1.834 1.804 1.118 α = 1.3 1.002* 4.282 1.940 1.895 1.148 α = 1.4 1.010* 4.475 2.058 2.008 1.140 α = 1.5 1.025* 4.899 2.224 2.189 1.145 α = 1.6 1.095* 5.244 2.313 2.285 1.175 α = 1.7 1.232* 5.606 2.466 2.434 1.232* α = 1.8 1.540 6.079 2.649 2.627 1.325*

Table 7: Training size: 3000, Validation size: 1000, Iterations: 100, DGP: y = 2.5 + 5x1+ (x2+ 5)α− x3− 2x4+ 0.25x5+ 0.75x6+ ε.

In the last simulation the machine learning techniques clearly have trouble mod-elling the non-linearity and for small values of α the misspecified linear model still outperforms the rest. Only if α is 1.8 or higher boosting does better than regression. The remaining methods have far bigger MSEs and are not competi-tive with linear regression and boosting.

(24)

5 Conclusion

Although this is a fairly simple study on the comparison of these techniques, some conclusions can be made. In case the relationship in the data is really linear, then undoubtedly linear regression is the best method with efficient esti-mations (even with small sample sizes) and it takes almost no time to compute the coefficients in the model. Single trees, bagging and random forests that implement the CART growing process however, are not a good tool to model linear characteristics, with mean squared errors often more than twice as big. The boosting algorithm does a better job and comes surprisingly close with a difference in MSE of only 0.072 (5000 training observations), in an area where linear regression is on its best.

A disappointing finding is the behaviour of the tree-based algorithms when the number of predictors grows. Where regression is still consistent with its estima-tions, the other algorithms clearly have some trouble specifying the relationship of the data resulting in slightly less performance. The least affected method is boosting and this is still the best performing algorithm besides regression. The cause of this increase in MSE might be the limiting tree depth of thirty nodes, but for boosting it is still unclear.

The only method that sometimes outperforms linear regression is boosting, but only if the omitted variable bias is severe enough or α is bigger than 1.8. These high values of α are not that subtle and if one uses linear regression there is a good chance the modeller notices the non-linearity. The bias however is some-what more realistic, because variables are sometimes omitted due to wrong theories or failing statistical tests. For single trees, bagging and random forests it seems that they are not suited for (linear) regression problems. And the good performance of these algorithms in the classification setting as shown by Bauer & Kohavi (1999) and Caruana & Niculescu-Mizil (2006) does not apply here.

An obvious advantage of the machine learning techniques is that they are self-learning and the modeller does not have to specify the relationship himself. However, this study shows that single trees, bagging and random forests based

(25)

on the CART growing methodology are not (yet) adequate to handle linear regression problems. Boosting shows some impressive results and surpasses some misspecified linear models that are known to have a bias.

The only thing that the algorithms are compared on is prediction perfor-mance in the form of the mean squared error, where linear regression does also point-estimations of the influence of the explanatory variables on the dependent variable. These machine learning techniques only have a variable importance plot, which produces an ordered list of relative influences of the variables. With boosting being able to produce these plots and show which variables are important it can display which variables are missing in a linear regression model to specify this model correctly and obtain unbiased estimates. This being said, it may be valuable for an econometrician to study these (and more) techniques from the area of machine learning as they provide new methods of analysing data, build classification models and maybe expand the knowledge of the used data set. Despite this, linear regression, when specified correctly, is still the best performing and most flexible method as far as this study goes.

(26)

6 Further Research

The regression trees used in this study show to be no replacements for linear regression, but there are other growing processes than CART such as C5.0 or Conditional Inference Trees (CIT) which are also often used (mostly for classification trees). Another thing is that the prediction equation is just a simple average and this could be a bit crude especially for linear relationships. To tackle this, there are model trees that have (linear) models in the terminal nodes and often need fewer nodes to make good predictions and maybe better when more predictors are available.

If there are growing processes better suited for regression problems, then bagging and random forests are excellent methods to increase forecasting per-formance. They maybe even outperform regression on real data sets, but this is subject to further studies.

Another point of interest, and the most important one for further research, is a comparison of the models on real data sets, because the ultimate goal will be predicting outcomes for real-life problems. So a study on the behaviour of these techniques on real data will reveal more on the performance of the different methods than just simulated data.

(27)

References

Bauer, E., R. Kohavi (1999). An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants. Machine Learning, 36, 105-139.

Breiman, L., et al. (1984). Classification and Regression Trees. Wadsworth Statistics/Probabilities series. Boca Raton: CRC press.

Caruana, R., Niculescu-Mizil, A. (2006), An Empirical Comparison of Supervised Learning Algorithms

Friedman, J.H. (2002). Stochastic Gradient Boosting. Computational Statistics and Data Analysis, 38, 367-378.

Hastie, T., et al. (2013). Introduction to Statistical Learning with Applications in R. New York: Springer.

Heij, C., et al. (2004). Econometric Methods with Application in Business and Economics. Oxford: Oxford University Press.

Kuhn, M., Johnson, K. (2013). Applied Predictive Modelling. New York: Springer.

Maclin, R., Opitz, D. (1999). Popular Ensemble Methods: An Empirical Comparison. Journal of Artificial Intelligence Research, 11, 169-198. Perlich, C., Provost, F., Simonoff J.S. (2003). Tree Induction vs. Logistic

Regression - A Learning-Curve Analysis Journal of Machine Learning Research, 4, 211-255.

The use of regression trees for prediction in misspecified linear models