Predicting Market Prices of Residential Real Estate

(1)

Predicting Market Prices of Residential Real Estate

(2)

Master’s Thesis Econometrics, Operations Research and Actuarial Studies

(3)

Predicting Market Prices of Residential Real Estate

Carlo Schimmel

Abstract

(4)

1 Introduction

The purchase of a house is the biggest expense in the lives of many people in developed nations. For many modern families, the home they live in is often also their main capital asset. As such, the housing market forms a key part of the economy of most countries. According to an estimate by Savills, a global real estate services provider listed on the FTSE 250 Index, the total value of global residential real estate was $168.5 trillion in 2017. This is more than twice the total value of global GDP during that year, which was about $80 trillion according to the world bank.

Due to this enormous importance in both a micro and macro economic sense, the question of how to determine a fair price for a house has drawn significant academic and commercial attention in recent history. Nonetheless, it has proven to be rather difficult to accurately predict house prices, even on relatively short timescales. It is well known that house prices can show large fluctuations in relatively short periods of time. Cities such as London, Amsterdam and San Francisco have seen such dramatic increases in real estate prices in recent years that some experts fear that new housing bubbles are forming. It is clear that the price of a house at any given time is not solely determined by its characteristics such as size, type, age or location. Among many other factors, it also depends on the demographics of its neighbourhood, trends in consumer preferences and macroeconomic factors that influence the housing market.

(7)

1.1 Structural Overview

This thesis is written in conjunction with an internship at a data science consultancy firm in Rotterdam. The objective of this internship is to improve upon a previously developed model for predicting the market price of newly constructed residential es-tate. As such, we have access to a prepared dataset in which information about newly constructed homes has been combined with several other data sets with demographic and economic information. We also have some preliminary results of the previously de-veloped model available. However, this model has not been thoroughly cross-validated. Thus, we approach our research question by first analyzing the predictive performance of these models in depth, and then to see whether we can improve upon them in terms of both accuracy and interpretability.

The previously developed model is actually a set of multiple models. It contains a classical linear regression model, a linear regression model regularized with an elastic net penalty, a linear support vector machine model, and a random forest model. It also contains a simple and weighted averaging of these models. All component models are estimated on a particular subsample of the data (a training sample) and then validated on the remaning data (a test sample). For each house type, the model that gives the best predictions of the prices of the houses in the test data is then selected to make predictions for new input data. However, which model performs best can vary depending on the particular choice of which subset of the data is used as the test set. Thus, we will analyze the accuracy of this mixed model’s out-of-sample predictions when it is trained on many randomly chosen subsets of the data. Based on our literature review in the next Section, we then investigate whether or not predictive rule ensemble models have a better overall out-of-sample performance.

In Section 3, we begin by describing a general mathematical framework for

(8)

predictive performance of such models in Subsection3.2. In the following two Sections, we then describe the specific prediction models that we will compare in this thesis. A detailed discussion of the component models included in the original mixed model

is given in Section 4. We propose that a Predictive Rule Ensemble or RuleFit model

might be a good alternative for these models, and discuss these in Section 5. Here, we

also propose an extension to the existing approach in Subsection5.4. We propose to use

the results of a fitted RuleFit model to identify candidate interaction terms, and then re-estimate the RuleFit model with these candidates as additional linear interaction terms.

In Section6, we discuss two model agnostic tools for the interpretation of prediction

models. These are feature importance and partial dependence plots, which we will use to see how our final models describe the relationship between the price of a house and

its characteristics. In Section7 we introduce the available dataset, and we discuss the

preliminary analysis of this data in Section8. In Sections9we discuss the results of our

randomized cross-validations of the mixed models, and the results for predictive rule

ensembles are discussed in Section10. In Section11, we then discuss the interpretation

of the best performing models. Finally, we draw our conclusions and formulate some

(9)

2 Literature Review

Approaches to predicting the prices of real estate found in the literature can be divided into two main classes. The first relies on using so-called hedonic models with a regression

approach, while the second employs machine learning techniques (Park & Bae, 2015).

Hedonic models explain the price of an object of real estate as a sum of contributions made by its features or characteristics. Such models tend to be easily interpretable

and relatively straight forward to estimate for a given set of data (Adair, Berry, &

McGreal, 1996). Furthermore, the statistical properties of regression models are well

understood mathematically, which helps researchers in determining an approach for a given set of data, because the underlying assumptions of statistical models can be tested

(Stevenson,2004).

However, hedonic models often lack predictive accuracy because they tend to over-simplify the relationship between price and other characteristics of a house. One reason for this is that the value of some characteristics of a house can depend on other

char-acteristics. For example, Fan et al. (2006) find that the floor level of apartments in

Singapore has a notably higher contribution to the price of large apartments with more than 5 rooms than to smaller apartments with less rooms. Another difficulty for he-donic models is that house prices also depend on market forces which can have both lagged and feedback effects that are hard to capture in a simple regression approach

(Schulz & Werwatz, 2004). Furthermore, there is wide evidence that housing markets

are not uniform across countries or regions, but consists of submarkets that are mainly,

but not solely, determined by location (Bourassaa, Hoesli, & Peng, 2003).

Machine learning methods provide an alternative to hedonic models that tends to

give better results in terms of predictive accuracy. Selim (2009) compared the predictive

(10)

et al. (2014) use a support vector machine model that they optimize using particle swarm optimization to predict the selling price of houses between 2000 and 2010 in the Chinese city of Chongqing. They also compare this to a neural network trained using back propagation. They find that the optimized support vector machine has a mean average prediction error of about 2.2%, while the neural network has an accuracy of about 5%. Both are considered to be highly accurate predictions.

Although machine learning techniques can give a higher accuracy than hedonic mod-els, they are typically more difficult to implement and interpret. Many algorithms are available, and they often require that the researcher specifies several hyper parameters before a predictive model can be estimated. In many cases, such as highly popular en-semble methods, the statistical properties of the algorithms are also poorly understood

(Biau, 2010). This makes it difficult to choose an approach for a specific data set in

another way than trying out many algorithms and hyper parameters, and selecting the one with the best performance. Another issue is that the resulting prediction models can be hard to interpret. Although their predictions may be accurate, they operate as a so-called black box, whose inner mechanics are not easily understood. This problem can be adressed by using several model agnostic interpretation tools that have been developed in concordance with the interest in such black-box models in recent years, but the interpretation of advanced machine learning algorithms is still by no means

trivial (Molnar,2019).

A machine learning model that was specifically designed to combine predictive ac-curacy with relatively easy interpretability is the Predictive Rule Ensemble or RuleFit

model created by Friedman and Popescu (2008). This model can combine a simple

he-donic regression model with an ensemble method based on regression trees, by adding simple rules derived from the data as dummy variables in a simple linear model. For

instance, a rule dummy may be 1 for houses that are both larger than 100 m2 _{and built}

(11)

rules can greatly improve the predictive accuracy of a simple hedonic regression model, while preserving its interpretability, as the simple rules are easy to understand.

In their paper, Friedman and Popescu also briefly illustrate their approach by using RuleFit to predict real estate prices. Using a dataset of sales of 506 houses in Boston during the 1970’s, they find that their approach has a notably higher accuracy than a simple regression model. In a larger evaluation of the method on many disctinct

datasets, Aho et al. (2012) find that RuleFit models often perform only slightly worse

than the popular Random Forest model (Breiman, 2001), which is also what they use

(12)

3 Methodology

3.1 General Prediction Models

Suppose that we have a sample (yi, xi) consisting of N joint observations of K + 1

variables. We call yi the dependent variable and refer to xi = (xi1, xi2, ..., xiK) as

the explanatory variables or features. For each i = 1, ..., N , we assume that yi ∈ R

and let χ denote the domain of all possible values for xi. Our goal is to use the

explanatory variables xi to accurately predict yi for new observations of xi. In general,

we approach this task by specifying a map f : χ → R that depends on a set of p unknown parameters, which we denote as the p-vector β. We then model the relationship between the dependent and explanatory variables as

yi = f (xi, β) + ui, (1)

where ui is a stochastic error term. Throughout this thesis, we assume that the

ex-planatory variables are strictly exogenous and that the error term has zero expectation.

Once we obtain an estimate ˆβ of the parameters β, we can make a prediction of yi

for any observation of xi, by taking the expectation of (1) with β = ˆβ. Denoting this

prediction as ˆyi we thus have

ˆ

yi = f (xi; ˆβ). (2)

To estimate the parameters β for a given function f (·), we specify a loss function

L : R → R+ which takes a prediction error ˆui := yi − ˆyi as its input and returns

a non-negative measure of the discrepancy between the prediction and its associated observation. We then estimate the parameters by minimizing the total loss over a subset

(13)

solving the following optimization problem: ˆ β := arg min β N X i=1 L(ˆui) = N X i=1 L(yi− f (xi; β)). (3)

When finding the solution is prohibited by the computational complexity of the mini-mization problem, we look for an approximate solution instead. Furthermore, we keep some of the data aside when we estimate the parameters of our models, so that we can test the accuracy of the estimates made by the final fitted model on data that was not used to determine its parameters. This is known as cross-validation. We refer to data that we use to determine the parameters of a model as training data, and call the data that we use for the validation of the model predictions the test data.

In this thesis, we consider 6 different specifications of f (xi; ˆβ). These are:

• A classical linear regression model, denoted LM.

• A linear regression model estimated with an elastic net penalty, denoted LMNET. • A linear support vector model, denoted LSVM.

• A random forest, denoted RF.

• A weighed ensemble model of the models above, referred to as the weighed model. • A predictive rule ensemble model, to which we refer by its alternative name,

RuleFit.

(14)

3.2 Model Evaluation

To asses the predictive performance of a fitted model ˆyi = f (xi; ˆβ), we evaluate its

prediction errors on a sample (y_i∗, x∗_i) of size N∗ that was not used to estimate the its

parameters. In practice, we randomly select a subset of 20% of the data to be used as the test sample. We then use three different measures to evaluate the performance of a prediction model on such test samples: the Mean Absolute Error (MAE), the Mean Absolute Percentage Error (MAPE) and the Root Mean Squared Error (RMSE). These are given by:

All of these performance measures would be zero for a model that perfectly predicts yi,

so we prefer models that give lower values of the performance measures.

Although we are primarily interested in the predictive performance of a model on the test sample, which we call its out-of-sample performance, we also compute the performance measures for the training sample. The reason for this is that a large discrepancy between the two can be an indication of overfitting. This means that a model does not only approximate the latent structure of the data generating process, but also describes some of the particular random variation observed in the training sample. This can lead to a model that does not generalize well to new data, so overfitting can be an indication that a better out-of-sample performance might be obtained by a using

a more sparse model (Burnham & Anderson, 2002).

(15)

(16)

4 Mixed Models

In this chapter, we give a detailed description of all the models that are a part of the original mixed model. For the sake of completeness, this includes a detailed discussion of the classical linear regression model, whose results are well known. However, this brief recap of its most important properties is also useful for comparison with the LMNET and LSVM models that we discuss next. After discussing these linear models, we turn

to a general discussion of classification and regression trees in Subsection4.2. Although

we do not use these models as such, this discussion is needed for the detailed description

of the Random Forest model in Sub-subsection4.2.3 and Predictive Rule Ensembles in

Section5.

4.1 Linear Models

A linear model supposes that the relationship between a dependent variable and a set of K explanatory variables can be (approximately) written as

yi = x0iβ + ui, (5)

where β is a constant K-vector of unknown parameters and ui a stochastic error term.

If we assume that the error term has zero expectation and that the explanatory variables

are non-stochastic, this model implies that the expected value of yi conditional on xi

and β is

E[yi|xi, β] = x0iβ. (6)

Therefore, we can use this model to predict yi for new observations of xi if we find an

estimate ˆβ of β based on observed data. The prediction ˆyi based on this model is given

by

ˆ

(17)

We consider three possible ways to obtain an estimate ˆβ from a given sample in the following three subsections.

4.1.1 Classical Linear Regression

For the classical linear regression model, we use the ordinary least squares (OLS) esti-mator of β. This estiesti-mator is found by minimizing the sum of squared residuals, i.e.

solving the minimization problem in (3) with L(ˆui) = ˆui2 = (yi− x0iβ)2. If we let y

de-note an N -vector whose elements are observations of yi and X an N × K matrix whose

rows contain the corresponding observations of xi, the minimization problem becomes

ˆ

βOLS : = arg min

β N X i=1 (yi− x0iβ)2, = arg min β (y − Xβ)(Xβ − y). (8)

An exact solution exists whenever X has full column rank. Differentiating the objective

in (8) with respect to β and equating the result to zero then gives

ˆ

βOLS = (X0X)−1X0y. (9)

Provided that E[u] = 0 and X is strictly exogeneous, where u denotes the N-vector

with elements ui, the OLS estimator is unbiased, as the vector of expectations of ˆβ is

given by:

E[ ˆβOLS] = E[(X0X)−1X0(Xβ + u)] = β, (10)

so that the model predictions ˆyi = x0iβˆOLS will also be unbiased. If it is also true that

E[u] = σ2_{I for some unknown σ}2 _{∈ R}

++, then it follows form the Gauss-Markov theorem

that among all unbiased linear estimators of β, the OLS estimator has the smallest

variance. More precisely, this means that for any estimator ˆβ∗ = Cy = C(Xβ + u) of

(18)

matrix of ˆβ satisfies:

Var[ ˆβ∗] ≥ Var[ ˆβOLS] = σ2(X0X)−1. (11)

Thus, under the conditions of strict exogeneity and homosekdasticity, the classical

linear model gives the best linear prediction of yi in the sense that it is unbiased with

the smallest possible variance. However, other linear predictors may lead to better predictions when these assumptions do not hold, which is often the case in practice when the linear model is used as an approximation to a more complicated data generating process.

4.1.2 LMNET

The functional form of the LMNET model is identical to that of the linear regression

model given in (5), but the parameters β are estimated by using a total loss function

that includes a penalty term for nonzero parameters. Using the notation:

||β||p := _Xk j=1 |βj|p 1/p for β ∈ Rk, (12)

the minimization problem for the LMNET model can be written as

ˆ

βLMNET = arg min β 1 N N X i=1 1 2(yi− x 0 iβ) 2_{+ λ}h_α||β|| 1+ 1 2(1 − α)||β|| 2 2 i , (13)

where λ > 0 and 0 ≤ α ≤ 1 are hyper-parameters that must be specified in advance. The additional penalty term for positive parameters consists of a mixing of two main components. The parameter α determines the relative weights of these components, and is known as the elastic net parameter, which gives this model its name. The parameter λ, which is known as the tuning parameter, determines the overall strength of the

weighed penalty terms. The first component, ||β||1, is known as the Least Absolute

Shrinkage and Selection Operator (LASSO) penalty, and the second component ||β||2

(19)

is known as the ridge penalty. Under the assumptions that the error term has zero expectation and that the explanatory variables are strictly exogenous, we find that the

estimator using the ridge penalty (α = 0 in (13)) is given by:

ˆ

βRidge = arg min β 1 N N X i=1 1 2(yi− x 0 iβ) 2_{+ λ}1 2||β|| 2 2 = (X0X + λI)−1X0y (14) so that we have E[ ˆβRidge] = (X0X + λI)−1X0Xβ

Var[ ˆβRidge] = σ2(X0X + λI)−1X0X(X0X + λI)−1. (15)

Thus, using the ridge penalty reduces the variance of the estimator β when λ increases, at the expense of increasing its bias.

The effect of using the LASSO penalty is that it leads to sparser models by forcing

some of the elements of β to zero (Tibshirani,1996). The estimator ˆβLASSO is given by:

ˆ

βLASSO = arg min β 1 N N X i=1 1 2(yi− x 0 iβ) 2_{+ λ||β||} 1 = arg min β (y − Xβ) 0 (y − Xβ) + λ K X j=1 |βj|, (16)

which has no closed form solution. Nonetheless, we can see from this minimization

problem that whenever the effect of setting βj to zero increases the total sum of squares

(y − Xβ)0(y − Xβ) by less than λ| ˆβj,OLS|, the LASSO penalty on its absolute value

(20)

The LMNET model mixes the LASSO and Ridge penalties together, which often leads to fitted models that are more parsimonious and simultaneously give more

accu-rate predictions than the classical linear regression model (Zou & Hastie, 2005). This

comes at the expense of having to specify the two hyper-parameters α and λ prior to fitting the model, but selecting these parameters can be automated by using multiple randomized cross-validations to select an optimal value. A fast algorithm for doing this

can be found in Simon et al. (2011).

4.1.3 Linear Support Vector Machine

The functional form of the linear support vector machine (LSVM) model is again

iden-tical to that of the classical linear model given in (5). To estimate β we now use a

ridge penalty together with a linear loss function that is insensitive to errors with a size

below a certain threshold b ∈ R++:

L(ˆui) := L(yi, ˆyi) =        0 if |yi− ˆyi| < b |yi− ˆyi| − b if |yi− ˆyi| > b . (17)

Introducing non-negative slack variables ζi, ζi∗ ∈ R+, we can write the resulting

mini-mization problem as ˆ βLSVM := arg min β 1 2||β|| 2 2+ C N X i=1 (ζi+ ζi∗), s.t. yi − x0iβ ≤ b + ζi, x0_iβ − yi ≤ b + ζi∗, ζi, ζi∗ ≥ 0, (18)

(21)

the error tolerance and sparseness of the fitted model. The threshold b is also known as the tolerance parameter, and the distance 2b is known as the margin. As with the LMNET model, we can use fast algorithms based on randomized cross-validation within the training sample to determine an appropriate choice for C.

The key idea behind the LSVM model is that the minimization problem (18) can

be transformed into a dual problem, where the solution expresses a prediction of the

dependent variable for an input xi as a linear combination of a (small) subsample of

the data. Introducing new parameters αi, α∗i ≥ 0 for i = 1, ...., N , we can write the

solution of (18) as (Smola & Schölkopf,2004)

ˆ β = N X i=1 ( ˆαi− ˆα∗i)xi, (19)

where the parameters ˆα(∗)_i (referring to both ˆαi and ˆα∗i) are given by the solution of the

dual problem max α(∗)₁ ,...α(∗)_N −1 2 N X i=1 N X j=1 (αi− α∗i)(αj− α∗j)x 0 ixj− b N X i=1 (αi+ α∗i) + N X i=1 yi(αi− α∗i) s.t. N X i=1 (αi− α∗i) = 0 αi, α∗i ∈ [0, C]. (20)

The full derivation of this dual problem and the proof that (19) is equal to (18) is

beyond the scope of this thesis. Details can be found in McCormick (1983).

The expression in (19) is known as the support vector expansion of β. The solution

of (20) will typically have many cases where αi = α∗i = 0, so predictions of the LSVM

model are based on a small set of important training observations, called the support vectors. The idea behind this approach is that it should avoid overfitting because this small set of only the informative observations is used to separate regions with similar

(22)

4.2 Decision Tree Models

Decision tree models offer an approach to classification and regression based on parti-tioning the data into subgroups according to a sequence of decision rules. As it is not feasible to consider every possible partition for all but the smallest data sets, fitting these models relies on heuristic algorithms to find partitions that lead to good

predic-tions (Hastie, Tibshirani, & Friedman,2001). In this thesis, we use a tree-based model

known as a Random Forest, which predicts a dependent variable by taking an average prediction from many regression trees grown on randomly selected subsets of observa-tions and features of the input data. To explain this in detail, we begin by introducing some terminology by considering simple decision trees for classification, and then dis-cuss the structure of a general decision tree used for regression. Once these concepts have been introduced, we give a precise description of the random forest model and discuss an algorithm that can be used to fit it to a dataset.

4.2.1 Decision Tree Terminology

An example of a simple classification tree is given in Figure 1. This tree explains the

categorical dependent variable "transport to work" as a decision based on the variables "weather" and "mood". We refer to the square and oval boxes in the tree as nodes, which correspond to particular subsets of the data. The nodes split into other nodes according to simple decision rules, which ask whether or not one of the explanatory variables of an observation has a given level. A sequence of nodes is called a branch. By conjunction of the simple rules in a branch, we obtain composite rules that divide the data up into subgroups. We call the number of terms leading up to a node its depth, and the depth of the full tree is the maximum depth among its nodes. For example,

the node labeled "Mood?" in Figure 1 has depth 1, and the three nodes that it splits

(23)

tree are also called the leaves, which are nodes that are not split any further. With each leaf, we associate a particular value of the dependent variable, which is the tree’s prediction for a given input of the explanatory variables.

Figure 1: Example Classification Tree

The figure shows a simple example of a decision tree for classifying how the choice between taking a bike or bus to work depends on the variables weather and mood.

(24)

4.2.2 Regression Trees

The basic idea behind classification trees can be extended to the task of regressing a numerical dependent variable on both numerical and categorical explanatory variables.

A generic decision tree for regression of a continuous dependent variable yi on a set of

explanatory variables xi divides the sample space χ of the explanatory variables into

M disjoint regions R1 ⊂ χ, ....RM ⊂ χ, and associates each region with a constant

cm ∈ R that is the tree’s prediction ˆyi for observations that it sorts into that region.

The relationship between the dependent and explanatory variables is thus modelled as

yi =

M

X

m=1

cmI{xi ∈ Rm} + ui, (21)

where ui is a stochastic error term and I{·} is the indicator function, which is 1 if the

expression within the brackets is true and zero otherwise. The tree’s prediction ˆyibased

on xi is thus ˆ yi =                c1 if xi ∈ R1 .. . ... cM if xi ∈ RM . (22)

For a giving partition R1, ..., RM of χ, the associated constants cm can be determined

by minimizing the total loss in each region separately

cm := arg min c N X i=1 L(yi− c)I{xi ∈ Rm}. (23)

In particular, if we use the quadratic loss function L(ˆui) = ˆu2i, it follows that the

(25)

that region, since differentiating (23) for the entire sample with respect to c yields d dc N X i=1 (yi− c)2 = N X i=1 h 2c − 2yi i , (24)

and setting this derivative equal to zero thus gives

c = 1 N N X i=1 yi. (25)

Thus, it is straightforward to determine the optimal choice ˆcm for a given partition

R1, ..., Rm. However, finding the optimal partitioning of χ into M disjoint regions for

a given M is generally not feasible, and their is also no a priori way to determine an optimal M so that the resulting tree will not overfit the input data and generalize well to new observations. Thus, we must rely on heuristic algorithms to construct a regression tree that leads to good predictions.

One example of such an algorithm, which is the one that we will use in this thesis to grow the trees that we use in the random forest model, is the greedy algorithm

for training classification and regression trees given by Breiman et al. (1984), which

Breiman (2001) also used in his original specification of the random forest model. This

algorithm partitions the sample space χ of the features into disjoint regions such that

each region contains at least M∗ observations. Here, M∗ is a hyper-parameter chosen

in advance. For each variable j = 1, ..., K, we split the data into two regions R1(j, s) =

{χ|xj ≤ s} and R2(j, s) = {χ|xj > s}, where we choose the splitting point s ∈ R by

solving min s h _X i∈R1(j,s) (yi− ˆcj1)2+ X i∈R2(j,s) (yi− ˆcj2)2 i (26)

where ˆcjr is the solution of (3) with a quadratic loss function for Rm = Rr(j, s). Since

we use the quadratic loss function, ˆcjr will be equal to the mean of the yi of the

(26)

Figure 2: Algorithm for growing a Regression Tree

Illustration of a greedy algorithm for splitting a sample space of two variables X1and X2 into regions

R1, ..., R5, adapted from Figure 9.2 in Hastie et al. (2001). First, the optimal splitting points t1 and

t2 are computed. The first split is then made on X1, which means that it leads to a smaller total

variance in the resulting nodes than splitting along X2≤ t2would. Both nodes are then split further,

where the left node is split according to X2≤ t2while the right node is again split based on the values

of X1. Finally, the region X1> t3 is split along X2≤ t2.

s by minimizing the sum of the variances of yi in the resulting regions. The variable

whose optimal split gives the smallest total variance in yi then becomes our choice

for the first split. This procedure is then repeated starting from each newly created

region, until a further split would create a region containing less than M∗ observations,

where we call M∗ the minimum node size. An illustration of this procedure is given

in Figure 2. The final result is a set of M decision rules partitioning the sample space

into regions R1, ..., RM, together with M constants ˆcm giving the tree’s prediction of yi

(27)

4.2.3 Random Forest

A random forest model predicts a dependent variable yi based on explanatory variables

xi by taking the average prediction of many decision trees, that are generated from

random sub-samples of the data. Letting {Tb : χ → R} denote the set of B regression

trees in the random forest, this model’s prediction ˆyi based on xi is simply the average

of the predictions of all individual regression trees in the ensemble:

ˆ yi = 1 B B X b=1 Tb(xi). (27)

In the original implementation of the random forest algorithm (Breiman, 2001),

trees are generated from bootstrap samples drawn with replacement from the original

data, where each bootstrap sample (y∗, X∗) has the same size as the original sample.

Also, for every split in the trees Tb, only a random selection of Q features is considered

instead of all K explanatory variables. Q is a hyper-parameter that must be specified

in advance, together with the minimum node size M∗ that implicitly determines the

depth of the trees in the final ensemble. Q is generally chosen to be small compared to the total number of features K, to ensure that many dissimilar trees are generated

(Liaw & Wiener,2002).

Other than the random selection of a subsample of the explanatory variables for splitting at each node in each tree, the algorithm that we will use in this thesis for generating the individual trees in the random forest is the same as the algorithm

de-scribed in Subsection 4.2.2. However, there are many somewhat different approaches

to generating random forests, and we will consider one of them in the following section where we discuss Predictive Rule Ensembles, which are models that combine a random forest with a linear model.

(28)

or only a small bias because each tree uses only a small number of randomly selected features. It thus seems plausible that an average of many unbiased estimates could give an accurate prediction, and the Random Forest does indeed have a reputation for giving highly accurate predictions in many practical applications. However, no precise math-ematical understanding of the statistical properties of the Random Forest exists, and

the underlying mechanism is not understood very well (Biau, 2010). Nonetheless, this

model is widely used in practice because of its often outstanding predictive performance

(Genuer, Poggi, & Tuleau,2008) compared to other available approaches, including the

linear models discussed in the previous subsection.

4.3 Weighed Model

A weighed or ensemble model can be used to generalize the main idea behind the random forest. Instead of making a prediction based on a simple average of decision trees, we now use a weighted average of different types of models. Suppose that we

have M fitted prediction models ˆy(m)_i = fm(xi; ˆβm), and let w1, ..., wM ∈ R be weights.

For a given choice of w1, ..., wM, the weighed model prediction of yi based on xi is given

by ˆ yi = fw(xi; w1, ..., wM) = M X m=1 wmfm(xi; ˆβm). (28)

To determine the weights, we use a random cross-validation approach on small

subsamples of the test data. S∗ times, we draw a sample of size Nw from the test data

and compute the predictions ˆy_i(m). Each time, we estimate the classical linear regression

model discussed in 4.1.1 on the subsample of the test data. Here, we use the observed

values yias the dependent variable and the predictions (ˆyi(1), ..., ˆy

(M )

i ) as the explanatory

variables. The weights for this sample are then given by the OLS estimates. Repeating

this procedure gives us S∗ sets of weights (w1, ..., wM), from which we take the median

(29)

5 Predictive Rule Ensembles

5.1 General Framework

A Predictive Rule Ensemble, also called a RuleFit model, combines a linear model with a decision tree model by adding dummy variables that correspond to nodes in

the decision trees of a of a tree ensemble fitted to the data (Friedman & Popescu,

2008). These dummy variables can be interpreted as rules, indicating whether or not

the explanatory variables xi satisfy a certain set of conditions. Let ri(xi) be a vector of

R binary variables, where each element depends on the values of a (small) subset of the

explanatory variables xi, so ri(xi) = (ri1(xi), ..., riR(xi)) with rij(xi) ∈ {0, 1}. Letting

γ be an R-vector of unknown parameters, the RuleFit model is given by

yi = x0iβ + ri(xi)0γ + ui, (29)

where ui is a stochastic error term. Fitting this model to a set of data now means

that we have to find a set of rules ri(xi) with corresponding parameter estimates γ in

addition to estimating the parameters β.

The basic idea behind the approach we use to find ri(xi) is that we begin by including

a large number of candidate rules into the model, and then fit the resulting linear model with R additional binary variables with a strong LASSO penalty term in the minimization, so that many of the variables and candidate rules will be given a weight of zero, effectively removing them from the final fitted model. Thus, given a set of rules

r(xi), we fit the model to the data by solving

ˆ β, ˆγ := arg min β,γ 1 2 n X i=1 (yi− x0iβ − ri(xi)0γ)2 + λ h ||β||1+ ||γ||1 i , (30)

(30)

strength of the penalty for giving a nonzero weight to any of the parameters. This parameter can be specified in advance, or determined using cross-validation as in the

LMNET and LSVM models (Friedman & Popescu,2004). To ensure that the candidate

rules have an a priori weight that is approximately the same as the linear variables, the linear variables are normalized prior to the minimization.

The main idea behind using such rules is that they can capture interactions between variables that are lost in a linear model. Furthermore, because each tree is constructed based on a random sub-sample of the data and a random subset of the features, the features that are most strongly related to the dependent variable will tend to appear in the rules more often, although some less important features will be included in the initial ensemble as well. This makes it possible to capture discontinuities, higher order interactions, or interactions with features that have only a small support in the dataset if these exist in the data generating process. In summary, a predictive rule ensemble thus tries to include a large number of potential interactions and discontinuities in a linear model, and then uses a LASSO penalty in the final optimization to ensure that

irrelevant candidate rules are ultimately removed from final fitted model (Friedman &

Popescu, 2008). The result is a sparse linear model complemented with a set of easily

understandable rules that allow it to accommodate non-linear relations in the data.

5.2 Finding Candidate Rules

(31)

which we describe below. We denote the parameters of tree b as pb, and refer to the

trees in the ensemble as Tb(xi; pb).

To obtain the base learners, we begin by specifying a randomly chosen maximum

number of leaves for each individual tree. Letting Lb denote this maximum for tree b,

we use Lb = 2 + bXbc Xb ∼ exp(λ) with λ = 1 ¯ L − 2 (31)

where bxc is the floor function, returning the largest integer that is less than or equal

to x, and ¯L denotes the average number of leaves in a tree, which is a hyper-parameter

that must be chosen beforehand. Typically, we want the individual trees to have a depth that is small compared to the total number of available features.

With Lb specified, we use an algorithm that is somewhat similar to the one discussed

in Subsection4.2.3to grow the trees, where we randomly select Q features to be

consid-ered for each split and use a random sample of size η from the data. However, instead

of using the greedy algorithm discussed in Subsection4.2.2to grow the individual trees,

we now use the model based recursive partitioning algorithm by Zeileis et al. (2008)

based on a simple linear model. Briefly summarized, this algorithm estimates a linear model at each node, starting with the full sample, and then splits the node according to the variable that would lead to the largest change in the parameter estimates when the linear model is estimated separately on the resulting subsets. For the exact details

on this procedure, see the paper by Zeileis at al. (2008).

Once we have obtained an ensemble of regression trees, {Tb(x; pb)}B1, we transform

all nodes in all trees into rules and remove any duplicate and complement rules. This

gives a set of R base-learners rb : χ → R, which we denote as rb(x; pb). We then

(32)

Fj : χ → R be functions of x ∈ χ, this algorithm proceeds as follows:

• First we set b = 1 and define the function F0 : χ → R as

F0(x) = arg min c N X i (yi− c)2 = 1 N N X i yi. (32)

• For base-learner b we draw a new subsample Sb of size η from the data and choose

its parameters by minimizing

pb = arg min p X i∈Sb Lyi, Fb−1(xi) + rb(xi; p) , (33)

where we use a quadratic loss function L(x) = x2_{. Next, we define the function}

Fj(x) as:

Fb(x) = Fb−1(x) + νrb(x; pb), (34)

where 0 ≤ ν ≤ 1 is a hyper-parameter known as the learnrate or boosting- or shrinkage parameter. Finally, we set b = b + 1, and repeat this step until all we have re-estimated the parameters of all base-learners in the ensemble.

The trained base-learners are then used as the candidate rules ri(xi) = (ri1(xi), ..., riR(xi))

in (30).

5.3 Choosing Hyperparameters

The final predictive performance of the RuleFit model depends on the choice of 5 hyper-parameters. These are the learnrate ν, the number of trees to be generated in the ensemble B, the number of random features to be considered for each split in each tree Q, the size of each subsample η used for growing the trees and estimating its

parameters, and the average number of nodes in each tree ¯L. Since we can not consider

(33)

large as possible while keeping computation times manageable. Furthermore, we will

use ¯L = 4 in all computations, so that we mostly obtain rules with two variables in the

initial ensemble. For Q, we only consider using all features for each tree or a randomly selecting a third of all features. Using all features will generate many similar trees, leading to many rules that concern the most informative features. Using Q = p/3 will give a more diverse initial ensemble. Similarly, for η we consider using either half the sample or the full sample for the estimation of each tree.

Choosing ν = 0 implies that each base-learner is optimized independently of the others, while using a positive ν creates dependence between the parameters of each base learner with the entire ensemble. Based on simulation studies, Friedman & Popescu

(2003) recommend using small but nonzero values for ν. Their results find that ν = 0.01

gives the best results, and this is also the default choice in the R package "pre" (Fokkema

& Christoffersen, 2018) that we use to estimate predictive rule ensembles in this thesis.

Therefore, we will consider ν = {0.001, 0.01, 0.1}.

Furthermore, to reduce the influence of outliers on both the selection of the candidate rules and the final model fitting, Friedman and Popescu suggest that one should consider "winsorizing" the input variables prior to estimating the model. This means that we do not use observations in the top and bottom ω quantiles of each explanatory variable.

Friedman & Popescu (2008) recommend considering ω ≤ 0.025. In this thesis, we will

consider ω = {0, 0.005, 0.025}.

5.4 Adding Linear Interactions

We will also consider a new addition to the RuleFit model, where we add linear

interac-tion terms of the form xjxj0 to (29) based on cross-validation of a regular RuleFit model.

(34)

interactions. This is useful because including every possible linear interaction term between two variables is generally not feasible. However, the original RuleFit model only includes such interactions as a binary variables, which does not accommodate in-teractions whose strength gradually changes with the values of its components well. If interactions of varying strength are indeed present, including a linear interaction term might lead to a model with a better predictive performance.

Because the RuleFit model relies on a randomly generated ensemble of base-learners, we use cross-validation to identify candidate interaction terms. We estimate the RuleFit model S times, and track how often each possible pair of variables is given a nonzero weight in the fitted model. We also track many rules are given a nonzero weight in total. We then compute the rate at which each variable pair that ever appeared is included (# occurrences of a pair / # rules with a nonzero parameter estimate). We shall refer to this as the occurence rate of a pair of variables. Based on the observed

occurrence rates, we select a cutoff rate rc, and include pairs with a higher frequency

(35)

6 Model Interpretation

In addition to the predictive performance of a model, we are also interested in its

inter-pretation. For the linear models discussed in Subsection 4.1, interpretation is trivial.

Here, the estimated parameters ˆβ immediately show the effect that each explanatory

variable has on the dependent variable. This also holds to some extent for the RuleFit Model, which contains linear components as well as composite rules in the form of bi-nary variables. If such a rule is satisfied, the corresponding estimate of its parameter gives the impact of this rule on the dependent variable. However, when a variable is included in multiple rules and also appears as a linear term, the interpretation of its effect becomes more difficult. On the contrary, the random forest and mixed model are very hard to interpret because the impact of any variable depends on the values of all other variables.

To assess the effects of individual explanatory variables on a dependent variables, we use two different model-agnostic tools for the interpretation of a general prediction model. These are the feature importance measure originally introduced by Breiman

(2001) for the interpretation of a random forest and partial dependence plots introduced

by Friedman (2001) for interpreting boosted regression tree ensembles.

To compute the feature importance of the explanatory variables xj in a prediction

model ˆyi = f (xi; ˆβ), we first compute the model’s root mean squared error on a test

sample (y∗_i, x∗_i). Denoting this error by E0, we have

E0 = v u u t 1 N∗ N∗ X i=1 (yi− ˆyi)2, for i = 1, ..., N∗. (35)

For each explanatory variable xj, we then compute the mean squared error again on

the test set with a random permutation of the values of xj, which we denote as Ej. The

(36)

the dependent variable, so that the resulting increase in the model’s prediction error

reflects the contribution of xj to the models predictive performance (Fisher, Rudin, &

Dominici,2018). We then compute the feature importance of variable xj as

FI(xj) =

Ej

E0

. (36)

Because the feature importance depends on taking a random permutation of the data of one feature, we take the average of 30 such computations. Furthermore, we scale the feature importances so that the variable with the highest feature importance is given an importance of 100 and the one with the lowest an importance of 0.

The partial dependence function of a subset of the explanatory variables xS ⊂ x in

a prediction model ˆy = f (x; ˆβ) is defined as fp(xS) = E[f (x)|x−S], where x−S denotes

the vector of variables in x that are not in xS(Molnar,2019). To estimate this function,

we compute the average model predictions on a set of training data for a range of values

of the variables in xS. The estimated function ˆfp(xS) is thus

ˆ fp(xS) = 1 N N X i=1 f (xS, xi,−S; ˆβ). (37)

We refer to a plot of (37) as a partial dependence plot. When xS contains two variables,

it is also sometimes called a pairplot. Such plots show the average marginal effect of

(37)

7 Data Description

We have a prepared dataset that consists of N = 5009 complete observations on housing units in The Netherlands. All houses were newly-constructed and sold immediately after their construction between 2003 and 2018. For each unit, we observe a total of 384 variables that can conceptually be divided into the following 6 groups:

1. House characteristics, such as the price for which it was sold, how long it took to sell the object since its first listing on the market, size, number of rooms, the year the object was build, the type of house and its location. (28 variables) 2. Information about the characteristics and demographics of the neighbourhood in

which the object is located, such as the average income of households, the percent-age of households with (young) children, population density, the percent-age distribution of the population and a quality of life score. (204 variables)

3. The distance to a wide range of services and utilities, such as schools, general practitioners, hospitals, libraries and major train stations. (91 variables)

4. Information about other houses in the neighbourhood, such as their type, age, size and the percentage of houses available for (social) rent. (20 variables) 5. General macroeconomic indicators, such as price index and indicators on consumer

trust, economic outlook, interest rates and GDP growth during the period that a house was sold. This also includes an indicator for the recession surrounding the financial and euro crises in the period 2008-2013. (15 variables)

(38)

This dataset was created by joining data on the houses built by two real estate development companies (group 1) together with 3 different data sets. Data on the variables in groups 2 and 5 were obtained from public data sets obtained from the Dutch Central Bureau for Statistics (CBS), data on the variables in groups 3 and 4 were obtained from a dataset that was purchased from a private research company, and those in group 6 were obtained from a publicly available dataset created by the Dutch association for real estate agents, the NVM.

The data contains 4 main types of houses, which are terrace houses, apartments, semi-detached and detached houses. Here, semi-detached means that a house is fully connected on one side to a similar house that was constructed simultaneously. We also have a dummy variable indicating whether a terrace house is on a corner, a dummy that indicates whether an apartment is a penthouse, and a dummy for detached houses that are coupled. Here, a detached house is coupled when it’s walls are partially connected to those of a different house, which need not be similar or constructed at the same time.

Additionally, all houses have one of 153 subtypes. Table1 shows how often each type

appears in the dataset. We see that over half of the data consists of terrace houses, and another 30% of apartments. We have relatively few observations on detached houses, and most of these are coupled.

Table 1: Types of Houses

Type Total Corner Penthouse Coupled

Terrace 2765 428 0 0

Apartment 1568 0 31 0

Semi-detached 510 0 0 0

Detached 166 0 0 95

The table shows how many observations we have for each of the 4 main house types in the data. For Terrace houses, it also shows how many of those are corner houses, for apartment it shows how many are penthouses and for detached houses it shows how many are coupled.

(39)

Figure 3: Houses per Project 5 10 15 40−50 ₉₀₋₁₀₀ 140−150 190−200 240−250 Number of houses Projects

The histogram shows the number of projects in the dataset that contain 0-10, 10-20, etc. houses.

of rooms. Houses in a project are also in the same neighbourhood, or in rare cases in adjacent neighbourhoods. This implies that many of observations in the dataset are highly similar, so that the data is not a random sample of newly constructed houses. We have a total of 107 projects which contain on average 47 houses per project. Figure

3 gives an indication of the distribution of the number of houses in a project. We see

that the majority of the projects contain at most 70 houses, while there are there are 19 projects with more than 70 houses, of which 11 contain more than 100 houses

(40)

8 Preliminary Data Analysis

Because this thesis was written during an internship with the objective of improving upon a previously developed model, we follow several choices that were made in the development of this original model to be able to compare the results of the RuleFit model to those of the mixed model. First of all, we use price per square meter as the dependent variable in all models, rather than the price itself. Secondly, we estimate all models separately for the four main types of houses in the data. These choices were found to improve the predictive accuracy of the previously developed mixed models.

To show why using price per square meter rather than the price itself as the

de-pendent variable in a prediction model is likely to improve its accuracy, Table 2 gives

the sample correlations between the house size and price, and house size and price per square meter. While the correlation between size and price is about the same for all types other than semi-detached, the correlation between size and price per square meter is notably different for each type. The nonzero correlations of price per square meter and size indicate that the relationship between price and size is not linear for all house types other than apartments. The predictive performance of the linear models is therefore likely to improve when price per square meter is used as the dependent variable while size is also included as an explanatory variable.

Table 2: Correlations

Type Living Area - Price Living Area - Price/m2

Terrace 0.78 0.21

Apartment 0.75 0.00

Semi-detached 0.49 -0.31

Detached 0.75 0.43

The table shows the Pearson correlation coefficients between living area (in m2_{) and price (in}_{e), and}

living area and price/m2 (in e), for each of the four main types of houses in the data. To calculate this coefficient ρ given a sample xi, yi of size N , we use ρ = N1

PN

i (xi− ¯x)(yi− ¯y)/(σxσy), where ¯x

(41)

Figure 4: Histograms of price per m2, per type 0 100 200 1500 2000 2500 3000 3500 4000 Price per m² Houses

(a) Terrace houses

0 50 100 150 200 2000 3000 4000 5000 6000 Price per m² Houses (b) Apartments 0 20 40 60 1500 2000 2500 3000 3500 Price per m² Houses (c) Semi-detached houses 0 10 20 2000 3000 4000 5000 Price per m² Houses (d) Detached houses

The histograms show the distribution of the price in euro’s per square meter of living area for each of the four main house types in the data.

Histograms of the price per square meter for each type are given in Figure 4, and

Table 3shows the respective means, standard deviations and minimum and maximum

(42)

Table 3: Summary statistics for price per m2

Mean S.d. Min Max

Terrace 2,321 411 1,504 4,181

Apartement 2,719 581 1,503 6,293

Semi-detached 2,679 401 1,532 3,787

Detached 2,731 560 2,014 4,945

The table shows the means, standard deviations, minima and maxima of the price per square meter of living space in euro’s for each of the four main house types in the data.

8.1 Variable selection

The method we use for variable selection is also chosen to follow the approach of the original mixed model. Selecting which variables to use as the explanatory variables prior to estimating our models is useful because our dataset contains 363 numeric variables, while we have at most 253 distinct houses (for terrace houses). Even if all observations were completely random we would have at most 7.6 observations per variable. For detached houses, we even have less observations (166) than variables.

To reduce the number of variables, we first remove any numeric variables that have no variance within the subsets of the data per type. This removes 9 variables for terrace houses, 12 for apartments, 13 for semi-detached houses and 15 for detached houses. Next, we proceed by removing variables that are strongly correlated with some of the other variables. To do this, we compute the correlation matrix of the numeric variables other than price and price per square meter, for the subsets of the data per house type separately. For each pair of variables with a correlation of 0.8 or higher, we remove the variable with the highest mean absolute correlation with all other variables. This removes 201, 235, 242 and 282 variables for terrace houses, apartments, semi-detached houses and detached houses, respectively.

For the remaining variables, we use a forward stepwise variable selection algorithm

(Venables & Ripley,2002) to determine which variables we include in our models. We

(43)

esti-Table 4: Final Variable Selection

Type Terrace Apartment Semi-detached Detached

Number of expl. variables 78 28 19 14

Number of observations 2765 1568 510 166

Observations per expl. variable 35.4 56 26.8 11.8

The table shows the number of variables that are included in all models for each house type, together with the number of observations that we have for those house types and the number observations per variable.

mate a classical linear model with the current selection of variables as the explanatory variables and price per square meter as the dependent variable on a random subset of

half the training data, using (8). We then compute the Akaike information criterion of

this model on the other half of the sample, using

AIC = 2k − 2 ln ˆL(Xk; ˆβOLS), (38)

where k denotes the number of variables included in the model, Xk denotes the data

matrix of the k explanatory variables, and ˆL(Xk; ˆβOLS) denotes the value of the

Gaus-sian log-likelihood function for the sample evaluated at the OLS estimates βOLS. We

repeat this 500 times and take the average of the AIC’s as the final AIC for the current model. For each variable that is not included in the current model, we then estimate the linear model that contains that variable as well in the same way, to compute the resulting AIC of the model with that variable. The variable that leads to the largest reduction in AIC on average is then included in the model. This process is repeated until no remaining variable would lead to a lower AIC, or until all variables are included in the model. The final number of explanatory variables and variables per observation

(44)

9 Results for Mixed Models

To evaluate the performance of the mixed model and its component models, we perform 300 randomized cross-validations of the LM, LMNET, LSVM, RF and mixed model. For the LMNET and LSVM model, we use an algorithm to automatically select the optimal choices for the hyper-parameters. For the Random Forest model, we generate 600 regression trees each time, and consider K/3 randomly selected variables for each split, with K the number of explanatory variables after variable selection for each type

of house. We set the minimum node size to M∗ = 5. On each run, we use a random

allocation of the data per type into a training set consisting of approximately 80% of the data and a test set containing the remaining 20%. Because the houses in the sample are grouped in projects that contain many highly similar houses, we randomize over the projects rather than individual houses. To ensure that we include on average about 20% of the houses in the test set, we use the following algorithm:

• We take a random permutation of the projects: {P1, P2, ..., P107} and start with no projects in the test sample.

• Starting with P1, we add the next project in the permutation to the test sample, so long as the resulting test sample contains less than 20% of houses.

• We add the next project PX if the percentage of houses in the resulting test sample would be closer to 20% than the test sample without it. For instance, suppose we have a test sample that contains 18% of all houses in the data. If PX contains 3% of houses we add it to the test sample, because the resulting sample would contain 21% of houses, giving a smaller difference from the desired 20%. The median MAE, MAPE and RMSE on the train and test samples for each house

type are given in Table5. We notice that the classical linear model has an extremely

(45)

Table 5: Median Model Performances

MAE MAPE RMSE

Terrace Train Test Train Test Train Test

LM 112 5,638,330 0.05 2,627 164 1,1381,774 LMNET 118 497 0.05 0.21 171 668 LSVM 111 648 0.05 0.28 173 894 RF∗∗∗ 85 271 0.04 0.12 133 349 Averaged 103 1,130,650 0.04 527 154 2,283,024 Mixed 88 152,448 0.04 67.52 133 299,968 Apartment LM 187 944 0.07 0.35 282 1258 LMNET∗∗∗ 188 358 0.07 0.14 283 463 LSVM 177 463 0.07 0.18 295 599 RF 160 398 0.06 0.15 258 520 Averaged 174 436 0.07 0.17 273 556 Mixed 162 481 0.06 0.19 259 628 Semi-detached LM∗ 79 142 0.03 0.05 114 187 LMNET∗∗∗ 79 126 0.03 0.05 115 168 LSVM∗ 80 142 0.03 0.05 120 187 RF 81 243 0.03 0.09 120 318 Averaged∗ 78 145 0.03 0.05 115 207 Mixed 86 163 0.03 0.06 120 207 Detached LM 88 499 0.03 0.20 137 598 LMNET 95 337 0.04 0.12 141 406 LSVM 99 293 0.04 0.12 144 379 RF∗∗∗ 90 219 0.03 0.09 147 276 Averaged 89 240 0.03 0.10 134 290 Mixed 98 512 0.04 0.21 139 576

The table shows the median in- and out-of-sample MAE (ine), MAPE (as a proportion) and RMSE (ine) obtained from 300 randomized cross-validations of the LM, LMNET, LSVM, RF, Averaged and Mixed models. Bold performance measures indicate the model with the best out-of-sample perfor-mance. The number of asterixes behind a model name indicates the number of performance measures according to which the model has the best or shared-best out-of-sample performance.

(46)

we find that the weighed model has a rather poor performance overall. Its cross-validated performance is 3rd worst for terrace houses, second worst for apartments and semi-detached houses and worst for detached houses.

The reason for the extreme errors in the predictions of the terrace houses is the relatively large number of explanatory variables (77) combined with the fact that of the 16 projects with more than 70 houses, 9 contain more than 70 highly similar terrace houses. In 292 out of 300 runs, this leads to a near-singular matrix of training data, which results in numerical errors that lead to extreme overfitting. Performance measures

for terrace houses on the test data for the remaining 8 runs are given in Table6. Here

we find that weighed model and random forest have the best performance, while only the random forest has the best median performance for terrace houses according to all three measures over all 300 runs.

Table 6: Median Out-of-Sample Performances for Terrace Houses

MAE MAPE RMSE

LM 1284 0.56 1971 LSVM 343 0.15 517 LMNET 330 0.14 480 RF∗∗ 197 0.09 277 Averaged 381 0.17 513 Mixed∗∗ 195 0.09 278

The table shows the median out-of-sample performance measures of the mixed models for terrace houses on the 8 / 300 runs for which MAPE < 1. Bold performance measures indicate the model with the best out-of-sample performance. The number of asterixes behind a model names indicates the number of performance measures according to which the model has the best, or shared-best out-of-sample performance. MAE and RMSE in euro’s, MAPE as a proportion.

For apartments, we find that the LMNET model outperforms the other models

according to all performance measures. Its median MAE for the price/m2 _{of a house is}

ise358, its median RMSE e463, and its median MAPE is 14%. The performances of the

(47)

the LM model is overfitting for the apartments, because the out-of-sample MAE and RMSE are about 4 to 5 times larger than their in-sample values, while this difference is only about 2,5 times for the other models. We also see that the LMNET and LSVM models, which are also linear but include less variables, perform notably better.

For the semi-detached houses, we again find that the LMNET model outperforms the other models. The out-of-sample prediction errors for this type are also notably smaller than those found for the terrace houses and apartments. We now have a median

MAE of e126, a median RMSE of e168 and a median MAPE of only 5%. Here, all

linear models perform well, and we see that the results for the LSVM model are almost identical to those of the LM model. This implies that the LSVM model does not set any coefficients to zero for the semi-detached houses.

For the detached houses, we find that the random forest model outperforms the other

models, giving a MAE ofe219, RMSE of e276 and MAPE of 9%. The performance of

the other models is again only slightly worse, with the exception of the LM and Mixed models, which have a MAE, MAPE and RMSE that is over twice as large. Overall, we find that the predictions for detached houses are slightly better than those for Terrace houses and apartments, and slightly worse than those for semi-detached houses.

Table 7 shows the proportion of runs in which each model gives the best

out-of-sample performance according to each performance measure. First of all, we notice that we find only minor differences between the three performance measures. We find that the random forest model gives the best out-of-sample performance for terrace houses in 94-95% of all runs. For the other house types, we find that the best models outperform the other models in only 32-39% of runs. The random forest model also performs best for detached houses, and the LMNET model gives the best results for apartments and semi-detached houses.

More extensive performance measures for the best performing model per type are

(48)

Table 7: Rates at which models perform best

Terrace Apartment Semi-detached Detached

MAE LM 0.00 0.15 0.12 0.10 LSVM 0.01 0.07 0.13 0.09 LMNET 0.03 0.35 0.39 0.12 RF 0.94 0.20 0.04 0.34 Averaged 0.00 0.15 0.17 0.28 Weighed 0.01 0.08 0.14 0.07 MAPE LM 0.00 0.16 0.13 0.10 LSVM 0.02 0.06 0.14 0.09 LMNET 0.03 0.35 0.39 0.14 RF 0.94 0.20 0.05 0.33 Averaged 0.00 0.15 0.16 0.27 Weighed 0.01 0.08 0.13 0.07 RMSE LM 0.00 0.14 0.18 0.09 LSVM 0.01 0.06 0.15 0.09 LMNET 0.03 0.39 0.38 0.13 RF 0.95 0.20 0.07 0.32 Averaged 0.00 0.15 0.11 0.29 Weighed 0.00 0.06 0.11 0.08

The table shows the proportion out of 300 randomized cross-validations that each model gives the best out-of-sample performance, according to MAE (e), MAPE (proportion) and RMSE (e), for each house type. Bold numbers indicate the highest value in each column.

(49)

Table 8: Best Model Performance Measures

Terrace Apartment Semi-detached Detached

RF LMNET LMNET RF MAE Median 271 357 129 219 Mean 271 379 155 275 S.d. 51.0 148 95.2 188 Min 139 146 43.6 5.19 Max 433 1105 748 836 5% 197 197 64.0 77.4 95% 348 673 332 709 MAPE Median 0.12 0.14 0.05 0.09 Mean 0.12 0.15 0.06 0.10 S.d. 0.02 0.06 0.03 0.06 Min 0.06 0.06 0.02 0.00 Max 0.17 0.41 0.28 0.26 5% 0.08 0.06 0.02 0.03 95% 0.15 0.24 0.12 0.22 RMSE Median 348 462 169 279 Mean 349 491 204 356 S.d. 62.0 186 120 204 Min 188 202 68.2 11.6 Max 547 1383 1081 895 5% 256 252 102 114 95% 455 869 422 754

The table shows medians, means, standard deviations, minima, maxima and 5% and 95% percentiles of the out-of-sample MAE, MAPE and RMSE per house type. These are obtained from 300 randomized cross-validations.

Histograms of the MAPE’s of the best performing models for each type are given

in Figure 5. These also indicate that which project are used as the test data can have

(50)

Figure 5: Best model out-of-sample MAPEs per type 0 10 20 30 0.0 0.1 0.2 0.3 0.4 0.5 MAPE Frequency

(a) Terrace houses

0 5 10 15 0.0 0.1 0.2 0.3 0.4 0.5 MAPE Frequency (b) Apartments 0 10 20 30 0.0 0.1 0.2 0.3 0.4 0.5 MAPE Frequency (c) Semi-detached houses 0 5 10 15 20 0.0 0.1 0.2 0.3 0.4 0.5 MAPE Frequency (d) Detached houses

The histograms show the distributions of the out-of-sample MAPEs (as a proportion) for the best model per type, obtained from 300 randomized cross-validations. The best performing model is the Random Forest for terrace and detached houses and LMNET for apartments and semi-detached houses.

(51)

10 Results for Predictive Rule Ensembles

Now that we have established the accuracy and robustness of the mixed models, we investigate whether or not RuleFit is able to outperform the weighed model and its best performing components. To do this, we begin by searching for a suitable choice for the hyperparameters for each house type in the following subsection. Next, we

perform randomized cross-validations of the RuleFit model in Subsection 10.2. We

use the results to find candidate interaction terms for each house type to be added to the RuleFit models, and perform another series of randomized cross-validations on the

resulting models in Subsection10.3.

10.1 Determining the Hyper-Parameters

We run multiple cross-validations of the RuleFit model to determine a suitable choice for

the set of hyper parameters Q, η, ν and ω. Table16in the appendix shows the medians

and standard deviations of the performance measure per house type for Q = {K/3, K} and η = {N/2, N }, where we fix the learnrate at ν = 0.01 and use no winsorizing, ω = 0. Each run consists of 50 randomized cross-validations. To determine which of these runs has the best overall out-of-sample performance, we first count how often each run gives the lowest median and standard deviation for each performance measure. We prefer the run which gives results in minimum values most often. In the case of a draw, which we have for detached houses, we prefer lower medians over lower standard deviations. Thus, we find that we obtain the best median out-of-sample performance when we use Q = K/3 for all types. For Terrace houses, we obtain the best median performance with η = N/2, while η = N is preferred for the other types.

Similarly, Table 17 in the appendix shows medians and standard deviations of the

(52)

and η = N . We now find that ν = 0.01 gives the best performance for each house type. The optimal winsorizing is found to be ω = 0.005 for terrace and semi-detached houses, ω = 0.025 for apartments and ω = 0 for detached houses.

Based on these results, we run a final series of 30 randomized cross-validations where we fix Q = K/3 and ν = 0.01, as these were found to consistently give the best results in the previous two runs. Thus, we now vary η = {N/2, N } and ω = {0, 0.005, 0.025}. The medians and standard deviations of the performance measures per house type for

this run are given in Table 18. For terrace houses, we again find that η = N/2 and

ω = 0.005 gives the best out-of-sample performance measures. For apartments, we find that the best performance is again obtained when we use η = N , but the optimal winsorizing is now found to be ω = 0 instead of ω = 0.025. For semi-detached houses we have the opposite situation, where we again find ω = 0.005 but now find that η = N/2 gives better results.

For detached houses, our rules for selecting the best out-of-sample performance would imply choosing η = 0.5 and ω = 0.005. Here, we find minimal standard deviations for each performance measure as well as a minimal RMSE, giving 4 minimal values in total. However, this choice also gives the highest overall MAE and MAPE, while we find a notably smaller MAE and MAPE when we use η = N, ω = 0.005 compared to all other runs. These are also smaller than the ones we found in all previous runs while varying other hyper-parameters. Meanwhile, the standard deviations of the MAE and MAPE for this choice are only slightly larger, and have medium values compared to the other runs, although the standard deviation of the RMSE is the highest among all other runs. Due to the notably smaller MAE and MAPE, and because η = N also gives a

better result in the randomized cross-validations shown in Table16, we decide to prefer

Predicting Market Prices of Residential Real Estate

Predicting Market Prices of Residential Real Estate

Predicting Market Prices of Residential Real Estate

Contents

1

Introduction

1.1

Structural Overview

2

Literature Review

3

Methodology

3.1

General Prediction Models

3.2

Model Evaluation

4

Mixed Models

4.1

Linear Models

4.2

Decision Tree Models

4.3

Weighed Model

5

Predictive Rule Ensembles

5.1

General Framework

5.2

Finding Candidate Rules

5.3

Choosing Hyperparameters

5.4

Adding Linear Interactions

6

Model Interpretation

7

Data Description

8

Preliminary Data Analysis

8.1

Variable selection

9

Results for Mixed Models

10

Results for Predictive Rule Ensembles

10.1

Determining the Hyper-Parameters