A comparison of traditional hedonic regression and random forest in mass real estate appraisal and the estimation of listing time effect on house prices

(1)

BACHELOR THESIS

Business Administration - Finance

A comparison of traditional hedonic regression and random forest in mass

real estate appraisal and the estimation of listing time effect on house prices

Phuong Anh Hoang - 11773944

First supervisor: Felipe Dutra Calainho

Time of submission: July, 2020

Word count: 7046

(2)

Statement of Originality

This document is written by Student Phuong Anh Hoang who declares to take full responsibility

for the contents of this document.

I declare that the text and the work presented in this document are original and that no sources

other than those mentioned in the text and its references have been used in creating it.

The Faculty of Economics and Business is responsible solely for the supervision of completion

of the work, not for the contents.

(3)

Abstract: This paper applies multiple regression analysis (MRA) and random forest in mass real estate appraisal to compare the explaining ability and the predicting power between the two methods. The study also looks at how each of this method estimates the effect of listing time on house prices. The analysis employs a Russian housing data set with 20375 observations after cleaning, 292 variables and was collected from 2011 to 2015. The results imply that random forest performs better than MRA in both explaining the in-sample variance and in predicting out-of-sample values. Both methods show that listing time adds little contribution to predicting house prices. However, MRA provided a result of time effect that is more easily observable and interpreted.

1. Introduction

December 5th_{, 2017, according to the New York Times, AlphaZero, a machine learning algorithm had}

been claimed to be the world’s best player not only in chess, but also in shogi (also known as Japanese chess) and Go. The algorithm was developed by researchers at DeepMind, the artificial intelligence company owned by Alphabet Inc. It had started with the knowledge containing only the basic rules of the game, played against itself and learned as it went along millions of times (Strogatz, 2018).

In recent years, the world has witnessed the rapid and strong growth in the field of machine learning. Numerous algorithms have been developed to solve simple tasks such as customers segmentation, sales prediction in business analytics, to voice recognition, facial recognition in smart devices, or to even more advanced cognitive tasks such as stock picking and diagnosing diseases. They gave rise to artificial intelligence assistants such as Siri, Alexa and Cortana, virtual online customer support, and AlphaZero, “the best player, human or computer, the world has seen” (Strogatz, 2018; Brooks, 2014).

Machine learning has also been applied in various researches to predict house prices. Traditionally, house pricing follows the hedonic method, an ordinary-least-squares regression method in which house price is considered as a bundle of different features. These features are assessed separately and then combined to determine the final house price using a multi regression model (MRA) (Sirmans, Macpherson, & Zietz, 2005). MRA models, however, are often under strict assumptions that the relationship between the features and the price is linear, the variables are normally distributed, and the data point are homoscedastic. Hence, in the more recent years, machine learning methods have been developed to overcome these problems. Some of the frequently used methods are artificial neural network (ANN), decision tree, random forest, or multivariate adaptive regression splines (MARS), etc. One of the most important reasons why machine learning is often preferred is that it can map complex relationships and interdependencies between the variables and solve for non-linear relationship (Mullainathan & Spiess, 2017).

(5)

Nevertheless, the superior performance of machine learning methods over traditional regression is still debatable. While some papers found that machine learning methods provide better estimates of house prices (Peterson & Flanagan, 2009; Tay & Ho, 1992; and Limsombunchai, 2004), others find little to no improvement in machine learning methods (Guan, Zurada, & Levitan, 2008). A paper by Worzala, Lenk, and Silva (1995) even found that neural networks underperform MRA due to the inconsistent results obtained by different software packages, and concluded that this method is not suitable for mass appraisal. The purpose of this paper, therefore, is to once again compare the performance of the two methods, traditional regression hedonic pricing model and machine learning. This study uses random forest regression as a representative for machine learning methods. In addition, this research aims to test how well each model can assess the effect of a specific variable – listing time (the year when the property is listed on the market). Is there any difference in prices between the years? How much does listing time contribute to the predicting power of the models? And most importantly, which model provides a more meaningful result of the effect of listing time?

In the following sections, we discuss the basics of traditional hedonic pricing model, basics of machine learning algorithms, the advantages and disadvantages of each approach, and a comparison of the methods in previous studies. Following that, some foundations of listing time effect on house prices will also be presented. The multi regression analysis and random forest are regressed on a large Russian housing data set. This data set was part of a competition that challenges data analysts to build an algorithm which helps forecast realty prices using a wide range of features; and it took place on Kaggle, a large community for data scientists. Detailed description of the data set, the logic of the algorithms, how the models are built and how their performances are evaluated will be deliberated further in the third section, data and methods. Lastly, we discuss the results, interpretation, discussion of the result and provide a conclusion.

2. Theoretical Framework

a. Traditional hedonic house pricing – classical multiple regression

Hedonic pricing is a common valuation method used in real estate, in which price is broken down into components and analyzed using ordinary least square regression (Sirmans et. al., 2005). Although Goodman (1998) mentioned that the model was first introduced by Andrew Court in 1939 when he tried to estimate the price of cars, another paper by Colwell and Dilmore (1999) found that similar models were already implemented by G. C. Haas in 1922 and H. A. Wallace in 1926. Lancaster (1966) attempted to break away from the traditional notion of treating goods as a direct object of utility, and considered it as an aggregation of utility of different components instead. Sharing the same view, Sirmans et. al. (2005) stated that housing

(6)

cannot be easily priced as a homogenous good, but rather should be priced as a bundle of different characteristics. Furthermore, Malpezzi (2003) (cited in Sirmans et. al., 2005), added that hedonic pricing also aims to solve the problem arising from heterogenous opinions of customers, meaning that different customers value the characteristics differently.

Sirmans et. al. (2005) summarized the most popularly used characteristics in hedonic models to determine house prices, some of which will be used in this paper. The top features that have appeared most frequently in previous researches include size, square feet, age of the estate, number of rooms. Other than that, proximity to school, distance from city center, and time on the market are also often looked into.

According to Sirmans et. al. (2005), the hedonic pricing models are typically single-stage equations, which assumes that the independent variables are exogenous and does not examine the parameters that construct these variables. The model used in determining house price often uses a log-linear model, taking the natural log of the house price as the outcome variable. The effect of each characteristic on the price is determined by estimating the coefficient. This is also in line with the model that was suggested by Andrew Court, in which he took the natural log of car price and regressed it against the variables of dry weight, wheelbase and advertised horsepower. The semi-log form in the case of Andrew Court outperformed linear model simply because it resulted in a more linear relationship and sample correlations (Goodman, 1998). Follain and Malpezzi (1980) (cited in Sirmans et. al., 2005) further extended the advantages of using a semi-log form. Firstly, the coefficients can be easily interpreted as the percentage change in the dependent variable with each one-unit change in the independent variable and thus more relevant for analysis. Furthermore, this form deals with heteroscedasticity better than linear model, which would result in a more linear relationship as found by Court.

b. Machine learning regressions

Machine learning, in essence, is the way that a computer can learn from experience, so that it can improve the efficiency while executing the program (Michie, Spiegelhalter, & Taylor, 1994). Using semi or non-parametric models, machine learning methods can perform basic tasks of classifying observations into categories, or predict numerical values (Steurer & Hill, 2020). Applications of machine learning vary from “traditional problems such as speech recognition, face recognition, handwriting recognition, medical data analysis, and game playing”, to “new kinds of problems, including knowledge discovery in data-bases, language processing, robot control and combinatorial optimization” (Dietterich, 1997). We have seen increasing frequency machine learning applications in recent years, some of which are virtual assistance

(7)

such as Siri, Cortana, Alexa, or Facebook’s auto-tagging in photos using face recognition (Mullainathan & Spiess, 2017).

While the classical hedonic method mentioned above attempts to estimate the effect of the independent variables on the dependent variable, and predictions are based on these effects, machine learning methods searches for interactions between independent variables, and from there, predict house prices (Mullainathan & Spiess, 2017). According to James, Witten, Hastie, and Tibshirani (2017), fundamentally, machine learning can be divided into 2 main types: supervised and unsupervised learning. In a broader view, there is a third type known as reinforcement learning, in which the algorithm receives rewards or punishment along its learning progress (Ghahramani, 2003).

James et. al. (2017) explained supervised learning as methods in which for each value of the input variable there is a corresponding output, and the aim of supervised methods is to predict output values given new input values. Examples of supervised methods are Decision Trees, Random Forest, Naïve Bayes, etc. On the contrary, in unsupervised learning methods, we intend to deal with the problems in which there is no corresponding output value for each input value. The main task of unsupervised learning methods is clustering, aiming to put the observations into groups. Examples of unsupervised learning methods are k-Nearest Neighbors (KNN), Density Based Spatial Clustering of Application with Noise (DBSCAN), Gaussian Mixture, etc.

c. Comparison between machine learning and classical regression methods

Although a lot of recent papers prefer machine learning methods to linear regression, each type has its own pros and cons.

According to both Peterson and Flanagan (2009) and Kok, Koponen, and Martinez-Barbosa (2017), supporters of the linear regression argue that this method is straightforward, simple to implement and easy to interpret. Furthermore, by estimating the coefficient of each independent variable, linear models show the contribution of each of them to the total value, and hence allow effects of changes in model composition to be detected and analyzed (Calhoun, 2001). However, the traditional hedonic model does not work well with categorical variables. To estimate the effect, it transforms each category into a binary dummy variable, and thus, if there are numerous categorical variables in the data set, the regression matrix can be dominated by dummies. To deal with this problem, when using linear regression, analysts tend to use fewer categorical variables, and thus risk losing valuable information. Analysts also often run into model misspecification error, as the effects of independent variables on dependent variables are not always linear (Peterson & Flanagan, 2009). In addition, as mentioned above, since the hedonic model is a single stage model, it often

(8)

fails to take into consideration the interaction effects between the independent variables and multicollinearity. This happens because looking into each pairwise effect is a cumbersome task and simply not feasible when there are a lot of independent variables (Mullainathan & Spiess, 2017; Zurada, Levitan & Guan, 2011; Tay & Ho, 1992).

On the other hand, machine learning methods also have some advantages and disadvantages. They are often preferred in multiple recent research papers thanks to their self-learning ability to unfold complex structures, nonlinear relationships, and interdependencies of the variables that are not specified beforehand (Mullainathan & Spiess, 2017; Tay & Ho, 1992). Artificial neural network (ANN), a method frequently used in researches of this field, for example, can solve complicated problems by inducing black-box rules without a specific algorithm (Tay & Ho ,1992). Moreover, some models, such as decision tree, are rather easy to understand and interpret (Kok et. al., 2017). Nevertheless, since most machine learning methods attempt to maximize its performance in the training data set, they have the tendency to overfit, thus, lower the out-of-sample performance. Furthermore, machine learning models can be run on different software packages, which are based on different algorithms and performance criteria. As a result, a model can have varying results when run on different packages (Worzala, et. al., 1995). Last but not least, as Worzala et. al. (1995) mentioned, the set up for machine learning methods is complicated, the best result must be achieved through several trials and errors, and the run times can be quite long.

When comparing the performance of MRA with that of machine learning regressions, various papers have found conflicting results. Peterson and Flanagan (2009) using a data set of 46,000 observations, Tay & Ho (1992) using a data set with 1,055 observations, and Limsombunchai (2004) using a data set of 200 observations all employed ANN in their analysis and all found significant improvement over the traditional MRA. Kok et. al. (2017) used two different large data sets, one with over 30,000 observations, and the other one with over 50,000 observations, and employed MRA, and ensembles of trees methods (random forest, gradient boost, and XG boost). They found that machine learning based methods beat the traditional MRA on all measures.

On the contrary, Worzala et. al. (1995), when using ANN in a research with a small 288-observation data set, found no improvement in the predicting power. They concluded that ANN was not suitable to be used in mass appraisal because of the inconsistent results provided by multiple runs within the same machine learning package and provided by different packages. Additionally, they also stated that ANN took a significantly longer time to obtain the results. Rossini (1997) found that neural network was more suitable for smaller data sets and traditional regression outperformed for large data sets. He shared the same view with Worzala et. al. (1995) that ANN took a long time and was not viable for commercialization. Guan et.

(9)

al. (2008) adopted adaptive neuro-fuzzy inference system (ANFIS) into their model and even showed inferior performance of ANFIS compared to MRA.

In contrast with Rossini’s (1997) results, Nguyen and Cripps (2001) ran ANN on 3,906 observations and concluded that when there is a sufficiently large data set, ANN tends to outperform MRA, otherwise, the results are mixed.

d. Listing time effect on house prices

Sirmans et. al. (2005) found in their systematic review that time trend is one of the most frequently examined variables in researches of this topic. Nevertheless, this variable is found insignificant the majority of times. According to one of the foundations of theoretical housing economics described by Meen (2002), in the case of a positive demand shock in real estate, the real house price (adjusted for inflation) will increase temporarily due to a shortage in supply, but in the long run, it will change only in agreement with the construction costs. If the costs of housing construction rise with the general increase of the market due to inflation, then real house price should be expected to remain constant.

Additionally, Guan et. al. (2008) mentioned that nominal house prices are likely to change over the years as there will be inflation and thus it will influence the effect that other variables have on house price. They suggested 3 approaches on solving this problem: (1) adjusting house prices using a formula for inflation, (2) analyzing the house prices only within one year, and (3) including listing year as an attribute in the analysis. While Guan et. al. (2008) took the first approach, this paper follows the third option.

3. Data and method a. The data

The data was provided by Sberbank, one of Russia’s largest banks, as they intend to predict Russian house prices to provide their customers with higher certainty when signing large rental, purchase or investment contracts. This data set contains 38,133 observations with 292 variables, and was collected from 2011 to 2016, however, no price was recorded for 2016.

Data cleaning and feature selection

After examining the data set, a few points worth highlighted: (1) this is a large data set, so, feature selection is challenging; (2) although there are 38,133 observations, there are ample of non-available (NA) values; (3) a large number of variables are repetitive. For instance, there is a group of variables that counts

(10)

the number of cafés within 0.5 kilometer, within 1 kilometer, within 3 kilometers, and within 5 kilometers and each of these variables includes sub-variables that contain price range. Selecting repetitive variables leads to the regression matrices being dominated by the same type of variables and analysis becoming less effective.

To select the features, previous studies were reviewed to find the most frequently used variables, and then use stepwise feature selection to choose the features that add value to the model. Some values that frequently appeared in previous research, according to Sirmans et. al. (2005) are: lot size, square feet, age, stories, construction structure, number of rooms, bathrooms, bedrooms, fire place, air conditioning, distance, time on market and time trend.

Based on the systematic review, from 292 variables, I narrowed down to 30 variables including: size, floor, timestamp, distance (to school, metro station, kindergarten, market, church, etc.), number of schools, number of hospital beds, presence of nuclear reactors, presence of dirty industries, etc. The location of the realty is also a desired feature for the analysis. Nonetheless, there are 105 categories in sub_area variable (name of the district), this leads to the coefficient table being dominated by this variable alone. Therefore, instead of including names of district as attributes, I added to the model other variables that may describe the specific characteristic of the location, for instance: presence of nuclear reactors, presence of dirty industries, or presence of universities in the top 20.

The variables of number of rooms, house materials, and built year which were included in previous studies are also provided in this data set. However, data for these variables are missing from the year 2011 until the year 2013. Given the fact that we want to estimate the effect of time on house price, including these variables will risk losing a large amount of useful information. Therefore, these variables are not included in the model.

From 30 variables, I used forward stepwise selection to determine the best fit linear model that includes 10 variables. Variables for distance are in kilometer (km), but currency for sales price and metric for total area are unknown. Descriptive statistics for the continuous variables are included in table 1:

(11)

Table 1. Descriptive statistics

Variables Min Max Mean Median

Sales price 100,000 111,111,112 7,123,035 6,274,411

Total area 0.000 5,326.000 54.110 50.000

Distance to metro

station 0.000 59.268 3.554 1.681

Floor (on which the

apartment is located) 0.000 77.000 7.667 7.000 Distance to railroad station 0.028 24.653 4.339 3.214 Distance to universities 0.000 84.862 6.824 4.232 Distance to business areas/offices 0.000 19.413 1.991 1.053 Number of higher education institution 0.000 29.000 6.719 5.000 Number of additional education organizations 0.000 16.000 2.897 2.000

Apart from 9 continuous variables above, there are 2 categorical variables: listing time (in year), and presence of dirty industry (yes/no). Figure 1 provides a graphic presentation of the relationship between the variable of interest – time (in year), and the sales price.

(12)

Figure 1. Average house prices through the years

All non-available (NA) values of the selected variables are removed for analysis. After the removal, the final data set contains 20,375 observations. Since no house price was recorded for 2016, the time variable is left with only 4 values from 2011 to 2015. This data set is then divided into a train set which contains 80% of the observations selected at random, and the test set which contains the remaining 20% of the observations. The train data set is used to train our models to predict house prices, and the test data set is used to test how well our models perform out-of-sample.

b. Traditional multiple regression model

To determine the effect of time on house prices, two different multivariate regression models were run, one model with all variables, and a model excluding the time variable. Initial analysis of the data suggests that all continuous variables are negatively skewed with long right tails. This may cause problems with heteroscedasticity and lead to bias in coefficients. Therefore, to mitigate the problem, natural log is taken for all continuous variables to transform them into normally distributed variables. The two models take the following forms:

Model 1: ln(𝑝𝑟𝑖𝑐𝑒) = 𝛽1ln(𝑓𝑢𝑙𝑙_𝑠𝑞) + 𝛽2𝑡𝑖𝑚𝑒_𝑦𝑒𝑎𝑟 + 𝛽3ln(𝑚𝑒𝑡𝑟𝑜_𝑘𝑚) + 𝛽4ln(𝑓𝑙𝑜𝑜𝑟) +

𝛽5ln(𝑟𝑎𝑖𝑙𝑟𝑜𝑎𝑑_𝑘𝑚) + 𝛽6ln(𝑢𝑛𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦_𝑘𝑚) + 𝛽7ln(𝑜𝑓𝑓𝑖𝑐𝑒_𝑘𝑚) +

(13)

Model 2: ln(𝑝𝑟𝑖𝑐𝑒) = 𝛽1′ln(𝑓𝑢𝑙𝑙_𝑠𝑞) + 𝛽3′ln(𝑚𝑒𝑡𝑟𝑜_𝑘𝑚) + 𝛽4′ln(𝑓𝑙𝑜𝑜𝑟) + 𝛽5′ln(𝑟𝑎𝑖𝑙𝑟𝑜𝑎𝑑_𝑘𝑚) +

𝛽6′ln(𝑢𝑛𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦_𝑘𝑚) + 𝛽7′ln(𝑜𝑓𝑓𝑖𝑐𝑒_𝑘𝑚) + 𝛽8′ln(ℎ𝑖𝑔ℎ𝑒𝑟_𝑒𝑑𝑢) +

𝛽9′ln(𝑎𝑑𝑑𝑖𝑡𝑖𝑜𝑛𝑎𝑙_𝑒𝑑𝑢) + 𝛽10′𝑑𝑖𝑟𝑡𝑦_𝑖𝑛𝑑𝑢𝑠𝑡𝑟𝑦

in which, price is the sales price; full_sq is the total area; time_year is the listing time in year; metro_km is the distance to the metro station; floor is the floor on which the apartment is located; railroad_km is distance to the railroad station; university_km is the distance to universities; office_km is the distance to business areas/offices; higher_edu is the number of higher education institutions; additional_edu is the number of additional education organizations; and dirty_industry is the presence of dirty industry in the area.

Initial analysis also suggests the existence of large and influential outliers that significantly affect the linearity of the models. To remove the outliers, two methods were taken into consideration: removing the house price values using quartile and interquartile range; or using a method called Cook’s distance, which is often used to identify influential points in multivariate regressions. However, the initial analysis shows that using Cook’s distance is more efficient in removing the influential data points.

Cook (1977) suggested that: to find the influence of a data point on the estimated coefficient of the model, we can calculate the coefficients (betas) before removing this data point and after removing this data point, and then measure the distance between the betas “in terms of descriptive levels of significance”. The formula for Cook’s distance is given by:

𝐷𝑖=

( − −𝑖) ′

𝑋′𝑋 ( − −𝑖)

𝑝𝑠2 𝑖 = 1, 2, … , 𝑛

in which Di is the distance, is the estimated beta before removing ith data point, -i is the estimated beta

after removing the ith _{data point, 𝑝 is the number of coefficients in the model, 𝑠 is the standard deviation}

from the fitted value, and X is a 𝑛 × 𝑝 matrix of known constants. For this paper, any point that has a Cook’s distance of more than 4 times the mean distance is removed. After removing the influential points, the two regressions are run again to obtain the final result.

To obtain the coefficients, robust analysis was utilized to ensure that there is no problem with heteroscedasticity. The models eventually are used to predict the house prices of the observations in the test data set. The difference between the prediction and the actual price is taken to measure the predicting power of the model. Different measures for prediction performance will be discussed separately in the following section.

(14)

c. Machine learning regression model

Since the aim of this paper is to provide a model that helps predict the house prices, an approach with unsupervised learning methods is more suitable. Random forest is a powerful predictive method that has been used widely in research papers of this field. How random forest is used in this paper will be described in the following part, but first, we need to discuss another method which is the foundation of random forest – decision tree:

Decision tree: Decision tree, a supervised machine learning method frequently used in research papers, provides visualized results as an upside-down tree-like map. Three major advantages of using decision trees that are worth mentioning are: (1) it can map non-linear relationships; (2) it can handle both classification and regression problems; and (3) the graphical result provided by the decision tree is straightforward and can be easily interpreted. Figure 2 represents a simple example of a decision tree.

Figure 2. Simple decision tree model (reconstructed based on James et. al., 2017)

Referring to figure 2, three main concepts of decision tree need to be clarified. Conditions for splitting or tests on attributes are represented by root node and internal nodes. Root node, containing the whole data set, is where the tree begins and it is split into internal nodes. Internal nodes are then split into either more internal nodes or leaf nodes. In the case of a classification tree, a leaf node should represent the class labels of the dependent variable that we are trying to predict, while in a regression tree, it should be a numerical value of the dependent variable.

(15)

The logic behind decision tree takes a top-down approach called recursive binary splitting, a numerical method dividing the sampling population into two parts using a cost function. At each node, the algorithm considers, tries and tests the attribute conditions to calculate the costs for splitting. The attribute that results in the lowest cost and most accurate split is then selected. However, recursive binary splitting does not consider which option will lead to the best tree in the following steps. Thus, this method is also referred to as the greedy algorithm. Each node is split into two branches until a value of the dependent variable is reached.

Decision tree faces two major drawbacks. Firstly, since there can be an unlimited number of splits, decision trees are often exposed to overfitting problems, meaning high out-of-sample error. Secondly, decision trees are non-robust in the sense that even a small change in the data set can lead to a large change in the result. Several ways are used to alleviate the problems: applying constraints to the depth of the tree, tree-pruning (a method to obtain an optimal sub-tree that has the lowest test error rate), or ensembles of trees (Kok et. al, 2017; James et. al., 2017).

Random forest: random forest is one of the most popularly used ensemble of trees that is able to overcome the disadvantages of decision tree. By using bootstrap, we can create multiple samples from the one single data set. Each tree in random forest is then fully developed (meaning that the trees are not pruned) using bootstrapped samples. Finally, by averaging predictions from all trees, we get an estimation of the dependent variable.

A major advantage of random forest worth mentioning is that the trees are decorrelated. To better understand the term “decorrelated”, consider a scenario where among 𝑝 predictors we only have one strong predictor, and a few other moderate predictors. If for every tree, at every node, the algorithm only uses this strong predictor, then all trees in the ensemble would look similar. The random forest algorithm overcomes this issue by allowing only a fraction of 𝑚

𝑝 of the splits to consider the strong predictor (with 𝑚 is the number of predictors considered for split at each node). As a result, the trees in the random forest are not correlated and when taken the average, variance of predictions is reduced (James et. al, 2017).

Apart from the key performance indicators (KPIs) described in the following section, the performance of random forest can be measured by out-of-bag (OOB) error, which is the prediction error within the training set. Each tree in random forest utilizes two-third of the observations in the training set to train a model (meaning a training set in a training set), and the remaining one-third

(16)

(referred to as out-of-bag observations) to test the predictions of the tree. To obtain one single OOB for all the trees, the average the predicted responses is taken.

Multiple trials and errors to build the random forest model. It was observed that when the number of trees increases from 200 to 500, the OOB error does not improve much, and when the number of trees reduces, all other key performance indicators (KPIs) of the model drop. Similarly, the KPIs of the model are highest when the number of split candidates at each node is around 4 or 5. Eventually, the optimal random forest algorithm was determined to include 200 trees and to choose 4 predictors as split candidates at each node. This model is then run on the data set of multiple regression model 1 after outliers have been removed by Cook’s distance for an unbiased comparison. Similar to the MRAs, the random forest model obtained is taken to predict the house price in the test set, and measures for predicting performance are calculated for comparison between the methods.

Last but not least, in order to determine the contribution of time in predicting the house price, the variable importance plot of the random forest model is examined. Variable importance plot is a graphical presentation of how much each predictor contributes to the model. The importance of features is ranked based on the percentage increase of the mean-squared-error (%IncMSE) if a feature is removed from the model. Thus, a higher percentage represents a higher importance.

d. Metrics for model performance

Steurer & Hill (2020) summarized the indicators of performance for automated valuation methods. This paper employs three following metrics:

R2_{: used in this paper as a measure for the in-sample goodness-of-fit and robustness of the model,} the higher the R2 _{the better.}

𝑅2 = 1 −𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 (𝑅𝑆𝑆) 𝑇𝑜𝑡𝑎𝑙 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 (𝑇𝑆𝑆)

Root Mean Squared Error (RMSE): measures the accuracy of the out-of-sample predictions, the lower the RMSE the better.

𝑅𝑀𝑆𝐸 = √1

𝑁∑ (𝑝𝑛− 𝑝̂𝑛)2

𝑁 𝑛=1

(17)

Mean Absolute Prediction Error (MAPE): measures the error rate in terms of percentage for out-of-sample predictions, the lower the MAPE the better.

𝑀𝐴𝑃𝐸 = 1 𝑁∑ | 𝑝̂𝑛− 𝑝𝑛 𝑝𝑛 | ∗ 100 𝑁 𝑛=1

In the formulas above, 𝑝𝑛 is the realized value of the dependent variable, 𝑝̂𝑛 is the prediction, n is the

index of the observation, and N is the number of observations.

4. Results

a. Multiple regression model

Figure 3 and 4 below are the Cook’s distance plot of the two multiple regression models used in this paper.

(18)

Figure 4. Influential observations by Cook's distance MRA model 2

After removing all the influential data points with Cook’s distance larger than 4 times the mean Cook’s distance, model 1 is ran on 15,471 observations and model 2 is ran on 15,562 observations. Table 2 below summarizes the results of the two models after robust analysis.

Table 2. Results of regressions

Regressor Regression 1 Regression 2

Intercept 12.230 12.494 Full_sq 0.893*** 0.913*** 2012 -0.024 2013 -0.001 2014 0.046 2015 0.115*** Metro_km 0.096*** -0.095*** Floor 0.034*** 0.030*** Railroad_km -0.002 -0.002 University_km -0.058*** -0.057*** Higher_edu 0.070*** 0.072*** Additional_edu -0.084*** -0.085***

(19)

Dirty_industry 0.136 0.093 Office_km -0.021*** -0.022*** Summary statistics R2 _0.4644 _0.4365 Adjusted R2 _0.4639 _0.4361 N 15,471 15,562 RMSE 1,996,499 2,031,308 MAPE 17.59% 17.88%

“***”, “**”, “*” represent 0.1%, 1% and 5% significance level respectively

The result of regression 1 shows that 7 out of 8 continuous variables have significant effects on house price. The most influential variable is the house total area with the coefficient of 0.893, meaning that 1% increase in the total area results in 0.89% increase in the house price. The distance to railroad station and presence of dirty industry are found to have no significant effect on the house price.

Our variable of interest – time, is represented by 4 binary dummy variables 2012, 2013, 2014, and 2015. Among these dummy variables, only 2015 is found significant, meaning that only the house prices in 2015 are found to be significantly different from those in 2011, and there is no difference in house prices from 2011 to 2014. The coefficient for 2015 of 0.115 means that sale prices in 2015 are on average 0.115% higher than sale prices in 2011.

In regression 2, after removing time as an attribute, 7 out of 9 variables are found significant. Total area remains to be the most influential variable with a slightly increased coefficient of 0.913. Distance to railroad station and presence of dirty industry stay insignificant. Both R2_{and adjusted R}2_{reduced slightly after}

removing the time variable. Root-mean-squared error for model 1 and Mean absolute percentage error are 1,996,499 and 17.59% respectively; and those for model are 2,031,308 and 17.88% respectively.

b. Random forest

The random forest model is run on the data set of the first MRA after the influential data points have been removed by Cook’s distance, which contains 15,471 observations. Figure 3, figure 4, and figure 5 illustrate the results of the random forest model. Figure 3 is the representation of OOB error. OOB error decreases as the number of trees in random forest increases and the OOB root-mean-squared error is 2,431,960. R2_{of the random forest model is 0.7257.}

(20)

Figure 5. Decision tree out-of-bag error

Figure 6 shows the level of contribution of each variable to the predicting power of the model, with rankings based on percentage increase of the mean-squared-error (MSE). Total area is still the variable that explains the most of the outcome variable, with an increase of 150% in the MSE if removed. Total area is followed by distance to metro station, distance to the university, and distance to offices and business areas. Although distance to railroad station was found insignificant in the first linear regression model, it ranks in the 5th_{place in terms of importance to the}

random forest model. If we remove our variable of interest – listing time, MSE increases around 25%. It thus ranks 7th_{on the importance plot. And similar to}

the result of the linear regressions, the presence of dirty industries does not contribute to the predicting power of the model.

(21)

Prediction performance of random forest in the test data is as follows: the root mean squared error of the model is 1,457,020; and the mean absolute percentage error is 11.23%.

5. Discussions

a. Model performance and comparison between models

Both MRA and random forest share the same result that the total area of the property contributes the most to its price. In both approaches, distance to metro station, university, offices and business centers, number of higher education institutions and additional organizations have moderate predicting power. The presence of dirty industry is found in both methods to have no impact on the house price. While the floor on which the apartment is located on is found to have stronger effect than distance to offices, it does not contribute much to the random forest model.

To examine and compare the performance of the classical linear regression with random forest, key performance indicators are summarized in table 3.

Table 3. Summary of key metrics for performance

Metric Regression 1 Regression 2 Random Forest

R2 _0.4644 _0.4365 _0.7257

RMSE 1,996,499 2,031,308 1,457,020

MAPE 17.59% 17.88% 11.23%

Adjusted R2 _0.4639 _0.4361

For an unbiased comparison on which model performs better, KPIs of random forest and the first regression are examined. The R2_{values of the first regression and random forest are 0.4644 and 0.7257}

respectively, meaning that random forest can explain a significantly higher portion of variance in the outcome variable. RMSE of random forest shows a slightly better performance of 27% less than that of the first regression; and MAPE also indicates that random forest also outperforms MRA to some extent. Therefore, within this data set, we can conclude that, while the random forest model considerably outperforms the first linear regression in explaining the percentage of variance in the training set, the performance shown by RMSE and MAPE show that random forest offers only a slightly superior predicting power in the test set. This result is partly in line with the results found by Nguyen and Cripps (2001) that with a large data set, machine learning performs better than classical regression.

(22)

While Worzala et. al. (1995), and Rossini (1997) argued that machine learning methods are not suitable for commercialization due to long run time and inconsistencies in the results, we should consider that a long time has passed since the publication of these papers and a lot of improvements in the machine learning packages have been made. Although the random forest model in this paper did take a longer run time than MRA, and the run time increases with the number of trees in the forest, it did not take longer than 5 minutes to run a model with 500 trees. Therefore, random forest is suitable to be used widely.

b. Listing time effect on house prices

Inspecting the effect of time variable, the coefficients of the year dummy variables for 2012, 2013, and 2014 in multivariate regression 1 are insignificant. This means that there is not a statistically significant difference in house prices of the year 2012, 2013, 2014 from the year 2011. Nevertheless, house prices in 2015 are 0.115% significantly higher than prices in 2011. As the time variable is removed from the regression, R2_{and drops slightly from 0.4644 to 0.4364, and adjusted R}2_{drops from 0.4639 to 0.4361. In}

addition to that, the variable importance plot of random forest demonstrated that although it adds some value to the predicting power of the model, time is not one of the strongest predictors.

Recalling what has been discussed previously, the result of this paper, showing no change in prices from the year 2011 to 2014, is in line with the systematic review by Sirmans et. al. (2005), which found no significant effect of time trend on house prices in previous studies. Meen (2002) stated that if the only source of increase in construction cost is inflation, the real price in real estate should stay the same and only nominal house price increases. Guan et. al. (2008) also added that inflation should be taken into consideration when analyzing nominal house price. This result, hence, may be explained by a very low inflation rate, which led to nominal price not deviating much from the real price. Alternatively, this may also be the result of different changes in different components’ costs due to demand and supply. For example, an increase in labor costs may have been cancelled out by a decrease in material costs.

Nevertheless, empirical results, showing that the prices in 2015 being 0.115% statistically higher than 2011, also implies a confirmation of what has been stated by Meen (2002) that changes between consecutive years might be unobservable, but will become more obvious if plotted over a longer time span. Moreover, similar to the argument above, the significant increase in 2015 may also be the result of a higher inflation rate in Russia in 2015 or an increase in demand for construction components.

To summarize, the two completely different automated valuation models provided the same answer that listing time adds little contribution to predicting house prices. However, the effect of time through the

(23)

years is easier to be observed and interpreted in MRA. In random forest, we can only imply the effect of time relative to other variables.

c. Limitations

The paper faces some limitations especially with the original data set and the attributes selection. First of all, it can be easily noticed that there is an excessive amount of missing values which led to a large reduction in the number of observations. The failure to include the year 2016 in our analysis is also due to the fact that no house price was recorded for this year. Secondly, the original data set included a substantial number of significant outliers that reduces the performance of our models. Initial analysis with MRA on the original data set showed an R2_{around 26%, RMSE nearly 4,000,000 and MAPE almost 60%. Similarly,}

for random forest, although the percentage of variance explained is not significantly reduced, RMSE was close to 3,000,000 and MAPE around 55%. Last but not least, the copious number of variables – 292, with many repetitive variables, made feature extraction rather challenging and led to the possibility of excluding important attributes.

6. Conclusion

This paper contributed to settle the debate of which method in mass real estate appraisal performs better: traditional hedonic multi regression method, or machine learning-based methods. In addition to mere comparison of performance, this paper aimed to test how the methods measure the effect of a specific attribute – listing time, on house prices. Analyses with MRA and random forest were run on a Russian housing data set from 2011 to 2015 with 20,375 observations after cleaning.

The results obtained are interesting: random forest shows a better performance than MRA. Nonetheless, while random forest significantly outperforms MRA in explaining in-sample variance, it does only slightly better in terms of out-of-sample predictions. What is obtained during the analysis also modernizes the view by Worzala et. al (1995) and Rossini (1997) that machine learning methods took considerably larger time to run. Random forest in this model took less than five minutes to deduce a conclusion with a large model of 500 trees, which makes it suitable for commercialization, even when working with big data sets. It shows that there has been advancement in technology and improvement of the software packages that considerably reduced run time.

In spite of the fact that random forest outperforms MRA in explaining and predicting the price, the effect of time obtained by MRA was easier to distinguish and interpret. Through the result of MRA, it is

(24)

observed within this data set that there is no difference in nominal price between one year to another, but the difference can only be observed in a longer time span. This result may be explained by a small inflation rate in Russia from the year 2011 to 2015, or by varying changes in demand and supply of construction components.

This study, therefore, suggests that when it comes to solely predicting the house price, it is more useful to utilize machine learning-based methods as they give results with higher precision. However, when determining the contribution of a specific variable to the whole model, traditional MRA tends to be more useful. Especially for categorical variables, when the number of categories is not too high, traditional MRA can easily compare the difference between the categories.

This paper faced a limitation that the data set included too many missing values, especially for the year 2016, leading to a less comprehensive conclusion of time effect on house price. For future researches that also targets at estimating the time effect on house price, it is suggested that the data set should be collected over a longer period of time. Furthermore, it is interesting to look at the real house price that has been adjusted for inflation. Thus, future researches of the same matter should study the inflation rate and trends in the general economy. Lastly, as mentioned above, it is worth taking into consideration the varying changes in the supply and demand of the construction components that may affect house prices through the years.

(25)

7. References

Brooks, D. (2014). What Machines Can’t Do. The New York Times. Retrieved from

https://www.nytimes.com/2014/02/04/opinion/brooks-what-machines-cant-do.html?searchResultPosition=40

Calhoun, C. A. (2001). Property valuation methods and data in the United States. Housing Finance International, 16(2), 12.

Colwell, P. F., & Dilmore, G. (1999). Who was first? An examination of an early hedonic study. Land Economics, 620-626.

Cook, R. D. (1977). Detection of influential observation in linear regression. Technometrics, 19(1), 15-18. Dietterich, T. G. (1997). Machine-learning research. AI magazine, 18(4), 97-97.

Ghahramani, Z. (2003, February). Unsupervised learning. In Summer School on Machine Learning (pp. 72-112). Springer, Berlin, Heidelberg.

Goodman, A. C. (1998). Andrew Court and the invention of hedonic price analysis. Journal of urban economics, 44(2), 291-298.

Guan, J., Zurada, J., & Levitan, A. (2008). An adaptive neuro-fuzzy inference system based approach to real estate property assessment. Journal of Real Estate Research, 30(4), 395-422.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2017). An Introduction to Statistical Learning: with Applications in R, 7th Edition.

Kok, N., Koponen, E. L., & Martínez-Barbosa, C. A. (2017). Big data in real estate? From manual appraisal to automated valuation. The Journal of Portfolio Management, 43(6), 202-211.

Lancaster, K. J. (1966). A new approach to consumer theory. Journal of political economy, 74(2), 132-157. Limsombunchai, V. (2004, June). House price prediction: hedonic price model vs. artificial neural network.

In New Zealand agricultural and resource economics society conference (pp. 25-26).

Meen, G. (2002). The time-series behavior of house prices: a transatlantic divide?. Journal of housing economics, 11(1), 1-23.

Michie, D., Spiegelhalter, D. J., & Taylor, C. C. (1994). Machine learning. Neural and Statistical Classification, 13(1994), 1-298.

(26)

Mullainathan, S., & Spiess, J. (2017). Machine learning: an applied econometric approach. Journal of Economic Perspectives, 31(2), 87-106.

Nghiep, Nguyen, and Cripps Al. "Predicting housing value: A comparison of multiple regression analysis and artificial neural networks." Journal of real estate research 22.3 (2001): 313-336.

Peterson, S., & Flanagan, A. (2009). Neural network hedonic pricing models in mass real estate appraisal. Journal of real estate research, 31(2), 147-164.

Rossini, P. (1997). Artificial neural networks versus multiple regression in the valuation of residential property. Australian Land Economics Review, 3(1), 1-12.

Sirmans, S., Macpherson, D., & Zietz, E. (2005). The composition of hedonic pricing models. Journal of real estate literature, 13(1), 1-44.

Steurer, M., & Hill, R. J. (2020).Metrics for Measuring the Performance of Machine Learning Prediction Models: An Application to the Housing Market. GEP.

Strogatz, S. (2018). One giant step for a Chess-Playing Machine. The New York Times. Retrieved from

https://www.nytimes.com/2018/12/26/science/chess-artificial-intelligence.html?searchResultPosition=27

Tay, D. P., & Ho, D. K. (1992). Artificial intelligence and the mass appraisal of residential apartments. Journal of Property Valuation and Investment.

Worzala, E., Lenk, M., & Silva, A. (1995). An exploration of neural networks and its application to real estate valuation. Journal of Real Estate Research, 10(2), 185-201.

Zurada, J., Levitan, A., & Guan, J. (2011). A comparison of regression and artificial intelligence methods in a mass appraisal context. Journal of Real Estate Research, 33(3), 349-387.

A comparison of traditional hedonic regression and random forest in mass real estate appraisal and the estimation of listing time effect on house prices

BACHELOR THESIS

Business Administration - Finance