Performance of Feature Selection on Automated Valuation Models in Real Estate

(1)

Performance of Feature Selection on Automated

Valuation Models in Real Estate

Julia Waardenburg 11750197

Automated Valuation Models Supervisor: Felipe Dutra Calainho

July 15th_{, 2020}

6013B0520Y Bachelor's Thesis and Thesis Seminar Finance for Business BSc Economics & Business Economics, specialization Finance

(2)

Statement of Originality

This document is written by student Julia Waardenburg who declares to take full responsibility for the contents of this document.

I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it.

The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

Abstract

This thesis investigates whether feature selection methods increase the predicting

performance of automated valuation models on real estate property prices. The study is based on a dataset containing the transaction prices of residential real estate from Moscow, Russia in the period August 2013 until June 2015. First, three subsets of the most important variables are selected by three different techniques: Principal Component Analysis; Variable

Importance; and Least Absolute Shrinkage and Selection Operator (LASSO). Next, each subset is applied to a linear regression model and a random forest model. The predicting performance is measured by the out-of-sample percentage of explained variance of the target variable by the subsets of independent variables. The performance of each restricted model by feature selection is compared to a performance benchmark value, measured by the model containing the full feature set. In total there are 6 restricted results that are compared to 2 unrestricted results. All the outcomes are ranked based on three model horseraces: a feature selection race, a regression race and a total race. The analysis of the total model horserace concluded that the best performing model is the unrestricted Ordinary Least Squares linear regression. The method that won in the feature selection horserace was Variable Importance applied on the Random Forest model. The conclusion from the regression race stated that Random Forest outperforms Ordinary Least Squares in the restricted feature subsets.

Keywords: feature selection, Principal Component Analysis (PCA), Variable Importance, LASSO, Random Forest, OLS

(4)

1. Introduction

Real estate makes up the largest global, and most important, asset class. When we want to invest in real estate, we need to determine the market value of the asset, just how we do with any other investment. In order to value real estate, an appraisal is made. There are two kinds of appraisals, individual and mass appraisals. Individual appraisals are done manually by valuing the characteristics of one specific property. To do mass appraisals of real estate, there are several Automated Valuation Models (AVMs). The classical method to build these automated models, is to use the ordinary least squares (OLS) linear regression. Subsequently, they are applied to estimate the market value of real estate objects with numerous similarities in characteristics (Kontrimas and Verikas, 2011).

The usage of linear regression models to predict real estate value is popular. The models are easy to make and interpret and give the build-up in value from each characteristic. However, the choice in these characteristics is usually subjective and determined by the appraiser who is guided by experience. A problem emerging in these types of models is that they are prone to omitted variable bias and inclusion of redundant variables, leading to model misspecification (Kubus, 2016). Biased outcomes of property value prediction are

problematical in today’s hyper-efficient capital markets. With the development of more advanced modelling techniques such as machine learning, the question rises if we are able to construct better performing AVMs (Kok, Koponen & Barbosa, 2017). A recent topic in Finance and Economics is feature selection, a machine learning technique that selects

variables that contribute most to a target variable. Given the rising availability and storage of data, it is interesting to investigate whether these feature selection methods could make better predictions, or even provide us with characteristics that have not been considered before.

Therefore, this paper will investigate the performance of feature selection methods on Automated Valuation Models in real estate. The subject matter, being very recent, has not been investigated extensively before. There is research investigating the predicting

performance of linear models on real estate prices (Sopranzetti, 2010; Schulz, Wersing & Werwatz, 2014), and there are papers comparing linear models to machine learning models in price predicting performance (Kok, Koponen & Martinez-Barbosa, 2017; Peterson &

Flanagan, 2009; Kontrimas & Verikas, 2011). The only research in performance of feature selection is done by Kubus (2016), however he used a rather small data sample. The relevance of this paper is to extend the boundaries of recent research in machine learning models

(6)

data analysis to isolate the performance of feature selection methods on price prediction of AVMs. In order to do so, the following research question is posed:

How do feature selection methods affect the predicting performance of automated valuation models in real estate?

First, a concise background literature review is carried out. Then, three different feature selection methods are assessed, which provide three subsets of most important variables. The selections are applied to two AVMs, a linear regression model and a machine learning model. Following, the procedure of measuring model performance is defined. The data used to conduct the empirical analysis is from the housing market in Moscow, Russia in the period 2011 till 2015.

This paper proceeds as follows. Section 2 reviews previous literature on the topic and derives hypotheses for empirical testing. Section 3 describes the methodology, deliberating how the data is used to conduct empirical research. Section 4 provides the results, and section 5 discusses the interpretations and limitations. Additionally, a suggestion for further research will be given. Section 6 states the concluding remarks.

2. Literature review

This section provides a review of insights and findings from recent literature that investigate AVMs in real estate. The section closes with a presentation of hypotheses to answer the research question. In paragraph 2.1 literature on linear AVMs is discussed. In paragraph 2.2 research on machine learning AVMs is explored. Paragraph 2.3 states the derived hypotheses.

2.1 Linear models

In order to establish the market value of property in real estate, an appraisal is made. These appraisals estimate the price for which the property would be sold. This value is not only used for purchase and sale of property, but also for tax assessment, investment and financing, portfolio value, et cetera. Owners of real estate as part of an investment portfolio, want the appraisal value as accurate as possible. Not only is a precise valuation beneficial for investors, but also for banks when providing debt. There are several approaches in appraising property. The most adopted methods are: (1) the income approach, which discounts the Net Operating Income (NOI) of the property by a certain yield; (2) the comparison approach, which

(7)

Traditional property appraisal uses a capitalization rate, calculated by dividing the NOI with transaction price. This cap rate is taken from three to five, recently sold neighboring buildings and adjusted for differences in characteristics. However, valuing real estate in large numbers, individually appraising each property is timely and costly. To solve this problem, the value of property is deconstructed in different characteristics that influence price. These characteristics are used as independent variables in linear regression models, which calculate the added value of each characteristic to the total property value (Sopranzetti, 2010).

Cannon and Cole (2011) find that appraisals deviate 10 to 15 percent from the actual transaction price. Large deviations in appraisals can be costly, as they increase the probability of type I and type II error. Overvaluing property can lead to denial of credit to creditworthy parties (type I error) and undervaluing property can lead to excess credit risk due to provision of debt to non-creditworthy parties (type II error). Linear regression models are always exposed to this kind of errors, due to their assumptions (Peterson and Flanagan, 2009).

Sopranzetti (2010) constructs a linear regression model to estimate real estate prices using characteristics such as the square footage, age, number of bedrooms, a dummy variable representing the number of bathrooms, a dummy variable for a pool, if the property was listed in summer, et cetera. The linear model explained 84.3% of the transaction price. The findings showed that a pool added substantial value to property price in Texas, due to hot summers. Moreover, properties listed in summer sold for a higher price than those that were not. The advantage of linear regression models is the possibility to deconstruct the price in different components that provide information about the added value of a specific characteristic, such as a pool. To deal with nonlinearities in the hedonic pricing structure of real estate,

Sopranzetti (2010) constructed a semi-log model. A semi-log regression allows the added value of a certain characteristic (e.g. number of bathrooms) to change proportionally with the value of other characteristics (e.g. number of bedrooms). In a linear regression, the added value of a second bathroom is the same for a one-bedroom and a five-bedroom residence. The semi-log performed slightly better, predicting 87.5% of the actual transaction price. The research discusses several problems of linear regression modeling. There exists an

identification problem, caused by characteristics influencing price that also correlate with quantity. Meaning, buyers of property will not only select on quantity of characteristics, but also on their price. Furthermore, he discusses a theory about the equilibrium pricing problem. Transaction prices of real estate are the presumed ‘true’ value of property, thus the

(8)

reasons, we will assume the transaction price of real estate to be the correct market value. Schulz, Wersing, and Werwatz (2014) investigate the performance of a linear

regression model in predicting the market value of residential real estate in Berlin. They find that outliers in the data result in large appraisal errors. Moreover, they stated that using linear regression models to make predictions, it is crucial to use cross-validation for eliminating the risk of poor out-of-sample performance. In their paper they also argued the problem of heteroscedasticity in linear regression models. This occurs when the variance of prediction errors is not constant across the variables, violating the validity of analysis derived from the model. Supporting the findings of Sopranzetti (2010), who obtained a higher R-squared using a semi-log model, Schulz, Wersing, and Werwatz (2014) find that transforming price

logarithmically leads to lower prediction errors of the model.

2.2 Machine learning models

Linear regression models produce good estimates of parameters that explain the relationship between independent and dependent variables. Yet, several problems arise when predicting something influenced by many factors, such as real estate value. Including too many variables in a linear regression can overfit the model. Machine learning models on the other hand, are able to handle larger amounts of explanatory variables. The disadvantage is that they do not produce regression coefficients that can be interpreted the same as in a linear regression model. Nevertheless, machine learning models usually make better predictions of the dependent variable (Mullainathan & Spiess, 2017). The predictive performance of machine learning models on real estate prices is investigated in various literature.

Kok, Koponen, and Barbosa (2017) compare three different machine learning models to a linear regression based on the Ordinary Least Squares principle. The three machine learning models in question are Random Forest, Gradient Boosting and XGBoost. These models are built as ensembles of decision trees, by dividing the data in smaller subsets and making many decision trees. The algorithm minimizes the variance of a regression between each combination of the independent and dependent variable and selects the best predictors from the independent variables based on their reduction in variance. This provides the order of variable importance. Decision tree models are able to handle categorical variables without the requisite to create dummy variables, they can make fast predictions, and are simple to understand. Using ensembles of trees avoids the problem of over- or underfitting the model, by limiting the trees from growing too deep. To measure the predictive performance of the constructed models, Kok, Koponen, and Barbosa (2017) use robustness and accuracy,

(9)

calculated by explained variance (R2_{) and mean squared error (MSE), respectively. The most}

favorable model maximizes explained variance while optimizing accuracy. The findings indicate a significantly better performance of all three machine learning models compared to the linear regression model with the same set of explanatory variables. Including NOI as explanatory variable in their data, the OLS model yielded a R2 _{of 0.84, Random Forest}

performed slightly lower with a R2_{of 0.80 and the two gradient boosting models}

outperformed OLS with a R2_{of respectively 0.89 and 0.92. In addition, they find that}

increasing the data sample to cover a wider region leads to more robustness in the machine learning models, in contrast to Ordinary Least Squares. This corresponds with the findings of Peterson and Flanagan (2009), who created an artificial neural network to estimate real estate prices and also concluded better performance in larger test samples. This is counterintuitive as real estate is often perceived a local business. Kok, Koponen, and Barbosa (2017), show that better performance in larger samples is due to the fact that machine learning models include detailed locational information (e.g. distance to public transport) better than the linear regression model. These generalizable characteristics seem to be important in estimating property value, thus the advantage of machine learning models not being constrained to the few standard explanatory variables led to outperforming the linear regression model.

Regarding inclusion of variables, Peterson and Flanagan (2009) argue that machine learning includes categorical variables better than a linear regression model. Inclusion of more

categorical variables which linear regression models often have to discard, such as location or material, provides better predictions.

Mullainathan and Spiess (2017) test the performance of different machine learning algorithms including Random Forest and LASSO, against the performance of OLS in predicting house values from the American Housing Survey. They find that Random Forest yields the highest prediction performance in the training sample (85.1% against 47.3% for OLS), and an ensemble (regression tree, LASSO and Random Forest together) generates the highest out-of-sample performance (45.9% against 41.7% for OLS). Although, in their paper they alert that an estimation problem can arise when using machine learning models. It is caused by so-called ‘noise’, referring to randomness or irrelevant information in the dataset. When the algorithm used to develop the model is too complex (has too many input variables), it can pick up noise from the data and cause the model to overfit. The model will make biased predictions based on the noise and perform poorly on unseen data.

(10)

The findings of Peterson and Flanagan (2009) on the predictive superiority of artificial neural networks over linear regression models are countered by Worzala, Lenk, and Silva (1995). They did not find supporting evidence on the superiority of neural networks over linear regression analysis, based on 288 property transaction prices in Colorado. They also encountered inconsistent results between software packages providing the neural network algorithms.

Kontrimas and Verikas (2011) conducted research on the predicting performance of machine learning models and linear OLS regression on property prices from Lithuania. They found that OLS outperformed the multilayer perceptron neural network model but did not outperform the support vector machine and an ensemble of the methods. From their results they concluded a non-linear modeling technique is required to predict real estate value.

Kubus (2016) discusses the importance of feature selection methods for the accuracy of real estate valuation models. There are three groups of feature selection methods: filters, wrappers and embedded methods. Filters drop variables that have a small chance to be used before estimating the model parameters. Many filters provide a ranking of features, unlike wrappers, which will estimate the best performing feature set to predict the dependent

variable. A technique used in wrappers is stepwise regression, which deletes the worst feature or adds the best (backward elimination or forward selection). Embedded methods perform feature selection simultaneously with model parameter estimation. Using a dataset of 23 transactions of single-family houses in Krakow, Kubus (2016) developed a model including feature selection that gave estimates with smaller variances than OLS estimates. The recursive feature elimination technique gave the largest error compared to the other feature selection techniques.

2.3. Hypothesis

This paper investigates the effect of feature selection on the predicting performance of

automated valuation models in real estate. In order to answer this research question, a set of

hypotheses will be formulated. The research question is divided in two domains, the

automated valuation model part and the feature selection methods part. The choice of AVMs in this investigation are OLS and Random Forest, as they appear the most in the reviewed literature. Each model will be tested on three chosen feature selection methods. Analyzing previous literature, it is expected that feature selection methods can increase model

performance, and that the machine learning model outperforms the linear regression model. Hypothesis 1 to 6 state the expected outcomes of the feature selection methods on the two

(11)

types of models. Hypothesis 1 expects that Principal Component Analysis will increase the predicting performance of an OLS linear regression model, whereas Hypothesis 2 expects it will increase the predicting performance of a Random Forest model aswell. Hypothesis 3 and 4 state Variable Importance will increase the predicting performance of an OLS and Random Forest model, respectively. The final two hypotheses concerning the feature selection models, Hypothesis 5 and 6 respectively, expect an increase in predicting performance of the OLS and Random Forest models after using LASSO. Hypothesis 7 and 8 state the expected outcomes from the comparison of the linear regression model with the machine learning model. Hypothesis 7 anticipates a greater predicting performance of an unrestricted Random Forest model compared to an unrestricted OLS linear regression model. Finally, Hypothesis 8 expects that a restricted Random Forest model has a greater predicting performance than a restricted OLS linear regression model.

3. Methodology

This section describes how research is conducted in order to test the hypotheses. Paragraph 3.1 contains the statistical statement of hypothesis. Paragraph 3.2 concerns the data, in which is explained how the sample selection is made from the data source. Additionally, descriptive statistics are given. Paragraph 3.3 delineates three feature selection methods used to construct different subsets of samples. Paragraph 3.4 specifies the models used for making prediction. Lastly, in Paragraph 3.5, the testing technique for model performance is described.

3.1. Statement of Hypothesis

For every hypothesis derived from the literature review, the null hypothesis states that the performance of a restricted model is lower than, or equal to the performance of an unrestricted model, implying that feature selection does not improve performance. Model performance is measured in R-squared and will be elaborated further in Paragraph 3.5. On the contrary, the alternative hypothesis states that the R-squared of a restricted model will be greater than the R-squared of an unrestricted model, meaning that feature selection will improve model performance

𝐻! ∶ 𝑅"#$%"&'%#() ≤ 𝑅*+"#$%"&'%#()

𝐻_, ∶ 𝑅_"#$%"&'%#() _{> 𝑅}

(12)

3.2. Data

To investigate the performance of feature selection methods on real estate valuation models, we make use of the Sberbank database of the Russian housing market, retrieved from Kaggle. Originally, the provided data that was part of a competition on Kaggle. Sberbank’s posed this challenge to find a better predictive model than the simple regressions they are currently using for estimation of property value. The task was to build an algorithm that estimates realty prices, using a broad spectrum of features. The data provided included a training set, a set with macroeconomic data and a test set of which the prices needed to be predicted. For the sake of this research, only the characteristics data containing the actual transaction prices will be used. This ‘train’ datafile given on Kaggle consists of 30,471 observations on individual residential real estate transaction prices and with their corresponding characteristics in Russia. These observations are made over the time period from August 2011 to June 2015. Initially, there are 292 variables which include information of transaction prices and real estate characteristics. The variable set contains 15 categorical variables. The type of residences is apartments. The target variable is price, which is continuously measured in Russian rubles. Before starting the analysis, the data set has been cleaned. The dataset contained many variables with missing observations. Observations that had more than 40 percent missing values and no strong correlation with the target variable were removed completely. For categorical variables, dummy variables have been created, in order to make the whole dataset numerical. This needed to be done due to the continuous target variable. After making binary variables, rows that contained missing observations have been removed from the data. Lastly, outliers were removed with box plot. Outliers are observations have large differences in relation to the majority of the sample. Observations in the features that were higher or lower than 1.5 times the interquartile range, were removed from the data. The outliers in the target variable were chosen to not be removed, because price is always positively skewed. This left us with a sample of 11,535 observations and 436 variables. For cross validation the sample has been split into training and test data. The training data is from May 2013 till November 2014 (9,115 observations), and the test data is from December 2014 till June 2015 (2,420 observations). This gives a division of approximately 25% for the evaluation dataset and 75% to calibrate our models on. Figure 1 shows the median price each month after cleaning the data. Figure 2 shows a correlation heatmap of the 20 variables that have the highest

correlation with the target variable, price (price_doc). In appendix figures A and B descriptive statistics of the most correlated features are given, and a graph showing the percentage of missing values per feature.

(13)

Figure 1. Bar chart median price in months

(14)

3.3. Feature selection methods

In machine learning, feature selection is used by picking the variables that contribute most to the prediction of the target variable to improve out-of-sample model performance. Generally, there are three different approaches to feature selection: filter, wrapper and embedded

methods. Filter methods select features individually, based on their relevance by using a proxy or so-called threshold. Wrapper methods try to find a best performing subset of features to estimate the target. Embedded methods perform feature selection while building the model. In this analysis, the filter approach is taken for choosing variables. The methods providing the variables are: Principal Component Analysis, Variable Importance and LASSO. The

following subsections will elaborate how the methods work and based on what conditions the features from it are selected.

3.3.1. Principal Component Analysis (PCA)

The first method that is used to select the most important variables from the full data set is Principal Component Analysis (PCA). Joliffe and Cadima (2016) describe the usage of PCA on large datasets. They state that it reduces the dimensionality and increases interpretability of the dataset, while minimizing the information loss. This is done by creating principle

components that explain most variance in the data, fitting lines through the observations. The components are not correlated with each other, because each component line is orthogonal on the previous one. This is why PCA is also widely used to solve the problem of

multicollinearity. The principle components are a linear function of the original variables, made by projecting the observations on the principal component line and minimizing sum of squared distances to the observations. Hence, each principal component selects of a set of variables with a minimized error to the component line and assigns them loadings. PCA is largely used to reduce the dimensionality in the data for making predictions, by replacing the original variables with principal components. A certain threshold for total explained variance by the principal components is chosen. The first principal component explains the most variation in the total data, from there the following components are sorted in a descending order. The cutoff is made when the components have accumulated the variance threshold value (Jolliffe & Cadima, 2016). For feature selection, we will make use of the loadings to select variables from each component. Because the first component contains the most variance in the data, we will select variables with the largest loadings from the first component. To do Principle Component Analysis, the data must be scaled so that each

(15)

observation has a mean of zero. The loadings are the weight by which each variable in the component needs to be multiplied with in order to obtain the component score. The

component score consists of the percentage of variance that component explains in the data. The mathematical notation for PCA can be written as a decomposed data matrix Y into two matrices X and P. Matrix P represents the loadings matrix, thus the weights for each variable which transform X into Y. Matrix X is the scores matrix and contains the original data in a scaled system. The result is matrix Y, a new representation of the dataset.

Figure 3. Notation of a transformed data matrix decomposed in loadings and scores matrices.

Note. Adapted from “A Tutorial on Principal Component Analysis” by Shlens, J., 2014, April 7. Retrieved from https://arxiv.org/abs/1404.1100

3.3.2. Variable Importance

The second method that is used to select features is Variable Importance. Variable importance is determined by a decision tree. A decision tree splits the dataset in increasingly smaller subsets to predict the target variable. This split is called a node and is based on a condition of a single feature. At each step a variable is chosen that splits the data the best. It depends on the used algorithm how these splits are made, for example the average decrease in variance after splitting. This is repeated until a specified limit is reached, creating a tree with a ranked order of features (Antipov & Pokryshevskaya, 2012). Using only one decision tree to assess variable importance causes an overfit of the model, which leads to poor predictions on unseen data. Hence, we will use random forest: an ensemble of many decision trees of which a weighted average is taken. The Random Forest makes bootstrap samples from the data given to it, which differ in feature composition. The accuracy of the bootstrapped sample containing a specific feature, is compared to the accuracy of another bootstrapped sample in which the feature is not used (Kok, Koponen, & Barbosa, 2017). In this research, the machine learning

(16)

features. In this library, the measure of feature importance is calculated by the variance reduction that feature brings to the node. Variance reduction is calculated with the mean squared error. To get feature importance it is weighted by the probability of reaching the node. Mean squared error is calculated by subtracting the value predicted by the bootstrapped from the true values and squaring its sum.

Equation 1. Mathematical notation of MSE

𝑀𝑆𝐸 = 1_𝑛 -(𝑌&− 𝑌1&)) +

&-,

Equation 2. Node importance calculated with Gini Importance

𝑊 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑟𝑒𝑎𝑐ℎ𝑖𝑛𝑔 𝑛𝑜𝑑𝑒 𝑗

𝐶 = 𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦 𝑣𝑎𝑙𝑢𝑒 𝑛𝑜𝑑𝑒 𝑗 (𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑟𝑒𝑑𝑢𝑐𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ 𝑀𝑆𝐸) 𝑙𝑒𝑓𝑡, 𝑟𝑖𝑔ℎ𝑡 = 𝑐ℎ𝑖𝑙𝑑 𝑛𝑜𝑑𝑒𝑠 𝑠𝑝𝑙𝑖𝑡 𝑓𝑟𝑜𝑚 𝑛𝑜𝑑𝑒 𝑗

Equation 3. Feature importance mathematical notation

To get a percentage value for the importance of a feature, this should be divided by the sum of all feature importance values. To get the feature importance in a random forest, divide the feature importance percentage by the number of trees in the forest.

3.3.3. LASSO

The final feature selection method will be performed by the Least Absolute Shrinkage and Selection Operator (LASSO) regression. The method uses a linear regression with

regularization, which adds weights to our estimators. The weight for parameters that the lasso regression does not deem important will be shrunk to zero and excluded from the regression. In this way LASSO regression performs feature selection, it excludes variables that are not important by giving them a weight of zero. Variables that are more important will receive a higher weight (Fonti & Belitser, 2017). The mathematical notation for LASSO is a loss

(17)

function minimizing the sum of squared residual errors plus a regularization term, which is a constant multiplied by the sum of absolute values of the coefficients. The function with the objective to be minimized is depicted as follows.

Equation 4. LASSO

𝐿_./001L𝛽NO = -L𝑌_& − 𝑥_&𝛽NO)+ 𝜆 -S𝛽N₂S

+ 2-, + &-, 𝜆 ∶ 𝑟𝑒𝑔𝑢𝑙𝑎𝑟𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 S𝛽N₂S ∶ 𝑎𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 L𝑌& − 𝑥&𝛽NO

)

∶ 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑒𝑟𝑟𝑜𝑟

To select features with LASSO we will set a certain threshold and select the features with the highest weights given by the regularization term in LASSO.

3.4 Models

To measure the performance of feature selection on valuation models, we will make use of 2 types of estimation models: linear regression and Random Forest. First, the two models will make an estimation with all features, to set a benchmark performance value. Then, this will be compared against the performance of the same type of model, but with feature selection. This will be referred to as the reduced model. From the feature selection methods, we will obtain three variable subsets. For each type there will be four models, three reduced and one full model. In total this will result in eight models. The mathematical model specifications are elaborated below.

3.4.1 OLS linear regression

The first type of model we will use for estimation is a linear regression. The technique used by this model is the Ordinary Least Squares principle. The regression constructs a linear function by estimating parameters for the explanatory variables. It does so by minimizing the sum of squares between the observed dependent variable and the one predicted from the independent variables. This results a best fitting (linear) line through the data. As a rule, it

(18)

term is normally distributed and has a constant variance σ2_{. All the independent variables are}

uncorrelated with this error term. The observations of the error term should be uncorrelated. Finally, the independent variables cannot be perfectly correlated with each other (Stock & Watson, 2014). The mathematical notation for the linear models is defined as follows.

Equation 5. Linear regression

𝑃𝑟𝑖𝑐𝑒_& = 𝛽_! + 𝛽𝑋_& + 𝜀_&

Where i is the number observations, 𝑃𝑟𝑖𝑐𝑒& is the dependent variable which is

continuously measured. 𝛽_! is the constant term, or the intercept. 𝑋_& is a vector of the predicting features that will vary across the four models. 𝛽₃ are the estimated regression coefficients for each explanatory variable.

3.4.2 Random Forest

The second type of predicting model used is the Random Forest regression model. As already described in subsection 3.3.2, it is an ensemble machine learning method, using a weighted average of several decision trees to make a prediction. The Random Forest algorithm makes bootstrapped samples from the data given to it and uses them to build the decision trees. This process, called bagging, is repeated many times before a weighted average of all predictions is taken. Without setting some limitations for Random Forest the trees can grow very deep and cause an overfit on the training data, which then leads to poor performance on unseen data (Kok, Koponen, & Barbosa, 2017). In this research, the scikit-learn machine learning libraries are used to set hyperparameters for the Random Forest model. The hyperparameters include the maximum depth of the trees, number of trees combined, the maximum number of features considered at each split and whether the bootstrapping (or bagging) is done with or without replacement. A disadvantage of Random Forest is that we cannot interpret how much each feature contributes to the price prediction made by it. The model makes branches for variables based on the reduction in the mean squared error of the prediction, this is the standard when solving a regression problem. Below a visual representation of a Random Forest model is given.

(19)

Figure 4. Visualization of a Random Forest model.

Note. Reprinted from “Random Forest Regression” by Chakure, A., 2019, June 29. Retrieved

from https://towardsdatascience.com/random-forest-and-its-implementation-71824ced454f

3.5 Measuring model performance

After constructing the 8 models, a statistical test to measure the performance of each model needs to be specified. With this measure the models can be compared and the performance of feature selection can be tested. The chosen test of performance in this research is the R-squared, or robustness, measured by the explained variance in the target by the independent variables.

Equation 6. Mathematical notation coefficient of determination, R-squared

𝑅) _{= 1 −}𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠

𝑇𝑜𝑡𝑎𝑙 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 = 𝑆𝑆𝐸 𝑇𝑆𝑆

To evaluate if the constructed models do minimize the estimation error,

cross-validation is used. Measuring the R-squared of the constructed model applied to unseen data, gives the predicting R-squared which can be compared to models with a reduced set of independent variables. For this reason, the dataset is split into a training and test set before constructing the models. The model is built on the training dataset and applied to the testing set. Letting the models make predictions on unseen data gives the out-of-sample performance. The in-sample performance is usually higher because the model is fitted on that data. The

(20)

Finally, a model horserace based on R-squared will be held. There will be three races, an overall race, a feature selection race, and a regression race. The overall or total race will provide a ranked list of all 8 models, the model with the highest explained variance will be on top. The feature selection race will only consider the models with applied feature selection methods, thus delivering a ranked list of 6 models. The regression race will be between the linear regression model and the random forest model. There will be 4 places in the ranked list, and for each place there will be a winner between the linear regression and the random forest, with the same conditions. Hence, the three restricted models and one full model with OLS are compared to the three restricted and one full random forest model. The model with the highest R-squared will win the place, and afterwards these places will be sorted to complete the regression horserace.

4. Results

In this section the findings from this the research are presented. Paragraph 4.1 contains the results of the feature selection methods on the two previously specified models. Paragraph 4.2 provides a comparison between the two models, with and without feature selection. Finally, in Paragraph 4.3 three horseraces between the models will be held and provide us with ranked lists based on performance.

4.1. Feature selection performance

In order to form a statement about accepting or rejecting the posed hypotheses about feature selection performance, the dataset is applied to the constructed models. Before reporting the result of our analysis, the corresponding hypothesis for each model is repeated. This will assist us justifying the drawn conclusion from the findings in our research. For every feature selection method, the R-squared will be compared to the R-squared of the corresponding unrestricted model, in other words, the benchmark. Table 1 contains the values for R-squared obtained with the OLS linear regression model. Table 2 contains the values for R-squared from the Random Forest model. Table 3 contains the results for OLS and Random Forest when increasing the number of features selected.

(21)

Table 1. OLS Model Performance before and after feature selection (10 features)

OLS Full PCA VarImp LASSO

Train R2 _0.5069 _0.0790 _0.4118 _0.0744

Test R2 _0.4805 _0.0310 _0.3968 _0.0244

Table 2. Random Forest Model Performance before and after feature selection (10 features)

Table 3. Model performance when increasing number of selected features Feature

selection Number of features Train/Test OLS RF

Full 436 features Train R2 _0.5069 _0.9237

Test R2 _0.4805 _0.4258

PCA 10 features Train R2 _0.0790 _0.7125

Test R2 _0.0310 _0.0759

20 features Train R2 _0.1030 _0.7319

Test R2 _0.055 _0.0865

40 features Train R2 _0.1243 _0.7394

Test R2 _0.068 _0.0979

VarImp 10 features Train R2 _0.4118 _0.9234

Test R2 _0.3968 _0.4276

20 features Train R2 _0.4379 _0.9253

Test R2 _0.4227 _0.4331

40 features Train R2 _0.4517 _0.9257

Test R2 _0.4350 _0.4414

LASSO 10 features Train R2 _0.0744 _0.7118

Test R2 _0.0244 _0.1058

20 features Train R2 _0.0869 _0.7252

Test R2 _0.0389 _0.0991

40 features Train R2 _0.0978 _0.7250

Test R2 _0.0463 _0.1118

Random Forest Full PCA VarImp LASSO

Train R2 _0.9237 _0.7125 _0.9234 _0.7136

(22)

4.1.1. Principal Component Analysis

Hypothesis 1 states that feature selection using Principal Component Analysis increases the predicting performance of an OLS regression model. From the results in Table 1, it appears that PCA for feature selection did not improve the predictive performance of the linear regression model. The in-sample performance of 0.08 is significantly lower than the R-squared of 0.51 for the OLS model without feature selection. PCA feature selection did also not improve the performance of the model on unseen data, the out-of-sample measured R-squared amounted 0.03 compared to 0.48 in the unrestricted model. Because the predictive performance out-of-sample of the restricted OLS model with the chosen features from PCA did not increase compared to the benchmark, Hypothesis 1 must be rejected.

Hypothesis 2 states that feature selection using Principal Component Analysis

increases the predicting performance of a Random Forest valuation model. From the results in Table 2, we see that both the in- and out-of-sample performance of the restricted random forest model decreased. Using the features retrieved with PCA the R-squared in the training sample decreased from 0.92 to 0.71. The R-squared in the testing sample decreased from 0.43 to 0.08. Due to the lower R-squared value in the testing sample, Hypothesis 2 must be

rejected.

From Figure 3 we see that the percentage of explained variance by the first ten

principal components. The first principal component (PC1) contains 17.5 percent of explained variance. From Table 3 the effect of increasing the number of features selected from PC1 is shown. Both the out-of-sample results from OLS and Random Forest increased, but not enough to outperform the unrestricted models.

(23)

4.1.2. Variable Importance

In the bar chart presented in Figure 5 shows the 10 most important features selected by

variable importance. The squared footage appears to be deemed most important by the feature importance algorithm from sci-kit learn.

Hypothesis 3 states that feature selection using Variable Importance increases the predicting performance of an OLS linear regression model. The presented results in Table 1 do not indicate an improved performance of prediction after feature selection with Variable Importance. The R-squared decreased from 0.51 to 0.41 when using the restricted linear regression model on the sample used for fitting the model. The value of R-squared in the validation sample decreased from 0.48 to 0.40 using the restricted model for prediction. A lower value for R-squared in the validation sample obtained from the restricted OLS model leads to a rejection of Hypothesis 3.

Hypothesis 4 states that feature selection using Variable Importance increases the predicting performance of a Random Forest model. Variable Importance is derived from a Random Forest model, so there is some synenergy. The results in Table 2 show a slight decrease in the R-squared value using the model on the training sample, 0.9237 with the full model and 0.9234 with the restricted model. The restricted model seems to obtain a higher value for R-squared in the validation sample, namely 0.4276. The full model has a R-squared of 0.4258. This suggests that feature selection using Variable Importance increases the predicting performance of a random forest model, thus Hypothesis 4 is accepted.

The results obtained from increasing the number of features selected by Variable Importance, depicted in Table 3, do not change the statements made about the hypotheses. Increasing the number of features does increase prediction performance, however for OLS, the unrestricted model remains superior.

(24)

4.1.3. LASSO

Hypothesis 5 states that feature selection using LASSO increases the predicting performance of an OLS linear regression model. The results in Table 1 do not show an increased predicting performance after feature selection with LASSO. The R-squared decreased from 0.51 to 0.07 when using the restricted linear regression model on the training sample. The R-squared value of the OLS model on the testing sample decreased from 0.48 to 0.02 using the reduced model for prediction. Hypothesis 5 must be rejected because of the lower R-squared value of the restricted OLS model in the testing sample.

Hypothesis 6 states that feature selection using LASSO increases the predicting performance of a Random Forest model. The presented values for the R-squared of the reduced random forest model decreased from 0.92 to 0.71 applying the model on the training data, and from 0.43 to 0.10 applied on the testing data. Thus, Hypothesis 6 is rejected due to a decreased performance of the model on the testing data.

Increasing the number of features selected by LASSO did also not change the outcome of the analysis. The predictive performance did increase along with increasing number of features but did not lead to outperforming the unrestricted models.

4.2. Model performance

In this section the results of the two types of models are compared. This approach will let us make a statement about the last two hypotheses. Table 4 contains the results from Table 1 and 2 in an organized way.

Table 4. Model performance OLS versus Random Forest

Feature selection Train/Test OLS Random Forest

Full Train R2 _0.5069 _0.9237 Test R2 _0.4805 _0.4258 PCA Train R2 _0.0790 _0.7125 Test R2 _0.0310 _0.0759 VarImp Train R2 _0.4118 _0.9234 Test R2 _0.3968 _0.4276 LASSO Train R2 _0.0744 _0.7136 Test R2 _0.0244 _0.1042

(25)

4.2.1. Unrestricted

The unrestricted models include 434 explanatory variables, of which 153 are binary, the dummy variables. The remaining variables, including the target variable, are all continuous. The in-sample performance of the OLS linear regression model gives a R-squaredof 0.51, and for the Random Forest model we measured a R-squared of 0.92. The measured out-of-sample R-squared is respectively 0.48 and 0.43. It appears that the OLS model performs better than the random forest model on unseen data. Hypothesis 1 states that an unrestricted Random Forest model has a greater predictive performance than an OLS linear regression model. The in-sample performance of the random forest is greater than the in-sample performance of the OLS model. However, out-of-sample performance, thus predicting performance, the Random Forest performs worse. In consequence, Hypothesis 7 must be rejected.

4.2.2. Restricted

Hypothesis 8 states that a restricted Random Forest model using feature selection has a greater predictive performance than an OLS linear regression model. In Table 4 we can compare the results between the restricted OLS and Random Forest models. For all the Random Forest models we measured a higher value of R-squared than for the OLS model, for both the training and testing data. On unseen data, the testing sample, Random Forest with PCA feature selection has a R-squared of 0.08, compared to 0.03 for OLS. With Variable Importance, Random Forest obtained an R-squared of 0.43, which is higher than 0.40 for OLS. Finally, with LASSO features, Random Forest has an R-squared of 0.10, and OLS got a value of 0.02. Due to the larger R-squares of each restricted model of Random Forest in the testing sample, Hypothesis 8 is accepted.

4.3 Model horserace

Lastly, this section provides a ranked order of the constructed models in this paper, by holding three horseraces.

The first horserace is the feature selection race, ranked on highest to lowest

performance on unseen data. This means that the first place will show the feature selection method that affects the predicting performance of the corresponding type of model the most out of all the methods tested in this research. The ranks are selected based on the coefficient of determination of the model, R-squared. The first place has the highest R-squared, and the

(26)

The Random Forest model with variable importance performed best. The second place is also taken by variable importance as feature selection method. The lowest performing feature selection method is LASSO applied to the OLS model.

Table 5. Feature selection horserace

Rank Feature selection R2

1. RF & VarImp 0.4276

2. OLS & VarImp 0.3968

3. RF & LASSO 0.1042

4. RF & PCA 0.0759

5.

OLS & PCA 0.0310

6. OLS & LASSO 0.0244

The second horserace is the regression race. The two types of models used for

prediction in this paper, will be compared to each other with a level playing field. This means that the OLS models are compared to the Random Forest models with the same restrictions, or with no restrictions at all. Each of the four comparisons yield one winner, which are

subsequently ranked based on highest performance. The results are given in Table 6. The winner of the regression race is the Ordinary Least Squares regression, without any

restrictions. The Random Forest wins in performance on all the restricted models. Variable importance as restriction got second place after the full, unrestricted model.

Table 6. Regression model horserace

Rank Winner Restriction Model R2

1. OLS Full OLS 0.4805

RF 0.4258 2. RF VarImp OLS 0.3968 RF 0.4276 3. RF LASSO OLS 0.0244 RF 0.1042 4. RF PCA OLS 0.0310 RF 0.0759

(27)

The third and last horserace ranks all the models based on their performance. In Table 7 we see that the model with the best performing score on unseen data is the unrestricted OLS linear regression. The second and third place are taken by the Random Forest model with variable importance and the unrestricted Random Forest model, respectively.

Table 7. Total model horserace

Rank Model R2

1. Full OLS 0.4805

2. RF & VarImp 0.4276

3. Full RF 0.4258

4. OLS & VarImp 0.3968

5. RF & LASSO 0.1042

6. RF & PCA 0.0759

7. OLS & PCA 0.0310

8. OLS & LASSO 0.0244

5. Discussion

In this section, the results from the data analysis will be interpreted. The unexpected outcomes and limitations of the research will also be discussed. Finally, there will be some directions for improving further research suggested.

From the review of previous literature, it was expected that feature selection would increase the predictive performance of the constructed models. Conducting feature selection using three different approaches, provided us with a restricted subset of independent

variables. These three subsets were applied to a linear regression model and a machine learning model. None of the reduced variable subsets deemed to improve the predicting performance of the OLS model, using R-squared as our measure. In the cross-validation sample, the full feature set applied to the linear model was able to explain 48 percent of the variation in the property transaction prices. The next best performing feature set applied to the linear model was chosen with Variable Importance. This reduced feature set explained 40

(28)

Increasing the number of selected features slightly improved the predicting performance of linear model. However, the larger subsets did not outperform the unrestricted OLS model. With R-squared as performance measure, it can be concluded that feature selection with PCA, Variable Importance and LASSO do not improve the predicting performance of an OLS linear regression model in this dataset.

For the machine learning model, a random forest is used. In contrast to the OLS model, applying the reduced independent variable subset obtained with Variable Importance improved the predictive performance of the random forest. The model with the full feature set succeeded in explaining 42.6 percent of the variance in the property transaction prices, while the reduced feature subset from the Variable Importance method resulted in an explained variance of 42.8 percent. The other two methods did not improve the predictive performance of the random forest model. Features selected with LASSO led to an explained variance of 10.4 percent, and features selected with PCA to 7.6 percent. From these results it can be concluded that Variable Importance as feature selection method increase the predicting performance of the Random Forest model. This is based on R-squared as a measure of performance, and on the dataset used in the analysis. Feature selection through usage of the loadings from PCA and the regularized coefficients from LASSO, did not improve the predictive performance of the Random Forest model.

Comparing the linear regression model to the machine learning model, the OLS model scores higher than the Random Forest model, without reducing the number of features. The unrestricted linear regression model explains 48 percent of the variance in price, while the prediction with Random Forest accounts for 42.6 percent. However, when both the models are reduced to a specific subset, the Random Forest always outperforms the OLS model. Starting with the Variable Importance feature subset, the Random Forest achieved an explained variance of 42.8 percent and the OLS regression only 39.7 percent. Following the LASSO reduced features, Random Forest justified 10.42 percent of the variance, and OLS predicted 2.4 percent. Lastly, the feature subset selected with PCA allowed Random Forest to explain 7.6 percent of the variance, compared to 3.1 percent by OLS. This suggests that overall the Random Forest outperforms OLS in predictive power. However, in this research the unrestricted OLS model obtained the highest score in predicting performance. This could be due to the curse of dimensionality, which arises when a high-dimensional dataset is used for training and testing a machine learning model. More features bring more variance and noise in the data, causing the machine learning algorithm building a model that is worse in making predictions on unseen samples (Aremu, Hyland-Wood & Mcaree, 2020). The curse of

(29)

dimensionality is associated with overfitting, which could be a reason that Random Forest did not outperform OLS on the full variable set. This suggests that the Random Forest overfitted the model on the high-dimensional training data, using too many features (Kok, Koponen & Barbosa, 2017). This also justifies the result depicted in the total model horserace from Table 7, where the Random Forest with Variable Importance is ranked above the unrestricted Random Forest model.

Regarding the investigated feature selection methods, from the feature selection horserace in Table 5 we can conclude that Variable Importance scores the highest in

increasing predictive performance, for both the linear regression and machine learning model. Feature selection with LASSO and PCA applied to the Random Forest model follow in third and fourth place. Restricting the OLS model with feature selection using PCA and LASSO are ranked fifth and last. The obviously lower predictive performance scores both models received when restricted with PCA and LASSO could be due to the missing observations in the data. Variables that were highly correlated with the target, contained many missing values. In the data preparation the choice has been made to remove all the observations with missing values. This might have included the removal of observations that could improve the selection of features by PCA and LASSO. In the Principal Component Analysis, the first principal component contained an explained variance of 17.5 percent of the data. Selecting the features with the highest loadings from only the first principal component could have been

insufficient. The LASSO feature selection method has some limitations, as described by Fonti (2017). When there are variables that are highly correlated with each other in groups, LASSO tends to select only one feature from each group, neglecting the others. The features that appeared to be important in this dataset included variables as the squared footage (full, living and kitchen), the number of rooms and neighborhood characteristics (café’s, offices, shopping malls). These features are highly correlated and could imply the low score of the LASSO method. These findings suggest that the choice in feature selection method is greatly dependent on the dataset in question and should be handled with particular precision.

5.1. Limitations and Future Research

Combining the findings on the effect of the feature selection methods at issue, the general conclusion is that, overall, they do not increase the predictive performance the two

(30)

due to an overfit of the unrestricted models. Described in previous literature, a random forest model should perform better out of sample with a reduced feature set. Opposed to this, an OLS linear regression model yields a better performance when the subset of independent variables is being increased. The taken approach in selecting features in the empirical analysis was the filter method, meaning to reduce the feature subset to only the most important

variables. However, instead of elimination of unimportant features by the methods under consideration, a basic set of features could have been chosen beforehand to set the benchmark performance value. An example of a basic set of features could for instance only include indoor characteristics. From this point, feature selection could have been used to add variables deemed important by each technique. This paper was limited to using backward elimination to reduce the high dimensionality of the dataset. Further research could consider forward

selection as filter method, or even assess wrapper or embedded methods.

Furthermore, this paper was limited to using only predictive R-squared as the measure of performance. In further research, additional performance measures such as RMSE and MAPE could be evaluated and compared to R-squared. Another shortcoming of this paper is the limitation to only three feature selection techniques and two predictive models. Future research could consider comparing more or other models and feature selection techniques.

Feature selection as the subject matter, has not been investigated extensively yet. To propose a suggestion for future study, investigating the performance of Variable Importance for feature selection in ensembles of decision trees could lead to relevant results, as the method seems to have potential for increasing predicting performance. To extend the purpose of this research, algorithm ensemble methods other than Random Forest could be considered. An interesting ensemble to investigate is the recently released software called XGBoost. Their eXtreme Gradient Boosting algorithm is most used among supervised machine learning problems on Kaggle. In competitions, the winning solution almost always uses this advanced implementation of the gradient boosting algorithm. Investigating the construction of better predictive models is not only beneficial for the estimation of property value, they could also be used for stress-testing under adverse macroeconomic situations. Furthermore, the

availability of an accurate and quick appraisal may stimulate financial innovation in the real estate sector (Kok, Koponen & Barbosa, 2017). Finally, feature selection can also be

considered for other applications in research. The subject matter can help in seeking and selecting the most important variables for interpretative models instead of predicting models. This is especially useful in the rising accessibility of big data. These methods could let the

(31)

data speak for us, and potentially bring new insights. It is crucial to continue investigating the insinuated effects of feature selection techniques.

5. Conclusion

This thesis has investigated the effect of three feature selection techniques on the predictive performance of two automated valuation models in real estate. The research is conducted based on residential property transaction prices with corresponding characteristics from August 2013 until June 2015.

First, three subsets of variables were chosen as most important by three different feature selection methods. This was done using the machine learning library ‘Sci-kit Learn’ for Python. With each feature selection technique, the most important 10, 20 and 40 variables were chosen and applied to a linear regression model, and a supervised machine learning model. Subsequently, the predictive performance was measured on unseen data for cross-validation.

The first predictive model in question is an Ordinary Least Squares linear regression. It was expected that feature selection using Principal Component Analysis, Variable

Importance and LASSO would increase the predictive performance of the linear regression model. However, the results showed no increase in model performance after feature selection.

The second predictive model used in the research, is a Random Forest regression model. The expectation for the machine learning model was that feature selection would increase its performance, and that it would outperform the linear regression model too. From the results it was concluded only Variable Importance increased the predictive performance of Random Forest. Overall, the unrestricted OLS model performed best. Random Forest

outperformed OLS only under the restricted circumstances, with a reduced feature set. Concluding, it can be stated that feature selection has a decreasing effect on the predictive performance of automated valuation models. This applies only to the three feature selection methods and two valuation models tested in this paper. This contradicts the expected outcomes, which were anticipating an increasing effect on prediction performance. Except for Hypothesis 5 and 8, all other hypotheses derived from previous literature were rejected. Variable Importance did improve the predicting performance of the Random Forest model, and Random Forest with feature selection outperforms OLS with feature selection. Other

(32)

6. Bibliography

Abidoye, R., Junge, M., Lam, T., Oyedokun, T., & Tipping, M. (2019). Property valuation methods in practice: evidence from Australia. Property Management, 37(5), 701–718.

Antipov, E. A., & Pokryshevskaya, E. B. (2012). Mass appraisal of residential apartments: An application of Random forest for valuation and a CART-based approach for model diagnostics. Expert Systems with Applications, 39(2), 1772-1778.

Aremu, O., Hyland-Wood, D., & Mcaree, P. (2020). A machine learning approach to circumventing the curse of dimensionality in discontinuous time series machine data. Reliability Engineering & System Safety, 195.

Cannon, S., and R. Cole. “How Accurate Are Commercial Real Estate Appraisals? Evidence from 25 Years of NCREIF Sales Data.” The Journal of Portfolio Management, Vol. 5, No. 5 (2011), pp. 68-88.

Chakure, A. (2019, June 29). Random Forest Regression. Towards Data Science. Retrieved from https://towardsdatascience.com/random-forest-and-its

implementation71824ced454f

Cousineau, D., & Chartier, S. (2010). Outliers detection and treatment: a review. International Journal of Psychological Research, 3(1), 58-67.

Faishal Ibrahim, M., Jam Cheng, F., & How Eng, K. (2005). Automated valuation model: an application to the public housing resale market in Singapore. Property

Management, 23(5), 357-373.

Fonti, V., & Belitser, E. (2017). Feature selection using lasso. VU Amsterdam Research Paper

in Business Analytics, 1-25.

Fortelny, A., & Reed, R. (2005). The increasing use of Automated Valuation Models in the Australian mortgage market. Australian property journal, 38(6), 681-685.

(33)

Goodman, A. C. (1998). Andrew Court and the invention of hedonic price analysis. Journal

of urban economics, 44(2), 291-298.

Guo, Q., Wu, W., Massart, D. L., Boucon, C., & De Jong, S. (2002). Feature selection in principal component analysis of analytical data. Chemometrics and Intelligent

Laboratory Systems, 61(1-2), 123-132.

Jahanshiri, E., Buyong, T., & Shariff, A. R. M. (2011). A review of property mass valuation models. Pertanika Journal of Science & Technology, 19(1), 23-30.

Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical,

Physical and Engineering Sciences, 374(2065), 20150202.

Kok, N., Koponen, E. L., & Martínez-Barbosa, C. A. (2017). Big data in real estate? From manual appraisal to automated valuation. The Journal of Portfolio Management,

43(6), 202-211

Kontrimas, V., & Verikas, A. (2011). The mass appraisal of the real estate by computational intelligence. Applied Soft Computing Journal, 11(1), 443–448.

Kubus, M. (2016). Assessment of Predictor Importance with the Example of the Real Estate Market. Folia Oeconomica Stetinensia, 16(2), 29–39.

Kubus, M. (2016). Locally regularized linear regression in the valuation of real estate. Statistics in Transition, 17(3), 515–524.

Mullainathan, S., & Spiess, J. (2017). Machine learning: an applied econometric approach.

Journal of Economic Perspectives, 31(2), 87-106

Nguyen, N. and A. Cripps. Predicting Housing Value: A Comparison of Multiple Regression Analysis and Artificial Neural Networks. Journal of Real Estate Research, 2001, 22:3,313-36

Peterson, S., & Flanagan, A. (2009). Neural network hedonic pricing models in mass real estate appraisal. Journal of real estate research, 31(2), 147-164

(34)

Schulz, R., W. Wersing, and A. Werwatz. “Automated Valuation Modelling: A Specification Exercise.” Journal of Property Research, Vol. 31, No. 2 (2014), pp. 131-153.

Shiller, R.J. and A.N. Weiss. Evaluating Real Estate Valuation Systems. Journal of Real

Estate Finance and Economics, 1999, 18:2, 147-6.

Shlens, J. (2014, April 3). A tutorial on principal component analysis. Retrieved from https://arxiv.org/abs/1404.1100

Sopranzetti, B. J. (2010). Hedonic regression analysis in real estate markets: a primer. In

Handbook of quantitative finance and risk management (pp. 1201-1207). Springer,

Boston, MA.

Stock, J. H., & Watson, M. W. (2014). Introduction to Econometrics, Update, Global

Edition (3rd ed.). Pearson Education.

Worzala, E., Lenk, M., & Silva, A. (1995). An Exploration of Neural Networks and Its Application to Real Estate Valuation. The Journal of Real Estate Research, 10(2), 85–201.

(35)

7. Appendix

Appendix A. Descriptive statistics of highly correlated variables with target

Performance of Feature Selection on Automated Valuation Models in Real Estate