Traditional regression versus kriging : an application on residential real estate

(1)

Traditional regression versus kriging: an

application on residential real estate

Bachelor’s Thesis

by

Sebastiaan Hersmis BSc. 1

Submitted in partial fulfillment of the requirements for the Degree Bachelor of Science

in Econometrics and Operational Research

Supervisor: dr. Wim van Beers June 27, 2016

1

(2)

Abstract

This thesis investigates the quality of predictions made by ordinary least squares (OLS) to those made by kriging. Kriging incorporates spatial correlation of the input data, but does not yield marginal effects, as opposed to OLS. This thesis compares predictions of residential real estate prices, made by both methods. OLS results in a smaller Mean Squared Error for predictions of the out-of-sample data set than kriging, but the difference is close to zero. Both methods show a high standard deviation in MSEs, which could be due to the nature of the data and the choice of a metamodel. Overall, OLS appears to make slightly better predictions.

(3)

1 Introduction

Everyone wants to predict the future. More specifically, the goal of most researchers in the field of economics is to predict output values based on a sample of known input factors. Using the predicted values, the impact of changing input variables can be assessed. However, when working with real-life data, the number of input combinations will always be limited by the number of observations. A common problem is that observations are expensive or impossible to acquire. When observations are scarce and the sample size is small, it is imperative to use methods that still estimate model parameters accurately. This thesis compares two of those methods, traditional regression and kriging, and applies those to a real-life data set on residential real estate prices.

When working with real-life data, researchers use mostly approximation models, so-called ’metamodels’. A metamodel mimics the real world scenario so that predictions can subsequently be made for new situations. A key issue here is the choice of the meta-model. Low-order polynomial regression is a widely used method in research on metamod-els (Kleijnen, 2009, p. 707). This traditional regression method fits a low-order polynomial to the observations and can be used for explaining the underlying marginal effects and predicting the output for unknown input combinations. Another method, the interpolation method kriging, is becoming frequently used for predicting output (Jin, Chen, & Simpson, 2001, p. 1). Kriging, originally developed for geostatistics by Danie Krige, has the goal of predicting the output value of an underlying function, given the observations at the inputs/sample points (Kleijnen, 2009). Kriging uses spatial correlation between inputs, meaning if inputs are close to each other, the predicted outputs will be as well.

The goal of this thesis is to estimate the quality of the predictions of traditional, low-order polynomial regression compared to the predictions by the kriging method. Since much research has been done on kriging and simulated data, Kleijnen (2009, p. 715) recommends that kriging should also be tested on practical situations. In this thesis, a low-order polynomial metamodel and kriging metamodel are tested on real-life data of residential real estate prices.

In order to compare the quality of predictions, the methods are tested on real-life data. The data set contains information on housing prices and multiple input factors, such as surface, location and type. In order to compare the quality of the predictions by both methods, the data set is split into two parts. One part is used to estimate the relevant parameters for both methods and the other part to validate the predictions of the models. It is important to note that the essence of this thesis lies in comparing the two methods,

(5)

and not developing a valuation model to estimate residential real estate prices.

The rest of this thesis is organised as follows. Section 2 covers the background of both methods, their theoretical differences and discusses drivers of residential real estate prices. Section 3 explains the methodology and collection of data. Section 4 contains the results of the comparison of both methods and section 5 concludes this thesis. Two limitations and two recommendations are given in section 6.

(6)

2 Theory on traditional regression and kriging

This section covers the theoretical background of both methods, and discusses drivers of residential real estate prices. Firstly, in subsection 2.1 the concept of metamodels and low-order polynomial regression is explained. Then, in subsection 2.2 kriging is treated. In subsection 2.3, methods for comparing the quality of predictions by both models are discussed. Finally, in subsection 2.4, theory of drivers of residential real estate prices are treated.

2.1 Metamodels and polynomial regression

It is imperative to explain metamodels before discussing polynomial regression, as a model has to be established before regression can be done.

A metamodel is a model of a model, i.e. a simplified version of a real-life system. Kleijnen, Sanchez, Lucas, & Cioppa (2005, p. 265) describe a metamodel as an approxi-mation of the true Input/Output (I/O) function where this approxiapproxi-mation simplifies the relation between input and output factors. In this thesis only simple (univariate) output metamodels are used, as most models used in scientific research also assume such output (Kleijnen, 2009, p. 708). This means that the metamodel yields a single, scalar output. For construction of the metamodel, see section 3.2.

Kleijnen et al. (2005, p. 265) state that polynomial regression, or traditional regression is a common metamodeling technique, where the I/O relation consists of a deterministic component (the polynomial function) and a random component (the error term). A first-order polynomial regression metamodel can be expressed as

y = β0+ β1x1+ ... + βkxk+ (2.1)

where y is the explained variable, the βs are the parameters, x is the input vector and is the error term. Note that a higher-order polynomial regression includes higher-order cross terms.

This thesis only uses low-order polynomial regression, assuming that interactions among three or more inputs are unimportant. This simplification is in line with Kleijnen (2015, p. 3), who states that such interactions are hard to interpret and often unimportant in practice. Furthermore, as this thesis compares two methods, the individual prediction process of those methods is of less importance than a comparison between the two.

When estimating the relevant parameters βs of the regression metamodel, (Ordinary) Least Squares - from now on abbreviated as OLS - is the most widely used method (Klei-jnen, 2008, p. 29). This method minimizes the sum of the squared distances between the

(7)

observations and the predicted output: min ˆ β SSR = n X i=1 (ˆei)2 = n X i=1 (ˆyi− yi)2 (2.2)

where ˆyi is the predicted output and yi is the observed value. The solution of equation 2.2

gives the least squares estimator for β, which is derived as

ˆ

β = (X0X)−1X0y (2.3)

where y is the dependent variable of the metamodel (Kleijnen, 2009, p. 709).

2.2 Kriging

In this subsection, the origin, theory and proof of kriging are discussed. Where valid, differences with traditional regression are highlighted.

Kriging originates from the field of geostatistics and was named after the mining engineer Danie Krige by Matheron (1963). Chen, Tsui, Barton, & Meckesheimer (2006, p. 275) recognize that since then, the method has gained popularity in the field of spatial statistics. They state that kriging uses the spatial correlation between input points to predict output values between observed points. It is important to remark that the input values can come from a multi-dimensional space. In this thesis, focus is laid on Ordinary Kriging (OK).

Although OK is the simplest form of kriging, it often suffices in practice (Kleijnen, 2009, p. 708). OK assumes that

y(x0) = µ + M (x) with x ∈ Rk (2.4)

where µ is the constant mean and M(x) is extrinsic, additive noise. The covariances be-tween y(xi) and y(xj) (i 6= j) depend only on the distance of their inputs xi and xj. The

distance between the inputs is captured in the vector h = (hj), where hj = |xi − xj|

(Kleijnen, 2015, p. 19).

When estimating new output, kriging gives more weight to inputs that have a small distance than inputs which lay further away (Basu & Thibodeau, 1998, p. 80). The pre-dicted output is estimated as

ˆ y(x0) = n X i=1 λixi (2.5)

which is a linear combination of already observed input variables. Kleijnen (2008, p. 182) states that the optimal weight-vector λ0 gives the best linear unbiased predictor (BLUP),

which minimizes the mean squared error (MSE) of the prediction

(8)

Kleijnen (2008, p. 183) shows that, by definition, the BLUP gives an unbiased pre-diction:

E[(ˆy(x0)] = E[y(x0)] (2.7)

which implies that if a new input coincides with an old input, OK is an exact interpolator. Kleijnen (2008, p. 183) states that kriging assumes that the closer the input data are, the more positively correlated their outputs are. He proves that it follows that the weights (λ1, ..., λn) do not have to be constant, but the sum has to be equal to one (Pni=1λi = 1).

This characteristic implies that weights could be negative.

To find the weights (λ1, ..., λn), Van Beers & Kleijnen (2004) show that the optimal

vector of weights depends on the correlations between the outputs of the kriging meta-model, as specified in equation 2.4. Following Cressie (1991, p. 53), this characteristic is modeled through a covariance process that is second-order stationary, meaning it has a constant mean and variance, and the covariance only depends on the distance between the input factors.

There are several types of covariance functions that give valid (i.e. positive definite) covariance matrices for second-order stationary processes. In general, these functions are based on the covariance:

Cov(Y (xi), Y (xj)) = Cov(|xi− xj|) = C(h) (2.8)

As can be seen in equation 2.8, the covariance depends only on the distance h of the input factors. In geostatistics however, variograms are used to display these functions. Following Cressie (1991, pp. 69-70), a variogram can be estimated by:

2ˆγ(h) = 1 |N (h)|

X

N (h)

(Y (xi) − Y (xj))2 (2.9)

where N(h) denotes the number of distinct pairs and γ(h) is called the semivariogram function. Kleijnen (2009, p. 710) gives three popular semivariogram functions for modeling the variance: • Linear: γ(h) = c0+ θh • Exponential: γ(h) = c0+ c1(1 − e−θh) • Gaussian: γ(h) = c₀+ c1(1 − e−θh 2 )

The variogram in figure 1 displays the three functions, with parameters c0 = 0.5, c1= 1.5

(9)

Figure 1: Variogram with three functions

Kleijnen (2008, p. 187) states that an essential part of OK is estimating the optimal kriging weights. He argues that these weights depend on the covariance of the metamodel, but it is unknown which semivariogram function gives a valid model. Therefore, it is customary to select a semivariogram function based on observation of the variogram. A common variogram function is the Gaussian function. To estimate the parameter values, usually maximum likelihood (ML) is applied. After the parameters are estimated, the corresponding weights can be deduced. The optimal weights are proven to be

λ0= Γ−1[γ + 1

1 − 10Γ−1γ

10_Γ−1₁ ] (2.10)

where Γ−1 and γ use the covariance, see equation 2.8 (Kleijnen, 2009, p. 709).

2.3 Comparing the quality of predictions by traditional regression and kriging

This subsection discusses the procedure of comparing traditional regression to kriging. Firstly, the practical differences are treated where special attention is paid to the effect of the differences when predicting outcome for each method. Thereafter, methods for comparing prediction results are explained.

A highly useful characteristic of OLS is that marginal effects are directly given in the form of the coefficients of the estimated model. In practice, this means that in order to predict output, kriging needs to make a new estimation for every new combination of input factors as where traditional regression establishes a single model to estimate all output.

(10)

In practice this means that if marginal effects are important for the research, OLS could be a better choice of method.

Secondly, kriging is an interpolation method. Therefore, for known input combina-tions, kriging returns the exact, correct output. This is different from traditional regression, which minimizes the sum of the squared difference and is not an exact interpolator, unless the design is saturated (Kleijnen, 2009, p. 711). This useful interpolation-characteristic of Kriging is not relevant when predicting new output and therefore has no use in this thesis. Lastly, a characteristic of kriging is the relatively high quality when predicting based on a small sample. Therefore, when predictions are based on a small sample, kriging could be preferred over OLS, this is tested in this thesis.

When assessing the quality of prediction results, the MSE is widely used, see equation 2.6. Simpson, Korte, Mauery, & Mistree (1998, p. 6) use this method in their comparison between kriging and regression on simulated data. This is in line with Forsberg & Nilsson (2005). Therefore, this thesis uses the MSE to assess the quality of prediction results and to compare both methods. In order to test the quality when predicting based on a small sample, both methods are tested both on a larger sample and on a smaller sample. Again, the MSE is used to compare the quality of predictions.

2.4 Main drivers of residential real estate prices

Using previous academic research, this section discusses the main drivers of residential real estate prices and gives an expectation about the impact of each factor.

Sirmans, MacDonald, Macpherson, & Zietz (2006, p. 216) performed a meta-analysis on several scientific articles about house prices and nine house characteristics. They find the nine most important characteristics to be (1) square footage (2) lot size, (3) age, (4) bedrooms, (5) bathrooms, (6) garage, (7) swimming pool, (8) fireplace, and (9) air conditioning. These price drivers are treated in detail in the next two paragraphs.

Firstly, Sirmans et al. (2006, p. 216) find that square footage has a significantly positive impact on the selling price, with a constant effect over time. Although they find a positive effect for lot size, they also find that controlling for square footage lowers the lot size coefficient significantly. Age has a small significant, negative effect, as expected by Sirmans et al. (2006).

Both the number of bathrooms and bedrooms have a positive effect, although the effect of bathrooms is only in the West of the United States found to be significant. Furthermore, a garage, swimming pool, fireplace and air conditioning have a small but significantly positive effect on housing prices.

(11)

The selection of relevant factors by Sirmans et al. (2006) is in line with the selection by Basu & Thibodeau (1998). However, Basu & Thibodeau (1998) find that, apart from the mentioned characteristics, location is an important driver. They state that prices are spatially correlated and therefore should be taken into consideration when predicting prices.

(12)

3 Data and methodology

This section describes the approach to answering the central question of this thesis. Firstly, in section 3.1, characteristics of the data are treated. Secondly, in section 3.2, the con-struction of the metamodel is discussed and in section 3.3, a framework for comparing both methods is established in order to answer the central question. Lastly, section 3.4 discusses the software and used Matlab-functions.

3.1 Origin and characteristics of the data

This thesis uses a real-life data set of residential real estate prices, from the surroundings and the city of Sacramento, CA, in the United States. The data consist of transactions in residential real estate, in a period of five days, gathered by the Sacramento Bee (Real estate transactions, 2008). The original data set contains 985 observations of residential real estate prices, including the following characteristics:

• Sq f t: the square footage, ranging from 0 to 5822.

• Beds: number of bedrooms, ranging from 1 to 3.

• Baths: number of bathrooms, ranging from 1 to 5.

• T ype: the type of real estate, can be residential, condo, multi-family or unknown.

• Sale date: the date when the transaction took place, ranging from the 15th of May to 21st of May, 2008.

• P rice: the price of the transaction, ranging from 2.000 to 884.790, in US dollars.

• Latitude and Longitude: all describing locations in Sacramento, see figure 2 for all raw data points. Surrounding towns, which are included in the data, all have just a few observations. To correct for characteristics of individual (small) towns, only observations in Sacramento itself are used. These observations are determined to lay in a radius of 11km from the centre of Sacramento. The centre in Sacramento is determined to be Sutter’s Fort, the physical centre (laid in ’midtown’) and historical centre around which the city is built. Formula 3.2 explains how the distance between two coordinates is calculated.

As there are several anomalies in the data, filtering is needed. Firstly, properties with zero square footage are filtered out. Secondly, only properties of the type residential are

(13)

relevant for this thesis and all other types are filtered out. Lastly, after filtering out prop-erties outside Sacramento and removing multiple observations with the same coordinates (anomalies in the data), 235 observations are left, the geographical position of which dis-played in figure 3.

Figure 2: Distribution of all original data points

Figure 3: Distribution of observa-tions after filtering

3.2 A metamodel for estimating housing prices

In order to answer this thesis’ central question, a metamodel for traditional regression is constructed. Optimally, both methods use the same variables to capture the same effects in order to make a fair comparison. Therefore, the same input factors are used by both models, based on the theory in subsection 2.4. For traditional regression, the following linear model is established:

Price = β0+ β1× Sq f t + β2× Beds + β3× Baths + (3.1)

where all variables are specified in subsection 3.1. For both traditional regression and kriging, higher-order effects are not incorporated, as these complications detract from making a fair comparison.

The basic model is augmented with the location effect, which is proven to be an important driver for determining housing prices (see 2.4). Firstly, the data is plotted on a heat map and areas with higher residential real estate prices are identified. The number of peaks is variable and depends on the data set. Secondly, after areas are identified, the distance of all observations to the absolute peak in this area is calculated. To calculate the distance between two coordinates, the following procedure is applied to each observation (Shumaker & Sinnott, 1984).

(14)

1. Convert decimal coordinates to degrees, using

Longitudedeg_i = Longitudedec_i × π/180 and Latitudedeg_i = Latitudedec_i × π/180

2. Subtract each observation by peak/centre, the result of which called dLong and dLat.

3. Use the haversine formula to calculate distance (Shumaker & Sinnott, 1984, p. 159)

αi = sin(dLat/2)2+ cos(Latitudei) × cos(Latitudepeak/centre) × sin(dLon/2)2

4. Use the arctangent function to get calculate the distance, where R is the radius of the earth (6378.1 km)

Distancei = 2 × R × atan2(

√

α,√1 − α) (3.2)

Thirdly, all Distance variables are (linearly) incorporated in both metamodels.

After completing the model, the variables are checked for multicollinearity. From the theory, it can be expected that square footage is correlated with the number of bedrooms and bathrooms. This will be checked by using OLS. If the number of bedrooms and bathrooms is insignificant only after adding square footage, these variables are left out. The same logic applies to both variables on an individual basis.

3.3 Framework for comparing quality of predictions

In order to compare traditional regression and kriging, several methods of comparison can be used. This subsection outlines the procedure of comparing these methods. This procedure is executed twice: once for a large sample to predict with, and once for a small sample.

Firstly, to check the predictions made by both methods, the data set is split into two parts. The first part, roughly two-thirds of the data, is used to create a model (traditional regression) or used as known input (kriging). Then, both methods make predictions by using the same inputs of the second part of the data set. The splitting of data is done randomly, see subsection 3.4 for the a reference to the exact approach.

Subsequently, the predictions are checked against their real value, using MSE as ex-plained in subsection 2.3.

Lastly, in order to check the consistency of the MSE prediction, the procedure is repeated thousand times, for two sample sizes. The first thousand iterations use two-thirds of the data to predict the other one-third, where each iteration uses a different

(15)

random division of data. The second thousand times repeated procedure uses one-third of the data to predict the other two-third. The mean of all MSEs and the standard deviation of all MSEs of both methods are compared.

3.4 Software and packages

In this thesis, Matlab is used for calculations. The DACE toolbox handles a large part of the kriging calculations. Several functions of this toolbox are mentioned in this subsection, see also Lophaven, Nielsen, & Søndergaard (2002). For visualizing data points on geographical maps, this thesis uses Ward (2012).

Before calculations are made, the data is split randomly, using the built-in Matlab function randperm.m which selects two-thirds of the data. Then, a variogram is created, by the function variogram.m. The type of correlation function is visually determined, by looking at the variogram. Next, the DACE toolbox establishes a model using the function dacef it.m. This model is used for making predictions, using the predictor.m function.

(16)

4 Results

Following the method outlined in section 3, the results are obtained. Firstly, subsection 4.1 describes the characteristics of the data, after the (first) random division into two parts. The sample data set contains 157 observations to make predictions with and the out-of-sample data set contains 78 observations to check those predictions (two-thirds to predict and one-third to check). Subsequently in subsection 4.2, possible peaks in prices are identified and the distance to those peaks is calculated for each observation, using the method of Shumaker & Sinnott (1984). Thirdly, after determining the corresponding correlation function, traditional regression and kriging are used to predict the out-of-sample observations in subsection 4.3. Lastly, in subsection 4.4 the MSEs of both methods are calculated and compared with each other.

4.1 Descriptive statistics

The following results are obtained, after splitting the data, using a larger sample to predict with.

It can be seen that both sets of data have similar characteristics, see table 1. Noteable to mention is that prices lay in a large range while bathrooms can just take three values (1,2 or 3). Therefore, the effect of bathrooms could be hard to estimate. The variable DistanceToPeak is treated in the next subsection

4.2 Identifying of possible peaks

Following the method outlined in subsection 3.2, two peaks in the data are identified, see figure 4. At the first peak, on the east side of the map, real estate prices cannot reasonably be expected to be higher. Rather, this peak in the data is caused by one or two extrema/outliers. The peak at the center of the map identifies an actual peak in prices, as this is close to the centre, university and hospital. Therefore, only the central peak will be seen as a peak in housing prices. At figure 4, the peak is identified with a green circle.

(17)

Table 1: Descriptive statistics (a) Complete data set

Variable Range Mean

Beds 1-5 3 Baths 1-3 2 Sq ft 539-4246 1331 Price 59222-699000 188594 DistanceToCentre 745-10980 7216 DistanceToPeak 0-12686 7086 235 observations

(b) Sample data set

Variable Range Mean

(c) Out-of-sample data set

Variable Range Mean

(18)

4.3 Predicting out-of-sample observations

In this subsection, firstly OLS is applied and the traditional regression model is estimated. Then, predictions are made for the out-of-sample observations based on this model. Sec-ondly, the same procedure is applied to the kriging method. Finally, the results are sum-marized and reported.

Using the established model, OLS is applied. When including Beds and Baths, both those variables, together and apart, do not have a significant effect, as can be seen in table 2 (P-value > 0.05). Therefore, a condensed form of the metamodel is used, without beds and baths. This OLS metamodel has the form:

Price = β0+ β1× Sq f t + β2× DistanceT oP eak + (4.1)

Table 2: OLS regression results (a) OLS results including beds and baths

Estimate T-statistic (Intercept) 121927.633 3.322*** Sq ft 178.662 5.734*** Beds -12601.532 -0.839 Baths -28526.047 -1.713* DistanceToPeak -11.798 -3.133*** Observations 235 R-squared 0.4327 *** p<0.01, ** p< 0.05, * p< 0.1

(b) OLS results of condensed model Variable Estimator T-statistic (Intercept) 96117.928 3.410*** Sq ft 143.932 7.487*** DistanceToPeak -13.595 -3.505*** Observations 235 R-squared 0.4129 *** p<0.01, ** p< 0.05, * p< 0.1

The OLS results of this model are in line with expectations from the theory. There exists a positive constant, which can be seen as a base price for residential real estate. Furthermore, it can be seen that more square footage causes a higher price. Proximity to an expensive area (’peak’ here) indeed has a positive effect on real estate prices. Finally, the established model is

Price = 96117 + 143.93 × Sq f t − 13.59 × DistanceT oP eak + (4.2)

Based on this model, predictions are made for the out-of-sample data set.

For kriging, firstly a variogram is established, see figure 5. By looking at the shape of the variogram, an exponential correlation function is chosen. For reference, see figure 1. Subsequently, a model is established using the DACE toolbox and predictions for the out-of-sample data are made (Lophaven et al., 2002). In the next subsection, the predictions are compared in terms of quality.

(19)

Figure 5: Variogram of sample data

4.4 Comparison of quality of predictions

Table 3 summarizes the predictions made by both methods, using a larger sample to predict with. The differences between each prediction and the true value are squared, and those differences are summed, yielding the MSE. The first result is that the MSE of the prediction made with OLS is lower than the MSE of the prediction made with kriging.

Table 3: Prediction results

Variable Min Max Mean Std. deviation Price 65000 660000 184821 100140 PredictionKriging 92629 350012 194573 52921 PredictionOLS 112349 355476 193198 50146 DifferenceKriging -150906 330395 -9752 79727 DifferenceOLS -153403 329147 -8376 77097 MSE OLS 5.9379 · 109 MSE Kriging 6.3701 · 109 Number of predictions: 78

In order to not only asses the accuracy, but also the consistency of the predictions, the whole procedure is repeated 1000 times, for two sample sizes (see subsection 3.3).

The first repeated procedure places a different, random one-third of the data in the out-of-sample data set at every iteration. Table 4a summarizes these results. The second result is that the mean of the MSE is on average lower for the OLS method, and the

(20)

standard deviation of OLS is slightly lower. Therefore, the quality of predictions made by OLS is slightly better than the quality of kriging, although the difference is small.

The second repeated procedure predicts results using the same method but based on a smaller sample, namely one-third of the data. Table 4b summarizes these results. Here, it can be seen that also for a smaller sample size, the mean of the MSE is on average lower for the OLS method than for kriging.

Table 4: Comparison results

(a) Results of repeated procedure, based on a large sample

OLS Kriging No. repetitions 1000 1000 Mean MSE 6.972 · 109 _{7.146 · 10}9

Std. dev. of MSE 1.430 · 109 1.464 · 109

Number of predictions: 78, repeated 1000 times

(b) Result of repeated procedure, based on a small sample

OLS Kriging No. repetitions 1000 1000 Mean MSE 7.173 · 109 _{7.425 · 10}9

Std. dev. of MSE 7.882 · 108 8.234 · 108

(21)

5 Conclusion

This section restates the objective of this thesis, summarizes all important steps and argu-ments, and finally draws a conclusion. Firstly, the objective of this thesis is discussed, after which the main theoretic points are repeated. Then, the main steps from the methodology are repeated and finally and answer to the research question is drawn from the results.

The main goal of this thesis is to compare traditional regression (OLS) with kriging. When predicting values based on sample data, OLS has traditionally been the standard regression tool. Kriging, meanwhile, has several characteristics which could be advanta-geous when predicting output in specific situations. This thesis tested predictions made by OLS against those made by kriging on real-life data of residential real estate prices.

Section 2 treats the theory behind OLS and kriging. The main idea behind kriging is that there exists spatial correlation between input factors. This means that, when predict-ing new values based on input variables, krigpredict-ing gives more weight to known inputs that have a small distance than input that lay further away, see equation 2.5. In practice when predicting out-of-sample output, the most notable difference between OLS and kriging is that kriging makes predictions for a specific set of input values, rather than establish a general model, as OLS does. As a result, marginal effects cannot be be derived as easily as with OLS.

Section 3 describs the data set used in this thesis. The data contains several variables on residential real estate, where price is the dependent variable that is predicted. Firstly, the data was randomly split into two parts: a sample data set (157 observations) and an out-of-sample data set (78 observations). To fairly assess the quality of the prediction made by both methods, the same metamodel was used for both methods. After checking for multicollinearity, as the number of bedrooms and bathrooms is correlated with the square footage of the property, the final model was established, see equation 4.1.

Sections 4 contains the results of the predictions. The most notable results are re-peated in this paragraph. Firstly, one clear peak in housing prices was established, see figure 4, and the distances of all observations to that peak were calculated (see equation 3.2 for specifics). Next, a variogram of the sample data was established, see figure 1 which clearly showed an exponential correlation function. Thirdly, a regression model was esti-mated for both OLS and kriging, and predictions for the out-of-sample data set were made. Lastly, those predictions were checked against their real value and a MSE was calculated, see table 3. The first result is that OLS has a slightly lower MSE than kriging.

(22)

out-of-sample data set was constructed. All 1000 MSEs were calculated and their mean and standard deviation are reported in table 4a. The results of the repeated procedure confirm the first result: OLS has a slightly lower MSE, with a slightly lower standard deviation.

In order to check the quality of predictions for smaller samples, the procedure was repeated with a smaller sample used for predicting: one-third of the data (78 observations) to predict the other two-third (157 observations) with. The same steps were taken, resulting in the mean and standard deviation of the MSE, reported in table 4b. The results of this second repeated procedure confirm the earlier results: OLS has a slightly lower MSE, with a slightly lower standard deviation.

Therefore, the answer to the central question is that OLS gives a better prediction than kriging, although the difference is close to zero.

(23)

6 Discussion

This section evaluates and discusses the approach of this thesis, including the collection of data, establishing the metamodel and comparison of prediction results. Firstly, two limitations are mentioned after which two recommendations for future research are made. A first limitation is due to the data used. The type and size of data collected place a limitation on the quality of the comparison made. Although residential real estate prices are heavily dependent on square footage and distance to expensive area (see subsection 2.4), a large part of the price is determined by other factors. Some of these factors, for example the quality of the house, the arrangement of rooms, the placement of the garden, are very subjective and inherently difficult to model. Therefore, the model contains a lot of noise, captured in the error term . Due to a lot of unexplained variance in the data, corresponding to a relatively low R2 (see table 2), making a comparison is difficult. This is also caused by the small out-of-sample data set, containing relatively few observations (78). In order to counter the uncertainty of this comparison, the method was repeated 1000 times, and although the mean MSE of OLS is slightly lower than that of kriging, the relatively high standard deviation hampers drawing a definitive conclusion.

A second limitation is due to the nature of OLS and kriging. As discussed in the theory, OLS is useful when there is little spatial correlation and when determining marginal effects is important. Kriging, however, is strong when spatial correlation is present and prediction of output is more important than having marginal effects. Therefore, both methods have their own strengths and weaknesses, and this is not taken into account when comparing these methods. However, the reason for choosing using exactly the same metamodel was to make a completely fair comparison. Next, two recommendations are made.

A first recommendation is to use a more extensive metamodel in order to explain more of the underlying effects, resulting in a smaller error term. With less variance unexplained and thus a higher R2, a better prediction can be made, and the variance of the MSE of both methods will be smaller, which could lead to a clearer comparison and a possible more definitive conclusion.

A second recommendation is to use different data set. This thesis uses residential real estate data of three days of a small geographical are, see also the first limitation. As the final complete data set holds just 235 observations, a larger data set allows for better predictions and for multiple/larger out-of-sample data sets.

(24)

References

Basu, S., & Thibodeau, T. G. (1998). Analysis of spatial autocorrelation in house prices. The Journal of Real Estate Finance and Economics, 17 (1), 61–85.

Chen, V. C., Tsui, K.-L., Barton, R. R., & Meckesheimer, M. (2006). A review on design, modeling and applications of computer experiments. IIE transactions, 38 (4), 273–291. Cressie, N. (1991). Statistics for spatial data. John Wiley & Sons.

Forsberg, J., & Nilsson, L. (2005). On polynomial response surfaces and kriging for use in structural optimization of crashworthiness. Structural and multidisciplinary optimiza-tion, 29 (3), 232–243.

Jin, R., Chen, W., & Simpson, T. W. (2001). Comparative studies of metamodelling tech-niques under multiple modelling criteria. Structural and Multidisciplinary Optimization, 23 (1), 1–13.

Kleijnen, J. P. (2008). Design and analysis of simulation experiments (Vol. 20). Springer. Kleijnen, J. P. (2009). Kriging metamodeling in simulation: A review. European Journal

of Operational Research, 192 (3), 707–716.

Kleijnen, J. P. (2015). Regression and kriging metamodels with their experimental designs in simulation: review. CentER Discussion Paper , 2015 .

Kleijnen, J. P., Sanchez, S. M., Lucas, T. W., & Cioppa, T. M. (2005). State-of-the-art review: a users guide to the brave new world of designing simulation experiments. INFORMS Journal on Computing, 17 (3), 263–289.

Lophaven, S. N., Nielsen, H. B., & Søndergaard, J. (2002). Dace-a matlab kriging toolbox, version 2.0 (Tech. Rep.).

Matheron, G. (1963). Principles of geostatistics. Economic geology , 58 (8), 1246–1266. Real estate transactions. (2008). Retrieved from

https://support.spatialkey.com/spatialkey-sample-csv-data/

Shumaker, B., & Sinnott, R. (1984). Astronomical computing: 1. computing under the open sky. 2. virtues of the haversine. Sky and telescope, 68 , 158–159.

Simpson, T. W., Korte, J. J., Mauery, T. M., & Mistree, F. (1998). Comparison of response surface and kriging models for multidisciplinary design optimization (Tech. Rep. No. AI AA-98-4755). United States of America: Institute of Aeronautics and Astronautics, Inc. Sirmans, G. S., MacDonald, L., Macpherson, D. A., & Zietz, E. N. (2006). The value

of housing characteristics: a meta analysis. The Journal of Real Estate Finance and Economics, 33 (3), 215–240.

(25)

Van Beers, W., & Kleijnen, J. P. (2004). Kriging interpolation in simulation: a survey. In Simulation conference, 2004. proceedings of the 2004 winter (Vol. 1).

Ward, D. J. (2012). Map tool: Map latitude longitude coordinates/points. Retrieved 2016-06-01, from http://www.darrinward.com/lat-long

Traditional regression versus kriging : an application on residential real estate