The predictive power of Funda

(1)

1

The predictive power of Funda

Dorinth van Dijk

Abstract

In this thesis internet data is used for predicting Dutch housing market developments. It is shown

that data on listed properties from the brokerage website Funda is able to proxy for supply and

demand in the Netherlands. Furthermore, the Funda data allows to examine price and liquidity

dynamics of the Dutch housing market. In a panel VAR framework it is shown that transaction

volume responds relatively quick to changes in demand and supply while prices react more

gradual. This is in accordance with search and match models where changes in demand and

supply are first incorporated into an increase in liquidity while prices respond with a time lag.

This thesis shows that the usage of alternative data sources is useful in economic research and that

these data can provide information to future homebuyers.

University of Amsterdam Supervisor: Prof. Dr. Marc K. Francke Second reader: Dr. Jeroen E. Ligterink Master Thesis Business Economics – Finance / Real Estate Finance Student Number: 10642935 November 2014

(2)

2 Abstract ...1 Acknowledgements ...4 1 Introduction ...5 2 Literature review ...8 2.1 Price dynamics ...8 2.1.1 Market efficiency ...8 2.1.2 Price-Volume Correlation ...10

2.1.3 Empirical evidence of price-volume correlation ...13

2.2 Modelling house prices and transaction volume ...13

2.3 New data ...16

2.3.1 Alternative data in economics and finance ...16

2.3.2 Alternative data in the housing market ...18

2.4 Relevance ...20 3 Data ...22 3.1 Data sources ...22 3.2 Aggregation process ...24 3.3 Variable description ...25 3.3.1 Cross-section models ...25

(3)

3

4 Econometric Modelling ...35

4.1 Cross-sectional models ...35

4.2 Panel data models ...36

4.3 Hypotheses ...39

5 Results ...41

5.1 Cross-sectional models ...41

5.2 Panel data models ...44

5.2.1 House prices ...44

5.2.2 Transaction volume ...48

5.2.3 Synergizing house prices and transaction volume ...52

5.3 Summarizing the results ...56

6 Robustness checks ...58

6.1 Different smoothing parameters ...58

6.2 OLS models ...60

7 Implications and limitations ...63

8 Conclusion ...66

References ...68

(4)

4

Acknowledgements

It is impossible to write a thesis without proper supervision. Therefore I would first like to thank

Prof. Dr. Marc Francke who provided such. Your comments and suggestions were most useful

and my academic skills have definitely benefited from them. Next I would like to thank Ir. Alex

van de Minne for providing me with many creative and useful ideas in the start-up phase of the

project. Especially, the visit to Funda and your comments on the proposal were most useful. Also,

thank you Dr. Martijn Dröes for providing me with useful technical guidance.

Special thanks go out to Funda for providing the data. In particular thank you Ruben

Scholten for gathering the massive amounts of data. Without this data it would clearly have been

impossible to write this thesis.

Finally, I would like to thank my family and friends for providing the necessary

(5)

5

1 Introduction

When predicting economic growth economists usually look at consumer confidence surveys.

Although a similar indicator exists for the Dutch housing market (i.e. the Vereniging Eigen Huis

Marktindicator), it is generally accepted that house prices differ regionally (Capozza et al., 2002).

Therefore it would be interesting to have an indicator that foreshadows housing prices on a

regional scale. This would in turn be valuable for home buyers and sellers to time there

transaction. But more importantly, this indicator could assist realtors and other real estate

professionals in predicting to “where the market is heading”. This thesis seeks to develop such an

indicator using alternative data sources.

As for these sources, the internet proves to be a valuable source of information when

predicting economic topics. More specifically, Wu and Brynjolfsson (2009) point out that for the

US market Google search query data predict house transaction volume and house prices with

relative high accuracy. The basic idea is that people who are interested in buying a house first start

by examining potential houses on the internet. Askitas and Zimmerman (2009) call this behavior

preparatory steps to spend.

The usage of both transaction volume and house prices as dependent variables is no

coincidence as there seems to exist a relationship between these (Stein, 1995; Clayton et al., 2010;

De Wit et al., 2013). Price increasing booming markets are typically characterized by more

liquidity, while down markets with declining prices usually shows less liquidity (De Wit et al,

2013). A reason for this relationship stems from search and matching models in which transactions

(6)

6 In the light of Wu and Brynjolfsson (2009) this thesis uses internet search query data in

order to decompose these effects for the Dutch housing market. The Netherlands are characterized

by the fact that most people look on one specific website in order to search for and compare

houses: Funda (Funda, 2013). Therefore, the activity on this website —or more specifically search

behavior– could give a useful indication about (future) demand. Moreover, the number of listed

properties could be a useful supply indicator. The advantage of these data is that it can be

determined on a very detailed scale (i.e. on zip code level).

The insights in the relationship between transaction volume and prices provided in this

thesis are inherent to the usage of new data. As the usage of internet data is probably one of the

most recent developments in economic research this thesis is also affiliated with these

developments. The research question to be answered is:

“To what extent can Funda search data predict house price and transaction volume

dynamics in the Netherlands?”

In this thesis the number of times watched and the number of objects is received from Funda. The latter

equals the amount of houses which are for sale on Funda in a certain neighborhood (on zip code

level) and the former describes how many times these houses have been clicked upon. By dividing

the times watched by the number of objects a demand versus supply variable is generated. In

order to examine the relationship between this variable and housing market developments the

data are implemented in two different types of models. First, in order introduce to Funda data in

an approachable manner, cross-sectional models are estimated. Here the data are included in

(7)

7 estimated in which the dynamic relationships between the Funda data, house prices and

transaction volume are examined. The modelling process will take off with single equation models

for house prices and transaction volume. These will subsequently be extended into a panel VAR

set-up in order to examine responses of house prices and transaction volume to demand shocks.

As the sample period of the obtained data is rather short (e.g. 2011 - 2013), the goal of this research

is to model the short-run dynamics of the Dutch housing market.

This thesis is structured in the following manner. In the second chapter an overview of the

literature is given on house price dynamics, the relationship between liquidity and prices and use

of alternative data in economics. In the third chapter the used data sources are described both

verbally and statistically. This chapter is succeeded by chapter 4 which discusses the applied

methodology. Furthermore, some hypotheses are composed in this chapter. Chapter 5 discusses

the main results and seeks to answer the hypotheses and research question. In the sixth chapter

several robustness checks are performed. In chapter 7 some limitations regarding Funda data and

the research are addressed. Moreover, some implications for further research will be touched

upon in this chapter. The thesis ends with a conclusion that tries to capture the main points of this

(8)

8

2 Literature review

Three different topics are addressed in this literature review. First dynamics regarding house

prices are discussed. This is followed by a discussion concerning the relationship between house

prices and transaction volume. Third, the use of “new” data in economic research is reviewed.

These are data that are related to the recent increase in internet usage. The applicability of these

sources in economic research is canvassed. This chapter will end with a note on the relevance.

2.1 Price dynamics

2.1.1 Market efficiency

One of the most well-known and discussed theories in finance is the Efficient Market Hypothesis

(EMH). This hypothesis states that all information regarding stocks is reflected into current prices

(Bodie et al., 2002). The degree to what information (i.e. public and/or private) is incorporated is

related to the form of this hypothesis. Kendal (1953) found that stock market price changes show

little serial correlation. This effectively means that past returns are unrelated to future returns.

Therefore no repeated patterns can be found in the stock market, hence the analysis of past price

movements in order to generate positive future returns is inutile. Hence, outsiders should not be

able to generate positive returns solely by watching price-movements. Kendal presents five

reasons why some investors might be able to have success: (1) luck, (2) at certain times all prices

rise so they can’t go wrong, (3) by exploiting inside information, (4) by acting first, and (5) by

exploiting scale advantages so that broker fees and stamp duties evaporate.

(9)

9 by Fama (1965). He provides further empirical evidence for the random walk theory. This theory

states that stock price changes are as predictable as a random set of numbers. In order for the

theory two hold two statements are discussed by Fama (1965): price changes are (1) independent

and (2) normally distributed variables. The former is found to be true as all three methods (e.g.

serial correlation model, runs analysis and Alexander’s filter technique) confirm the statement

that price changes are independent of each other. The latter statement is rejected by Fama. He

finds that stock price returns are not normally distributed. Extreme events (very large positive or

negative returns) are more likely to happen than implied by a normal distribution. This

phenomenon is also known as fat tails (Bodie et al., 2002).

Counter-evidence for the random walk theory is provided by Lo & MacKinlay (1988). The

empirical evidence in this article shows that the random walk model is not valid for weekly stock

returns. The conclusion is that (significant) positive autocorrelation is evident in the stock market

when return figures are based on weekly data. More specifically, a first-order autocorrelation

coefficient of 0.30 is found. Furthermore, Lo & MacKinlay (1988) extend their framework by

creating size-sorted portfolios (e.g. small, medium and large stocks). The conclusion is that the

random walk model is rejected for all three categories. The behavior of small stocks however,

seems to contrast the most with the random walk model. There are numerous studies that show

that the stock market exhibits “anomalies” (see Malkiel, 2003 for an overview). The existence of

these anomalies however, does not give way to portfolio strategies that generate excess returns

given the risk (Malkiel, 2003).

In the context of the efficiency of the housing market Case and Shiller (1989) provide

(10)

10 predictable. They use their —at the time of the research— recently developed Weighted Repeated

Sales (WRS) index to show that housing prices are predictable in the short run. This WRS index is

an index based on houses that are sold at least twice and did not significantly change in between

the two times of sale (see Case and Shiller 1987 for more on this WRS index). The authors show

that after-tax excess returns in Atlanta, Chicago, Dallas and San Francisco are strongly dependent

on the after-tax excess returns of the preceding year. Case and Shiller (1989) show that a change

in house prices in a certain year is usually followed by a change of about half the size in the same

direction. The authors do however emphasize that individual house prices changes are not

predictable due to the excessive amounts of noise existent in transaction prices.

In an overview of empirical studies given by Cho (1996) regarding the efficiency of real

estate markets, the conclusion is that they are not fully efficient. A general trading rule that

generates excess returns seems to be practically infeasible. This is mainly the result of relatively

high transaction costs in real estate markets. This thesis does not relate directly to the efficiency of

the housing market in general, in the sense that it tries to examine this efficiency, nor does it try

to formulate a general trading rule to provide excess returns. It is merely a result of the earlier

findings that suggest that real estate markets are not fully efficient, which means that modeling

house prices is not inutile.

2.1.2 Price-Volume Correlation

Since housing markets are not perfectly efficient and no central housing exchange exists, the

housing market can be characterized as a search market (Diaz & Jerez, 2013). In a search market

(11)

11 which means a house will be transacted. This searching and matching principle is important in

house price dynamics. Buyers and sellers set their reservation prices for which they are willing to

buy or sell their house (Geltner et al., 2007). A transaction will occur if the reservation price of the

buyer equals or exceeds the reservation price of the seller. The way how buyers and sellers react

to a shock (i.e. set their reservation prices) may be different. Genesove and Han (2011) show that

sellers react to a demand shock with a lag. In other words sellers gradually adjust their reservation

prices upwards when demand increases. If demand increases the group of buyers willing to pay

the sellers’ reservation prices increases. Hence, the probability that a transaction occurs increases.

This theory could explain the observed price-volume correlation in housing markets (De Wit et

al., 2013). Consider a positive demand shock which is not instantaneously observed by all market

participants. The increase in demand will result in more transactions at first, or houses will be sold

quicker. Sellers react more slowly to this increase in demand. In time they will increase their

reservation prices until the point at which the demand shock is completely absorbed into higher

prices. Therefore this theory gives rise to a positive price-volume correlation in the housing

market.

De Wit et al. (2013) research this price-volume correlation for the Dutch housing market

and seek to find an explanation for this correlation in the Dutch housing market. Besides this

search and matching approach De Wit et al. (2013) identify two other groups of theories: (1) the

interaction between downpayment constraints, mobility and house prices and (2) behavioral

explanations. The authors stress however that the three approaches are not mutually exclusive.

The fundamentals of the group of downpayment constraints lie within the work of Stein

(12)

12 homeowners that would like to buy another (more expensive) house are constrained by a

downpayment that they have to make in order to buy the new house. In this proposition, credit

constraints are linked to transaction volume and prices changes to explain the correlation between

these. Consider the example when house prices decrease. In this case current homeowners have

potentially less money to spend on a new home. This will hold especially when the current home

is financed with a mortgage, since this will first have to be amortized. The result is that

homeowners have little equity left in order to make a possible downpayment for their new home.

The consequence is that less people will sell and buy houses (i.e. the transaction volume will

decrease). This could in turn result in lower prices. The same is true vice versa if house prices

increase. In this case current homeowners should have more equity left to make their

downpayment, which allows them to “move up on the housing ladder” (De Wit et al., 2013).

Therefore, according to Stein et al. the price changes and changes in transaction volume are

thought to be reinforcing each other.

Finally, there are behavioral explanations for the relationship between prices and

transaction volume. The behavioral bias of loss aversion is generally thought to hold for

homeowners (De Wit et al., 2013). When markets go down, homeowners don’t like to sell their

houses for less than what they paid. The result is that reservation prices of sellers, hence asking

(13)

13

2.1.3 Empirical evidence of price-volume correlation

In a VEC-framework De Wit et al. (2013) model the long-run relationship between the rate of sale

and house prices in the Netherlands. They seek to explain what causes this correlation in the Dutch

situation. Their results suggest that the Dutch evidence seems to fit the search and matching

models best: buyers and sellers adjust gradually to changes in fundamentals. If news arrives, first

the rate of sale seems to change and prices react much more gradual.

Clayton et al. (2010) employ a panel VAR-framework and confirm that a price-volume

correlation exists for the US housing market. They confirm the downpayment hypothesis of Stein

(1995). Furthermore, it is found that prices react in a different manner than trading volume to

shocks in Fundamentals (i.e. an exogenous shocks). Transaction volume seems to react stronger

than prices. Although it is not specifically mentioned in the article, this also confirms the first set

of theories: transaction volume reacts quickly while prices are adjusted gradually.

2.2 Modelling house prices and transaction volume

Based on the literature several possible variables that influence house prices and/or transaction

volume can be identified. In Table 2-1 an overview of three articles that research both house prices

and transaction volume is presented. In this table the effect of several explanatory variables is

listed. The most attention is given to variables that affect house price dynamics in the short-run as

this thesis seeks to model these.

A common factor among all three articles is that all researchers include seasonal effects

due to seasonality in house prices and transaction volume. Furthermore, there seems to be mixed

(14)

first-14 order autocorrelation while Clayton et al. (2013) find negative coefficients for these parameters.

The reason for these negative coefficients according to Clayton et al. is that both prices and

transaction volume adjust to exogenous shocks. In this adjustment process there may be

overshooting. House prices for example tend to overreact to an income shock, which causes them

to rise at first but is followed by a subsequent drop towards a new steady state level in a relatively

short timeframe. Nevertheless, it is generally accepted that house prices exhibit positive

autocorrelation in the short run and are mean reverting in the long run (Capozza et al., 1997).

Household income is expected to influence house prices and transaction volume positively as

more people will be able to move up the housing ladder.

The mortgage rate which is linked to the interest rate has a negative impact as a higher

mortgage rate makes buying a house effectively more expensive. The evidence regarding

(un)employment is rather mixed. A priori a higher unemployment rate should influence both

prices and transaction volume negatively as generally less people will be able to buy a house.

According to De Wit et al. (2013) this holds for the Dutch housing market in the long run. The

unemployment rate however, does not seem to influence the short-term dynamics. The latter is

also found true to be for US house prices in Clayton et al. (2010). The unemployment rate does

influence transaction volume negatively in Clayton et al.

Clayton et al. (2010) also include two different stock market variables. The index returns,

which is the first-order difference of the level of the S&P 500 and the second-order difference of

the level of the S&P 500. The former is hypothesized to proxy for financial wealth and constraints

of household while the latter represents up- or downtrends of the stock market. Furthermore, Wu

(15)

15

Table 2-1: Overview the variables used in articles to explain house price changes and transaction volume, + indicates that the effect is positive and statistically significant, – indicates a statistically significant negative effect, 0 indicates that the estimated coefficient is not statistically different from 0, and other indicates other control variables. SR indicates short-run estimates while LR indicates a long-run relationship.

Article Influence House prices* Transaction volume*

Clayton et al. (2010)

+

Employment Household income Mortgage rate trend S&P 500 trend

House prices lags 2/3 Turnover lag 3

Employment Household income Mortgage rate trend House prices lag 1

-

Mortgage rate level S&P 500 level House prices lag 1

Mortgage rate level S&P 500 trend Unemployment rate Turnover lags 1/2/3 0 Unemployment Turnover lags 2/3 S&P 500 level House prices lags 2/3

other

Quarter dummies Quarter dummies

De Wit et al. (2013) + Rate of sale (SR) Transaction prices (SR) List prices (SR/LR) Rate of entry (SR) Rate of sale (SR/LR) - Unemployment (LR) Interest rate(SR/LR) Rate of sale (LR) Rate of entry (SR/LR) Unemployment (LR) Interest rate (SR/LR) 0 Unemployment (SR) Transaction prices (LR), Unemployment (SR) Rate of entry (LR) Transaction prices (SR/LR) List prices (SR/LR) other

Seasonal dummies Seasonal dummies

Wu and Brynjolfsson (2009) + Transactions lag 1 HPI lag 1

Google “RE Agencies” contemp. and lag 1 index

Google “RE listing” contemp. index

Transactions lag 1

Google “RE Agencies” contemp. index Google “RE Listing” contemp. index

-

Google “RE listing” lag 1 HPI lag 1

0

Google “RE Agencies” lag 1, Google “RE Listing” lag 1

other

Quarter dummies State fixed effects MSA fixed effects Population

* The exact variables that the presented articles use as dependent variables are: Clayton et al. (2010) study Home prices and Turnover; De Wit et al. (2013) study transaction prices, list prices, rate of sale and rate of entry, in this overview the variables that affect transaction prices and the rate of sale are presented; Wu and Brynjolfsson (2009) study the House Price Index (HPI) and transaction volume

(16)

16

2.3 New data

2.3.1 Alternative data in economics and finance

The use of alternative data in scientific research recently has received some attention in the

literature (Lohr, 2012). Several examples on this topic will be discussed briefly. In reviewing these

articles specific attention is given to the alternative dataset which has been used. Most articles use

data that measure the gathering of information. The authors basically argue that information

which is gathered today by consumers, may say something about actions taken in the future. Or

as Askitas and Zimmerman (2009) put it: they provide a measure of preparatory steps to spend.

Askitas and Zimmerman (2009) find that Google Insights (known today as Google Trends)

can be useful in employment forecasts in Germany. They use an index based on Google search

queries like unemployment office (k1), unemployment rate (k2), personnel consultant (k3) and the

names of German Job search agencies (k4). They argue that k1 is associated with flow into

unemployment as activity along this search query should be linked to people contacting the

unemployment office. Hence, k1 is expected to have positive influence on unemployment figures.

According to Askitas and Zimmerman, k2 is just conveniently linked to unemployment and they

do not hypothesize this variable. The search for personnel consultant (k3) should be correlated

with the fear of high-skilled workers to get fired in a restructuring process, but k3 is not explicitly

hypothesized. Finally, the activity in the search query index for job searching agencies (k4) should

be associated with flow out of unemployment, since this search query should be related to job

searching activities of unemployed. They hypothesize that k4 should have a negative impact on

(17)

17 unemployment office (k1) and German job search agencies (k4) produces the best results in their

error correction framework. More specifically, they find that k1 has a significant positive impact

on unemployment figures in both the short and long-run and k4 has a significant negative impact

on employment figures in the short-run.

Vosen en Schmidt (2011) also use data from Google Trends to forecast economic data. In

their article, they create a new indicator for private consumption based on Google Trends. They

compare this Google Trends indicator with two survey-based private consumption indicators (e.g.

MSCI and CCI). Google groups several search queries into aggregated search indices regarding a

certain topic. For example “Real Estate” is one of those groups. The authors select 56 categories

that they think are connected to private consumption. They find that using either of the three

indicators improves the out-of-sample one-month ahead forecasts of their ARMAX model. But

more interestingly, the Google trends indicator significantly outperforms the MSCI and CCI

indicators.

More related to the topic of finance is the article of Bollen et al. (2011). In their article they

use “collective mood indicators” to predict the stock market. Since the stock market is known to

be relatively efficient and returns will not depend on past returns but only on news, returns are

not predictable (i.e. semi-strong EMH). But according to Bollen et al. (2011), returns in the stock

market should, besides news, also depend on mood and sentiment. In order to capture the

collective mood and sentiment they analyze almost 10 million tweets. These tweets are then

grouped into seven different mood dimensions using two different software packages (i.e.

Opinion Finder and GPOMS). Opinion Finder groups the tweets into positive or negative tweets

(18)

18 happy). Using a Granger causality analysis it is shown that the “calm” dimension has the most

significant predictive power in forecasting the stock market. More specifically, it is found that the

calm dimension Granger causes the Dow Jones Industrial Average (DJIA) three to four days in

advance. In order to estimate a forecasting model to accurately predict the DJIA closing value, the

authors employ a Self-Organizing Fuzzy Neural Network (SOFNN).

2.3.2 Alternative data in the housing market

It seems that real estate research proves to be no exception when it comes to applying alternative

data sources in predicting future trends. Wu and Brynjolfsson (2009) use data from Google Trends

to forecast future transaction volume and price developments in the housing market. They use

quarterly search query data from Google of 51 states in the United States. In their research two

different predefined search query indices are used: (1) Real Estate Listing and (2) Real Estate

Agencies. Category (1) should reflect all search queries related to real estate listings and category

(2) should approximate for home buying activities. They add these Google search variables to

their baseline model, which is a seasonal autoregressive model. In this model contemporaneous

home sales (levels) are regressed against previous quarter home sales, previous quarter house

price index (i.e. the WRS index of Case and Shiller), and control variables (e.g. population, entity

fixed effects and time fixed effects). Wu and Brynjolfsson (2009) find that the coefficient on Google

search frequencies in the same month is significantly related to contemporaneous home sales, but

the coefficient on lagged search frequencies is not. More precisely, a one-percent increase in the

contemporaneous index on Real Estate Listing is associated with an increase of 20.8 percent

(19)

19 Real Estate Agencies index relates to an increase of 14.8 percent increase in home sales in the same

quarter. In order to perform of out-of-sample forecasts, one-quarter forecast models are employed.

The findings are that the Mean Absolute Error (MAE) of the contemporaneous model, is

0.170, which is an improvement of 2.3 percent over the baseline model without search indices. The

MAE of the one quarter forecast model is equal to 0.172, which is an improvement of 7.1 percent

over the baseline model.

Similarly, Wu and Brynjolfsson (2009) employ models to forecast the Case and Shiller

house price index (HPI). A one-percent increase in the contemporaneous Real Estate Agencies

index is associated with an increase of 3.5 to 6 percent of the HPI. The past search frequencies of

Real Estate Agencies proves to be insignificant. Moreover, a one-percent increase in the current

search frequency of Real Estate Listing is related to 8 to 9 percent increase of the HPI. But

strikingly, the coefficient on the one-quarter lagged variable of Real Estate Listing turns out to be

negative and significant. The authors do not give an explicit explanation for this phenomenon,

but suggest that this may have something to do with the difficulty in forecasting prices. The reason

why prices may be more difficult to forecast is the result of complex supply and demand

dynamics. Nevertheless, the MAE of the baseline model that predicts the current HPI improves

by 2.5 percent when Google search data is included. Moreover, the MAE improves by 3 percent

when predicting the future HPI.

Another neat result of the research is that only Google Trend data are able to forecast the

contemporary transaction volume with relative high accuracy. The result is obtained by leaving

out the autoregressive component of transaction volume and the lagged value of the HPI. The

(20)

20 contemporary HPI forecast model, the R2_{is 0.987 with the autoregressive component of the HPI}

and lagged value of transaction volume. When these components are left out (i.e. only the Google

Trend data and controls remain) the R2_{remains at 0.987.}

Although the results of the article of Wu and Brynjolfsson (2009) seem convincing, their

model may contain some issues. In their article, no results are provided whether the time series

used (e.g. transaction data, price index and Google Trends data) are stationary. But the authors

specifically model levels. This may cause spurious regression if the series are actually

non-stationary and not cointegrated (Stock & Watson, 2011). In many researches price changes, which

are already differenced, are modelled (see for example Malpezzi, 1999). It also seems mean and

median price per month on a national aggregated scale in the Netherlands are non-stationary

(more on this in the methodology chapter). Furthermore, in the article of De Wit et al. (2013) a

VEC model is applied to prices on the Dutch market and transactions, which implies these

variables are I(1). This in turn implies these variables are non-stationary in levels.

2.4 Relevance

This research uses Funda data in models of the Dutch housing market. To my knowledge this data

has never been used in scientific research. The data are comparable to several articles discussed

in this chapter, albeit there are differences. It is comparable in the sense that Funda data can also

be considered an alternative source of data. This in the context that it could capture preparatory

steps to spend. It is different in the sense, that the data are not publically available. But on the

other hand, this also means that it can be captured on a much finer-grained scale than for example

(21)

21 data will be used in order provide insights the dynamics between price changes and transaction

volume in the Netherlands. Moreover, since the data contains both a cross-sectional and temporal

component the effects can be modelled in a panel-data setting which allows to decrease Omitted

(22)

22

3 Data

In this chapter a verbal, statistical and spatial description of the obtained data will be presented.

3.1 Data sources

Two main datasets are required in order to perform the research. First data regarding house prices

and transaction volume is required. Data from the Dutch Brokerage Association (NVM) is used

for these purposes. The NVM data consist of the number of transactions and median price per

cubic meter per month per zip code area from January 2011 until December 2013. In total there are

approximately 275,000 transactions included in the database. A transaction is denoted as

“transaction” in the NVM database at the time of the signing of the buyers’ contract. Other

databases like the CBS/Kadaster a transaction is included when the legal transfer takes place.

Therefore it is generally found that the NVM transaction data leads other data sources.

As aforementioned the relationship between house prices and “Funda search behavior” is

examined. In a survey executed by Kaela Research, 93% of the respondents answer “Funda” when

they are asked to name a housing website (Funda, 2013). Additionally, 81% prefers Funda if they

would sell their home online. There are 4.2 million unique visitors per month on Funda.nl (Funda,

2013). In a report by Kerste et al. (2012) it is shown that Funda is by far the most popular housing

website in the Netherlands. According to this research Funda has a stable market share of around

60% of all Dutch housing websites. The owner of Funda is the NVM from which the transaction

data originates. In the report of Kerste et al. (2012) it is mentioned that most of the brokers that are

(23)

23 from these NVM brokers. The relationship between Funda activity and NVM transactions and

price developments is therefore intuitive.

It must however be noted that not all transactions in the Netherlands go through NVM

brokers. The transaction and price data presented in this research is therefore not a 100% accurate

representation of the Dutch housing market. According to Kerste et al. (2012) the NVM has a

market share of around 75% of Dutch housing transactions between 2010 and 2011. In the article

of De Wit et al. (2013) which also employs NVM data it is mentioned that the market share was

55-60% in 2007. According to the TU Delft (2014) the market share of the NVM lies around the

70%. As mentioned before the used dataset of the NVM consists of roughly 275,000 transactions

between 2011 and 2013. According to CBS (2014a), which uses data from the Kadaster which

includes all transactions, there were 348,094 transactions in this same period. This suggests that

the NVM had a market share of approximately1_{79% over this period.}

The inaccurate representation of the Dutch housing market may influence the models

employed. Especially the regional models may exhibit some bias. The market share if the NVM

differs regionally (Kerste et al., 2012; De Wit et al., 2013). In more rural areas the market share is

said to be substantially lower. The models therefore, may prove to be less accurate for rural areas.

However, as most houses on Funda originate from NVM brokers it is expected that this will not

be a large problem in this case. It does however lower the robustness of the employed models

with respect to the complete Dutch market. Nevertheless Funda and the NVM are still by far the

1_{As mentioned in the text the NVM uses a different definition for “transaction”, therefore compared periods} might not be exactly equal. This however should only have a minor effect on the market share as the lag of the Kadaster on the NVM is only three months (De Wit et al., 2013). Three months is only 3/36 of the whole series.

(24)

24 largest in terms of market share. The second-largest brokerage website is Jaap.nl with a market

share of only 9% (Kerste et al., 2012). The second-largest brokerage association is the VBO and has

a market share of less than 10% (Kerste et al., 2012). Furthermore, obtaining transaction and search

information of each zip-code area of other associations and websites would be too

time-consuming. Therefore Funda.nl seems to be the most suitable in order to provide data on search

behavior.

Fortunately, Funda is willing to collaborate with this research and is in the position deliver

data regarding search behavior. This behavior is obtained in the following manner. The data

describes the times watched per object per month. The objects (i.e. houses) are subsequently linked

to a PC4 code (i.e. zip-code). In the next step the totals per PC4-area are calculated. The results are

(1) times watched and (2) number of objects per PC4-area per month.

Finally, some control variables are obtained from the Dutch Central Bureau of Statistics in

order to ascertain the cross-sectional explaining power of Funda (CBS, 2014a). In the panel data

setting both time fixed effects and entity fixed effects are included, which should already account

for the most of these control variables.

3.2 Aggregation process

An identified data issue is related to the scale level of the data. In spite of the fact that the price

index is corrected for size (i.e. price per cubic meter), it is not corrected for any other characteristic.

For example, if only one or two houses are sold per quarter, the result could be that in month 1 a

very well-maintained house is sold, while in month 2 a very bad-maintained house is sold. This

(25)

25 aggregate the zip codes in two ways: (1) to aggregated PC-levels for the cross-sectional models

and (2) to COROP-regions for the panel data models. A detailed description of the aggregation

process is presented in de Appendix.

3.3 Variable description

3.3.1 Cross-section models

In the cross-sectional models regional differences between house prices are explained with the

inclusion of Funda data. The used variables in the cross-sectional models are presented in Table

3-1. In this table all variables are per PC-area in which the PC-areas are defined as the aggregated

areas as elaborated in the previous section.

Table 3-1: The variables employed in the cross-sectional models

Variable name Description Source

pr Mean transaction price per m3 in 2011-2013 NVM

fw Mean Funda times watched in 2011-2013 Funda

fo Mean Funda number of objects in 2011-2013 Funda fwo Mean Funda times watched per object in 2011-2013 Funda

inc Average monthly fiscal income in 2008 CBS

hh Number of households in 2013 CBS

p_he Percentage of population with higher education in 2011 CBS

hhs Average household size in 2013 CBS

pd Population Density in 2013 (per ha) CBS/UvA

To measure “internet search popularity” an additional variable is generated. This variable

(26)

26 for this variable could indicate a more popular area. The concept behind this interpretation is as

follows. More clicks on a certain object could imply this object is relatively popular. Therefore,

more clicks on objects (relative to the total number of objects) in a certain area could be evidence

for a more poplar area. The times watched can therefore be characterized as a demand variable.

The number of objects per area can proxy for supply in a certain area as these are roughly equal

to the number of houses that are for sale in this area. The generated variable can therefore

approximate demand relative to supply.

To isolate this “popularity” several control variables are included in the models. Although

literature on cross-sectional house prices is limited and not discussed in detail in the literature

review, it can be expected that demand and supply factors that determine house price dynamics

are similar to factors that determine house-prices over the cross-section. As the scale level is

relatively detailed, the size of the pool of control variables decreases. CBS (2014a) provides yearly

demographic data per PC4-area like population size, number of households and household size.

Furthermore, the CBS (2014b) irregularly publishes data regarding income on a zip-code level. As

the most recent version of this database was published in 2012 and contains income data of 2008,

this database is somewhat outdated (CBS, 2014b). In 2013 the bureau also published data

regarding levels of education and social security on a zip-code level.

All variables are aggregated from PC4-level to PC-levels as elaborated in the Appendix.

The house prices per PC-area are calculated as the mean transaction price per cubic meter between

2011 and 2013.The Funda data are first aggregated by summing the number of times watched and

number of objects per aggregated PC-area. Next, the average amount of objects and times watched

(27)

27 The aggregation of the control variables is elaborated in the Appendix. In order to calculate

the population density per PC-area spatial data regarding the polygons of each PC4-area is

obtained from the University of Amsterdam (2014). These PC4 polygons are aggregated using the

aggregation scheme. Next the size (in ha) is calculated using the geometry functions of the GIS

package ArcMap. Subsequently the population density is calculated by dividing the number of

inhabitants by the size of the area.

Summary statistics of the variables are presented in Table A-2. The result of the

aggregation procedure is that the larger aggregated areas are located in rural regions. This can

clearly be seen from the population density statistic which dramatically decreases when

aggregation level increases. The average price per cubic meter decreases when aggregation level

increases. Furthermore, higher-aggregated areas encompass more Funda search activity. This is

due to the fact there are simply more objects placed on the internet in these areas as they are larger.

Interestingly, Table A-2 indicates that the number of clicks per objects decreases when the

aggregation level increases. This suggests that more aggregated areas are relatively unpopular

(28)

28

Figure 3-1: Average price per cubic meter (left) and average times watched per object (right), 2011-2013, darker areas indicate a higher price and more clicks respectively.

(29)

29 It is generally accepted that house prices differ regionally (Capozza et al., 2002). The Netherlands

prove to be no exception (Figure 3-1). From the left map it can clearly be observed that the central

areas in the Randstad or areas close to the Randstad are more expensive per cubic meter. Moreover,

the prices in larger cities also seems to be more expensive. Furthermore, the prices per cubic meter

are lowest in the Northern provinces of Friesland, Groningen and Drenthe (with the exception of

the city of Groningen), in the Eastern province of Overijssel and in the South-East in the province

of Limburg.

With respect to the popularity of PC-areas, the right map in Figure 3-1 provides a

clear-cut overview. More popular areas are the areas in or close to larger cities. Moreover the

comparison between the left and right maps in Figure 3-1 is striking. For example, the houses in

the Randstad are watched relatively watched more often and are also more expensive per cubic

meter. Houses in PC-areas which are located in on near larger cities are watched more often and

are more expensive per cubic meter.

Although this pattern is visible for most of the country, some areas in the North (i.e.

Groningen) seem to show a somewhat different pattern. A possible explanation for this

phenomenon could be that in the year over which the sample was taken (2011-2013), this area was

hit by several earthquakes. During these times a lot of questions were inquired whether these

earthquakes had an impact on house prices. Although research by Francke and Lee (2013) has

shown that prices changes in Groningen did not differ significantly from a comparison area, news

regarding house prices in Groningen may have triggered internet search behavior.

The correlations between the variables are presented in Table 3-2. The cross-correlation

(30)

30 fairly high. The correlation between the number of objects and times watched is very high (0.88).

But the correlations between these variables and house prices are low. This suggests that the ratio

between times watched and the number of objects may be more useful in regression models than

the separate variables. Other variables that are relatively high and positively correlated with

house prices are income and the percentage highly educated people. These correlations are 0.65

and 0.7 respectively. Another notable feature from the correlation table is the very high correlation

between the number of objects and the number of households (0.9).

Table 3-2: Correlations between the variables in the cross-sectional models

Variable pr fw fo fwo inc hh p_he hhs pd

pr ₁ fw _0.1574 ₁ fo -0.136 0.8792 1 fwo 0.5988 0.175 -0.315 1 inc 0.649 0.1715 -0.0875 0.5251 1 hh -0.2151 0.7474 0.9043 -0.3797 -0.2838 1 p_he _0.6944 _0.0815 _-0.2134 _0.6087 _0.6694 _-0.2783 ₁ hhs _-0.3471 _0.035 _0.158 _-0.2618 _0.0197 _0.0623 _-0.5964 ₁ pd _0.3353 _-0.3313 _-0.3976 _0.1665 _0.1008 _-0.2812 _0.495 _-0.6253 ₁

(31)

31

3.3.2 Panel data models

This paragraph presents an overview of the variables employed in the panel data models, which

are presented in Table 3-3. All variables are log transformed. The used variable to measure price

developments are median2_{prices per cubic meter. In the literature review several possible control}

variables are presented and some of these are included in the cross-sectional models.

Unfortunately, most data is not available both on COROP level on a quarterly basis. However, it

may be expected that most of the variation in these control variable is absorbed in the time fixed

and entity fixed effects. For example it can be expected that building costs do not differ much

across COROP regions, hence time fixed effects can control for these. Other variables that differ

regionally like income differences may be absorbed by entity fixed effects if it assumed that they

do not differ over time.

Table 3-3: Overview of variables employed in the panel data models

Variable name Description tr Transaction volume pr Median house prices per m3 fw Funda times watched fo Funda number of objects fwo Funda times watched per object

2_{There is also experimented with mean house prices, but median house prices provided slightly better fit,} furthermore the median price should be less affected by outliers than the mean price.

(32)

32 For illustration purposes the time component of the variables will be described visually by

presenting national aggregates of each variable. The transaction variables and Funda variables are

the totals of the respective in the corresponding quarter. The price index is calculated as the

weighted aggregate median transaction price index where the weights are based on the number

of transactions3_{. Therefore, regions with more transactions have a larger share in the national}

median price index. The statistics presented in Table 3-4 are based on COROP-scale data.

As can be seen from the graphs in Figure 3-2 transaction volume decreased in 2011 and

2012. The median prices per cubic meter recovered slightly in 2013. The number of transactions

per quarter stayed within the range of 20,000 to 25,000 transactions until 2012, but started

increasing in the second half of 2013.

A striking feature of the quarterly transaction development is the spike during the last

three months of 2012. Similarly, the 2013Q1 showed relatively little transactions. A possible reason

for this spike and the subsequent fall, could be the tax reforms which came into effect on the 1st_of

January 2013. In these reforms the interest deductibility for new interest-only mortgages was

abolished (Belastingdienst, 2014). However, interest payments from mortgages that were

originated prior to the 1st_{of January 2013 can still be deducted. These reforms may have caused}

people who desired an interest-only mortgage to quickly buy a house before the new tax rules

came into effect. The spike in transactions in 2012Q4 and the subsequent drop 2013Q1 could bias

the results and therefore have to be taken into account in the modelling process.

The drop which can be seen in the number of transactions is also visible in the Funda data.

3_{𝑝𝑟_𝑖𝑛𝑑𝑒𝑥} 𝑡= ∑ 𝑝𝑟_𝑖𝑛𝑑𝑒𝑥_𝑖∗𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠_𝑖 𝑡𝑜𝑡𝑎𝑙 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠_𝑡 𝐼

(33)

33 Interestingly the drop in times watched was already visible in 2012Q4 —one quarter before the

actual drop in transactions occurred. Like the number of transactions the Funda variables also

showed an increase during 2013. The analogy between the Funda variables and prices is

somewhat less straightforward. But the small recovery in prices in 2013 is also visible in the Funda

variables.

Although not visible in the national aggregate of the price index, the median price per

cubic meter contains noise. This is clearly visible in Table E-1 in the Appendix, which contains the

first-order autocorrelations of the first difference of the median price per cubic meter. The AR(1)

coefficient on the unsmoothed series is negative and significant which indicates noise. In order to

cope with this noise, the data is smoothed using a Hodrick-Prescott (HP) filter. There has been

experimented with several smoothing parameters and other techniques like weighted moving

averages or exponential decay smoothing, but the HP-technique proves to be most successful. It

is somewhat arbitrarily chosen to continue with the HP filtered series with a lambda of 0.20, but

other smoothing parameters will be discussed in the robustness checks chapter. Similarly,

transaction volume also contains noise. Therefore, these series are also smoothed using a lambda

of 0.50.

Table 3-4: Descriptive statistics of the variables employed in the panel data models, 2011-2013

Variable Mean Std. Dev. Min Max N

tr 579.66 535.69 37.00 3,145.00 480

pr 570.47 134.41 352.94 1,112.16 480

fw 14,875,266 12,133,033 2,576,366 79,000,000 480

fo 58,109 42,382 9,592 223,969 480

(34)

34

Figure 3-2: Quarterly developments of the nationally aggregated variables employed in the panel data models

(35)

35

4 Econometric Modelling

In this chapter the econometric methodology in order to estimate the cross-sectional and panel

data models is presented. As the cross-sectional models are meant to provide an introduction to

the panel data modelling process the emphasis will lie on the latter.

4.1 Cross-sectional models

The general specification is presented in equation (1). Here the βs are the parameters which have

to be estimated and ε is the error term which is assumed to be i.i.d. Furthermore, subscript i

denotes the aggregated PC-area. The variables are listed in Table 4-1. The methodology applied

in order to estimate the coefficients is Ordinary Least Squares (OLS).

(1) 𝑝𝑟𝑖= 𝛽𝑜+ 𝛽1𝑓𝑤𝑜𝑖+ 𝛽2𝑖𝑛𝑐𝑖+ 𝛽3ℎℎ𝑖+ 𝛽4𝑝_ℎ𝑒𝑖+ 𝛽5ℎℎ𝑠𝑖+ 𝛽6𝑝𝑑𝑖+ 𝜀𝑖

Table 4-1: Definition of used variables

Variable Description Definition

pr House prices Log(Mean Transaction price per m3) fwo Funda watched per object Log(Funda Times watched/Funda Objects) tr Transaction Volume Log(Number of transactions per month)

inc Income Log(Average monthly fiscal income)

hh Number of households Log(Number of households)

p_he % High educated Percentage of population with high education hhs Household size Population/Number of households

(36)

36

4.2 Panel data models

Although it may be expected that there exists a long-run relationship between the Funda data and

house prices and/or transaction volume, the limited availability of the data does not allow to

examine this. It is however expected that the data may be able to track the short-run dynamics of

supply and demand. Therefore an econometric framework that models short run changes in house

prices and transaction volume is most applicable. .

The data consists of a balanced panel of 40 COROP-regions with data ranging from 2011Q1

to 2013Q4. When applying a panel data approach one can control for unobserved heterogeneity

between the cross-sectional units by using for example a within transformation or a Least Squares

Dummy Variable (LSDV) approach (Stock & Watson, 2011). In the theoretical framework it was

concluded that that house prices exhibit positive serial correlation in the short run. This suggest

lags of the dependent variable should be included in the models. Hence, modelling the data as a

dynamic panel seems most natural. A problem that arises when including lags of the dependent

variable in the regression is that these lags are correlated with the error term and therefore will

result in biased results (i.e. Nickell’s bias, see Nickell, 1981; Roodman, 2009). A possible solution

to cope with this problem is by applying the GMM approach as proposed by Arellano and Bond

(1991) which will be elaborated in the following sections.

For convenience the house price model presented in equation (2) is discussed in the text.

The transaction volume model can be constructed in a similar fashion and is presented in equation

(3).

(37)

37 The general model to explain house price changes is presented in equation (2) 4_{, the used variables}

are presented in Table 4-1. Here house prices changes of cross-sectional unit i in period t depend

on house prices changes in cross-sectional unit i in period t-1 through t-q and the clicks per objects

in cross-sectional unit i in period t-1 through t-p. Furthermore vit is the error term which is i.i.d.

By transforming the variables, which in this case is done by taking first differences, the

unobserved heterogeneity αi cancels out. The lagged dependent variable(s) however, are still

correlated with the error term because by construction the error term includes entity fixed effects

which are unobserved. The lagged dependent variable also depends on these entity fixed effects,

since by definition these are time-invariant. Hence, the lagged dependent variable depends on the

error term in t, which means the estimated coefficients will be biased and inconsistent.

Besides entity fixed effects time fixed effects are also included in the models, which are

denoted by λt. These will account for temporal factors that are similar across entities. Seasonal

effects are also absorbed by these time fixed effects.

In order to cope with these issues, the parameters of interest are estimated by using the

Generalized Methods of Moments (GMM) estimator (Arellano & Bond, 1991; Arellano & Bover,

1995). Here the coefficients are estimated using an Instrumental Variable (IV) approach. Here

first-differences are instrumented by the lagged levels of this variable5_.

4_{This model is created by taking the first differences of:}

𝑝𝑟𝑖𝑡= 𝛼𝑖+ ∑ 𝛾𝑞𝑝𝑟𝑖𝑡−𝑞 𝑄 𝑞=1 + ∑ 𝛽𝑞𝑓𝑤𝑜𝑖𝑡−𝑝 𝑃 𝑝=0 + 𝜆𝑡+ 𝑣𝑖𝑡

5_{In the Anderson-Hsiao approach only lagged dependent variable is included. If the variable in levels is} close to a random walk, this instrument is found to be relatively weak (Roodman, 2009). Since house prices are expected to follow a random walk it is chosen to include more lagged variables as instruments.

(38)

38 The transaction volume model can be established in a similar fashion and is presented in

equation (3).

(3) ∆𝑡𝑟𝑖𝑡 = ∑𝑆𝑠=1𝛿𝑞∆𝑡𝑟𝑖𝑡−𝑠+ ∑𝑅𝑟=0𝛽𝑞∆𝑓𝑤𝑜𝑖𝑡−𝑟+ ∆𝜃𝑡+ ∆𝑢𝑖𝑡

In order to examine the relationship between house prices and transaction volume and how these

respond to changes in the time watched per object variable a panel Vector Autoregression (VAR)

model is defined in (4). In this VAR model the changes in transaction volume and price changes

are modelled simultaneously. In the impulse responses generated by a VAR model the

simultaneous causality between house prices and transaction volume is explicitly taken into

account. (4) (∆𝑝𝑟_∆𝑡𝑟𝑖 𝑖)_𝑡 = ∑ ( 𝛾1 𝛿1 𝛾2 𝛿2)_𝑞 𝑄 𝑞=1 ( ∆𝑝𝑟𝑖 ∆𝑡𝑟𝑖)_𝑡−𝑞+ ∑ ( 𝛽1 𝛽2)_𝑝 𝑃 𝑝=0 𝑓𝑤𝑜𝑖𝑡−𝑝+ ( ∆𝜆1 ∆𝜆2)_𝑡+ ( ∆𝑣1 ∆𝑣2)𝑡

Roodman (2009) distinguishes eight assumptions regarding the data-generating process of

applying GMM and states that these should be touched upon prior to doing the research. These

are included in Appendix B. The first assumption is the main reason why the coefficients are

estimated using GMM. From the literature it can be concluded that house prices are expected to

follow a dynamic process. The second assumption is a reason why a panel set-up is performed.

Although, cross-sectional regressions are performed, the time component is expected to provide

additional insights in the data. Lagged dependent variables are obviously endogenous but this is

handled through the GMM set-up. The standard errors are estimated using the cluster option, and

(39)

39 however may not be correlated over the cross-section. Therefore, time fixed effects (i.e. time

dummies) are included in the regressions, which makes this assumption more likely to hold

(Roodman, 2009). The used panel can be considered “small T, large N” as T is 12 and N is 40. In

the model only lags of the regressors are included as instruments, hence there are only

internal-based instruments.

4.3 Hypotheses

Although no empirical evidence regarding the Funda data exists, several based on the literature

hypotheses regarding transaction volume and house prices in the Dutch housing market can be

stated. The first hypothesis is related to both cross-sectional models and panel data models.

Statements II through IV only apply to the panel data models.

I. The Funda data have explanatory power in house price models.

II. The Funda data have explanatory power regarding transaction volume. III. Transaction volume and house prices are positively related.

IV. Transaction volume reacts quicker to changes in demand and supply dynamics than prices.

The first hypothesis explores whether the data are helpful in explaining cross-sectional and

intertemporal variance in house prices. The testing of this hypothesis will be approached in two

ways: (1) several variables derived from the Funda data will be tested whether they are (jointly)

significant in the cross-sectional model and (2) the Funda variable fwo and its lags are included

and it is tested whether the Funda data Granger causes house prices.

Secondly, the relationship between the data and transaction volume is examined in a

(40)

40 Next, from the literature it is concluded that there exists a positive relationship between

house prices and transaction volume. In the third hypothesis this is stated and this will be

empirically tested for the data. In order to test this, a panel VAR model is formulated which

includes both transaction volume and house prices.

Finally, a synergy between the first, second and third hypothesis is formulated. In

the literature it is generally found that transactions are the first to react to demand or supply

shocks and that prices adapt much more gradual. As it is expected that the Funda data is able to

track these demand and supply shocks, similar results are expected regarding the Funda

(41)

41

5 Results

5.1 Cross-sectional models

The results of three cross-sectional models are presented in Table 5-1. Model (1) includes several

explanatory variables, but no Funda variables. Model (2) additionally includes two separate

Funda variables being times watched and number of objects. Model (3) includes the generated

variable times watched per object. From Figure D-1 in Appendix D it can be seen that that

residuals are normally distributed.

Table 5-1: Cross-sectional models.

This table presents 3 alternative models with house prices per PC-area as dependent variable, column 1 includes explanatory variables and columns 2-3 additionally include Funda variables. Variable coding: pr is house prices, fw is funda watched, fo is funda objects, fwo times watched per object, inc is income, hh is households, p_he is percentage highly educated, hhs is household size and pd is population density, for a definition of the used variables see Table 4-1. All standard errors are heteroscedastic robust standard errors but are unreported, t-statics between parentheses. *,* and *** indicate statistical significance at 10%, 5% and 1% levels respectively.

(1) (2) (3) pr pr pr fw 0.246*** (8.29) fo -0.235*** (6.23) fwo 0.246*** (8.40) inc 0.812*** 0.740*** 0.748*** (10.01) (8.72) (9.57) hh 0.008 0.028 0.038*** (0.65) (0.98) (3.22) p_he 0.539*** 0.202 0.195 (3.51) (1.39) (1.34) hhs -0.148*** -0.128*** -0.128*** (4.13) (3.69) (3.73) pd 0.012** 0.025*** 0.024*** (1.97) (3.73) (3.91) Constant 0.230 -0.834 -0.923* (0.41) (1.43) (1.77) 𝑅2 _0.56 _0.60 _0.60 RMSE 0.17 0.17 0.17 N 788 788 788

(42)

42 The variables in the first model all have the expected signs. The largest contributor to regional

house price differences is income: an increase of 1% of average income per area translates into

0.8% higher house prices. Household size is found to be inversely related to house prices: an

increase of household size by 1 is related to 14.8% lower house prices. A priori a similar effect is

expected, since larger households are located more often in suburbs or more rural areas while

smaller households are located in more urban areas. The significance of the variable is maybe

somewhat unexpected since population density is also included in the model. This variable

should measure similar effects6_{. The sign of population density is as expected: an increase of 1%}

translates into 0.01% higher house prices. An increase of 1% in population density is equal to 0.3

more inhabitants per ha. Even when controlling for income, the percentage highly educated is

significant. If the percentage of highly educated increases by 1%-point house prices increase by

0.5%. The number of households is found to be insignificant.

In the second model two Funda variables have been added. More specifically, the times

watched and number of objects variables are included. These variables have the expected sign

and are both highly significant. The demand variable (i.e. times watched) therefore is positively

related to price while the supply variable (i.e. number of objects) is negatively related to prices.

An increase of times watched with 1% translates into 0.25% higher prices. A 1% increase in the

number of objects relates to 0.24% lower house prices. A 1% increase in times watched on average

is related to approximately 1530 more clicks. An increase of 1% in the number of objects compares

to approximately 6 listed properties on Funda on average. The signs and magnitudes of the

6_{When population density is left out, the size and significance of household size increases. Similarly, when} household size is left out the size and significance of population density increases

(43)

43 control variables are roughly similar to those in model (1). One notable exception is that that the

percentage highly educated becomes insignificant. Furthermore, the coefficient on income

decreases somewhat in size. The reason might be that the Funda data absorbs some of these

demand characteristics and measures similar things. The increase in model fit when Funda data

is included is visible but limited as the R-squared increases from 0.56 to 0.607_{. This provides}

evidence that the Funda data is able to measure similar variation in house prices as other demand

and supply characteristics, and thus is able to measure demand and supply.

The similarity in magnitude between the Funda variables is striking. This could however

be the result of the high correlation between the number of time watched and number of objects

(0.88; see Table 3-2). Therefore, in model (3) these variables are combined into the times watched

per object variable. The sign of this variable is conform expectations and is also very significant.

If the times watched per object in a PC-area increases by 1% the prices increase by approximately

0.25%. A 1% increase in times watched per object is related to approximately 3 clicks more per

object. If it is assumed that the times watched per object variable measures supply relative to

demand accurately, the magnitude of an increase in demand relative to supply should be similar.

The estimated coefficients of the control variables with the exception of number of households are

very similar to those of model (1). The number of households is now significant and has the

expected sign as the number of objects is now excluded, solving the multicollinearity issues. An

increase of 1% in the number of households is related to 0.4% higher house prices.

When including control variables, the interpretation of the Funda data changes when no

7_{When no control variables are included Funda explains roughly 36% in regional house price differences.} Adjusted R-squares are also 0.56 and 0.60, for consistency reasons only the regular R-squared is reported, since these have to be calculated by hand in the panel data models estimated by GMM.