1
The predictive power of Funda
Dorinth van Dijk
Abstract
In this thesis internet data is used for predicting Dutch housing market developments. It is shown
that data on listed properties from the brokerage website Funda is able to proxy for supply and
demand in the Netherlands. Furthermore, the Funda data allows to examine price and liquidity
dynamics of the Dutch housing market. In a panel VAR framework it is shown that transaction
volume responds relatively quick to changes in demand and supply while prices react more
gradual. This is in accordance with search and match models where changes in demand and
supply are first incorporated into an increase in liquidity while prices respond with a time lag.
This thesis shows that the usage of alternative data sources is useful in economic research and that
these data can provide information to future homebuyers.
University of Amsterdam Supervisor: Prof. Dr. Marc K. Francke Second reader: Dr. Jeroen E. Ligterink Master Thesis Business Economics – Finance / Real Estate Finance Student Number: 10642935 November 2014
2 Abstract ...1 Acknowledgements ...4 1 Introduction ...5 2 Literature review ...8 2.1 Price dynamics ...8 2.1.1 Market efficiency ...8 2.1.2 Price-Volume Correlation ...10
2.1.3 Empirical evidence of price-volume correlation ...13
2.2 Modelling house prices and transaction volume ...13
2.3 New data ...16
2.3.1 Alternative data in economics and finance ...16
2.3.2 Alternative data in the housing market ...18
2.4 Relevance ...20 3 Data ...22 3.1 Data sources ...22 3.2 Aggregation process ...24 3.3 Variable description ...25 3.3.1 Cross-section models ...25
3
4 Econometric Modelling ...35
4.1 Cross-sectional models ...35
4.2 Panel data models ...36
4.3 Hypotheses ...39
5 Results ...41
5.1 Cross-sectional models ...41
5.2 Panel data models ...44
5.2.1 House prices ...44
5.2.2 Transaction volume ...48
5.2.3 Synergizing house prices and transaction volume ...52
5.3 Summarizing the results ...56
6 Robustness checks ...58
6.1 Different smoothing parameters ...58
6.2 OLS models ...60
7 Implications and limitations ...63
8 Conclusion ...66
References ...68
4
Acknowledgements
It is impossible to write a thesis without proper supervision. Therefore I would first like to thank
Prof. Dr. Marc Francke who provided such. Your comments and suggestions were most useful
and my academic skills have definitely benefited from them. Next I would like to thank Ir. Alex
van de Minne for providing me with many creative and useful ideas in the start-up phase of the
project. Especially, the visit to Funda and your comments on the proposal were most useful. Also,
thank you Dr. Martijn Dröes for providing me with useful technical guidance.
Special thanks go out to Funda for providing the data. In particular thank you Ruben
Scholten for gathering the massive amounts of data. Without this data it would clearly have been
impossible to write this thesis.
Finally, I would like to thank my family and friends for providing the necessary
5
1 Introduction
When predicting economic growth economists usually look at consumer confidence surveys.
Although a similar indicator exists for the Dutch housing market (i.e. the Vereniging Eigen Huis
Marktindicator), it is generally accepted that house prices differ regionally (Capozza et al., 2002).
Therefore it would be interesting to have an indicator that foreshadows housing prices on a
regional scale. This would in turn be valuable for home buyers and sellers to time there
transaction. But more importantly, this indicator could assist realtors and other real estate
professionals in predicting to “where the market is heading”. This thesis seeks to develop such an
indicator using alternative data sources.
As for these sources, the internet proves to be a valuable source of information when
predicting economic topics. More specifically, Wu and Brynjolfsson (2009) point out that for the
US market Google search query data predict house transaction volume and house prices with
relative high accuracy. The basic idea is that people who are interested in buying a house first start
by examining potential houses on the internet. Askitas and Zimmerman (2009) call this behavior
preparatory steps to spend.
The usage of both transaction volume and house prices as dependent variables is no
coincidence as there seems to exist a relationship between these (Stein, 1995; Clayton et al., 2010;
De Wit et al., 2013). Price increasing booming markets are typically characterized by more
liquidity, while down markets with declining prices usually shows less liquidity (De Wit et al,
2013). A reason for this relationship stems from search and matching models in which transactions
6 In the light of Wu and Brynjolfsson (2009) this thesis uses internet search query data in
order to decompose these effects for the Dutch housing market. The Netherlands are characterized
by the fact that most people look on one specific website in order to search for and compare
houses: Funda (Funda, 2013). Therefore, the activity on this website —or more specifically search
behavior– could give a useful indication about (future) demand. Moreover, the number of listed
properties could be a useful supply indicator. The advantage of these data is that it can be
determined on a very detailed scale (i.e. on zip code level).
The insights in the relationship between transaction volume and prices provided in this
thesis are inherent to the usage of new data. As the usage of internet data is probably one of the
most recent developments in economic research this thesis is also affiliated with these
developments. The research question to be answered is:
“To what extent can Funda search data predict house price and transaction volume
dynamics in the Netherlands?”
In this thesis the number of times watched and the number of objects is received from Funda. The latter
equals the amount of houses which are for sale on Funda in a certain neighborhood (on zip code
level) and the former describes how many times these houses have been clicked upon. By dividing
the times watched by the number of objects a demand versus supply variable is generated. In
order to examine the relationship between this variable and housing market developments the
data are implemented in two different types of models. First, in order introduce to Funda data in
an approachable manner, cross-sectional models are estimated. Here the data are included in
7 estimated in which the dynamic relationships between the Funda data, house prices and
transaction volume are examined. The modelling process will take off with single equation models
for house prices and transaction volume. These will subsequently be extended into a panel VAR
set-up in order to examine responses of house prices and transaction volume to demand shocks.
As the sample period of the obtained data is rather short (e.g. 2011 - 2013), the goal of this research
is to model the short-run dynamics of the Dutch housing market.
This thesis is structured in the following manner. In the second chapter an overview of the
literature is given on house price dynamics, the relationship between liquidity and prices and use
of alternative data in economics. In the third chapter the used data sources are described both
verbally and statistically. This chapter is succeeded by chapter 4 which discusses the applied
methodology. Furthermore, some hypotheses are composed in this chapter. Chapter 5 discusses
the main results and seeks to answer the hypotheses and research question. In the sixth chapter
several robustness checks are performed. In chapter 7 some limitations regarding Funda data and
the research are addressed. Moreover, some implications for further research will be touched
upon in this chapter. The thesis ends with a conclusion that tries to capture the main points of this
8
2 Literature review
Three different topics are addressed in this literature review. First dynamics regarding house
prices are discussed. This is followed by a discussion concerning the relationship between house
prices and transaction volume. Third, the use of “new” data in economic research is reviewed.
These are data that are related to the recent increase in internet usage. The applicability of these
sources in economic research is canvassed. This chapter will end with a note on the relevance.
2.1 Price dynamics
2.1.1 Market efficiency
One of the most well-known and discussed theories in finance is the Efficient Market Hypothesis
(EMH). This hypothesis states that all information regarding stocks is reflected into current prices
(Bodie et al., 2002). The degree to what information (i.e. public and/or private) is incorporated is
related to the form of this hypothesis. Kendal (1953) found that stock market price changes show
little serial correlation. This effectively means that past returns are unrelated to future returns.
Therefore no repeated patterns can be found in the stock market, hence the analysis of past price
movements in order to generate positive future returns is inutile. Hence, outsiders should not be
able to generate positive returns solely by watching price-movements. Kendal presents five
reasons why some investors might be able to have success: (1) luck, (2) at certain times all prices
rise so they can’t go wrong, (3) by exploiting inside information, (4) by acting first, and (5) by
exploiting scale advantages so that broker fees and stamp duties evaporate.
9 by Fama (1965). He provides further empirical evidence for the random walk theory. This theory
states that stock price changes are as predictable as a random set of numbers. In order for the
theory two hold two statements are discussed by Fama (1965): price changes are (1) independent
and (2) normally distributed variables. The former is found to be true as all three methods (e.g.
serial correlation model, runs analysis and Alexander’s filter technique) confirm the statement
that price changes are independent of each other. The latter statement is rejected by Fama. He
finds that stock price returns are not normally distributed. Extreme events (very large positive or
negative returns) are more likely to happen than implied by a normal distribution. This
phenomenon is also known as fat tails (Bodie et al., 2002).
Counter-evidence for the random walk theory is provided by Lo & MacKinlay (1988). The
empirical evidence in this article shows that the random walk model is not valid for weekly stock
returns. The conclusion is that (significant) positive autocorrelation is evident in the stock market
when return figures are based on weekly data. More specifically, a first-order autocorrelation
coefficient of 0.30 is found. Furthermore, Lo & MacKinlay (1988) extend their framework by
creating size-sorted portfolios (e.g. small, medium and large stocks). The conclusion is that the
random walk model is rejected for all three categories. The behavior of small stocks however,
seems to contrast the most with the random walk model. There are numerous studies that show
that the stock market exhibits “anomalies” (see Malkiel, 2003 for an overview). The existence of
these anomalies however, does not give way to portfolio strategies that generate excess returns
given the risk (Malkiel, 2003).
In the context of the efficiency of the housing market Case and Shiller (1989) provide
10 predictable. They use their —at the time of the research— recently developed Weighted Repeated
Sales (WRS) index to show that housing prices are predictable in the short run. This WRS index is
an index based on houses that are sold at least twice and did not significantly change in between
the two times of sale (see Case and Shiller 1987 for more on this WRS index). The authors show
that after-tax excess returns in Atlanta, Chicago, Dallas and San Francisco are strongly dependent
on the after-tax excess returns of the preceding year. Case and Shiller (1989) show that a change
in house prices in a certain year is usually followed by a change of about half the size in the same
direction. The authors do however emphasize that individual house prices changes are not
predictable due to the excessive amounts of noise existent in transaction prices.
In an overview of empirical studies given by Cho (1996) regarding the efficiency of real
estate markets, the conclusion is that they are not fully efficient. A general trading rule that
generates excess returns seems to be practically infeasible. This is mainly the result of relatively
high transaction costs in real estate markets. This thesis does not relate directly to the efficiency of
the housing market in general, in the sense that it tries to examine this efficiency, nor does it try
to formulate a general trading rule to provide excess returns. It is merely a result of the earlier
findings that suggest that real estate markets are not fully efficient, which means that modeling
house prices is not inutile.
2.1.2 Price-Volume Correlation
Since housing markets are not perfectly efficient and no central housing exchange exists, the
housing market can be characterized as a search market (Diaz & Jerez, 2013). In a search market
11 which means a house will be transacted. This searching and matching principle is important in
house price dynamics. Buyers and sellers set their reservation prices for which they are willing to
buy or sell their house (Geltner et al., 2007). A transaction will occur if the reservation price of the
buyer equals or exceeds the reservation price of the seller. The way how buyers and sellers react
to a shock (i.e. set their reservation prices) may be different. Genesove and Han (2011) show that
sellers react to a demand shock with a lag. In other words sellers gradually adjust their reservation
prices upwards when demand increases. If demand increases the group of buyers willing to pay
the sellers’ reservation prices increases. Hence, the probability that a transaction occurs increases.
This theory could explain the observed price-volume correlation in housing markets (De Wit et
al., 2013). Consider a positive demand shock which is not instantaneously observed by all market
participants. The increase in demand will result in more transactions at first, or houses will be sold
quicker. Sellers react more slowly to this increase in demand. In time they will increase their
reservation prices until the point at which the demand shock is completely absorbed into higher
prices. Therefore this theory gives rise to a positive price-volume correlation in the housing
market.
De Wit et al. (2013) research this price-volume correlation for the Dutch housing market
and seek to find an explanation for this correlation in the Dutch housing market. Besides this
search and matching approach De Wit et al. (2013) identify two other groups of theories: (1) the
interaction between downpayment constraints, mobility and house prices and (2) behavioral
explanations. The authors stress however that the three approaches are not mutually exclusive.
The fundamentals of the group of downpayment constraints lie within the work of Stein
12 homeowners that would like to buy another (more expensive) house are constrained by a
downpayment that they have to make in order to buy the new house. In this proposition, credit
constraints are linked to transaction volume and prices changes to explain the correlation between
these. Consider the example when house prices decrease. In this case current homeowners have
potentially less money to spend on a new home. This will hold especially when the current home
is financed with a mortgage, since this will first have to be amortized. The result is that
homeowners have little equity left in order to make a possible downpayment for their new home.
The consequence is that less people will sell and buy houses (i.e. the transaction volume will
decrease). This could in turn result in lower prices. The same is true vice versa if house prices
increase. In this case current homeowners should have more equity left to make their
downpayment, which allows them to “move up on the housing ladder” (De Wit et al., 2013).
Therefore, according to Stein et al. the price changes and changes in transaction volume are
thought to be reinforcing each other.
Finally, there are behavioral explanations for the relationship between prices and
transaction volume. The behavioral bias of loss aversion is generally thought to hold for
homeowners (De Wit et al., 2013). When markets go down, homeowners don’t like to sell their
houses for less than what they paid. The result is that reservation prices of sellers, hence asking
13
2.1.3 Empirical evidence of price-volume correlation
In a VEC-framework De Wit et al. (2013) model the long-run relationship between the rate of sale
and house prices in the Netherlands. They seek to explain what causes this correlation in the Dutch
situation. Their results suggest that the Dutch evidence seems to fit the search and matching
models best: buyers and sellers adjust gradually to changes in fundamentals. If news arrives, first
the rate of sale seems to change and prices react much more gradual.
Clayton et al. (2010) employ a panel VAR-framework and confirm that a price-volume
correlation exists for the US housing market. They confirm the downpayment hypothesis of Stein
(1995). Furthermore, it is found that prices react in a different manner than trading volume to
shocks in Fundamentals (i.e. an exogenous shocks). Transaction volume seems to react stronger
than prices. Although it is not specifically mentioned in the article, this also confirms the first set
of theories: transaction volume reacts quickly while prices are adjusted gradually.
2.2 Modelling house prices and transaction volume
Based on the literature several possible variables that influence house prices and/or transaction
volume can be identified. In Table 2-1 an overview of three articles that research both house prices
and transaction volume is presented. In this table the effect of several explanatory variables is
listed. The most attention is given to variables that affect house price dynamics in the short-run as
this thesis seeks to model these.
A common factor among all three articles is that all researchers include seasonal effects
due to seasonality in house prices and transaction volume. Furthermore, there seems to be mixed
first-14 order autocorrelation while Clayton et al. (2013) find negative coefficients for these parameters.
The reason for these negative coefficients according to Clayton et al. is that both prices and
transaction volume adjust to exogenous shocks. In this adjustment process there may be
overshooting. House prices for example tend to overreact to an income shock, which causes them
to rise at first but is followed by a subsequent drop towards a new steady state level in a relatively
short timeframe. Nevertheless, it is generally accepted that house prices exhibit positive
autocorrelation in the short run and are mean reverting in the long run (Capozza et al., 1997).
Household income is expected to influence house prices and transaction volume positively as
more people will be able to move up the housing ladder.
The mortgage rate which is linked to the interest rate has a negative impact as a higher
mortgage rate makes buying a house effectively more expensive. The evidence regarding
(un)employment is rather mixed. A priori a higher unemployment rate should influence both
prices and transaction volume negatively as generally less people will be able to buy a house.
According to De Wit et al. (2013) this holds for the Dutch housing market in the long run. The
unemployment rate however, does not seem to influence the short-term dynamics. The latter is
also found true to be for US house prices in Clayton et al. (2010). The unemployment rate does
influence transaction volume negatively in Clayton et al.
Clayton et al. (2010) also include two different stock market variables. The index returns,
which is the first-order difference of the level of the S&P 500 and the second-order difference of
the level of the S&P 500. The former is hypothesized to proxy for financial wealth and constraints
of household while the latter represents up- or downtrends of the stock market. Furthermore, Wu
15
Table 2-1: Overview the variables used in articles to explain house price changes and transaction volume, + indicates that the effect is positive and statistically significant, – indicates a statistically significant negative effect, 0 indicates that the estimated coefficient is not statistically different from 0, and other indicates other control variables. SR indicates short-run estimates while LR indicates a long-run relationship.
Article Influence House prices* Transaction volume*
Clayton et al. (2010)
+
Employment Household income Mortgage rate trend S&P 500 trend
House prices lags 2/3 Turnover lag 3
Employment Household income Mortgage rate trend House prices lag 1
-
Mortgage rate level S&P 500 level House prices lag 1
Mortgage rate level S&P 500 trend Unemployment rate Turnover lags 1/2/3 0 Unemployment Turnover lags 2/3 S&P 500 level House prices lags 2/3
other
Quarter dummies Quarter dummies
De Wit et al. (2013) + Rate of sale (SR) Transaction prices (SR) List prices (SR/LR) Rate of entry (SR) Rate of sale (SR/LR) - Unemployment (LR) Interest rate(SR/LR) Rate of sale (LR) Rate of entry (SR/LR) Unemployment (LR) Interest rate (SR/LR) 0 Unemployment (SR) Transaction prices (LR), Unemployment (SR) Rate of entry (LR) Transaction prices (SR/LR) List prices (SR/LR) other
Seasonal dummies Seasonal dummies
Wu and Brynjolfsson (2009) + Transactions lag 1 HPI lag 1
Google “RE Agencies” contemp. and lag 1 index
Google “RE listing” contemp. index
Transactions lag 1
Google “RE Agencies” contemp. index Google “RE Listing” contemp. index
-
Google “RE listing” lag 1 HPI lag 1
0
Google “RE Agencies” lag 1, Google “RE Listing” lag 1
other
Quarter dummies State fixed effects MSA fixed effects Population
Quarter dummies State fixed effects MSA fixed effects Population
* The exact variables that the presented articles use as dependent variables are: Clayton et al. (2010) study Home prices and Turnover; De Wit et al. (2013) study transaction prices, list prices, rate of sale and rate of entry, in this overview the variables that affect transaction prices and the rate of sale are presented; Wu and Brynjolfsson (2009) study the House Price Index (HPI) and transaction volume
16
2.3 New data
2.3.1 Alternative data in economics and finance
The use of alternative data in scientific research recently has received some attention in the
literature (Lohr, 2012). Several examples on this topic will be discussed briefly. In reviewing these
articles specific attention is given to the alternative dataset which has been used. Most articles use
data that measure the gathering of information. The authors basically argue that information
which is gathered today by consumers, may say something about actions taken in the future. Or
as Askitas and Zimmerman (2009) put it: they provide a measure of preparatory steps to spend.
Askitas and Zimmerman (2009) find that Google Insights (known today as Google Trends)
can be useful in employment forecasts in Germany. They use an index based on Google search
queries like unemployment office (k1), unemployment rate (k2), personnel consultant (k3) and the
names of German Job search agencies (k4). They argue that k1 is associated with flow into
unemployment as activity along this search query should be linked to people contacting the
unemployment office. Hence, k1 is expected to have positive influence on unemployment figures.
According to Askitas and Zimmerman, k2 is just conveniently linked to unemployment and they
do not hypothesize this variable. The search for personnel consultant (k3) should be correlated
with the fear of high-skilled workers to get fired in a restructuring process, but k3 is not explicitly
hypothesized. Finally, the activity in the search query index for job searching agencies (k4) should
be associated with flow out of unemployment, since this search query should be related to job
searching activities of unemployed. They hypothesize that k4 should have a negative impact on
17 unemployment office (k1) and German job search agencies (k4) produces the best results in their
error correction framework. More specifically, they find that k1 has a significant positive impact
on unemployment figures in both the short and long-run and k4 has a significant negative impact
on employment figures in the short-run.
Vosen en Schmidt (2011) also use data from Google Trends to forecast economic data. In
their article, they create a new indicator for private consumption based on Google Trends. They
compare this Google Trends indicator with two survey-based private consumption indicators (e.g.
MSCI and CCI). Google groups several search queries into aggregated search indices regarding a
certain topic. For example “Real Estate” is one of those groups. The authors select 56 categories
that they think are connected to private consumption. They find that using either of the three
indicators improves the out-of-sample one-month ahead forecasts of their ARMAX model. But
more interestingly, the Google trends indicator significantly outperforms the MSCI and CCI
indicators.
More related to the topic of finance is the article of Bollen et al. (2011). In their article they
use “collective mood indicators” to predict the stock market. Since the stock market is known to
be relatively efficient and returns will not depend on past returns but only on news, returns are
not predictable (i.e. semi-strong EMH). But according to Bollen et al. (2011), returns in the stock
market should, besides news, also depend on mood and sentiment. In order to capture the
collective mood and sentiment they analyze almost 10 million tweets. These tweets are then
grouped into seven different mood dimensions using two different software packages (i.e.
Opinion Finder and GPOMS). Opinion Finder groups the tweets into positive or negative tweets
18 happy). Using a Granger causality analysis it is shown that the “calm” dimension has the most
significant predictive power in forecasting the stock market. More specifically, it is found that the
calm dimension Granger causes the Dow Jones Industrial Average (DJIA) three to four days in
advance. In order to estimate a forecasting model to accurately predict the DJIA closing value, the
authors employ a Self-Organizing Fuzzy Neural Network (SOFNN).
2.3.2 Alternative data in the housing market
It seems that real estate research proves to be no exception when it comes to applying alternative
data sources in predicting future trends. Wu and Brynjolfsson (2009) use data from Google Trends
to forecast future transaction volume and price developments in the housing market. They use
quarterly search query data from Google of 51 states in the United States. In their research two
different predefined search query indices are used: (1) Real Estate Listing and (2) Real Estate
Agencies. Category (1) should reflect all search queries related to real estate listings and category
(2) should approximate for home buying activities. They add these Google search variables to
their baseline model, which is a seasonal autoregressive model. In this model contemporaneous
home sales (levels) are regressed against previous quarter home sales, previous quarter house
price index (i.e. the WRS index of Case and Shiller), and control variables (e.g. population, entity
fixed effects and time fixed effects). Wu and Brynjolfsson (2009) find that the coefficient on Google
search frequencies in the same month is significantly related to contemporaneous home sales, but
the coefficient on lagged search frequencies is not. More precisely, a one-percent increase in the
contemporaneous index on Real Estate Listing is associated with an increase of 20.8 percent
19 Real Estate Agencies index relates to an increase of 14.8 percent increase in home sales in the same
quarter. In order to perform of out-of-sample forecasts, one-quarter forecast models are employed.
The findings are that the Mean Absolute Error (MAE) of the contemporaneous model, is
0.170, which is an improvement of 2.3 percent over the baseline model without search indices. The
MAE of the one quarter forecast model is equal to 0.172, which is an improvement of 7.1 percent
over the baseline model.
Similarly, Wu and Brynjolfsson (2009) employ models to forecast the Case and Shiller
house price index (HPI). A one-percent increase in the contemporaneous Real Estate Agencies
index is associated with an increase of 3.5 to 6 percent of the HPI. The past search frequencies of
Real Estate Agencies proves to be insignificant. Moreover, a one-percent increase in the current
search frequency of Real Estate Listing is related to 8 to 9 percent increase of the HPI. But
strikingly, the coefficient on the one-quarter lagged variable of Real Estate Listing turns out to be
negative and significant. The authors do not give an explicit explanation for this phenomenon,
but suggest that this may have something to do with the difficulty in forecasting prices. The reason
why prices may be more difficult to forecast is the result of complex supply and demand
dynamics. Nevertheless, the MAE of the baseline model that predicts the current HPI improves
by 2.5 percent when Google search data is included. Moreover, the MAE improves by 3 percent
when predicting the future HPI.
Another neat result of the research is that only Google Trend data are able to forecast the
contemporary transaction volume with relative high accuracy. The result is obtained by leaving
out the autoregressive component of transaction volume and the lagged value of the HPI. The
20 contemporary HPI forecast model, the R2 is 0.987 with the autoregressive component of the HPI
and lagged value of transaction volume. When these components are left out (i.e. only the Google
Trend data and controls remain) the R2 remains at 0.987.
Although the results of the article of Wu and Brynjolfsson (2009) seem convincing, their
model may contain some issues. In their article, no results are provided whether the time series
used (e.g. transaction data, price index and Google Trends data) are stationary. But the authors
specifically model levels. This may cause spurious regression if the series are actually
non-stationary and not cointegrated (Stock & Watson, 2011). In many researches price changes, which
are already differenced, are modelled (see for example Malpezzi, 1999). It also seems mean and
median price per month on a national aggregated scale in the Netherlands are non-stationary
(more on this in the methodology chapter). Furthermore, in the article of De Wit et al. (2013) a
VEC model is applied to prices on the Dutch market and transactions, which implies these
variables are I(1). This in turn implies these variables are non-stationary in levels.
2.4 Relevance
This research uses Funda data in models of the Dutch housing market. To my knowledge this data
has never been used in scientific research. The data are comparable to several articles discussed
in this chapter, albeit there are differences. It is comparable in the sense that Funda data can also
be considered an alternative source of data. This in the context that it could capture preparatory
steps to spend. It is different in the sense, that the data are not publically available. But on the
other hand, this also means that it can be captured on a much finer-grained scale than for example
21 data will be used in order provide insights the dynamics between price changes and transaction
volume in the Netherlands. Moreover, since the data contains both a cross-sectional and temporal
component the effects can be modelled in a panel-data setting which allows to decrease Omitted
22
3 Data
In this chapter a verbal, statistical and spatial description of the obtained data will be presented.
3.1 Data sources
Two main datasets are required in order to perform the research. First data regarding house prices
and transaction volume is required. Data from the Dutch Brokerage Association (NVM) is used
for these purposes. The NVM data consist of the number of transactions and median price per
cubic meter per month per zip code area from January 2011 until December 2013. In total there are
approximately 275,000 transactions included in the database. A transaction is denoted as
“transaction” in the NVM database at the time of the signing of the buyers’ contract. Other
databases like the CBS/Kadaster a transaction is included when the legal transfer takes place.
Therefore it is generally found that the NVM transaction data leads other data sources.
As aforementioned the relationship between house prices and “Funda search behavior” is
examined. In a survey executed by Kaela Research, 93% of the respondents answer “Funda” when
they are asked to name a housing website (Funda, 2013). Additionally, 81% prefers Funda if they
would sell their home online. There are 4.2 million unique visitors per month on Funda.nl (Funda,
2013). In a report by Kerste et al. (2012) it is shown that Funda is by far the most popular housing
website in the Netherlands. According to this research Funda has a stable market share of around
60% of all Dutch housing websites. The owner of Funda is the NVM from which the transaction
data originates. In the report of Kerste et al. (2012) it is mentioned that most of the brokers that are
23 from these NVM brokers. The relationship between Funda activity and NVM transactions and
price developments is therefore intuitive.
It must however be noted that not all transactions in the Netherlands go through NVM
brokers. The transaction and price data presented in this research is therefore not a 100% accurate
representation of the Dutch housing market. According to Kerste et al. (2012) the NVM has a
market share of around 75% of Dutch housing transactions between 2010 and 2011. In the article
of De Wit et al. (2013) which also employs NVM data it is mentioned that the market share was
55-60% in 2007. According to the TU Delft (2014) the market share of the NVM lies around the
70%. As mentioned before the used dataset of the NVM consists of roughly 275,000 transactions
between 2011 and 2013. According to CBS (2014a), which uses data from the Kadaster which
includes all transactions, there were 348,094 transactions in this same period. This suggests that
the NVM had a market share of approximately1 79% over this period.
The inaccurate representation of the Dutch housing market may influence the models
employed. Especially the regional models may exhibit some bias. The market share if the NVM
differs regionally (Kerste et al., 2012; De Wit et al., 2013). In more rural areas the market share is
said to be substantially lower. The models therefore, may prove to be less accurate for rural areas.
However, as most houses on Funda originate from NVM brokers it is expected that this will not
be a large problem in this case. It does however lower the robustness of the employed models
with respect to the complete Dutch market. Nevertheless Funda and the NVM are still by far the
1 As mentioned in the text the NVM uses a different definition for “transaction”, therefore compared periods might not be exactly equal. This however should only have a minor effect on the market share as the lag of the Kadaster on the NVM is only three months (De Wit et al., 2013). Three months is only 3/36 of the whole series.
24 largest in terms of market share. The second-largest brokerage website is Jaap.nl with a market
share of only 9% (Kerste et al., 2012). The second-largest brokerage association is the VBO and has
a market share of less than 10% (Kerste et al., 2012). Furthermore, obtaining transaction and search
information of each zip-code area of other associations and websites would be too
time-consuming. Therefore Funda.nl seems to be the most suitable in order to provide data on search
behavior.
Fortunately, Funda is willing to collaborate with this research and is in the position deliver
data regarding search behavior. This behavior is obtained in the following manner. The data
describes the times watched per object per month. The objects (i.e. houses) are subsequently linked
to a PC4 code (i.e. zip-code). In the next step the totals per PC4-area are calculated. The results are
(1) times watched and (2) number of objects per PC4-area per month.
Finally, some control variables are obtained from the Dutch Central Bureau of Statistics in
order to ascertain the cross-sectional explaining power of Funda (CBS, 2014a). In the panel data
setting both time fixed effects and entity fixed effects are included, which should already account
for the most of these control variables.
3.2 Aggregation process
An identified data issue is related to the scale level of the data. In spite of the fact that the price
index is corrected for size (i.e. price per cubic meter), it is not corrected for any other characteristic.
For example, if only one or two houses are sold per quarter, the result could be that in month 1 a
very well-maintained house is sold, while in month 2 a very bad-maintained house is sold. This
25 aggregate the zip codes in two ways: (1) to aggregated PC-levels for the cross-sectional models
and (2) to COROP-regions for the panel data models. A detailed description of the aggregation
process is presented in de Appendix.
3.3 Variable description
3.3.1 Cross-section models
In the cross-sectional models regional differences between house prices are explained with the
inclusion of Funda data. The used variables in the cross-sectional models are presented in Table
3-1. In this table all variables are per PC-area in which the PC-areas are defined as the aggregated
areas as elaborated in the previous section.
Table 3-1: The variables employed in the cross-sectional models
Variable name Description Source
pr Mean transaction price per m3 in 2011-2013 NVM
fw Mean Funda times watched in 2011-2013 Funda
fo Mean Funda number of objects in 2011-2013 Funda fwo Mean Funda times watched per object in 2011-2013 Funda
inc Average monthly fiscal income in 2008 CBS
hh Number of households in 2013 CBS
p_he Percentage of population with higher education in 2011 CBS
hhs Average household size in 2013 CBS
pd Population Density in 2013 (per ha) CBS/UvA
To measure “internet search popularity” an additional variable is generated. This variable
26 for this variable could indicate a more popular area. The concept behind this interpretation is as
follows. More clicks on a certain object could imply this object is relatively popular. Therefore,
more clicks on objects (relative to the total number of objects) in a certain area could be evidence
for a more poplar area. The times watched can therefore be characterized as a demand variable.
The number of objects per area can proxy for supply in a certain area as these are roughly equal
to the number of houses that are for sale in this area. The generated variable can therefore
approximate demand relative to supply.
To isolate this “popularity” several control variables are included in the models. Although
literature on cross-sectional house prices is limited and not discussed in detail in the literature
review, it can be expected that demand and supply factors that determine house price dynamics
are similar to factors that determine house-prices over the cross-section. As the scale level is
relatively detailed, the size of the pool of control variables decreases. CBS (2014a) provides yearly
demographic data per PC4-area like population size, number of households and household size.
Furthermore, the CBS (2014b) irregularly publishes data regarding income on a zip-code level. As
the most recent version of this database was published in 2012 and contains income data of 2008,
this database is somewhat outdated (CBS, 2014b). In 2013 the bureau also published data
regarding levels of education and social security on a zip-code level.
All variables are aggregated from PC4-level to PC-levels as elaborated in the Appendix.
The house prices per PC-area are calculated as the mean transaction price per cubic meter between
2011 and 2013.The Funda data are first aggregated by summing the number of times watched and
number of objects per aggregated PC-area. Next, the average amount of objects and times watched
27 The aggregation of the control variables is elaborated in the Appendix. In order to calculate
the population density per PC-area spatial data regarding the polygons of each PC4-area is
obtained from the University of Amsterdam (2014). These PC4 polygons are aggregated using the
aggregation scheme. Next the size (in ha) is calculated using the geometry functions of the GIS
package ArcMap. Subsequently the population density is calculated by dividing the number of
inhabitants by the size of the area.
Summary statistics of the variables are presented in Table A-2. The result of the
aggregation procedure is that the larger aggregated areas are located in rural regions. This can
clearly be seen from the population density statistic which dramatically decreases when
aggregation level increases. The average price per cubic meter decreases when aggregation level
increases. Furthermore, higher-aggregated areas encompass more Funda search activity. This is
due to the fact there are simply more objects placed on the internet in these areas as they are larger.
Interestingly, Table A-2 indicates that the number of clicks per objects decreases when the
aggregation level increases. This suggests that more aggregated areas are relatively unpopular
28
Figure 3-1: Average price per cubic meter (left) and average times watched per object (right), 2011-2013, darker areas indicate a higher price and more clicks respectively.
29 It is generally accepted that house prices differ regionally (Capozza et al., 2002). The Netherlands
prove to be no exception (Figure 3-1). From the left map it can clearly be observed that the central
areas in the Randstad or areas close to the Randstad are more expensive per cubic meter. Moreover,
the prices in larger cities also seems to be more expensive. Furthermore, the prices per cubic meter
are lowest in the Northern provinces of Friesland, Groningen and Drenthe (with the exception of
the city of Groningen), in the Eastern province of Overijssel and in the South-East in the province
of Limburg.
With respect to the popularity of PC-areas, the right map in Figure 3-1 provides a
clear-cut overview. More popular areas are the areas in or close to larger cities. Moreover the
comparison between the left and right maps in Figure 3-1 is striking. For example, the houses in
the Randstad are watched relatively watched more often and are also more expensive per cubic
meter. Houses in PC-areas which are located in on near larger cities are watched more often and
are more expensive per cubic meter.
Although this pattern is visible for most of the country, some areas in the North (i.e.
Groningen) seem to show a somewhat different pattern. A possible explanation for this
phenomenon could be that in the year over which the sample was taken (2011-2013), this area was
hit by several earthquakes. During these times a lot of questions were inquired whether these
earthquakes had an impact on house prices. Although research by Francke and Lee (2013) has
shown that prices changes in Groningen did not differ significantly from a comparison area, news
regarding house prices in Groningen may have triggered internet search behavior.
The correlations between the variables are presented in Table 3-2. The cross-correlation
30 fairly high. The correlation between the number of objects and times watched is very high (0.88).
But the correlations between these variables and house prices are low. This suggests that the ratio
between times watched and the number of objects may be more useful in regression models than
the separate variables. Other variables that are relatively high and positively correlated with
house prices are income and the percentage highly educated people. These correlations are 0.65
and 0.7 respectively. Another notable feature from the correlation table is the very high correlation
between the number of objects and the number of households (0.9).
Table 3-2: Correlations between the variables in the cross-sectional models
Variable pr fw fo fwo inc hh p_he hhs pd
pr 1 fw 0.1574 1 fo -0.136 0.8792 1 fwo 0.5988 0.175 -0.315 1 inc 0.649 0.1715 -0.0875 0.5251 1 hh -0.2151 0.7474 0.9043 -0.3797 -0.2838 1 p_he 0.6944 0.0815 -0.2134 0.6087 0.6694 -0.2783 1 hhs -0.3471 0.035 0.158 -0.2618 0.0197 0.0623 -0.5964 1 pd 0.3353 -0.3313 -0.3976 0.1665 0.1008 -0.2812 0.495 -0.6253 1
31
3.3.2 Panel data models
This paragraph presents an overview of the variables employed in the panel data models, which
are presented in Table 3-3. All variables are log transformed. The used variable to measure price
developments are median2 prices per cubic meter. In the literature review several possible control
variables are presented and some of these are included in the cross-sectional models.
Unfortunately, most data is not available both on COROP level on a quarterly basis. However, it
may be expected that most of the variation in these control variable is absorbed in the time fixed
and entity fixed effects. For example it can be expected that building costs do not differ much
across COROP regions, hence time fixed effects can control for these. Other variables that differ
regionally like income differences may be absorbed by entity fixed effects if it assumed that they
do not differ over time.
Table 3-3: Overview of variables employed in the panel data models
Variable name Description tr Transaction volume pr Median house prices per m3 fw Funda times watched fo Funda number of objects fwo Funda times watched per object
2 There is also experimented with mean house prices, but median house prices provided slightly better fit, furthermore the median price should be less affected by outliers than the mean price.
32 For illustration purposes the time component of the variables will be described visually by
presenting national aggregates of each variable. The transaction variables and Funda variables are
the totals of the respective in the corresponding quarter. The price index is calculated as the
weighted aggregate median transaction price index where the weights are based on the number
of transactions3. Therefore, regions with more transactions have a larger share in the national
median price index. The statistics presented in Table 3-4 are based on COROP-scale data.
As can be seen from the graphs in Figure 3-2 transaction volume decreased in 2011 and
2012. The median prices per cubic meter recovered slightly in 2013. The number of transactions
per quarter stayed within the range of 20,000 to 25,000 transactions until 2012, but started
increasing in the second half of 2013.
A striking feature of the quarterly transaction development is the spike during the last
three months of 2012. Similarly, the 2013Q1 showed relatively little transactions. A possible reason
for this spike and the subsequent fall, could be the tax reforms which came into effect on the 1st of
January 2013. In these reforms the interest deductibility for new interest-only mortgages was
abolished (Belastingdienst, 2014). However, interest payments from mortgages that were
originated prior to the 1st of January 2013 can still be deducted. These reforms may have caused
people who desired an interest-only mortgage to quickly buy a house before the new tax rules
came into effect. The spike in transactions in 2012Q4 and the subsequent drop 2013Q1 could bias
the results and therefore have to be taken into account in the modelling process.
The drop which can be seen in the number of transactions is also visible in the Funda data.
3 𝑝𝑟_𝑖𝑛𝑑𝑒𝑥 𝑡= ∑ 𝑝𝑟_𝑖𝑛𝑑𝑒𝑥𝑖∗𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠𝑖 𝑡𝑜𝑡𝑎𝑙 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠𝑡 𝐼
33 Interestingly the drop in times watched was already visible in 2012Q4 —one quarter before the
actual drop in transactions occurred. Like the number of transactions the Funda variables also
showed an increase during 2013. The analogy between the Funda variables and prices is
somewhat less straightforward. But the small recovery in prices in 2013 is also visible in the Funda
variables.
Although not visible in the national aggregate of the price index, the median price per
cubic meter contains noise. This is clearly visible in Table E-1 in the Appendix, which contains the
first-order autocorrelations of the first difference of the median price per cubic meter. The AR(1)
coefficient on the unsmoothed series is negative and significant which indicates noise. In order to
cope with this noise, the data is smoothed using a Hodrick-Prescott (HP) filter. There has been
experimented with several smoothing parameters and other techniques like weighted moving
averages or exponential decay smoothing, but the HP-technique proves to be most successful. It
is somewhat arbitrarily chosen to continue with the HP filtered series with a lambda of 0.20, but
other smoothing parameters will be discussed in the robustness checks chapter. Similarly,
transaction volume also contains noise. Therefore, these series are also smoothed using a lambda
of 0.50.
Table 3-4: Descriptive statistics of the variables employed in the panel data models, 2011-2013
Variable Mean Std. Dev. Min Max N
tr 579.66 535.69 37.00 3,145.00 480
pr 570.47 134.41 352.94 1,112.16 480
fw 14,875,266 12,133,033 2,576,366 79,000,000 480
fo 58,109 42,382 9,592 223,969 480
34
Figure 3-2: Quarterly developments of the nationally aggregated variables employed in the panel data models
35
4 Econometric Modelling
In this chapter the econometric methodology in order to estimate the cross-sectional and panel
data models is presented. As the cross-sectional models are meant to provide an introduction to
the panel data modelling process the emphasis will lie on the latter.
4.1 Cross-sectional models
The general specification is presented in equation (1). Here the βs are the parameters which have
to be estimated and ε is the error term which is assumed to be i.i.d. Furthermore, subscript i
denotes the aggregated PC-area. The variables are listed in Table 4-1. The methodology applied
in order to estimate the coefficients is Ordinary Least Squares (OLS).
(1) 𝑝𝑟𝑖= 𝛽𝑜+ 𝛽1𝑓𝑤𝑜𝑖+ 𝛽2𝑖𝑛𝑐𝑖+ 𝛽3ℎℎ𝑖+ 𝛽4𝑝_ℎ𝑒𝑖+ 𝛽5ℎℎ𝑠𝑖+ 𝛽6𝑝𝑑𝑖+ 𝜀𝑖
Table 4-1: Definition of used variables
Variable Description Definition
pr House prices Log(Mean Transaction price per m3) fwo Funda watched per object Log(Funda Times watched/Funda Objects) tr Transaction Volume Log(Number of transactions per month)
inc Income Log(Average monthly fiscal income)
hh Number of households Log(Number of households)
p_he % High educated Percentage of population with high education hhs Household size Population/Number of households
36
4.2 Panel data models
Although it may be expected that there exists a long-run relationship between the Funda data and
house prices and/or transaction volume, the limited availability of the data does not allow to
examine this. It is however expected that the data may be able to track the short-run dynamics of
supply and demand. Therefore an econometric framework that models short run changes in house
prices and transaction volume is most applicable. .
The data consists of a balanced panel of 40 COROP-regions with data ranging from 2011Q1
to 2013Q4. When applying a panel data approach one can control for unobserved heterogeneity
between the cross-sectional units by using for example a within transformation or a Least Squares
Dummy Variable (LSDV) approach (Stock & Watson, 2011). In the theoretical framework it was
concluded that that house prices exhibit positive serial correlation in the short run. This suggest
lags of the dependent variable should be included in the models. Hence, modelling the data as a
dynamic panel seems most natural. A problem that arises when including lags of the dependent
variable in the regression is that these lags are correlated with the error term and therefore will
result in biased results (i.e. Nickell’s bias, see Nickell, 1981; Roodman, 2009). A possible solution
to cope with this problem is by applying the GMM approach as proposed by Arellano and Bond
(1991) which will be elaborated in the following sections.
For convenience the house price model presented in equation (2) is discussed in the text.
The transaction volume model can be constructed in a similar fashion and is presented in equation
(3).
37 The general model to explain house price changes is presented in equation (2) 4, the used variables
are presented in Table 4-1. Here house prices changes of cross-sectional unit i in period t depend
on house prices changes in cross-sectional unit i in period t-1 through t-q and the clicks per objects
in cross-sectional unit i in period t-1 through t-p. Furthermore vit is the error term which is i.i.d.
By transforming the variables, which in this case is done by taking first differences, the
unobserved heterogeneity αi cancels out. The lagged dependent variable(s) however, are still
correlated with the error term because by construction the error term includes entity fixed effects
which are unobserved. The lagged dependent variable also depends on these entity fixed effects,
since by definition these are time-invariant. Hence, the lagged dependent variable depends on the
error term in t, which means the estimated coefficients will be biased and inconsistent.
Besides entity fixed effects time fixed effects are also included in the models, which are
denoted by λt. These will account for temporal factors that are similar across entities. Seasonal
effects are also absorbed by these time fixed effects.
In order to cope with these issues, the parameters of interest are estimated by using the
Generalized Methods of Moments (GMM) estimator (Arellano & Bond, 1991; Arellano & Bover,
1995). Here the coefficients are estimated using an Instrumental Variable (IV) approach. Here
first-differences are instrumented by the lagged levels of this variable5.
4 This model is created by taking the first differences of:
𝑝𝑟𝑖𝑡= 𝛼𝑖+ ∑ 𝛾𝑞𝑝𝑟𝑖𝑡−𝑞 𝑄 𝑞=1 + ∑ 𝛽𝑞𝑓𝑤𝑜𝑖𝑡−𝑝 𝑃 𝑝=0 + 𝜆𝑡+ 𝑣𝑖𝑡
5 In the Anderson-Hsiao approach only lagged dependent variable is included. If the variable in levels is close to a random walk, this instrument is found to be relatively weak (Roodman, 2009). Since house prices are expected to follow a random walk it is chosen to include more lagged variables as instruments.
38 The transaction volume model can be established in a similar fashion and is presented in
equation (3).
(3) ∆𝑡𝑟𝑖𝑡 = ∑𝑆𝑠=1𝛿𝑞∆𝑡𝑟𝑖𝑡−𝑠+ ∑𝑅𝑟=0𝛽𝑞∆𝑓𝑤𝑜𝑖𝑡−𝑟+ ∆𝜃𝑡+ ∆𝑢𝑖𝑡
In order to examine the relationship between house prices and transaction volume and how these
respond to changes in the time watched per object variable a panel Vector Autoregression (VAR)
model is defined in (4). In this VAR model the changes in transaction volume and price changes
are modelled simultaneously. In the impulse responses generated by a VAR model the
simultaneous causality between house prices and transaction volume is explicitly taken into
account. (4) (∆𝑝𝑟∆𝑡𝑟𝑖 𝑖)𝑡 = ∑ ( 𝛾1 𝛿1 𝛾2 𝛿2)𝑞 𝑄 𝑞=1 ( ∆𝑝𝑟𝑖 ∆𝑡𝑟𝑖)𝑡−𝑞+ ∑ ( 𝛽1 𝛽2)𝑝 𝑃 𝑝=0 𝑓𝑤𝑜𝑖𝑡−𝑝+ ( ∆𝜆1 ∆𝜆2)𝑡+ ( ∆𝑣1 ∆𝑣2)𝑡
Roodman (2009) distinguishes eight assumptions regarding the data-generating process of
applying GMM and states that these should be touched upon prior to doing the research. These
are included in Appendix B. The first assumption is the main reason why the coefficients are
estimated using GMM. From the literature it can be concluded that house prices are expected to
follow a dynamic process. The second assumption is a reason why a panel set-up is performed.
Although, cross-sectional regressions are performed, the time component is expected to provide
additional insights in the data. Lagged dependent variables are obviously endogenous but this is
handled through the GMM set-up. The standard errors are estimated using the cluster option, and
39 however may not be correlated over the cross-section. Therefore, time fixed effects (i.e. time
dummies) are included in the regressions, which makes this assumption more likely to hold
(Roodman, 2009). The used panel can be considered “small T, large N” as T is 12 and N is 40. In
the model only lags of the regressors are included as instruments, hence there are only
internal-based instruments.
4.3 Hypotheses
Although no empirical evidence regarding the Funda data exists, several based on the literature
hypotheses regarding transaction volume and house prices in the Dutch housing market can be
stated. The first hypothesis is related to both cross-sectional models and panel data models.
Statements II through IV only apply to the panel data models.
I. The Funda data have explanatory power in house price models.
II. The Funda data have explanatory power regarding transaction volume. III. Transaction volume and house prices are positively related.
IV. Transaction volume reacts quicker to changes in demand and supply dynamics than prices.
The first hypothesis explores whether the data are helpful in explaining cross-sectional and
intertemporal variance in house prices. The testing of this hypothesis will be approached in two
ways: (1) several variables derived from the Funda data will be tested whether they are (jointly)
significant in the cross-sectional model and (2) the Funda variable fwo and its lags are included
and it is tested whether the Funda data Granger causes house prices.
Secondly, the relationship between the data and transaction volume is examined in a
40 Next, from the literature it is concluded that there exists a positive relationship between
house prices and transaction volume. In the third hypothesis this is stated and this will be
empirically tested for the data. In order to test this, a panel VAR model is formulated which
includes both transaction volume and house prices.
Finally, a synergy between the first, second and third hypothesis is formulated. In
the literature it is generally found that transactions are the first to react to demand or supply
shocks and that prices adapt much more gradual. As it is expected that the Funda data is able to
track these demand and supply shocks, similar results are expected regarding the Funda
41
5 Results
5.1 Cross-sectional models
The results of three cross-sectional models are presented in Table 5-1. Model (1) includes several
explanatory variables, but no Funda variables. Model (2) additionally includes two separate
Funda variables being times watched and number of objects. Model (3) includes the generated
variable times watched per object. From Figure D-1 in Appendix D it can be seen that that
residuals are normally distributed.
Table 5-1: Cross-sectional models.
This table presents 3 alternative models with house prices per PC-area as dependent variable, column 1 includes explanatory variables and columns 2-3 additionally include Funda variables. Variable coding: pr is house prices, fw is funda watched, fo is funda objects, fwo times watched per object, inc is income, hh is households, p_he is percentage highly educated, hhs is household size and pd is population density, for a definition of the used variables see Table 4-1. All standard errors are heteroscedastic robust standard errors but are unreported, t-statics between parentheses. *,* and *** indicate statistical significance at 10%, 5% and 1% levels respectively.
(1) (2) (3) pr pr pr fw 0.246*** (8.29) fo -0.235*** (6.23) fwo 0.246*** (8.40) inc 0.812*** 0.740*** 0.748*** (10.01) (8.72) (9.57) hh 0.008 0.028 0.038*** (0.65) (0.98) (3.22) p_he 0.539*** 0.202 0.195 (3.51) (1.39) (1.34) hhs -0.148*** -0.128*** -0.128*** (4.13) (3.69) (3.73) pd 0.012** 0.025*** 0.024*** (1.97) (3.73) (3.91) Constant 0.230 -0.834 -0.923* (0.41) (1.43) (1.77) 𝑅2 0.56 0.60 0.60 RMSE 0.17 0.17 0.17 N 788 788 788
42 The variables in the first model all have the expected signs. The largest contributor to regional
house price differences is income: an increase of 1% of average income per area translates into
0.8% higher house prices. Household size is found to be inversely related to house prices: an
increase of household size by 1 is related to 14.8% lower house prices. A priori a similar effect is
expected, since larger households are located more often in suburbs or more rural areas while
smaller households are located in more urban areas. The significance of the variable is maybe
somewhat unexpected since population density is also included in the model. This variable
should measure similar effects6. The sign of population density is as expected: an increase of 1%
translates into 0.01% higher house prices. An increase of 1% in population density is equal to 0.3
more inhabitants per ha. Even when controlling for income, the percentage highly educated is
significant. If the percentage of highly educated increases by 1%-point house prices increase by
0.5%. The number of households is found to be insignificant.
In the second model two Funda variables have been added. More specifically, the times
watched and number of objects variables are included. These variables have the expected sign
and are both highly significant. The demand variable (i.e. times watched) therefore is positively
related to price while the supply variable (i.e. number of objects) is negatively related to prices.
An increase of times watched with 1% translates into 0.25% higher prices. A 1% increase in the
number of objects relates to 0.24% lower house prices. A 1% increase in times watched on average
is related to approximately 1530 more clicks. An increase of 1% in the number of objects compares
to approximately 6 listed properties on Funda on average. The signs and magnitudes of the
6 When population density is left out, the size and significance of household size increases. Similarly, when household size is left out the size and significance of population density increases
43 control variables are roughly similar to those in model (1). One notable exception is that that the
percentage highly educated becomes insignificant. Furthermore, the coefficient on income
decreases somewhat in size. The reason might be that the Funda data absorbs some of these
demand characteristics and measures similar things. The increase in model fit when Funda data
is included is visible but limited as the R-squared increases from 0.56 to 0.607. This provides
evidence that the Funda data is able to measure similar variation in house prices as other demand
and supply characteristics, and thus is able to measure demand and supply.
The similarity in magnitude between the Funda variables is striking. This could however
be the result of the high correlation between the number of time watched and number of objects
(0.88; see Table 3-2). Therefore, in model (3) these variables are combined into the times watched
per object variable. The sign of this variable is conform expectations and is also very significant.
If the times watched per object in a PC-area increases by 1% the prices increase by approximately
0.25%. A 1% increase in times watched per object is related to approximately 3 clicks more per
object. If it is assumed that the times watched per object variable measures supply relative to
demand accurately, the magnitude of an increase in demand relative to supply should be similar.
The estimated coefficients of the control variables with the exception of number of households are
very similar to those of model (1). The number of households is now significant and has the
expected sign as the number of objects is now excluded, solving the multicollinearity issues. An
increase of 1% in the number of households is related to 0.4% higher house prices.
When including control variables, the interpretation of the Funda data changes when no
7 When no control variables are included Funda explains roughly 36% in regional house price differences. Adjusted R-squares are also 0.56 and 0.60, for consistency reasons only the regular R-squared is reported, since these have to be calculated by hand in the panel data models estimated by GMM.