• No results found

Using spatial econometrics to find data quality issues.

N/A
N/A
Protected

Academic year: 2021

Share "Using spatial econometrics to find data quality issues."

Copied!
59
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Using spatial econometrics to find data quality issues.

Master Thesis Stefan Zwart University of Groningen Faculty of Economic and Business

MSc Marketing Intelligence 02-07-2018 Folkingestraat 47D2-2 9711JV Groningen Student number: S2550040 Email: s.zwart.1@student.rug.nl Tel: +31643146597 Supervisors University of Groningen

First Supervisor: Prof. Dr. J.E. Wieringa (j.e.wieringa@rug.nl) Second Supervisor: A. Ahmad (m.a.ahmad@rug.nl)

(2)

Abstract

The consequences of incorrect data in governmental databases are diverse and lead us to investigating and predicting data quality issues. This research expects data

mutations, data size and the standard deviation of data size to be predictors of data quality issues while also expecting spatial autocorrelation. To investigate this, we use a dataset of 98600 addresses and have aggregated this on a postal code 6 level of 5104 different postal codes. We did find spatial autocorrelation and therefore we used different spatial models to control for this. Our research found all variables to be statistically significant with data mutations and the standard deviation of data size being positive, and data size being negative. These outcomes have substantial effects ant therefore future research should include these variables when assessing data quality while also trying out new variables. Also, this thesis does show that spatial autocorrelation is present on a very low aggregation level, future research should therefore always test for spatial autocorrelation when dealing with spatial data.

(3)

Table of Contents Abstract 1. Introduction 2. Literature Review 2.1 Information Systems 2.2 Data Quality

2.2.1 The four-stage methodology 2.2.2 QUADRIS project

2.3 Spatial analysis

2.3.1 Spatial analysis in marketing 2.3.2 Spatial analysis in fraud 2.3.3 Social contagion

2.4 Possible predictors of data quality issues 2.4.1 Fraud

2.4.2 Socioeconomic status 2.4.3 Data volume and mutations 2.5 Control variables

2.6 Conceptual model 3. Data

3.1 Data collections 3.2 Data description

3.3 Outliers & missing values 3.4 Variable descriptions 4. Methodology 4.1 Model Criteria 4.2 Model 4.3 Spatial Models 4.3.1 Spatial autocorrelation 4.3.2 Spatial weights matrix

4.3.3 Spatial autoregressive models 4.3.4 Spatial error models

4.3.5 Spatial Durbin models 4.3.6 Spatial Durbin error models 4.3.7 Manski models

4.3.8 Spatial model interpretation 4.3.9 Hypothesis testing

5. Results

5.1 Spatial autocorrelation 5.2 Model validation

5.3 Model estimation and interpretation 5.3.1 Data mutations

5.3.2 Data size

(4)

6. Discussion 6.1 Spatial autocorrelation 6.2 Variables 7. Conclusion 7.1 Conclusions 7.1.1 Contribution to literature 7.1.2 Managerial implications 7.2 Limitations

7.3 Future research suggestions 8. References

(5)

1. Introduction

Data quality issues are prevalent in all information systems and often only become apparent after the data is already used. In a marketing context these issues could have a severe impact on the effectiveness of organisations that use database marketing. Dealing with these problems of quality can improve operational costs, customer satisfaction, effective decision making and help gain effectiveness in customer relationship management (CRM) (A. Reid & M. Catterall, 2005). High quality data helps to create a competitive advantage and a more effective business, this can be done by reducing operational costs otherwise caused by redundant or incorrect data. The Data Warehousing Institute estimates that annually, in the United states alone, $611 billion is wasted on postage, printing and staff overhead costs (Eckerson, 2002). As discussed by Foss et al. (2002), poor data quality could also impose risks to more fraudulent behaviour. Which leads us to the case that we will be discussing in this thesis: Data quality issues in the Basisregiratie Personen (BRP) of the municipality of Groningen. In a report about improving the address quality of the BRP by PBLQ (2014) it is stated that agencies that use the BRP believe that it is of insufficient quality. This could lead to agencies using their own address information instead of that from BRP which should be the main guidance

(6)

Recent research of the Central Bureau of Statistics shows that

400.000-500.000 address registrations do not match with the actual people living at the address (Rijksoverheid: aanpak adresfraude loont, 2017). Incorrect address registrations have different kind of consequences for a municipality. Citizens can abuse regulations by avoiding fines, evading taxes and unfairly acquiring grants or allowances. And even in some more serious cases address fraud is committed to participate in illegal

activities such as production of drugs, human trafficking, prostitution or undermining crime. Nevertheless, the monetary fallout of these consequences is often not for the municipality itself. A municipality is a non-profit organization and their main concern is the well-being of the people who live in it. The before mentioned criminal activities that are tied to address fraud could however have consequences on the quality of life of the people who live in neighbourhoods where these crimes are committed. Other consequences for a municipality is that its citizens can become a victim of address fraud. In some cases, it can also be that people are subletting their houses which they got through a social housing project, making them unavailable to those who need them the most. Accurate address information can also help emergency services with accurate information on the number of residents in case of an emergency. Tackling these issues would benefit the quality of life of the citizens of a municipality, improve the image of the municipality and it could greatly benefit the national government in a monetary way. Also, decreasing the opportunity to commit fraud will most likely lead to a lower incentive of committing fraudulent behaviour.

As of now address fraud is mainly detected using certain indicators or

(7)

and is therefore probably incorrect. Debt collection agencies can also indicate that a person is not living at the correct address anymore. Furthermore, when changing your address, you will receive a letter which indicates how many people live at the address currently, if this is incorrect you can contact the municipality. Lastly property owners can ask the municipality who are registered at their address and can indicate if this information is correct. All these methods have one thing in common and that is that they rely on other parties for reporting suspicious activity or an address mismatch and therefore do not lead to a proactive behaviour of the municipality to tackle this

problem.

As of late 2014 a new project has been launched by the government, in order to improve the address quality and prevent address fraud, which is called the

Landelijke Aanpak Adreskwaliteit (LAA). The LAA creates risk profiles for each address which are then investigated by the 162 municipalities which participate in this project. This has led to a more proactive behaviour in assessing the address quality and preventing address fraud. However, these risk profiles are based upon a national model which mostly do not take city characteristics into account.

This is where spatial econometrics could be of use. Spatial econometrics can take characteristics of spatial areas into account that would otherwise remain

unobserved. For instance, it could be possible that citizens that live in a close area share the same characteristics and that these unobserved characteristics of spatial areas could then be correlated to neighbouring areas. So by using spatial econometrics we can take information into account that is now ignored in the LAA project. For our research the main question would be:

(8)

This research question would contribute to municipalities in a practical way by correcting for spatial characteristics and therefore more accurate predictions of data quality issues of the BRP. Also, this research contributes to marketing literature in multiple ways. Firstly, the use of spatial econometrics for solving data quality issues has not been done before and would be a powerful addition to current literature. Secondly, correcting for spatial characteristics could be implemented in other fields of marketing as well and help create more accurate models. Lastly, looking at address specific information instead of personal specific information could be a stepping stone for developing new marketing theories.

(9)

2. Literature Review

In this chapter, we first introduce the literature on information systems. After this introduction we will discuss the definition of data quality as well as methods to ensure high quality data. This section is followed by literature of spatial analysis in which we make a link with fraud and marketing. The remainder of this chapter will discuss predictors of data quality issues and control variables.

2.1 Information systems

Businesses can be seen as an information systems manager of a data manufacturing system. Within data manufacturing systems, three roles are defined by Strong, Lee & Wang (1997), which are: data producers, data custodians, and data consumers. Data producers are sources that generate data. Data custodians use databases to store, process and secure the information of the data manufacturing system. Data consumers are people that use the data manufacturing system. Strong, Lee & Wang (1997) discussed the issues that come with data quality problems. They state that mismatches in data could lead to believability problems. If data consumers believe that the data from the custodians is less accurate then their own data, it could lead to a poor reputation of the custodians’ data. As the reputation gets lower, the added value also lowers which in turn could lead to a reduced use of the data.

2.2 Data Quality

(10)

beforehand and differ per data consumer. Another definition is that of Loshin (2001): ‘fitness for use’ which is the definition we will use throughout the thesis. This

definition is hampered by the issue that data custodians need to understand how the data is going to be used by the data consumers. If however data is fit for use, the data from the data manufacturing system can directly be used by the data consumers.

2.2.1 The four-stage methodology

One way to ensure high quality data is by making use of the four-stage methodology defined by Richard Marsh (2004). The four-stage methodology makes use of a continuous, integrated process of data management. This methodology consists of four separate actions that a business should take for ensuring high quality data. Firstly, a business needs to perform an audit in order to detect missing data, find incorrect or duplicate records or any inconsistencies in the data. Secondly, when the audit is complete, the business can then start with removing all errors and with fixing the data quality issues. As for the third step, it is important that the business

implements systems that prevent errors from entering the system, this can be done by making use of real-time error prevention methods. In the fourth step it is important that all the data that enters the system is ready for future use, which will help with easier dealing of data issues in real time rather than ad hoc.

2.2.2 QUADRIS project

(11)

ensure high quality data. These dimension are then measured by the use of formal (quantitative) or informal (qualitative) measurements to detect data quality issues.

2.3 Spatial Analysis

The use of spatial analysis is almost unlimited and is being applied in all fields that have spatial data. Spatial analysis has for instance widely been used in the field of medicine for reporting on epidemiology (Lessler, 2016) or by correcting for spatial autocorrelation in drug use for curing cancer (Yang et al, 2018). In the field of

marketing spatial data has for instance be used to determine the location of new stores based on geographical information (Hunneman, 2011).

2.3.1 Spatial analysis in marketing

Analyses applying spatial statistics are still new in the field of marketing. Similar to time-series analysis, in spatial analysis nearby observations can predict behaviour at a certain place. The biggest differences between the two is that spatial data can not be defined on a single dimension, it runs in many directions and it has no constant units (Bronnenberg, 2005). Spatial data in marketing can include similar characteristics such as socio-demographic factors, lifestyles, attitudes or economic information that define a geographical location that would normally remain unobserved (Hunneman, 2011). Hunneman (2011) also states that in western countries customers in similar neighbourhoods could have the same characteristics and spending patterns. However, the origin of these similarities within spatial areas remain unclear and could come from both true or apparent spatial contagion (Anselin, 2010).

(12)

which leads to local similarities. Because of these similarities, the error term, the dependent variable or the independent variable could likely be spatially correlated. Not accounting for spatial correlation could then lead to inefficient estimates of the parameters.

Bell and Song (2007) discuss neighbouring effects on consumer decisions, consumers in a zip-code positively influence the trial decisions of consumers. This finding is contrary to one of the key assumptions in traditional marketing literature, which is that the behaviour of a consumer is independent of the behaviour of others (Bradlow et al, 2005). Even though spatial econometrics is still relatively new in the field of marketing, Bronnenberg (2005) discusses some other applications of spatial data and modelling.

Firstly, spatial data can be used to discover the size of trade areas of stores. After defining these areas, companies could find new profitable locations for opening a store. Secondly, marketing policies could be based on spatial data. Companies could for instance focus more heavily on marketing campaigns in highly competitive trade areas, or they could raise their prices where they have less competition. Thirdly, spatial models could be used for a better understanding of product diffusion. Companies could then for instance better target certain customers or markets for seeding products. Lastly, spatial modelling could be used for predictions of sales. The main note in this is that the success of prediction is dependent on the chosen distance metric.

2.3.2 Spatial analysis in fraud

(13)

fraud as a combination of individual decisions and characteristics instead of a result of many different influences. However, there has been some research with exploratory spatial data analysis where they try to map different type of crimes across space (Cracolici & Uberti, 2009). One type of crime they analyse is fraud. Their research indicates a spatial distribution of fraud which shows a pattern that fraud might be spatial related on a national level in Italy. Other instances where spatial analyses were used is in the research of Kim (2007) for monitoring fraud in a public delivery

program. In their research they highlight the importance of identifying appropriate territorial levels because of heterogeneity between locations.

2.3.3 Social contagion

(14)

To conclude these three subsections, we could argue that spatial analysis makes it possible to better understand individual behaviour by assuming a spatial correlation of individuals. Therefore, in our thesis we hypothesise that:

H1: Data quality issues are positively spatially correlated

2.4 Possible predictors of data quality issues

In this subsection predictors for behaviour that will help predict data quality issues are discussed. We first discuss literature on fraud, since fraudulent behaviour could lead to false data entries. Secondly, we discuss how socioeconomic status could affect data quality. Finally, we discuss how data volume or data mutations and the standard deviation of data size could influence data quality.

2.4.1 Fraud

The Association of Certified Fraud Examiners have reported in 2006 that

organizations lose five percent of their yearly revenues due to fraud and almost every organization is at risk of becoming a victim of fraud (Murdock, 2008).

(15)

2.4.2 Socioeconomic status

The research one socioeconomic status (SES) is very broad and is applied in many fields. In the field of psychology, it is found that a lower Socioeconomic status is correlated with higher antisocial behaviour (Piotrowska et al., 2014). The research linking socioeconomic status to criminal or fraudulent behaviour is controversial. Some research indicates a positive effect, some a negative effect and other a

conditional effect (Tittle & Meier, 1990). However, it seems that sometimes SES can predict delinquencies, but most of the time it doesn’t. Since the research in this thesis is exploratory in its field we still try to see if it might predict something in our case. Hence, we use the following variables to predict socioeconomic status: squared meters of the address, bought or rented house, the value of the house and if it is a social renting house. These variables will then be used to test the following hypothesis:

H2: Higher socioeconomic status will positively influence data quality.

2.4.3 Data volume and mutations

(16)

H3a: Higher amount of data mutations negatively influences data quality

By using the same reasoning as above we also expect data to be more prone to errors when there is a higher volume of data instead of only more data mutations. Therefore, we hypothesize the following:

H3b: Higher amounts of data negatively influences data quality

Since this research is also in its way exploratory, we also hypothesize that high fluctuations in data size could have a positive effect on data quality issues:

H3c: Higher standard deviation in data size negatively influences data quality

2.5 Control variables

Besides the fact that data quality could be affected by the previously mentioned constructs there are also some variables that we might have to control for in our research. These control variables are based on the construct of fraud defined in paragraph 2.4, a change in need over a time period could be an incentive to falsify data. Secondly a change in data control measures over a time period indicates a difference in the opportunity to falsify. Controlling for both changes will take out their effect in our final model.

2.6 Conceptual model

(17)
(18)

3. Data

This chapter first describes how the dependent variable is collected and which information systems are used to extract the data for the independent variables. In section 3.2 we give some descriptive statistics of our data set. The following section discusses the missing data and outliers. The final section describes the variables and how they are computed.

3.1 Data collection

To detect data quality issues of addresses in the BRP we extract data from two different databases: Basisregistratie Personen (BRP) and Basisregistraties Adrressen en Gebouwen (BAG). The BRP consists of all personal data of people living in the municipality of Groningen. In the BAG we can find all information of addresses and buildings. After extracting information from both databases we end up with a dataset of 98600 addresses. Incidents of data quality issues are defined by the department of Burgerzaken of the municipality of Groningen. Burgerzaken verifies if a person is living at the address specified in the BRP after suspicions arise.

3.2 Data description

We aggregate our data on a postal code 6 level (postal codes with 4 digits and 2 letters) and end up with 5104 postal codes. For the distance matrix we define

(19)

We then plot the spatial polygon data frame (figure 2) with incidents to give us a more visual representation of the incidents. However, it has to be taken into account that this plot does not take population into account. Of the 5104 postal codes the average proportion of data quality incidents is 0,1 with a standard deviation of 0,09.

Figure 2: Density plot of data quality incidents

3.3 Outliers & Missing values

The data we used for analysis does have 127 postal codes that consist of outliers in the dependent variable (figure 3). Addresses are treated as outliers when the dependent variable falls outside of the boxplot. However, since our data mostly consists of small numbers, removing these outliers would lead to a loss of

information. Also removing the outliers causes problems with the distance matrix since many postal codes would lose neighbours. Therefore, we decide to not remove the outliers in the data.

(20)

Figure 3: Missings and Outliers

3.4 Variable descriptions

In order to analyse our hypotheses, we make use of the variables shown in table 1. Some variables are computed by using multiple variables. The variable Data quality

incidents is computed by assigning the number of data quality incidents to each

spatial area and dividing it by the average amount of people that have lived on an address. Taking the average people that have lived on an address is necessary since the incidents happen over time while the people living on the address is a snapshot of the data. Data Mutations is computed by looking at the amount of data mutations and dividing it by the amount of months an address is in use. Doing this allows us to standardize the data mutations.

(21)

Also it was not possible to retrieve data that could be used to estimate the socio-economic status, hence this variable is also dropped from our model.

Variable Variable name Description Variable Type

Data Quality Incidents

propAanpPostcode Proportion of data quality incidents in postal code 6 area

Continuous DV

Data Mutations gemPersVoorbijPostcode Data mutations dived by the age of the data record

Continuous IV

Data Size nPersAvgPostcode Average amount of people that have lived in a postal code

Continuous IV

Data Size Standard deviation

stddevPersPostcode Standard deviation of the amount of people that have lived in a postal code

Continuous IV

(22)

4. Methodology

In this chapter we first discuss criteria that will define a model as a good model. In the following section we formulate our model. In the last section we discuss statistical techniques that are used to test our hypotheses.

4.1 Model Criteria

A model needs to fulfil certain criteria in order to be considered a good model: simple, complete, adaptive and robust (Leeflang et. al, 2015). The first criteria, simplicity, can be achieved by only including relevant variables in the model. A model is complete whenever it accounts for all important variables and is a good representation of reality. Adaptivity describes if a model is able to adapt to

environmental changes over time. Lastly, a robust model is one that has constraints to produce meaningful answers.

4.2 Model

In order to test the hypotheses that we formulated in chapter two and estimate the effects of our independent variables on our dependent variable we make use of the following model:

! = $%&'()*')+ $,-./. 01/./2345 + $6-./. 7289 + $:-./. 5289 5/.4;.<; -9=2./234 + >

4.3 Spatial models

(23)

follow a normal distribution. Lastly, the residuals should not be autocorrelated. According to Tobler’s (1970) first law of geography: “Everything is related to everything else, but near things are more related than distant things.”, the third assumption is often violated and there is an existence of spatial autocorrelation.

4.3.1 Spatial autocorrelation

Since we are using spatial data, the error term in our model that tries to explain the quality of the data is likely to be spatially autocorrelated. Failing to correct for spatial autocorrelation when existing can cause an inefficient model leading to less optimal predictions, misleading significance tests and biased estimates of parameters

(Hunneman, 2011; Shrestha, 2006). We use Moran’s index to measure spatial autocorrelation which ranges between -1 and +1. Which will indicate either a

clustering of values that are alike when positive and significant, or alternating patterns (spatial heterogeneity) when negative and significant. The formula for measuring Moran’s index is as follows:

Where

I = Moran’s Index

n = the number of spatial objects

y = the proportion of detected data quality issues in an area w = the spatial weights matrix

(24)

simulation randomly assigns values to each polygon and then a new index is calculated. This process is then repeated multiple times in order to calculate a distribution. Our observed Moran’s I is then compared to the Moran’s I of the simulated distribution. In this comparison we can see if the observed data quality issues in the spatial data could be considered random or not.

4.3.2 Spatial weights matrix

In order to use spatial models, we first need to define the spatial weights matrix. A spatial weights matrix is a mathematical representation of neighbouring areas in the form of a square matrix. This matrix is defined as a N by N positive matrix (figure 4): W, with elements Wij. In which, Wij is nonzero for neighbours, Wij is zero for non-neighbours and Wii is zero since an area cannot be a neighbour with itself.

Figure 4: Spatial Weights Matrix

In this matrix each neighbour is defined using queen contiguity weights, meaning that neighbours are defined as areas that share a border or a corner. The weights in the matrix need to be standardized, which can be achieved by using row standardization. In row standardization an area has n neighbours and each of the weights then becomes 1/n. After standardization the lagged variable in the model, can be interpreted as a mean of the neighbours’ values. To further explain this, we interpret the Manski Model (Elhorst, 2010) below textually as well as in section 4.3.7:

(25)

In this model y is a function of rho times the average y of its neighbours, plus X times β, plus the average value of the explanatory values of its neighbours and ends with a spatially lagged error term. The spatially lagged error term is defined as lambda times the average value of the residual term for its neighbours, in which Lambda is the part of the residuals explained by its neighbours’ residuals.

Lesage and Pace (2015) cover spatial weights matrices and and their effect on spatial models. In summary they conclude that slope coefficients are sensitive to the choice of W, however the marginal effects are not effected that much. It should be noted that this conclusion is only valid for spatial autoregressive models, because spatial Durbin models and spatial error models were not covered.

4.3.3 Spatial autoregressive models

As we already stated in section 2.3.1, spatial autoregressive (SAR) models are

somewhat similar to time-series autoregressive models. Time series models are useful for modelling processes that vary over time. These models regress values of a time-series on previous values of the same time-time-series. For SAR models we could just change the word time to spatial. SAR models regress values of spatial areas on

neighbouring areas based on a distance metric. We make use of SAR models when we have a spatial autocorrelation, which can be checked by the aforementioned Moran’s

I. In the model described below we use a spatially lagged dependent variable to

control for spatial correlation and it can be formulated as: ! = ?@! + A$ + > Where

(26)

W = the distance matrix based on nearest neighbours β = the unknown parameter

X = the dependent variable ε = the error term

As can be seen from the formula, spatial autoregressive models are similar to

regressions models. The main difference is that in SAR models we control for spatial correlation by including a first order spatial lag of the dependent variable.

4.3.4 Spatial error models

A spatial error model incorporates a spatial pattern in the error term. Meaning that it accommodates spill-over of residuals to surrounding areas. This effect could occur if there is a missing spatially related explanatory variable in the model that will end up in the residuals. Spatial error models help with this by having a function of not only its own stochastic error term but also of its neighbours. This function can be

formulated as follows:

! = A$ + 1, 1 = D@1 + > Where

y = the proportion of detected data quality issues in an area β = the unknown parameter

X = the dependent variable

W = the distance matrix based on nearest neighbours

λ = spatially correlated errors

ε = the error term

(27)

The spatial Durbin model is an extension of the SAR model. The SAR model only includes a spatially lagged dependent variable, while the Durbin model also adds a spatially lagged explanatory variable. The spatially lagged explanatory variables are a local spill over, meaning that only its neighbours effect the outcome. However, SAR models have global effects meaning that its effect affects all areas. The spatial Durbin model is therefore a mix between global and local effects and can be formulated as follows:

! = ?@! + A$ + @AB + > Where

y = the proportion of detected data quality issues in an area ρ = the spatial autocorrelation coefficient

W = the distance matrix based on nearest neighbours β = the unknown parameter

X = the dependent variable

θ = spatial spill over of neighbours ε = the error term

4.3.6 Spatial Durbin error models

The spatial Durbin error model (SDEM) is created by extending the error model with a spatially lagged explanatory variable. However, the SDEM does not include a spatially lagged dependent variable and can be formulated as follows:

! = A$ + @AB + 1, 1 = D@1 + > Where

(28)

X = the dependent variable

W = the distance matrix based on nearest neighbours θ = spatial spill over of neighbours

λ = spatially correlated errors

ε = the error term

4.3.7 Manski models

The final model that we will use for our analysis is the Manski model. The model is a combination of all the aforementioned spatial models and therefore includes all spatial effects. The Manski model can be formulated as follows:

! = ?@! + A$ + @AB + 1, 1 = D@1 + > Where

y = the proportion of detected data quality issues in an area ρ = the spatial autocorrelation coefficient

W = the distance matrix based on nearest neighbours β = the unknown parameter

X = the dependent variable

θ = spatial spill over of neighbours

λ = spatially correlated errors

ε = the error term

4.3.8 Spatial model interpretation

(29)

in order to account for taking the log of 0. This allows us to interpret the effects as elasticities of the impact from the variables (Kirby & LeSage, 2009). Kirby and LeSage also explain that the transformation of the variables allows us to distinguish which variables are most important in explaining data quality issues.

When the independent variable in an area changes it will affect the dependent variable in that area (direct effect) and also the dependent variable in other areas (indirect effect). The total impact, which is the combined direct and indirect effect, consists of all marginal effects by taking derivatives of yi with respect to xj for any i and j (LeSage & Pace, 2009). Kirby & LeSage (2009) consider the Spatial Durbin Model (4.3.5) which could be rewritten as:

(In - ρW)y = x β + Wx θ + ε

y = SI(W)x + V(W) + ε

S(W) = V(W)(Inβ + Wθ)

V(W) = (In –ρW)

=In + ρW +ρ2W2 +ρ3W3 +…+ ρnWn

Which follows that:

EF(HI)

EKL = S(W)ij

Where S(W)ij is the ijth element of the matrix S(W).

(30)

results (Elhorst, 2014). To account for these large numbers and differences of impacts in spatial units Lesage and Pace (2009) suggest a summary measure of all impacts. They suggest a summary that sums the total impacts of all the rows or columns of matrix S(W) and takes the average value of all regions. The average of the columns or rows would provide the total average impact while the average of the diagonal of the matrix would provide the average direct impact. The average indirect impact can then be calculated by subtracting the average direct impact from the average total impact.

4.3.9 Hypothesis testing

(31)

5. Results

The results of our analysis are presented in this chapter. First, we test for spatial autocorrelation and resolve this issue. Thereafter, we discuss the validity and compare the fit of all models. Lastly, the final model is estimated and interpreted.

5.1 Spatial autocorrelation

In order to test for spatial autocorrelation, we first need to estimate a simple OLS regression of our model. After estimating the model, we tested for spatial

autocorrelation in the residuals using Moran’s index. This shows significant positive spatial autocorrelation (table 2). This outcome shows that the residuals are spatially clustered, indicating spatial autocorrelation for which we need to correct.

Moran's I

Model Statistic P_value 1 OLS 0.143 0.010 2 Lag -0.025 0.990 3 Error -0.015 0.950 4 Durbin -0.017 0.950 5 Error Durbin -0.009 0.880

Table 2: Moran’s index

Rho, the spatial spillover coefficient in the spatial Durbin model is 0,564 and highly significant as can be seen in the appendix (table 3). Rho being significant means that the inclusion of our spatially lagged variables significantly improve our model.

5.2 Model Validation

(32)

can see from the table all spatial autocorrelation seems to have been resolved for all models.

To asses the model fit we first look at the estimated parameters. All

parameters are significant and therefore we can not make conclusion based on this. Therefore, we compare the AIC-scores of each model and find that the spatial Durbin model is the best fit with an AIC of 14105 (table 3). Lastly, we use a likelihood ratio test to see if the Durbin model is significantly better at predicting the proportion of detected data quality issues than a NULL model would. The outcome is highly significant (table 4) and therefore our independent variables do add explanatory power. To conclude, the spatial Durbin model best fits our data and will therefore be used for estimation and interpretation.

#DF Log Likelihood ΔDF Chisq P-value

NULL-model 1 -12100,4

SDM 9 -7043,9 8 10113 <0,001

Table 4: Log Likelihood Ratio Test

5.3 Model Estimation and interpretation

In this subsection we interpret the outcomes of our spatial Durbin model based on the theory specified in 4.3.8 and discuss if the hypotheses are confirmed or not. The variables used for estimation are Data mutations, Data size, Data size standard

deviation, their spatially lagged variants and the spatially lagged dependent variable:

(33)

Variable Direct Effect Standard Deviation

z-statistic z-probability

Data mutations 0,924 0,035 27,4 <0,001

Data size -1,753 0,055 -28,5 <0,001

Data size standard deviation 1,035 0,044 22,1 <0,001

Table 5: Direct effects

Variable Indirect Effect Standard Deviation z-statistic z-probability Data mutations 0,367 0,054 6,8 <0,001 Data size -1,281 0,105 -11,8 <0,001

Data size standard deviation 0,989 0,105 9,5 <0,001

Table 6: Indirect effects

Variable Total Effect Standard

Deviation

z-statistic z-probability

Data mutations 1,291 0,052 28,7 <0,001

Data size -3,034 0,112 -26,7 <0,001

Data size standard deviation 2,025 0,114 17,7 <0,001

Table 7: Total effects

5.3.1 Data mutations

(34)

5.3.2 Data size

Our independent variable data size has a negative direct (-1,753) and indirect (-1,281) effect which are both significant. Both effect combines result in a significant total effect of -3,034 in data size and therefore reject hypothesis 3b since the effect is not in the direction that we expected. We can interpret the effect of data size as follows: when the data size in an area increases with 1% the proportion data quality issues in that area will decrease with 1,753% and the proportion data quality issues in other areas with 1,281%. To conclude, a 1% increase in data mutations would lead to a 3,034% decrease in the total proportion of data quality issues.

5.3.3 Data size standard deviation

The variable that represents data size standard deviation has a positive direct (1,035) and indirect (0,989) effect which are both significant. Both effect combines result in a significant total effect of 2,025 in data size standard deviation and therefore we can statistically support hypothesis 3c. We can interpret the effect of the standard deviation of data size as follows: when the standard deviation of data size in an area increases with 1% the proportion data quality issues in that area will increase with 1,035% and the proportion data quality issues in other areas with 0,989%. To

conclude, a 1% increase in data mutations would lead to a 2,025% increase in the total proportion of data quality issues.

(35)

6. Discussion

This chapter discusses the results found in the previous chapter. Each hypothesis is discussed in a new section. In these sections we link the results to the theory of chapter two and discuss if the theory is confirmed. Section 6.1 discusses the spatial autocorrelation and section 6.2 discusses the variables used in our model.

6.1 Spatial autocorrelation

Hunneman (2011) stated that in western countries customers in similar

neighbourhoods could have the same characteristics and spending patterns. However, neighbourhoods often contain many different customers and therefore prediction can be difficult. In our analysis we found that areas are spatially correlated on the postal code 6 level, which could mean that even on smaller aggregation levels unobserved characteristics could be spatially clustered.

In chapter two we discussed the findings of Bell and Song (2007) which were contrary to traditional marketing findings that behaviour of a consumer is independent of the behaviour of others (Bradlow et al, 2005). Our analysis focussed more on address specific characteristics. However, these characteristics are caused by citizen behaviour and our findings therefore support Bell and Song in that consumer decision could be spatially correlated.

6.2 Variables

Research on exact variables that influence data quality is sparse. Most research

(36)

(2001): ‘fitness for use’. Fitness for use is an abstract definition and differs between data manufacturing systems.

Literature that we have used to define our hypothesis is that of Kozak and colleagues (2015) which describes data input errors. Their findings are based on data entry faults. However, in our data we do not know if the data entry faults are

intentional or unintentional. We expected all variables to have a negative effect on data quality. However, our results did match the expectations for data mutations and data size standard deviation but did not match for data size. More mutations would increase the risk of data entry faults as defined by Kozak and colleagues (2015). A higher size of data should also show such effects. Further research should however confirm if this theory is indeed incorrect. As for the last variable, high fluctuations in data size could possibly be an indicator of data entry faults since people living on an address is most likely to be consistent over time with low fluctuations.

To conclude, we earlier stated that research on exact variables that predict data quality is sparse due to quality being abstract. For this reason, we defined our

variables in the most abstract way possible so that it could be applicable when future new theory is developed.

(37)

7. Conclusions

This chapter discusses our results found in the previous chapter. Secondly, we discuss the theoretical and managerial implications of our research. Lastly, the limitations of this research together with future research suggestions is discussed.

7.1 Conclusions

In this research we expected data mutations, data size and the standard deviation of data size to be positively affecting data quality issues. We also expected there to be some form of spatial correlation.

After estimating a normal OLS model we did find spatial autocorrelation and therefore we used multiple spatial models. From these models we can only confirm our hypotheses for data mutation and the standard deviation of data size. Data size seems to be negatively affecting data quality issues. This suggests that more data mutations and the more fluctuations of data size will lead to a higher proportion of data quality issues. A larger data size would lead to a lower proportion of data quality issues.

(38)

7.1.1 Contribution to literature

Our research contributes to literature in multiple ways. The most important finding is that of spatial autocorrelation. As we have stated before, spatial econometrics are still new in the field of marketing. By finding spatial autocorrelation on a low aggregation level such as postal code 6, we highlight the importance of spatial econometrics. Using spatial econometrics can help researchers with cross-sectional data that contains autocorrelation to get more efficient estimates of the parameters. Our findings concerning data size and data mutations and the standard deviation of data size have large effects. The effects of our findings are significant and can therefore benefit existing literature in finding data quality issues. The variables used in our thesis should therefore always be included when assessing data quality such as in the first step of the four-stage methodology of Richard Marsh (2004).

7.1.2 Managerial implications

According to Courtheoux (2003) quality problems should be caught and fixed as soon as they arise, he also states that quality is a shared responsibility. This implicates that the municipality should try to detect data quality issues as soon as possible and that it should work with other governmental agencies in order to assess the quality. Quality assessment is not a one-time process and should be done continuously. Entirely correct data is likely to never be achieved, however the municipality should define a goal beforehand and try to come as close as possible.

(39)

people leaving on a postal code are high and when the average amount of people on a postal code is low. When investigating quality of the BRP they can then proactively investigate those addresses that rank high first.

7.2 Limitations

This research does have some limitations and therefore the results should be

interpreted carefully. The first limitation is based on the normality assumption. The residuals of our analysis do not exactly follow a normal distribution (figure 5).

Secondly, the dependent variable follows a Poisson distribution. We also analysed the model with a generalized linear model that can handle Poisson

distributions, however these models can not correct for spatial autocorrelation. Hence we tried to transform our dependent variable by doing a log-transformation and add a small constant to account for the zero’s. This transformation does make the dependent variable more normal, however all zero’s are still outside of the normal distribution.

Thirdly, due to lack of computational power we could not estimate the Manski model. The Manski model includes the spatially lagged dependent variable as well as the lagged independent variables and the lagged error term. Hence the Manski model could have resulted in better estimates of the parameter. However, the Manski model could lead to over fitting the data. Due to overlap, the multiple spatial lags could blow each other up just like what happens in the existence of multi-collinearity (Leeflang et. al, 2017)

(40)

Also this would give better insights in address specific proportions of data quality issues and therefore specify the high risk locations.

The final and most important limitation is that our model is biased since our data is biased. For our analysis we used detected data quality issues. However, many data quality issues remain undetected and these are excluded from analysis.

7.3 Future research suggestions

In future research it is important to take our limitations and suggestions into account. This research main focus was on address specific details. Future research could also include additional variables such as personal details and more address specific variables in order to get more explanatory power.

Since we have not found any spatial models that could deal with a dependent variable that has a Poisson distribution. Future research could try to find a way to account for spatial autocorrelation in count data.

A final research suggestion is to experiment with different spatial weights matrices. It could be interesting to instead of geographical weights use

(41)

8. References

Aanpak adresfraude loont. (2017). Retrieved 04-08, 2018,

from https://www.rijksoverheid.nl/actueel/nieuws/2017/11/08/aanpak-adresfraude-loont

Akoka, J., Berti-Equille, L., Boucelma, O., Bouzeghoub, M., Comyn-Wattiau1ab, I., Cosquer, M., et al. (2007). A framework for quality evaluation in data integration systems. 9th International Conference on Entreprise Information Systems

(ICEIS), pp. 10.

Anselin, L. (2010). Thirty years of spatial econometrics. Papers in Regional

Science, 89(1), 3-25.

Bell, D. R., & Song, S. (2007). Neighborhood effects and trial on the internet: Evidence from online grocery retailing. Quantitative Marketing and

Economics, 5(4), 361-400.

Bradlow, E. T., Bronnenberg, B., Russell, G. J., Arora, N., Bell, D. R., Duvvuri, S. D., et al. (2005). Spatial models in marketing. Marketing Letters, 16(3-4), 267-278.

Brown, J. R., Ivković, Z., Smith, P. A., & Weisbenner, S. (2008). Neighbors matter: Causal community effects and stock market participation. The Journal of

Finance, 63(3), 1509-1531.

Courtheoux, R. J. (2003). Marketing data analysis and data quality management. Journal of Targeting, Measurement & Analysis for

(42)

Cracolici, M. F., & Uberti, T. E. (2009). Geographical distribution of crime in italian provinces: A spatial econometric analysis. Jahrbuch Fur

Regionalwissenschaft/Review of Regional Research, 29(1), 1-28.

Crosby, P. B. (1980). Quality is free: The art of making quality certain Signet.

Devillers, R., Stein, A., Bédard, Y., Chrisman, N., Fisher, P., & Shi, W. (2010). Thirty years of research on spatial data quality: Achievements, failures, and opportunities. Transactions in GIS, 14(4), 387-400.

Dimmock, S. G., Gerken, W. C., & Graham, N. P. (2018). Is fraud contagious? coworker influence on misconduct by financial advisors. The Journal of

Finance, 73(3), 1417-1450.

Duflo, E., & Saez, E. (2003). The role of information and social interactions in retirement plan decisions: Evidence from a randomized experiment. The

Quarterly Journal of Economics, 118(3), 815-842.

Eckerson, W. (2002). Data warehousing special report: Data quality and the bottom line. Applications Development Trends, 1(1), 1-9.

Elhorst, J. P. (2010). Applied spatial econometrics: Raising the bar. Spatial Economic

Analysis, 5(1), 9-28.

Elhorst, J. P. (2014). Linear spatial dependence models for cross-section data. Spatial

(43)

Foss, B., Henderson, I., Johnson, P., Murray, D., & Stone, M. (2002). Managing the quality and completeness of customer data. Journal of Database

Marketing, 10(2), 139.

Huang, S. Y., Lin, C., Chiu, A., & Yen, D. C. (2017). Fraud detection using fraud triangle risk factors. Information Systems Frontiers, 19(6), 1343-1356.

Hunneman, A. (2011). Advances in methods to support store location and design decisions. Groningen: University of Groningen, SOM Research School,

J Bronnenberg, B. (2005). Spatial models in marketing research and practice. Applied

Stochastic Models in Business and Industry, 21(4-5), 335-343.

Kim, Y. (2007). Using spatial analysis for monitoring fraud in a public delivery program. Social Science Computer Review, 25(3), 287-301.

Kirby, D. K., & LeSage, J. P. (2009). Changes in commuting to work times over the 1990 to 2000 period. Regional Science and Urban Economics, 39(4), 460-471.

Kozak, M., Krzanowski, W., Cichocka, I., & Hartley, J. (2015). The effects of data input errors on subsequent statistical inference. Journal of Applied

Statistics, 42(9), 2030-2037.

Leeflang, P. S., Wieringa, J. E., Bijmolt, T. H., & Pauwels, K. H. (2015). Building models for markets. Modeling markets (pp. 1-24) Springer.

Leeflang, P. S., Wieringa, J. E., Bijmolt, T. H., & Pauwels, K. H. (2017). Advanced

(44)

LeSage, J. P., & Pace, R. K. (2014). The biggest myth in spatial econometrics. Econometrics, 2(4), 217-249.

LeSage, J., & Pace, R. K. (2009). Introduction to spatial econometrics Chapman and Hall/CRC.

Lessler, J., Salje, H., Grabowski, M. K., & Cummings, D. A. T. (2016). Measuring spatial dependence for infectious disease epidemiology. Plos One, 11(5), e0155249-e0155249.

Loshin, D. (2001). Enterprise knowledge management: The data quality

approach Morgan Kaufmann.

Marsh, R. (2005). Drowning in dirty data? it's time to sink or swim: A four-stage methodology for total data quality management. Journal of Database Marketing

& Customer Strategy Management, 12(2), 105-112.

Murdock, H. (2008). The three dimensions of fraud: Auditors should understand the needs, opportunities, and justifications that lead individuals to commit fraudulent acts. Internal Auditor, 65(4), 81-83.

PBLQ. (2014). Verbeteren van de kwaliteit van het woonadres in de BRP

(45)

Reid, A., & Catterall, M. (2005). Invisible data quality issues in a CRM implementation. Journal of Database Marketing & Customer Strategy

Management, 12(4), 305-314.

Shrestha, P. M. (2006). Comparison of ordinary least square regression, spatial

autoregression, and geographically weighted regression for modeling forest structural attributes using a geographical information system (GIS)/remote sensing (RS) approach (Doctoral dissertation, Thesis. Canada: University of

Calgary. http://people. ucalgary. ca/~ mcdermid/Docs/Theses/Shrestha_2006. pdf (Accessed may, 2018)).

Strong, D. M., Lee, Y. W., & Wang, R. Y. (1997). Data quality in context. Communications of the ACM, 40(5), 103-110.

Tittle, C. R., & Meier, R. F. (1990). Specifying the SES/delinquency relationship. Criminology, 28(2), 271-300.

Tobler, W. R. (1970). A computer movie simulating urban growth in the detroit region. Economic Geography, 46(sup1), 234-240.

Yang, J., Zhang, Y., Luo, L., Meng, R., & Yu, C. (2018). Global mortality burden of cirrhosis and liver cancer attributable to injection drug use, 1990-2016: An age-period-cohort and spatial autocorrelation analysis. International Journal of

Environmental Research and Public Health, 15(1)

(46)

Appendix

Results

Dependent variable:

Proportion Data Quality Issues

OLS SAR SEM SDM SDEM

(1) (2) (3) (4) (5)

Data Mutations 1,135*** 0,895*** 1,024*** 0,906*** 0,930*** (0,026) (0,027) (0,030) (0,034) (0,033) Data Size -2,119*** -1,734*** -1,859*** -1,689*** -1,756*** (0,051) (0,051) (0,054) (0,057) (0,056) Data Size standard deviation 1,178*** 1,037*** 1,028*** 0,987*** 1,037*** (0,042) (0,040) (0,042) (0,042) (0,042)

lag.Data Mutations -0,005 0,323***

(0,050) (0,049)

lag.Data Size -0,423*** -1,032***

(0,104) (0,103)

lag.Data Size standard deviation 0,415*** 0,760***

(0,090) (0,097) Constant 2,948*** 2,575*** 2,336*** 2,997*** 3,925*** (0,128) (0,122) (0,144) (0,186) (0,221)

Observations 5.104 5.104 5.104 5.104 5.104

Log Likelihood -7.054,578 -7.117,510 -7.043,917 -7.058,285 Akaike Inf. Crit. 14.121,160 14.247,020 14.105,830 14.134,570

Rho 0,586*** 0,564***

Note: *p<0,1; **p<0,05; ***p<0,01

(47)
(48)

Using spatial econometrics to find data

quality issues.

Master Thesis Defense

By Stefan Zwart

1

st

supervisor: prof. dr. J.E. (Jaap) Wieringa

(49)

Aim of the study

Consequences of data quality issues:

Marketing context

operational costs

customer satisfaction

decision making

CRM

$611 billion costs

Municipality of Groningen

Why spatial econometrics?

(50)
(51)
(52)
(53)

Models & methodology

Base Model

y = #$%&'()&(+ #1,)() -.()(/%&' + #2,)() 1/23 + #3,)() '/23 '()&5)65 ,37/)(/%& + 8

(54)

Interpretation & Hypothesis

testing

Moran’s I

Between -1 and +1

Spatial heterogeneity or cluster of values that are alike

Monte Carlo simulation to test significance

Spatial Model

Log-transforming DV and IV’s

Change in IV will affect DV in area (direct effect), Change in IV will affect DV in other areas (indirect effect)

Calculate effects by making Marginal Effects matrix

(55)
(56)

Results

Variable Total Effect Standard

(57)
(58)

Limitations & Future Research

Limitations

Dependent variable is a

proportion of count variable

Normality assumption for

residuals does not hold

Results give information on a

aggregated

level

instead of

(59)

Referenties

GERELATEERDE DOCUMENTEN

This section describes Bayesian estimation and testing of log-linear models with inequality constraints and compares it to the asymptotic and bootstrap methods described in the

Among those respon- dents who did not skip the name generator questions we find that the web respondents tend to fill out fewer names, and they tend to have somewhat more missing

- Voor waardevolle archeologische vindplaatsen die bedreigd worden door de geplande ruimtelijke ontwikkeling en die niet in situ bewaard kunnen blijven:.  Wat is

Bij een verblijftijd van 18 dagen vertoonden de monsters welke bij 150°C voorbehandeld waren nog een geringe stijging in methaanproductie, voor een verblijftijd

De boom splitst elke keer in twee takkenb. Bij elke worp heb je keus uit twee mogelijkheden: k

The reason for this is that stars with magnitudes close to a gate or window transition have measure- ments obtained on either side of that transition due to inaccura- cies in

This paper studies how consistent the different aggregators are in terms of the social media metrics provided by them and discusses the extent to which the strategies and

Since the availability of the last time point usually determines the choice for a particular read session, this implies that data of previous sessions will usually be