Predicting parking lot occupancy using Prediction Instrument Development for Complex Domains

(1)

PREDICTING PARKING LOT OCCUPANCY USING PREDICTION INSTRUMENT

DEVELOPMENT FOR COMPLEX DOMAINS

(2)

(3)

M ASTER T HESIS

Predicting parking lot occupancy using Prediction Instrument

Development for Complex Domains

hier een wit regel Public version hier een wit regel hier een wit regel

Author Joanne Lijbers

Study programme: ..Business Information Technology Email:...JLijbers@deloitte.nl

hier een wit regel hier een wit regel

Graduation committee S.J. van der Spoel, MSc.

Industrial Engineering and Business Information Systems, University of Twente Dr. C. Amrit

Industrial Engineering and Business Information Systems, University of Twente Dr.Ir. M. van Keulen

EEMCS - Database Group, University of Twente C. ten Hoope, MSc.

Analytics and Information Management, Deloitte Nederland

hier een wit regel hier een wit regel

Faculty of Electrical Engineering, Mathematics and Computer Science University of Twente

July, 2016

(4)

ii

“Not everything that can be counted counts, and not everything that counts can be counted”

William Bruce Cameron

(5)

iii

Abstract

In predictive analytics, complex domains are domains in which behavioural, cultural, political, and other soft factors, affect prediction outcome. A soft- inclusive domain analysis can be performed, to capture the effects of these (domain specific) soft factors.

This research assesses the use of a soft-inclusive domain analysis, to de- velop a prediction instrument in a complex domain, versus the use of an analysis in which no soft factors are taken into account: a soft-exclusive anal- ysis.

A case study of predicting parking lot occupancy is used to test the meth- ods. A regression approach is taken, trying to predict the exact number of cars in the parking lot, one day ahead.

Results show no significant difference in predictive performance, when

comparing the developed prediction instruments. Possible explanations for

this result are: high predictive performance of the soft-exclusive developed

predictive model, and the fact that not all soft factors, identified using soft-

inclusive analysis, could be used in training the predictive model.

(6)

(7)

v

Acknowledgements

In your hands, or on your screen, you find the result of performing my final project at the University of Twente, to graduate from the study Business In- formation Technology. As my specialization track was ‘Business Analytics’, I wanted to focus on this topic in my final project as well. With the topic of this thesis being predictive analytics, I have gained sufficient knowledge into the field of my studies, and am eager to continue to work in, and learn from, this field after graduating.

I want to thank my university supervisors, Sjoerd, Chintan and Maurice, for guidance during the research process, the feedback and answering my questions (from in-depth to design questions). Special thanks to Claudia, my supervisor at Deloitte, for the weekly meetings. Thanks for helping me out with the technical aspects of this research and for your guidance dur- ing the process. Last, I want to thank friends and family for the support throughout the past months, and the good times during all the years of study.

Joanne Lijbers

Amsterdam, July 2016

(8)

(9)

vii

List of Figures

1.1 Influence of soft-factors on domain . . . . 2

1.2 PID-CD . . . . 3

2.1 Research Methodology . . . . 8

3.1 Developing & evaluating a prediction model . . . . 13

4.1 Refinement literature search . . . . 20

4.2 Exploratory data analysis . . . . 26

5.1 Hypotheses constructs . . . . 33

5.2 Hypotheses after specialization . . . . 34

5.3 Relation between traffic & occupancy . . . . 38

(12)

(13)

xi

List of Tables

4.1 Performance Measures . . . . 18

4.2 Search Terms . . . . 19

4.3 Soft-exclusive prediction factors . . . . 22

4.4 Significant correlations . . . . 26

4.5 Soft-exclusive regression results . . . . 28

5.1 Domain experts . . . . 30

5.2 Collected hypotheses . . . . 32

5.3 Variable correlation . . . . 38

5.4 Random Forest - Performance Measures . . . . 39

5.5 Comparison of results . . . . 40

6.1 Performance measures selected strategies . . . . 41

7.1 Correlation occupancy & appointments . . . . 49

A.1 Correlation (a) . . . . 52

A.2 Correlation (b) . . . . 53

B.1 Performance measures - Strategy 1 . . . . 55

B.2 Performance measures - Strategy 2 . . . . 55

B.3 Performance measures - Strategy 3 . . . . 55

B.4 Performance measures - Strategy 4 . . . . 55

B.5 Performance measures - Strategy 5 . . . . 56

B.6 Performance measures - Strategy 6 . . . . 56

B.7 Performance measures - Strategy 7 . . . . 56

C.1 Performance measures - Strategy 2 . . . . 57

C.2 Performance measures - Strategy 3 . . . . 57

C.3 Performance measures - Strategy 4 . . . . 57

C.4 Performance measures - Strategy 5 . . . . 57

C.5 Performance measures - Strategy 8 . . . . 58

C.6 Performance measures - Strategy 9 . . . . 58

(14)

(15)

xiii

List of Abbreviations

BI Business Intelligence

CART Classification And Regression Trees DT Decision Tree

IMS Intelligence Meta-Synthesis MAE Mean Absolute Error

MAPE Mean Absolute Percentage Error MLR Multiple Linear Regression MSE Mean Square Error

PID-CD Prediction Instrument Development for Complex Domains

PID-SEDA Prediction Instrument Development with Soft-Exclusive Domain Analysis

RF Random Forest

RMSE Root Mean Square Error

SVR Support Vector Regression

(16)

(17)

1

Chapter 1

Introduction

1.1 Introduction

Sensors, and the data they collect, are used in a wide variety of domains, like disaster management and intelligence analysis, but also in the ‘manu- facturing, energy and resources’-industry and the ecology sector [1]. Some- times collected (sensor) data in these domains is very straightforward to analyse, for example when a sensor is used inside a machine to monitor tool conditions. The number of factors influencing the tool condition is lim- ited, because, for example, only the number of usages influences tool qual- ity. Prediction of failure is easy once the process is repeating itself. Analy- sis of such a simple domain can be done without taking soft-factors into ac- count. Soft-factors are factors like behaviour, politics and strategies, which can influence a domain and the data retrieved from a domain [2]. Since no soft-factors need to be taken into account, such an analysis is referred to as soft-exclusive. Van der Spoel, Amrit and van Hilligersberg [2] describe a soft-exclusive analysis to be ‘domain analysis that only takes easily quan- tifiable factors into account’.

Other domains are more complex: they are only partially observable; prob- abilistic; evolve over time; and are subject to behavioural influences [3].

Many factors might interact with these complex domains, and not all of them might be known. When human behaviour is involved, domains can almost always be referred to as complex domains [3].

Complex domains need a different approach compared to the simpler

ones. When analysing such a domain, the influence of soft-factors, like

mentioned above, does need to be taken into account. This is because fac-

tors like behaviour and politics influence how a domain is represented by

data (see Figure 1.1). Including soft-factors into an analysis is referred to as

a soft-inclusive approach. Van der Spoel et al. [2] developed a method to use

this soft-inclusive approach in developing predictive models. This research

will add to this topic, amongst others, by validating the method of Van der

Spoel et al. [2]. Previous and related research will now be discussed, as well

as current gaps in knowledge, to motivate the choice of research. Thereafter

(18)

2 Chapter 1. Introduction

F IGURE 1.1: The influence of soft-factors on a domain, re- trieved from [2, p. 11]. A domain gets represented by data going through a filter, which can be affected by soft-factors

from the domain.

the objectives of the research are given. The objectives are used as a basis in designing the research.

1.2 Research Background

As mentioned in the introduction, this research focuses on the use of data gathered from complex domains. If valuable insights need to be derived from these data, it is important to understand the domain: to gather intel- ligence about it. Domain analysis is a way to do so: to learn to understand the context in which insights are created [2], [4].

Van der Spoel et al. [2] developed a method to develop prediction instru- ments which uses domain analysis, see Figure 1.2. The method provides steps in which prediction models are created, by using hypotheses obtained from analysing the domain to which the prediction models apply. Using

‘intelligence’ from people involved in this domain (the experts), the do- main can be analysed in a more thorough way than using knowledge of the researcher(s) alone. By performing field studies, or brainstorming with these experts, hypotheses on what influences the to-be-predicted system are gathered and, together with constraints, are used to create prediction models. The steps, as displayed in Figure 1.2, are explained in more detail in Section 3.3.

Prediction Instrument Development for Complex Domains (PID-CD) has recently been developed (see [2]). The method needs to be tested, to see how it performs in a new environment, to be able to increase its valid- ity.

1.2.1 Case Description

To validate PID-CD, a case study of predicting parking lot occupancy is

used. The research uses sensor data from the parking lot of ‘The Edge’, one

of the offices of Deloitte Nederland B.V. The building is said to be the most

(19)

1.2. Research Background 3

F IGURE 1.2: Prediction Instrument Development for Com- plex Domains as designed by Van der Spoel et al. [2]. The

steps are explained in more detail in Section 3.3

sustainable office in the world [5]. Rainwater gets re-used to flush toilets, and solar panels collect all the power the building uses. Besides this, one of the special features of the building is the many sensors it contains. Be- cause of these special features ‘The Edge’ is referred to as a smart building.

Smart buildings pursue goals that relate to energy consumption, security and the needs of users [6]. At The Edge this is, among other things, real- ized by movement sensors, which control the lights based on occupancy and by temperature sensors, which control the climate at the different de- partments. The data collected for these uses also get saved for the purpose of analysis and optimization. Data analysis could reveal patterns that en- able more efficient use.

One of the things which can be used more efficiently is the parking lot of the building. Approximately 2500 employees are based at The Edge, but only 240 permanent parking spots are reserved for Deloitte employees. To solve the issue of these few parking spots, employees are able to park at a garage near the office. Only employees to which the following rules apply, get rights to park in the parking lot of The Edge:

1. The Edge as main office;

2. Joining the lease-program;

(20)

4 Chapter 1. Introduction

3. Function-level senior manager of higher;

4. Ambulant function in Audit or Consulting.

Unfortunately this still leaves more people with parking rights than there are spots available. This results in both inefficient use of time, when em- ployees have to search for a parking spot elsewhere, as well as dissatisfied employees because of that.

Since it affects efficiency and satisfaction of employees it is important to maximize the use of available parking space, and so to increase its effi- ciency. Some employees get dissatisfied because they arrive at a full park- ing lot, but other employees get dissatisfied because on quiet days, when parking spots are available, they are still not allowed to park. Predicting the occupancy of the parking lot might help in resolving these problems.

1.3 Research Objectives

The goal of the research is twofold. Firstly, the research is aimed at vali- dating PID-CD, developed by Van der Spoel et al. [2]. Secondly, the goal of the research is to predict the parking lot occupancy of the office ‘The Edge’ in Amsterdam, used by Deloitte Netherlands. These goals will now be discussed separately.

1.3.1 Validation Goal

According to Wieringa and Moralı [7], validation research needs to be done to answer questions on the effectiveness and utility of a designed artefact.

Wieringa and Moralı define an artefact in Information System Research to be anything from software to hardware, or more conceptual entities like methods, techniques or business processes [7]. Validation can be used for trade-off analysis: ‘Do answers change when the artefact changes?’, as well as for sensitivity analysis: ‘Do answers change when the context (in which the artefact is implemented) changes?’ [7].

In this research, the artefact is the method developed by Van der Spoel et al.

[2]. The research is aimed at answering both validation questions. The goal is to see how this method performs in a different context, compared to the context Van der Spoel et al. describe in their research, which is to predict turnaround time for trucks at a container terminal [2] (sensitivity analysis).

The second validation goal is to see the change in answers when the

artefact changes (trade-off analysis). This trade-off question will be an-

swered by developing two prediction instruments. Besides developing a

(21)

1.3. Research Objectives 5

prediction instrument using PID-CD, a prediction instrument will be de- veloped using a soft-exclusive approach: prediction instrument development with soft-exclusive domain analysis (PID-SEDA) [2]. This method differs from PID-CD in the phase of collecting hypotheses and constraints, as will be ex- plained in Section 3.3. By comparing the results of this change, a trade-off can be made between, for example, quality of results on one hand and effort to develop the artefact on the other hand.

1.3.2 Prediction Goal

The second goal of this research is to accurately predict the occupancy of the parking lot of ‘The Edge’ on a given day. If the occupancy could be predicted, arrangements can be made in advance: if it is predicted to be very busy, employees can be warned, or if it is predicted to be quiet other employees might get access for that day.

The development of applications that enable these uses is beyond the scope

of this research. This research is about developing an actionable prediction

model for the occupancy of the parking lot, being a model that ’satisfies

both technical concerns and business expectations’ [8].

(22)

(23)

7

Chapter 2

Methodology

2.1 Research Design

This chapter explains how the research is conducted. The methodology, as well as the research questions will be explained. At the end of this chapter, the structure of the remainder of this thesis is presented.

2.1.1 Research Methodology

This research is classified as Technical Action Research (TAR). TAR is ‘the attempt to scale up a treatment to conditions of practice by actually using it in a particular problem’, as defined by Wieringa and Moralı [7]. With Technical Action Research a developed artefact is implemented in practice, to validate its design and by doing so increasing its relevance. By imple- menting in practice, an artefact moves from being in idealized conditions to being an actionable artefact in the real world [9]. TAR is intended to solve improvement problems (designing and evaluating artefacts) as well as to answer knowledge questions (resolve a lack of knowledge about some aspects in the real world). This research is classified as TAR because it an- swers knowledge questions like ‘What would be the effect of applying In- telligence Meta-Synthesis (IMS) in developing a predictive instrument in a complex system?’ and addresses the improvement problem of predicting the occupancy of the parking lot of The Edge.

The structure of TAR is shown in the top half of Figure 2.1 [7, p.231]. The

left improvement problem, which shows the steps in developing an arte-

fact, has already been conducted by Van der Spoel et al. [2]. This research

will contain the steps in the dotted frame. In the bottom half of Figure 2.1

these steps are applied to this research, showing the chapters in which the

different steps will be discussed. The different stages of developing a do-

main driven prediction instrument, as defined in [2], are also mapped onto

the structure of TAR.

(24)

8 Chapter 2. Methodology

F IGURE 2.1: The structure of Technical Action Research, taken from [7]. The dotted frame in the bottom shows the steps of TAR applied to this research and the steps of PIC-

CD.

.

(25)

2.2. Thesis structure 9

The Research Execution phase is performed twice: first a prediction in- strument following a soft-exclusive development approach, using only lit- erature, is developed. Second a prediction instrument is developed follow- ing the PID-CD method of Van der Spoel et al. [2].

2.1.2 Research Questions

The goals of validating the domain driven prediction instrument develop- ment method and predicting parking lot occupancy translate to the follow- ing research questions, and subquestions:

’How does a prediction instrument developed using a soft-inclusive method compare to a prediction instrument developed using a soft-exclusive method?’

1. What instrument for predicting parking lot occupancy results from using

’prediction instrument development with soft-exclusive domain analysis’?

2. What instrument for predicting parking lot occupancy results from using

’prediction instrument development for complex domains’?

2.2 Thesis structure

The remainder of this thesis will be structured as follows (as can be seen in Figure 2.1):

Chapter 3 provides a theoretical background into the topics of ’predictive analytics’ and ’intelligence meta-synthesis’. Common terms and practices will be introduced, to ease the understanding of the other chapters. The stages and steps of PID-SEDA and PID-CD are explained as well.

Chapter 4 addresses the improvement problem, using a soft-exclusive development method, answering the first sub-question. As can be seen in Figure 2.1 this includes problem investigation, treatment design, design validation, implementation, and evaluation of the design.

Chapter 5 shows the results of performing the same steps, but in this chapter the PID-CD method is used, answering sub-question two.

In chapter 6 the results of the two development methods are compared.

Internal and external validity will be checked, and contributions and limi- tations are described.

Concluding this thesis, the research questions are answered, and recom-

mendations for future work are given in chapter 7

(26)

(27)

11

Chapter 3

Theoretical Background

This chapter provides a theoretical background into the topics of ’predic- tive analytics’ and ’domain-driven development methods’. Next, the stages of Prediction Instrument Development for Complex Domains (PID-CD) [2]

are explained.

3.1 Predictive analytics

According to Waller and Fawcett [10] data science is ‘the application of quantitative and qualitative methods to solve relevant problems and pre- dict outcomes’. Besides, for example, database management and visualiza- tion, predictive analytics form a subset of data science.

Predictive analytics is the process of developing prediction models, as well as evaluating the predictive power of such models [11]. A prediction model can be viewed as a function:

y = f (X)

The output of the model is represented by y, the variable to be predicted.

X represents the (set of) input variable(s) [12]. By calculating this function, the relationship between X and y can be modelled and used for predicting new values of y.

Training this function can be done by using a train-set of data (e.g. 70 percent of a dataset), for which all values of X and y are known. Because the values are known the relationship between the input- and output vari- able(s) can be determined. After this a test-set (using the remaining 30 per- cent of data), only using values of X, is used to test if the trained function accurately predicts y.

The process of predictive analytics is displayed in Figure 3.1. When us-

ing a function to predict a numerical outcome, the predictive model can be

referred to as a regression model. When the outcome is categorical this is re-

ferred to as classification. Another prediction goal is ranking, which is used

to "rank observations to their probability of belonging to a certain class" [11,

(28)

12 Chapter 3. Theoretical Background

p. 23].

Linear- and Multiple Regression are the most important and single most widely used prediction techniques [13]. Besides these techniques other techniques can be used, like Support Vector Regression, which can recognize subtle pat- terns in complex data sets [14], but also techniques like Decision Tree or Ran- dom Forest, which can be used for both classification and regression. Deci- sion Trees (DT) exist of multiple nodes at which an attribute gets compared to a certain constant (often being a greater- or smaller than comparison) [15]. Each branch represents an outcome of the comparison, and tree leaves represent prediction values (or classes in case of classification) [12]. The learning of a DT is simple and fast, and the representation of results is in- tuitive and easy to understand [12]. Random Forest (RF) is a technique that uses multiple Decision Trees to create a prediction model. According to Breiman [16], using RF results in high prediction accuracy. The technique often achieves the same or better prediction performance compared to a single DT [17]. The process described before, and displayed in Figure 3.1, remains the same for these techniques, with the function being a decision tree or a forest.

As mentioned, evaluating the predictive power of a model is the sec- ond part of predictive analytics. Evaluating the accuracy of a (numerical) prediction model is done by calculating the difference (the error) between known values of y and the predicted values y’ [12].

A prediction instrument, as developed in this research, is a combination of a predictive model (the trained function), the technique used to create it, its parameters, a data selection & refinement strategy and (business) con- straints (to determine whether or not the model is useful in practice) [2].

3.2 Domain-driven development methods

As mentioned in Section 1.1, developing prediction instruments gets more difficult when dealing with complex domains. Predictive models often can- not be copied from existing ones, since every complex domain has its own unique characteristics. One way to develop actionable prediction models in such domains is to use domain-driven development methods.

According to Cao and Zhang [18], domain-driven data mining aims to de-

velop specific methodologies and techniques to deal with complex (busi-

ness) domains. When using a domain-driven approach, both objective and

subjective factors can be included in a (predictive) model. Waller and Fawcett

state analysis and domain knowledge cannot be separated [10]. According

(29)

3.3. Prediction Instrument Development for Complex Domains 13

F IGURE 3.1: The process of developing and evaluating a prediction model. A function of y gets trained, using a sub- set of data. This function is used to predict the goal variable of the test set. The last step is evaluating the results by cal-

culating prediction error.

to the authors data scientists need both a broad set of analytical skills, as well as deep domain knowledge.

This domain knowledge however does not necessarily has to come from data scientists themselves. One way of developing instruments with a domain- driven view is to use an Intelligence Meta-Synthesis (IMS) approach. IMS is a method for capturing soft factors, in the form of different kinds of intel- ligence, like human intelligence, data intelligence and domain intelligence [19]. According to Gu and Tang [20], ‘meta-synthesis emphasizes the syn- thesis of collected information and knowledge of various kinds of experts’.

It is a methodology in which quantitative methods are combined with qual- itative (domain) knowledge, obtained by consulting domain experts.

Van der Spoel et al. use IMS as a basis to their soft-inclusive domain analysis [2]. The domain analysis is soft-inclusive because it, besides in- cluding hard factors, also takes soft, domain specific, factors like behaviour and culture, into account. Soft-exclusive domain analysis on the other hand only takes factors into account that are directly quantifiable (hard factors).

3.3 Prediction Instrument Development for Complex Domains

The development method designed by Van der Spoel et al. [2] is described

below, as it is used in Chapter 5 of this thesis. The steps of the method

are displayed in Figure 1.2. In the preparation stage the prediction goal

is defined and experts are selected. In stage I hypotheses and constraints

regarding the domain are collected. In stage II the collected hypotheses

(30)

14 Chapter 3. Theoretical Background

are translated into datasets and the constraints are used to select the final datasets. These datasets are used to train predictive models, which need to comply with constraints given. In third stage final predictive models are chosen. These steps are discussed more elaborately below.

3.3.1 Preparation Stage

As displayed in Figure 1.2, before hypotheses are collected, preparations need to be taken. ‘What needs to be predicted’ (the prediction goal) and

‘what are characteristics of the problem domain’, are questions that are an- swered during this preparation stage [2]. Another part of the preparation stage is to determine the experts who will be consulted later in the process.

3.3.2 Stage I: Qualitative assumptions

In the first core stage of the development method hypotheses are collected and constraints are defined. Hypotheses are collected through brainstorm- ing, individual interviews, field studies and/or literature review [2]. Brain- storming might have to be done anonymously, to ensure conflicting inter- ests do not affect the results of the brainstorm.

After hypotheses are collected, the number of hypotheses is reduced, to avoid having to test similar hypotheses. Selections are made by looking at the level of agreement. Only those hypotheses that are sufficiently different and interesting are taken into account in the development stage. Merging the hypotheses into one set T for testing is done by following the next steps:

1. Translate the hypotheses collected into diagrams, showing the factors (constructs) and their relations

2. Standardize and specialize the constructs (synonymous factors are re- placed by one synonym; constructs are possibly replaced by their sub- construct)

3. Determine causal influence of constructs and group the hypotheses to the same causal influence. One hypothesis per group gets added to the set of hypotheses to be tested ‘T’.

The last part of the first stage is to define constraints. Through consult-

ing experts domain constraints are collected, which regard to data, deploy-

ment and interestingness. Domain constraints origin from the domain the

prediction instrument is being developed for, for example having to com-

ply with privacy standards. Data constraints are constraints on structure,

quantity and quality of data. Whether or not a prediction instrument can

(31)

3.3. Prediction Instrument Development for Complex Domains 15

actually be used in existing technological infrastructures, relates to deploy- ment constraints. Finally, the interestingness constraint relates to the perfor- mance of the instrument.

3.3.3 Stage II: Predictive modelling

Once the hypotheses set is completed, prediction models are created. The hypotheses are translated into available variables for learning the models.

After this selection of data, the data might need to be cleaned before usage (for example delete outlier rows). Through exploratory data analysis and consulting the experts, it can be checked which selection- & cleaning strate- gies need to be applied. Next, the different selection & cleaning strategies are reduced by checking compliance with the data and domain constraints collected in stage I. After that prediction methods and parameters (like size of the training/test set) are selected. For every strategy and every predic- tion method selected, predictive models are trained and evaluated, using calculated performance measures. Based on the interestingness, deploy- ment & domain constraints, models that do not meet the constraints are discarded.

3.3.4 Stage III: Model convergence

The final stage of the method is to select (a) predictive model(s). Domain experts are consulted to make this selection. Selection is done based on pre- dictive performance, but also factors like training time, or personal pref- erences can be taken into consideration. If a model gets selected it will, together with the data selection & cleaning strategy, prediction method, pa- rameters and constraints form the developed prediction instrument.

3.3.5 PID-SEDA

As explained, besides using PID-CD, Prediction Instrument Development with Soft-Exclusive Domain Analysis (PID-SEDA) will be used to serve as a benchmark method. Using this method represents using a soft-exclusive approach in analysing a complex domain. Almost no knowledge of the do- main is used for selecting factors or algorithms. By comparing its results to the results of using PID-CD, the effect of including soft-factors in analysing a complex domain is researched.

PID-SEDA is the soft-exclusive development method which is used in

Chapter 4. The method differs from PID-CD by collecting hypotheses only

through conducting a literature review. The predictive modelling stage is

similar to the one in PID-CD, except no domain, deployment and interest-

ingness constraints need to be met. At the end of th (iterative) process, the

best predictive model is chosen based on predictive power [2].

(32)

(33)

17

Chapter 4

Soft-exclusive development

This chapter displays the results of using the soft-exclusive development method (PID-SEDA), to develop a prediction instrument for predicting park- ing lot occupancy in The Edge. The different stages of the method, as well as the final prediction instrument developed, are presented below.

4.1 Stage I: Assumptions

The first stage of the soft-exclusive development model focuses on gather- ing assumptions on how to predict parking lot occupancy. The prediction goal is determined and a structured literature review is performed to see which factors are mentioned in existing literature. Concluding this stage a description of available data is given.

4.1.1 Goal definition

The prediction goal is to predict the occupancy of the parking lot of The Edge. Occupancy is the number of (Deloitte) cars that are currently in the parking lot. Different time windows are tested: predicting occupancy half an hour in advance, predicting two hours in advance and predicting the evening before the predicted moments. Output (occupancy) will be part of a wide range of possible, continues values (appr. 0-250 cars). Therefore, a regression approach is taken, trying to predict the exact number of cars in the parking lot (referred to as a prediction goal [11, p. 23]).

The performance of the model(s) is determined by calculating the mean

square error (MSE), the root mean square error (RMSE) and the mean abso-

lute error (MAE). See Table 4.1 for a description of these methods. The MSE,

RMSE and MAE are scale-dependent measures, useful when comparing

different methods applied to the same dataset [21]. These measures will be

used to select the best soft-exclusive developed prediction model: the lower

the error, the better the model. MAE will be treated as the most relevant

measure, since it is the most natural and unambiguous measure of average

error [22]. Another often used performance measure is the mean absolute

(34)

18 Chapter 4. Soft-exclusive development

T ABLE 4.1: Performance Measures

Measures Formulas

Mean squared error MSE = 1

n

X

t=1

e ² _t

Root mean squared error RMSE = v u u t 1 n

n

X

t=1

e ² _t

Mean absolute error MAE = 1

n

X

t=1

|e _t | Mean absolute percentage error MAPE = 100%

n

X

t=1

e t

y _t

n = the number of prediction values. e t = prediction error: the difference between the t _th prediction and the t _th actual value. y t = the t _th actual value.

percentage error (MAPE) (see Table 4.1), which can be used to compare pre- diction performance across different data sets [21]. This measure cannot be used in this case since the actual data frequently contains zeros (for example occupancy at night), resulting in an infinite MAPE.

4.1.2 Literature review

To select factors from existing literature a structured review, described by Webster and Watson [23], is conducted. Criteria for inclusion and/or exclu- sion are defined; fields of research are determined; appropriate sources are selected and specific search terms are defined. After the search the results are refined by language, title, abstract and full text. Forward and backward citations are used to search for new relevant articles, until no new articles come up [23]. These steps are described in detail below.

Inclusion/exclusion criteria

To ensure the relevance of the selected articles inclusion and exclusion cri- teria are determined. Articles should mention the topic of parking or any- thing synonymous. Real time, near real time and non-real time predictions are all included, trying to collect as much hypotheses as possible. Articles that do not use empirical data are excluded, since we try to find articles which test factors influencing parking lot occupancy.

Fields of research

No limits on fields of research will be set, since non-related articles will be

filtered out in the refine-steps.

(35)

4.1. Stage I: Assumptions 19

T ABLE 4.2: Search Terms

Prediction terms Synonyms Prediction goals Predicting Parking lot Occupancy Prediction Parking space Availability

Parking area Parking spot Lay-by Garage Parking

Sources

The sources used for the search are Google Scholar and Scopus. Accord- ing to Moed, Bar-Ilan and Halevi [24], both of these databases cover a set of core sources in the field of study. Although Scopus is a good source for finding published articles, Google Scholar can add to a search by also show- ing ‘intermediary stages of the publication process’ [24]. Using these both databases can therefore provide a surround search.

Search

Table 4.2 displays the specific search terms that are used for the literature search. Besides ‘occupancy’ as a prediction goal, also ‘availability’ is used, since predicting occupancy can also be done by predicting the available spots left. In the middle, synonyms for ‘parking lot’ are given. Using dif- ferent synonyms in the literature search will limit the impact of use of lan- guage.

All possible combinations of these terms, synonyms and goals are used in the search, resulting in a total of 461 articles (354 from Google Scholar, 107 from Scopus).

Refine sample

The results are refined using the following steps:

1. Filter out doubles

2. Filter by language

3. Refine by title

4. Refine by abstract

5. Refine by full text

(36)

20 Chapter 4. Soft-exclusive development

F IGURE 4.1: Filter & refinement steps of the structured lit- erature review. n is the number of articles remaining after

the refinement step before.

(37)

4.1. Stage I: Assumptions 21

Figure 4.1 displays the refinement of the search results. The search results contained 79 doubles, because two different sources were used. Seven arti- cles were removed because they were not written in English. Articles with a title that did not mention ’parking’ (or related terms) were removed. The abstracts of the remaining 85 articles were read. Articles that do not seem to contribute to the purpose of predicting parking lot occupancy are removed.

The remaining 35 articles were read, leaving nine relevant articles.

Forward & backwards citations

The nine selected articles were cited by, and referred to 147 articles in total.

The same refining steps as above were applied, resulting two new articles.

These two new articles contained two new citations, which were removed from the list after reading the abstract.

This structured review resulted in eleven articles useful for selecting vari- ables in the soft-exclusive development method.

Analysis

The final factors which will be used in the prediction model, derived from the eleven articles, can be found in a concept matrix, as recommended by Webster and Watson [23], see Table 4.3.

The factor time of day is mentioned in most articles. The occupancy of a parking garage might for example be higher during business hours and low during the night, or vice versa if it is a residential garage.

The second factor derived from literature is day of week. Whether it is a working day or a non-working day (like in the weekend), might influence occupancy.

Weather is the third factor, displayed in Table 4.4, mentioned in three different articles. Weather conditions might influence people’s choice to go by car or not.

Holidays also is a straightforward factor, derived from literature. Whether or not it is a holiday, e.g. like Christmas, likely influences the occupancy of a parking garage.

A factor mentioned only by David, Overkamp and Scheuerer [28] is the effect of a day being a day between holiday and weekend. A lot of people take, or have to take, a day off on such days, possibly resulting in an effect on the parking lot occupancy.

Where Chen et al. [29] and Reinstadler et al. [30] only mention normal

holidays as an influential factor, David et al. also mention the influence of

school-holidays [28].

(38)

22 Chapter 4. Soft-exclusive development

T ABLE 4.3: Soft-exclusive prediction factors

Articles Concepts

Time of day Day of week Events Weather Holidays

Chen (2014) X X X

Chen et al. (2013) X X

David et al. (2000) X X X X

Fabusuyi et al. (2014) X X X X

Kunjithapatham et al. (n.d.) X X

McGuiness and McNeil (1991)

Richter et al. (2014) X

Reinstadler et al. (2013) X X X X

Soler (2015) X X

Vlahogianni et al. (2014) X X X

Zheng et al. (2015) X X

Articles Concepts

Historic occupancy

Day between holiday and weekend

School holidays Parking lot

accessibility Parking price

Chen (2014) X

Chen et al. (2013)

David et al. (2000) X X

Fabusuyi et al. (2014) Kunjithapatham et al. (n.d.)

McGuiness and McNeil (1991) X

Richter et al. (2014) Reinstadler et al. (2013) Soler (2015)

Vlahogianni et al. (2014) X

Zheng et al. (2015) X

The seventh factor derived from literature is historic occupancy. This fac- tor, mentioned by Zheng, Rajasegarar and Leckie [31] and Vlahogianni et al. [32], means using the occupancy of some points in time before the time of prediction as input to the model. The researchers use the occupancy of n steps before time of prediction, to predict occupancy k steps ahead. Zheng et al. for example, use the previous observation when working with a re- gression tree, and the previous 5 observations when working with neural networks, to predict 1 to 20 steps ahead [31]. Vlahogianni et al. use a look back window of 5 steps to predict 1 to 60 steps ahead [32]. The goal of including this factor is "to represent the temporal characteristics of trans- portation time series (e.g. parking occupancy) in a manner resembling the common statistical prediction techniques" [32].

The influence of events, is described in four different articles. For exam- ple, when located near a theatre, as described by Fabusuyi et al. [33], data on theatre performances can be a great predictor of parking lot occupancy.

Parking lot accessibility shows whether or not a parking lot is easily acces- sible. Road constructions or detours nearby might influence the occupancy of the garage.

Only mentioned by Chen [35] is the final factor ’parking price’. Increasing

or decreasing parking price might influence parking occupancy.

(39)

4.2. Stage II: Predictive modelling 23

4.1.3 Data constraint

After collecting these hypotheses, data is checked on availability, is cleaned, and is transformed into a dataset ready for training a prediction model (stage II). A data constraint, as described below, states the quality and quan- tity of data [2].

The original dataset used for this research is collected from the parking lot of The Edge between November 17, 2014 and April 15, 2016. A row of data is given per car entering the parking lot: showing row-ID; entry time; exit time; subscription type; license plate and country code.

Data on occupancy is not available and has to be calculated based on the number of cars going in and out. Only 49.897 from the 176.997 rows of the original dataset contain the date and time a car left the parking lot. Reasons for these empty rows are for example the gate of the parking lot not closing after each car, resulting in not scanning all cars leaving the parking lot.

4.2 Stage II: Predictive modelling

As mentioned above, stage II of the development method takes the factors derived from literature (see Table 4.3), to create predictive models. In this stage the data cleaning step is described, data selection strategies are de- scribed, an exploratory analysis will be performed and the predictive mod- els will be evaluated.

4.2.1 Data cleaning

It is chosen to impute the 127.100 missing values, mentioned in Section 4.1.3, instead of using only the ± 50.000 complete rows, because these com- plete rows are obtained only in the first half year after opening the parking lot. The first few weeks after opening, employees still were in the process of moving to the building and getting used to parking at The Edge. Us- ing these weeks in data of only half a year might negatively influence the prediction model. Besides this, using only half a year of data excludes the possible influence of holidays, school-holidays etcetera. Using the whole dataset, with imputing missing values, should result in learning a more ro- bust prediction model.

Imputation is done by using one of the most frequently used imputa-

tion methods: mean imputation [36]. Using the available data, averages are

calculated to replace the missing values. Using the complete rows, an av-

erage parking stay is calculated for every hour of every different day. The

missing values are replaced based on the entry times of the car, adding the

average parking-stay time for that time of arrival, leaving a new calculated

(40)

24 Chapter 4. Soft-exclusive development

exit time. With these entry and exit times completed, the occupancy of the parking lot is calculated for every half hour of the day (from November 17, 2014 to April 15, 2016 this makes 24.765 rows of data).

Because the average stay is used in calculating the occupancy, the cal- culated number of cars inside the parking lot sometimes (2.4% of values) exceeds the maximum number of parking spots. In the parking lot 240 spots are reserved for Deloitte, the dataset created contains 594 rows with an occupancy exceeding this number. To reduce noise, these rows are re- moved from the dataset.

The factors time of day, day of week and historic occupancy also are retrieved from this transformed dataset.

Weather data is retrieved from the Royal Netherlands Meteorological In- stitute [37]. Data on (school)holidays and days between holiday and weekend are retrieved from Schoolvakanties-Nederland [38].

No (open) datasets are available on the factors events and parking lot ac- cessibility: no historic information on road constructions or detours can be found, and possible events in The Edge are not centrally registered. There- fore these two factors can not be used in the final dataset (and predictive model).

The factor parking price technically could be integrated to the final dataset, but since parking price is zero at all times (resulting in no predictive power), the factor is ignored.

The original dataset contained one row for every car entering and/or leav- ing the parking lot. After the above mentioned cleaning- and transforma- tion steps, the final dataset contains one row per half an hour (Novem- ber 17, 2014 - April 15, 2016), with the associated variables (occupancy;

weather; holidays, etc.).

4.2.2 Data selection strategies

Different data selection strategies are defined, based on the use of the factor historic occupancy.

Vlahogianni et al. and Zheng et al. include previous observations in their

prediction models [31], [32]. Although these researches predict occupancy

in a more real-time manner (1 to 60 minutes ahead), the effects of including

this factor are researched here. The referred articles use 1 and 5 observa-

tions, depending on the modelling technique. Both uses, including 1 or 5

observations, are tested to see which time window works best with which

technique. For the prediction goal of predicting the evening before, pre-

dictions are made as being one day ahead (48-steps ahead prediction). By

(41)

4.2. Stage II: Predictive modelling 25

doing so predictions still can be checked the evening before, but data on more similar time frames will be taken into account. For this prediction goal both strategies of using the previous 1 and 5 observations are tested, as well as including data on observations of the previous 5 days. This is because the goal is to add memory structures that retain the effect of past information to the system and use it during learning [32]. While the short- term predictions incorporate the memory structures of that particular day, including information on the past 5 days might be a better memory struc- ture for the non-near real time predictions.

The different strategies therefore are:

1. 1-steps ahead occupancy - including 1 time frame 2. 1-steps ahead occupancy - including 5 time frames 3. 4-steps ahead occupancy - including 1 time frame 4. 4-steps ahead occupancy - including 5 time frames 5. 48-steps ahead occupancy - including 1 time frame 6. 48-steps ahead occupancy - including 5 time frames

7. 48-steps ahead occupancy - including occupancy at prediction time up to 5 days before

All strategies also include the other factors which resulted from the litera- ture review.

4.2.3 Exploratory data analysis

Figure 4.2 displays the average number of cars in the parking lot for every hour, per day of the week. Looking at the graph, time of day and day of week seem to be important independent variables. For example: office hours result in high occupancy; weekends result in few cars parked in the parking lot.

Besides this visual exploration, correlation between all of the variables

is checked. Table A.1 and A.2 in Appendix A display all correlations. Bold

values indicate a significant correlation at α < 0.05. Table 4.4 shows the

correlations between the goal variable and the independent variables of all

strategies. All independent variables significantly correlate with the depen-

dent variable occupancy.

(42)

26 Chapter 4. Soft-exclusive development

T ABLE 4.4: Significant correlations between independent variables & occupancy

Ind. factor Corr. Ind. factor Corr. Ind. factor Corr.

WeekDay -0.28 1-step/obs. 2 0.93 48-steps/obs. 2 0.67 Holiday -0.38 1-step/obs. 3 0.86 48-steps/obs. 3 0.63 Wind 0.13 1-step/obs. 4 0.79 48-steps/obs. 4 0.58 Sunshine 0.32 1-step/obs. 5 0.71 48-steps/obs. 5 0.52 Rain 0.02 4-step/obs. 3 0.62 48-steps/day2 0.45 Temperature 0.02 4-step/obs. 4 0.53 48-steps/day3 0.48 Hour 0.16 4-step/obs. 5 0.44 48-steps/day4 0.47 1-step/obs. 1 0.98 48-steps/obs. 1 0.68 48-steps/day5 0.43

F IGURE 4.2: The average number of cars in the parking lot,

displayed per day of the week and hour of the day.

(43)

4.2. Stage II: Predictive modelling 27

4.2.4 Technique selection

With the factors retrieved from the structured research, the prediction mod- els can be tested. BI software ’Qlikview’ [39] is used to combine all the vari- ables from different sources into one dataset. This set is used to create and analyse the prediction model(s), using the ’R suite for statistical analysis’

[40].

Four learning methods are tested: Multiple Linear Regression (MLR);

Random Forest (RF); Classification and Regression Trees (CART) and Sup- port Vector Regression (SVR). MLR is used because it is a very straightfor- ward method, often used in the field of predictive analytics (as mentioned in Section 3.1). RF is an ensemble method, which combines the results of other prediction methods to make (better) predictions [11]. RF often results in high accuracy and is robust to outliers and noise [16]. CART and SVR are used to represent other categories of regression techniques. Doing this reduces the risk of poor performing models, because of technique-specific issues [2].

When testing these different machine learning methods, the best per- forming method is chosen based on the performance measures explained in Section 4.1.1.

4.2.5 Evaluation, validation & model selection

To evaluate the results 10-fold cross-validation is used. With cross-validation a model is trained and tested using the hold-out method explained in Sec- tion 3.1. The training and testing procedure however is repeated multiple times (in this case ten times), each time testing on a different set of obser- vations [11]. The performance measures are calculated using the total pre- diction errors (squared and absolute) from the 10 iterations, divided by the total number of rows in the original dataset [12]. Although Random Forest is robust against overfitting [16], cross-validation is used to make sure the performance measures are not due to chance. The performance measures for the different techniques and data selection strategies are displayed in Tables B.1 up to and including B.7 of Appendix B. Using Random Forest results in the lowest prediction errors for all different strategies. The per- formance measures resulting from using RF are displayed in Table 4.5.

The table also displays the measures for a naive model. This naive model predicts the occupancy based on averages for day of week and time of day: variables which can be used using the original dataset only.

The bottom row of Table 4.5 shows the accuracy of the different predic-

tion models. This is the percentage of occupancy predictions made that are

within a range of 5 cars from the actual occupancy.

(44)

28 Chapter 4. Soft-exclusive development

T ABLE 4.5: RF performance measures all strategies

Naive 1 2 3 4 5 6 7

MSE 956.2 43.2 20.6 182.7 153.9 150.2 184.3 104.8 RMSE 30.9 6.6 4.5 13.5 12.4 12.3 13.6 10.2

MAE 18.0 2.9 2.3 6.4 6.0 5.9 6.5 4.9

% ≤ 5 24.1 84.0 86.6 70.4 70.6 69.7 67.9 72.3

4.3 Stage III: Model convergence

Using the performance and accuracy measures displayed in the tables above, final predictive models are selected. All tested strategies result in less pre- diction error and higher accuracy compared to using averages of day and time to predict occupancy (naive model). It can be seen that including a larger time window improves the performance of a model, except for in- cluding the 5 last observations in the 48-steps ahead prediction. Including the observations of the last 5 days however does result in less prediction error.

Strategies 2, 4 and 7 therefore are selected to be the prediction models for the goals of predicting respectively half an hour, two hours and one day ahead. When predicting half an hour in advance predictions are on average 2.3 cars off. Using strategies 4 and 7 result in average (absolute) error of respectively 6.0 and 4.9 cars.

4.4 Conclusion

Based on the results a prediction instrument using the weather variables temperature, sunshine, wind and rainfall; using (school)holiday data; day of week;

time of day and (historic) occupancy numbers of 5 steps in advance, applying

the data cleaning strategy described in Section 4.1.3 and using a Random

Forest technique, is selected as the final prediction instrument developed

using PID-SEDA. Using this instrument results in an average error of 2.3

cars, when predicting half an hour in advance, an average error of 6.0 cars

predicting 2 hours in advance, and an average error of 4.9 cars predicting

one day ahead.

(45)

29

Chapter 5

Soft-inclusive development

This chapter describes the process of developing a prediction instrument, following the soft-inclusive approach of Van der Spoel et al. [2]. For compa- rability, the prediction problem is the same as the prediction problem in the soft-exclusive development method. This chapter starts with the process of selecting experts, collecting hypotheses on what influences the occupancy of the garage and collecting possible constraints. In the second stage these hypotheses and constraints are used to develop prediction models, after which a final model is selected in stage III.

5.1 Preparation

Before conducting the first stage of the development method, developing hypotheses, the prediction problem is identified. As mentioned this predic- tion problem is similar to the one in Section 4.1.1. By doing so the results of the different development methods can be compared using a trade-off analysis.

5.1.1 Problem identification

The problem is predicting occupancy of the parking garage of The Edge, in a way employees can decide whether or not to park in the garage on a given day and time. The goal variable in this prediction is occupancy: the number of Deloitte-cars in the garage.

5.1.2 Expert selection

A stakeholder analysis is conducted to identify the involved actors in the domain of the prediction problem. Results from this analysis are summa- rized in Table 5.1.

Three different groups of domain experts are identified. The first is the group of employees who is always allowed to park their cars in the garage.

These employees are mainly partners, directors and senior managers, since

these functions require to be at the office a great deal of the time.

(46)

30 Chapter 5. Soft-inclusive development

T ABLE 5.1: Domain experts

Group Stakeholder

1. Employee - Parking rights Partner Director Senior Manager 2. Employee - Parking rights Manager

(only after 04:00 PM) Senior Consultant Consultant Business Analyst

3. Support CBRE Global Workplace Solutions IP Parking

Reception

The second group is the group of employees with only ‘overtime’ park- ing rights. These rights allow the employees to park in the garage af- ter 04:00 PM every working day, as well as the whole day during week- ends. On working days before 04:00 PM no access is granted to the parking garage.

The third group is a collection of support stakeholders, instead of being users like the first two groups. ‘CBRE GWS’ manages the housing of The Edge and is also in charge of the parking garage. ‘IP Parking’ is the com- pany which installed and manages the parking management system and the corresponding sensors. The receptionists also connect to the predic- tion problem as domain experts, because they see the people who parked their car entering the building, as well as have experience reserving parking spots for visitors.

5.2 Stage I: Assumptions

In the first stage of the soft-inclusive development method, hypotheses are collected on what influences the occupancy of the parking garage. These hypotheses are collected by consulting the experts from Table 5.1.

5.2.1 Hypothesis divergence

Hypotheses are collected through conducting brainstorm sessions with groups of experts. As mentioned in Section 3.1, brainstorming sometimes has to be done anonymously to ensure no information is withheld. In this re- search anonymous brainstorming is not necessary, since the behaviour of experts does not affect the (business) processes of the company. However to increase internal validity, individual expert-interviews were conducted as well. Figure 5.1 shows the hypotheses that are mentioned by experts di- agrammatically, by showing the constructs and hypothetical relationships.

Table 5.2 summarizes the hypotheses and their constructs.

(47)

5.2. Stage I: Assumptions 31

Hypothesis 1

Mentioned by a manager of Risk Advisory, as well as during a brainstorm session with experts in group 2, is the influence of day of the week. This is because of one of the main activities of Deloitte it advisory work. On Fri- days a lot of employees come to the office to meet up with internal teams.

On the other days these employees are working at a clients’ office.

Hypothesis 2

Mentioned by one of the experts in group 1: when arriving ‘on time’ occu- pancy of the garage is low. This suggests influence of the variable time of day, like used in the soft-exclusive method.

Hypothesis 3 & 4

One of the experts mentioned to work at home if the weather is really bad, so not parking a car in the garage on those days. Other employees men- tioned they usually go by bike to The Edge, but might go by car if it is a rainy day.

Hypothesis 5

A hypothesis also mentioned by employees who usually go by bike, is the fact that they go by car if they have to be somewhere else later that day.

This happens quite often because of the advisory work: in the morning em- ployees might have team meetings at The Edge, after which they go to the office of clients to discuss progress, present results, et cetera.

Hypothesis 6

Flexible working is encouraged by Deloitte. Employees can work from home, or other locations, connected through the Deloitte Network. Some- times physical presence however is still needed, to have meetings and/or discussions on important topics. Experts mentioned they park their car in the garage if they have appointments at The Edge that require their physi- cal presence.

Hypothesis 7

For The Edge being in Amsterdam (capital of The Netherlands), highways surrounding the office often are congested during office hours. During a brainstorm-session with experts in group 2 it was mentioned some employ- ees choose to go by train if a lot of traffic is predicted.

Hypothesis 8

Employees who usually go to office by public transportation might go by

car if trains do not ride, or if detours have to be taken, as mentioned during

(48)

32 Chapter 5. Soft-inclusive development

T ABLE 5.2: Collected hypotheses and their constructs

Hypothesis Constructs

1 Fridays are more busy than other days Friday (F) is-a day-of-week (DoW);

other day (OD) is-a DoW 2 Occupancy depends on what time you arrive Time-of-day (ToD) 3 Extremely bad weather: I will not go (by car) Weather (W) 4 Bad weather: I will go by car, instead of by bike Weather (W)

5 I go by car if I have to be somewhere else later External appointments (Ex) is-an appointment (A)

6 I go to the office if I have appointments Internal appointments (In) is-an A 7 When there is a lot of traffic I go by train Traffic (T)

8 When trains do not ride I go by car Rail problems (RP) 9 Because of road constructions I might go by train Road constructions (RC) 10 The number of external assignments Ex is-a A

11 Results from literature review DoW; ToD; W;

Holiday (H); Historic Occupancy (HO)

an expert brainstorm.

Hypothesis 9

One of the experts (stakeholder group 2) mentioned road constructions can hinder accessing the garage. Employees can travel by public transportation instead, resulting in lower parking occupancy.

Hypothesis 10

Another hypothesis resulting from the expert brainstorm is the influence of external assignments. The percentage of posted employees might influence the number of cars in the garage, since it influences the number of employ- ees working at a clients’ office.

Hypothesis 11

The last hypothesis is not derived using expert opinions, but by conducting a structured literature review. This review has already been conducted dur- ing development of the soft-exclusive prediction model in Section 4.1.2. Us- able variables resulting from this review are time of day, day of week, weather, holiday and occupancy k-steps ahead (historic occupancy).

5.2.2 Hypothesis convergence

To make sure no hypothesis will be tested more than once, the set of hy- potheses gets converged. The first step in this process is specialization, in which constructs are replaced by their subconstructs (if any exist). This rule applies to hypothesis 1, 5, 6, and 10, resulting in the diagrams shown in Figure 5.2.

After specialization the set of hypotheses gets standardized, which means

synonymous constructs are replaced by one synonym [2], [41]. In this case

no synonymous relations are described.

(49)

5.2. Stage I: Assumptions 33

( A ) Hypothesis 1 ( B ) Hypothesis 2

( C ) Hypothesis 3 ( D ) Hypothesis 4

( E ) Hypothesis 5 ( F ) Hypothesis 6

( G ) Hypothesis 7 ( H ) Hypothesis 8

( I ) Hypothesis 9 ( J ) Hypothesis 10

( K ) Hypothesis 11

F IGURE 5.1: Hypotheses diagrams show the possible rela- tionships between constructs, as mentioned by experts. For example, weather (W) decreases occupancy (O), as suggested

in hypothesis 3.

Predicting parking lot occupancy using Prediction Instrument Development for Complex Domains

PREDICTING PARKING LOT OCCUPANCY USING PREDICTION INSTRUMENT

DEVELOPMENT FOR COMPLEX DOMAINS

M ASTER T HESIS

Predicting parking lot occupancy using Prediction Instrument

Development for Complex Domains

hier een wit regel Public version hier een wit regel hier een wit regel

Author Joanne Lijbers

Study programme: ..Business Information Technology Email:...JLijbers@deloitte.nl

hier een wit regel hier een wit regel

Graduation committee S.J. van der Spoel, MSc.

Industrial Engineering and Business Information Systems, University of Twente Dr. C. Amrit

Industrial Engineering and Business Information Systems, University of Twente Dr.Ir. M. van Keulen

EEMCS - Database Group, University of Twente C. ten Hoope, MSc.

Analytics and Information Management, Deloitte Nederland

hier een wit regel hier een wit regel

Faculty of Electrical Engineering, Mathematics and Computer Science University of Twente

July, 2016

ii

“Not everything that can be counted counts, and not everything that counts can be counted”

William Bruce Cameron

iii

Abstract

In predictive analytics, complex domains are domains in which behavioural, cultural, political, and other soft factors, affect prediction outcome. A soft- inclusive domain analysis can be performed, to capture the effects of these (domain specific) soft factors.

This research assesses the use of a soft-inclusive domain analysis, to de- velop a prediction instrument in a complex domain, versus the use of an analysis in which no soft factors are taken into account: a soft-exclusive anal- ysis.

A case study of predicting parking lot occupancy is used to test the meth- ods. A regression approach is taken, trying to predict the exact number of cars in the parking lot, one day ahead.

Results show no significant difference in predictive performance, when

comparing the developed prediction instruments. Possible explanations for

this result are: high predictive performance of the soft-exclusive developed

predictive model, and the fact that not all soft factors, identified using soft-

inclusive analysis, could be used in training the predictive model.

v

Acknowledgements

Joanne Lijbers

Amsterdam, July 2016

vii

Contents

Abstract iii

Acknowledgements v

List of Figures ix

List of Tables xi

List of Abbreviations xiii

1 Introduction 1

1.1 Introduction . . . . 1

1.2 Research Background . . . . 2

1.2.1 Case Description . . . . 2

1.3 Research Objectives . . . . 4

1.3.1 Validation Goal . . . . 4

1.3.2 Prediction Goal . . . . 5

2 Methodology 7 2.1 Research Design . . . . 7

2.1.1 Research Methodology . . . . 7

2.1.2 Research Questions . . . . 9

2.2 Thesis structure . . . . 9

3 Theoretical Background 11 3.1 Predictive analytics . . . . 11

3.2 Domain-driven development methods . . . . 12

3.3 Prediction Instrument Development for Complex Domains . 13 3.3.1 Preparation Stage . . . . 14

3.3.2 Stage I: Qualitative assumptions . . . . 14

3.3.3 Stage II: Predictive modelling . . . . 15

3.3.4 Stage III: Model convergence . . . . 15

3.3.5 PID-SEDA . . . . 15

4 Soft-exclusive development 17 4.1 Stage I: Assumptions . . . . 17

4.1.1 Goal definition . . . . 17

4.1.2 Literature review . . . . 18

4.1.3 Data constraint . . . . 23

4.2 Stage II: Predictive modelling . . . . 23

4.2.1 Data cleaning . . . . 23

4.2.2 Data selection strategies . . . . 24

4.2.3 Exploratory data analysis . . . . 25

4.2.4 Technique selection . . . . 27

viii

4.2.5 Evaluation, validation & model selection . . . . 27

4.3 Stage III: Model convergence . . . . 28

4.4 Conclusion . . . . 28

5 Soft-inclusive development 29 5.1 Preparation . . . . 29

5.1.1 Problem identification . . . . 29

5.1.2 Expert selection . . . . 29

5.2 Stage I: Assumptions . . . . 30

5.2.1 Hypothesis divergence . . . . 30

5.2.2 Hypothesis convergence . . . . 32

5.2.3 Constraint definition . . . . 34