PREDICTING PARKING LOT OCCUPANCY USING PREDICTION INSTRUMENT
DEVELOPMENT FOR COMPLEX DOMAINS
M ASTER T HESIS
Predicting parking lot occupancy using Prediction Instrument
Development for Complex Domains
hier een wit regel Public version hier een wit regel hier een wit regel
Author Joanne Lijbers
Study programme: ..Business Information Technology Email:...JLijbers@deloitte.nl
hier een wit regel hier een wit regel
Graduation committee S.J. van der Spoel, MSc.
Industrial Engineering and Business Information Systems, University of Twente Dr. C. Amrit
Industrial Engineering and Business Information Systems, University of Twente Dr.Ir. M. van Keulen
EEMCS - Database Group, University of Twente C. ten Hoope, MSc.
Analytics and Information Management, Deloitte Nederland
hier een wit regel hier een wit regel
Faculty of Electrical Engineering, Mathematics and Computer Science University of Twente
July, 2016
ii
“Not everything that can be counted counts, and not everything that counts can be counted”
William Bruce Cameron
iii
Abstract
In predictive analytics, complex domains are domains in which behavioural, cultural, political, and other soft factors, affect prediction outcome. A soft- inclusive domain analysis can be performed, to capture the effects of these (domain specific) soft factors.
This research assesses the use of a soft-inclusive domain analysis, to de- velop a prediction instrument in a complex domain, versus the use of an analysis in which no soft factors are taken into account: a soft-exclusive anal- ysis.
A case study of predicting parking lot occupancy is used to test the meth- ods. A regression approach is taken, trying to predict the exact number of cars in the parking lot, one day ahead.
Results show no significant difference in predictive performance, when
comparing the developed prediction instruments. Possible explanations for
this result are: high predictive performance of the soft-exclusive developed
predictive model, and the fact that not all soft factors, identified using soft-
inclusive analysis, could be used in training the predictive model.
v
Acknowledgements
In your hands, or on your screen, you find the result of performing my final project at the University of Twente, to graduate from the study Business In- formation Technology. As my specialization track was ‘Business Analytics’, I wanted to focus on this topic in my final project as well. With the topic of this thesis being predictive analytics, I have gained sufficient knowledge into the field of my studies, and am eager to continue to work in, and learn from, this field after graduating.
I want to thank my university supervisors, Sjoerd, Chintan and Maurice, for guidance during the research process, the feedback and answering my questions (from in-depth to design questions). Special thanks to Claudia, my supervisor at Deloitte, for the weekly meetings. Thanks for helping me out with the technical aspects of this research and for your guidance dur- ing the process. Last, I want to thank friends and family for the support throughout the past months, and the good times during all the years of study.
Joanne Lijbers
Amsterdam, July 2016
vii
Contents
Abstract iii
Acknowledgements v
List of Figures ix
List of Tables xi
List of Abbreviations xiii
1 Introduction 1
1.1 Introduction . . . . 1
1.2 Research Background . . . . 2
1.2.1 Case Description . . . . 2
1.3 Research Objectives . . . . 4
1.3.1 Validation Goal . . . . 4
1.3.2 Prediction Goal . . . . 5
2 Methodology 7 2.1 Research Design . . . . 7
2.1.1 Research Methodology . . . . 7
2.1.2 Research Questions . . . . 9
2.2 Thesis structure . . . . 9
3 Theoretical Background 11 3.1 Predictive analytics . . . . 11
3.2 Domain-driven development methods . . . . 12
3.3 Prediction Instrument Development for Complex Domains . 13 3.3.1 Preparation Stage . . . . 14
3.3.2 Stage I: Qualitative assumptions . . . . 14
3.3.3 Stage II: Predictive modelling . . . . 15
3.3.4 Stage III: Model convergence . . . . 15
3.3.5 PID-SEDA . . . . 15
4 Soft-exclusive development 17 4.1 Stage I: Assumptions . . . . 17
4.1.1 Goal definition . . . . 17
4.1.2 Literature review . . . . 18
4.1.3 Data constraint . . . . 23
4.2 Stage II: Predictive modelling . . . . 23
4.2.1 Data cleaning . . . . 23
4.2.2 Data selection strategies . . . . 24
4.2.3 Exploratory data analysis . . . . 25
4.2.4 Technique selection . . . . 27
viii
4.2.5 Evaluation, validation & model selection . . . . 27
4.3 Stage III: Model convergence . . . . 28
4.4 Conclusion . . . . 28
5 Soft-inclusive development 29 5.1 Preparation . . . . 29
5.1.1 Problem identification . . . . 29
5.1.2 Expert selection . . . . 29
5.2 Stage I: Assumptions . . . . 30
5.2.1 Hypothesis divergence . . . . 30
5.2.2 Hypothesis convergence . . . . 32
5.2.3 Constraint definition . . . . 34
5.3 Stage II: Predictive modelling . . . . 35
5.3.1 Data selection & cleaning strategies . . . . 35
5.3.2 Reduction by data & domain constraints . . . . 37
5.3.3 Exploratory data analysis . . . . 37
5.3.4 Technique & parameter selection . . . . 38
5.3.5 Model training . . . . 39
5.3.6 Reduction by interestingness, deployment & domain constraints . . . . 39
5.4 Stage III: Model Convergence . . . . 40
5.5 Conclusion . . . . 40
6 Discussion 41 6.1 Comparing models . . . . 41
6.2 Validity . . . . 42
6.2.1 Conclusion Validity . . . . 43
6.2.2 Internal Validity . . . . 43
6.2.3 External Validity . . . . 45
7 Conclusion 47 7.1 Answering Research Questions . . . . 47
7.2 Recommendations . . . . 48
A Correlation soft-exclusive factors 51
B Performance measures - Soft-exclusive 55
C Performance measures - Soft-inclusive 57
References 59
ix
List of Figures
1.1 Influence of soft-factors on domain . . . . 2
1.2 PID-CD . . . . 3
2.1 Research Methodology . . . . 8
3.1 Developing & evaluating a prediction model . . . . 13
4.1 Refinement literature search . . . . 20
4.2 Exploratory data analysis . . . . 26
5.1 Hypotheses constructs . . . . 33
5.2 Hypotheses after specialization . . . . 34
5.3 Relation between traffic & occupancy . . . . 38
xi
List of Tables
4.1 Performance Measures . . . . 18
4.2 Search Terms . . . . 19
4.3 Soft-exclusive prediction factors . . . . 22
4.4 Significant correlations . . . . 26
4.5 Soft-exclusive regression results . . . . 28
5.1 Domain experts . . . . 30
5.2 Collected hypotheses . . . . 32
5.3 Variable correlation . . . . 38
5.4 Random Forest - Performance Measures . . . . 39
5.5 Comparison of results . . . . 40
6.1 Performance measures selected strategies . . . . 41
7.1 Correlation occupancy & appointments . . . . 49
A.1 Correlation (a) . . . . 52
A.2 Correlation (b) . . . . 53
B.1 Performance measures - Strategy 1 . . . . 55
B.2 Performance measures - Strategy 2 . . . . 55
B.3 Performance measures - Strategy 3 . . . . 55
B.4 Performance measures - Strategy 4 . . . . 55
B.5 Performance measures - Strategy 5 . . . . 56
B.6 Performance measures - Strategy 6 . . . . 56
B.7 Performance measures - Strategy 7 . . . . 56
C.1 Performance measures - Strategy 2 . . . . 57
C.2 Performance measures - Strategy 3 . . . . 57
C.3 Performance measures - Strategy 4 . . . . 57
C.4 Performance measures - Strategy 5 . . . . 57
C.5 Performance measures - Strategy 8 . . . . 58
C.6 Performance measures - Strategy 9 . . . . 58
xiii
List of Abbreviations
BI Business Intelligence
CART Classification And Regression Trees DT Decision Tree
IMS Intelligence Meta-Synthesis MAE Mean Absolute Error
MAPE Mean Absolute Percentage Error MLR Multiple Linear Regression MSE Mean Square Error
PID-CD Prediction Instrument Development for Complex Domains
PID-SEDA Prediction Instrument Development with Soft-Exclusive Domain Analysis
RF Random Forest
RMSE Root Mean Square Error
SVR Support Vector Regression
1
Chapter 1
Introduction
1.1 Introduction
Sensors, and the data they collect, are used in a wide variety of domains, like disaster management and intelligence analysis, but also in the ‘manu- facturing, energy and resources’-industry and the ecology sector [1]. Some- times collected (sensor) data in these domains is very straightforward to analyse, for example when a sensor is used inside a machine to monitor tool conditions. The number of factors influencing the tool condition is lim- ited, because, for example, only the number of usages influences tool qual- ity. Prediction of failure is easy once the process is repeating itself. Analy- sis of such a simple domain can be done without taking soft-factors into ac- count. Soft-factors are factors like behaviour, politics and strategies, which can influence a domain and the data retrieved from a domain [2]. Since no soft-factors need to be taken into account, such an analysis is referred to as soft-exclusive. Van der Spoel, Amrit and van Hilligersberg [2] describe a soft-exclusive analysis to be ‘domain analysis that only takes easily quan- tifiable factors into account’.
Other domains are more complex: they are only partially observable; prob- abilistic; evolve over time; and are subject to behavioural influences [3].
Many factors might interact with these complex domains, and not all of them might be known. When human behaviour is involved, domains can almost always be referred to as complex domains [3].
Complex domains need a different approach compared to the simpler
ones. When analysing such a domain, the influence of soft-factors, like
mentioned above, does need to be taken into account. This is because fac-
tors like behaviour and politics influence how a domain is represented by
data (see Figure 1.1). Including soft-factors into an analysis is referred to as
a soft-inclusive approach. Van der Spoel et al. [2] developed a method to use
this soft-inclusive approach in developing predictive models. This research
will add to this topic, amongst others, by validating the method of Van der
Spoel et al. [2]. Previous and related research will now be discussed, as well
as current gaps in knowledge, to motivate the choice of research. Thereafter
2 Chapter 1. Introduction
F IGURE 1.1: The influence of soft-factors on a domain, re- trieved from [2, p. 11]. A domain gets represented by data going through a filter, which can be affected by soft-factors
from the domain.
the objectives of the research are given. The objectives are used as a basis in designing the research.
1.2 Research Background
As mentioned in the introduction, this research focuses on the use of data gathered from complex domains. If valuable insights need to be derived from these data, it is important to understand the domain: to gather intel- ligence about it. Domain analysis is a way to do so: to learn to understand the context in which insights are created [2], [4].
Van der Spoel et al. [2] developed a method to develop prediction instru- ments which uses domain analysis, see Figure 1.2. The method provides steps in which prediction models are created, by using hypotheses obtained from analysing the domain to which the prediction models apply. Using
‘intelligence’ from people involved in this domain (the experts), the do- main can be analysed in a more thorough way than using knowledge of the researcher(s) alone. By performing field studies, or brainstorming with these experts, hypotheses on what influences the to-be-predicted system are gathered and, together with constraints, are used to create prediction models. The steps, as displayed in Figure 1.2, are explained in more detail in Section 3.3.
Prediction Instrument Development for Complex Domains (PID-CD) has recently been developed (see [2]). The method needs to be tested, to see how it performs in a new environment, to be able to increase its valid- ity.
1.2.1 Case Description
To validate PID-CD, a case study of predicting parking lot occupancy is
used. The research uses sensor data from the parking lot of ‘The Edge’, one
of the offices of Deloitte Nederland B.V. The building is said to be the most
1.2. Research Background 3
F IGURE 1.2: Prediction Instrument Development for Com- plex Domains as designed by Van der Spoel et al. [2]. The
steps are explained in more detail in Section 3.3
sustainable office in the world [5]. Rainwater gets re-used to flush toilets, and solar panels collect all the power the building uses. Besides this, one of the special features of the building is the many sensors it contains. Be- cause of these special features ‘The Edge’ is referred to as a smart building.
Smart buildings pursue goals that relate to energy consumption, security and the needs of users [6]. At The Edge this is, among other things, real- ized by movement sensors, which control the lights based on occupancy and by temperature sensors, which control the climate at the different de- partments. The data collected for these uses also get saved for the purpose of analysis and optimization. Data analysis could reveal patterns that en- able more efficient use.
One of the things which can be used more efficiently is the parking lot of the building. Approximately 2500 employees are based at The Edge, but only 240 permanent parking spots are reserved for Deloitte employees. To solve the issue of these few parking spots, employees are able to park at a garage near the office. Only employees to which the following rules apply, get rights to park in the parking lot of The Edge:
1. The Edge as main office;
2. Joining the lease-program;
4 Chapter 1. Introduction
3. Function-level senior manager of higher;
4. Ambulant function in Audit or Consulting.
Unfortunately this still leaves more people with parking rights than there are spots available. This results in both inefficient use of time, when em- ployees have to search for a parking spot elsewhere, as well as dissatisfied employees because of that.
Since it affects efficiency and satisfaction of employees it is important to maximize the use of available parking space, and so to increase its effi- ciency. Some employees get dissatisfied because they arrive at a full park- ing lot, but other employees get dissatisfied because on quiet days, when parking spots are available, they are still not allowed to park. Predicting the occupancy of the parking lot might help in resolving these problems.
1.3 Research Objectives
The goal of the research is twofold. Firstly, the research is aimed at vali- dating PID-CD, developed by Van der Spoel et al. [2]. Secondly, the goal of the research is to predict the parking lot occupancy of the office ‘The Edge’ in Amsterdam, used by Deloitte Netherlands. These goals will now be discussed separately.
1.3.1 Validation Goal
According to Wieringa and Moralı [7], validation research needs to be done to answer questions on the effectiveness and utility of a designed artefact.
Wieringa and Moralı define an artefact in Information System Research to be anything from software to hardware, or more conceptual entities like methods, techniques or business processes [7]. Validation can be used for trade-off analysis: ‘Do answers change when the artefact changes?’, as well as for sensitivity analysis: ‘Do answers change when the context (in which the artefact is implemented) changes?’ [7].
In this research, the artefact is the method developed by Van der Spoel et al.
[2]. The research is aimed at answering both validation questions. The goal is to see how this method performs in a different context, compared to the context Van der Spoel et al. describe in their research, which is to predict turnaround time for trucks at a container terminal [2] (sensitivity analysis).
The second validation goal is to see the change in answers when the
artefact changes (trade-off analysis). This trade-off question will be an-
swered by developing two prediction instruments. Besides developing a
1.3. Research Objectives 5
prediction instrument using PID-CD, a prediction instrument will be de- veloped using a soft-exclusive approach: prediction instrument development with soft-exclusive domain analysis (PID-SEDA) [2]. This method differs from PID-CD in the phase of collecting hypotheses and constraints, as will be ex- plained in Section 3.3. By comparing the results of this change, a trade-off can be made between, for example, quality of results on one hand and effort to develop the artefact on the other hand.
1.3.2 Prediction Goal
The second goal of this research is to accurately predict the occupancy of the parking lot of ‘The Edge’ on a given day. If the occupancy could be predicted, arrangements can be made in advance: if it is predicted to be very busy, employees can be warned, or if it is predicted to be quiet other employees might get access for that day.
The development of applications that enable these uses is beyond the scope
of this research. This research is about developing an actionable prediction
model for the occupancy of the parking lot, being a model that ’satisfies
both technical concerns and business expectations’ [8].
7
Chapter 2
Methodology
2.1 Research Design
This chapter explains how the research is conducted. The methodology, as well as the research questions will be explained. At the end of this chapter, the structure of the remainder of this thesis is presented.
2.1.1 Research Methodology
This research is classified as Technical Action Research (TAR). TAR is ‘the attempt to scale up a treatment to conditions of practice by actually using it in a particular problem’, as defined by Wieringa and Moralı [7]. With Technical Action Research a developed artefact is implemented in practice, to validate its design and by doing so increasing its relevance. By imple- menting in practice, an artefact moves from being in idealized conditions to being an actionable artefact in the real world [9]. TAR is intended to solve improvement problems (designing and evaluating artefacts) as well as to answer knowledge questions (resolve a lack of knowledge about some aspects in the real world). This research is classified as TAR because it an- swers knowledge questions like ‘What would be the effect of applying In- telligence Meta-Synthesis (IMS) in developing a predictive instrument in a complex system?’ and addresses the improvement problem of predicting the occupancy of the parking lot of The Edge.
The structure of TAR is shown in the top half of Figure 2.1 [7, p.231]. The
left improvement problem, which shows the steps in developing an arte-
fact, has already been conducted by Van der Spoel et al. [2]. This research
will contain the steps in the dotted frame. In the bottom half of Figure 2.1
these steps are applied to this research, showing the chapters in which the
different steps will be discussed. The different stages of developing a do-
main driven prediction instrument, as defined in [2], are also mapped onto
the structure of TAR.
8 Chapter 2. Methodology
F IGURE 2.1: The structure of Technical Action Research, taken from [7]. The dotted frame in the bottom shows the steps of TAR applied to this research and the steps of PIC-
CD.
.
2.2. Thesis structure 9
The Research Execution phase is performed twice: first a prediction in- strument following a soft-exclusive development approach, using only lit- erature, is developed. Second a prediction instrument is developed follow- ing the PID-CD method of Van der Spoel et al. [2].
2.1.2 Research Questions
The goals of validating the domain driven prediction instrument develop- ment method and predicting parking lot occupancy translate to the follow- ing research questions, and subquestions:
’How does a prediction instrument developed using a soft-inclusive method compare to a prediction instrument developed using a soft-exclusive method?’
1. What instrument for predicting parking lot occupancy results from using
’prediction instrument development with soft-exclusive domain analysis’?
2. What instrument for predicting parking lot occupancy results from using
’prediction instrument development for complex domains’?
2.2 Thesis structure
The remainder of this thesis will be structured as follows (as can be seen in Figure 2.1):
Chapter 3 provides a theoretical background into the topics of ’predictive analytics’ and ’intelligence meta-synthesis’. Common terms and practices will be introduced, to ease the understanding of the other chapters. The stages and steps of PID-SEDA and PID-CD are explained as well.
Chapter 4 addresses the improvement problem, using a soft-exclusive development method, answering the first sub-question. As can be seen in Figure 2.1 this includes problem investigation, treatment design, design validation, implementation, and evaluation of the design.
Chapter 5 shows the results of performing the same steps, but in this chapter the PID-CD method is used, answering sub-question two.
In chapter 6 the results of the two development methods are compared.
Internal and external validity will be checked, and contributions and limi- tations are described.
Concluding this thesis, the research questions are answered, and recom-
mendations for future work are given in chapter 7
11
Chapter 3
Theoretical Background
This chapter provides a theoretical background into the topics of ’predic- tive analytics’ and ’domain-driven development methods’. Next, the stages of Prediction Instrument Development for Complex Domains (PID-CD) [2]
are explained.
3.1 Predictive analytics
According to Waller and Fawcett [10] data science is ‘the application of quantitative and qualitative methods to solve relevant problems and pre- dict outcomes’. Besides, for example, database management and visualiza- tion, predictive analytics form a subset of data science.
Predictive analytics is the process of developing prediction models, as well as evaluating the predictive power of such models [11]. A prediction model can be viewed as a function:
y = f (X)
The output of the model is represented by y, the variable to be predicted.
X represents the (set of) input variable(s) [12]. By calculating this function, the relationship between X and y can be modelled and used for predicting new values of y.
Training this function can be done by using a train-set of data (e.g. 70 percent of a dataset), for which all values of X and y are known. Because the values are known the relationship between the input- and output vari- able(s) can be determined. After this a test-set (using the remaining 30 per- cent of data), only using values of X, is used to test if the trained function accurately predicts y.
The process of predictive analytics is displayed in Figure 3.1. When us-
ing a function to predict a numerical outcome, the predictive model can be
referred to as a regression model. When the outcome is categorical this is re-
ferred to as classification. Another prediction goal is ranking, which is used
to "rank observations to their probability of belonging to a certain class" [11,
12 Chapter 3. Theoretical Background
p. 23].
Linear- and Multiple Regression are the most important and single most widely used prediction techniques [13]. Besides these techniques other techniques can be used, like Support Vector Regression, which can recognize subtle pat- terns in complex data sets [14], but also techniques like Decision Tree or Ran- dom Forest, which can be used for both classification and regression. Deci- sion Trees (DT) exist of multiple nodes at which an attribute gets compared to a certain constant (often being a greater- or smaller than comparison) [15]. Each branch represents an outcome of the comparison, and tree leaves represent prediction values (or classes in case of classification) [12]. The learning of a DT is simple and fast, and the representation of results is in- tuitive and easy to understand [12]. Random Forest (RF) is a technique that uses multiple Decision Trees to create a prediction model. According to Breiman [16], using RF results in high prediction accuracy. The technique often achieves the same or better prediction performance compared to a single DT [17]. The process described before, and displayed in Figure 3.1, remains the same for these techniques, with the function being a decision tree or a forest.
As mentioned, evaluating the predictive power of a model is the sec- ond part of predictive analytics. Evaluating the accuracy of a (numerical) prediction model is done by calculating the difference (the error) between known values of y and the predicted values y’ [12].
A prediction instrument, as developed in this research, is a combination of a predictive model (the trained function), the technique used to create it, its parameters, a data selection & refinement strategy and (business) con- straints (to determine whether or not the model is useful in practice) [2].
3.2 Domain-driven development methods
As mentioned in Section 1.1, developing prediction instruments gets more difficult when dealing with complex domains. Predictive models often can- not be copied from existing ones, since every complex domain has its own unique characteristics. One way to develop actionable prediction models in such domains is to use domain-driven development methods.
According to Cao and Zhang [18], domain-driven data mining aims to de-
velop specific methodologies and techniques to deal with complex (busi-
ness) domains. When using a domain-driven approach, both objective and
subjective factors can be included in a (predictive) model. Waller and Fawcett
state analysis and domain knowledge cannot be separated [10]. According
3.3. Prediction Instrument Development for Complex Domains 13
F IGURE 3.1: The process of developing and evaluating a prediction model. A function of y gets trained, using a sub- set of data. This function is used to predict the goal variable of the test set. The last step is evaluating the results by cal-
culating prediction error.
to the authors data scientists need both a broad set of analytical skills, as well as deep domain knowledge.
This domain knowledge however does not necessarily has to come from data scientists themselves. One way of developing instruments with a domain- driven view is to use an Intelligence Meta-Synthesis (IMS) approach. IMS is a method for capturing soft factors, in the form of different kinds of intel- ligence, like human intelligence, data intelligence and domain intelligence [19]. According to Gu and Tang [20], ‘meta-synthesis emphasizes the syn- thesis of collected information and knowledge of various kinds of experts’.
It is a methodology in which quantitative methods are combined with qual- itative (domain) knowledge, obtained by consulting domain experts.
Van der Spoel et al. use IMS as a basis to their soft-inclusive domain analysis [2]. The domain analysis is soft-inclusive because it, besides in- cluding hard factors, also takes soft, domain specific, factors like behaviour and culture, into account. Soft-exclusive domain analysis on the other hand only takes factors into account that are directly quantifiable (hard factors).
3.3 Prediction Instrument Development for Complex Domains
The development method designed by Van der Spoel et al. [2] is described
below, as it is used in Chapter 5 of this thesis. The steps of the method
are displayed in Figure 1.2. In the preparation stage the prediction goal
is defined and experts are selected. In stage I hypotheses and constraints
regarding the domain are collected. In stage II the collected hypotheses
14 Chapter 3. Theoretical Background
are translated into datasets and the constraints are used to select the final datasets. These datasets are used to train predictive models, which need to comply with constraints given. In third stage final predictive models are chosen. These steps are discussed more elaborately below.
3.3.1 Preparation Stage
As displayed in Figure 1.2, before hypotheses are collected, preparations need to be taken. ‘What needs to be predicted’ (the prediction goal) and
‘what are characteristics of the problem domain’, are questions that are an- swered during this preparation stage [2]. Another part of the preparation stage is to determine the experts who will be consulted later in the process.
3.3.2 Stage I: Qualitative assumptions
In the first core stage of the development method hypotheses are collected and constraints are defined. Hypotheses are collected through brainstorm- ing, individual interviews, field studies and/or literature review [2]. Brain- storming might have to be done anonymously, to ensure conflicting inter- ests do not affect the results of the brainstorm.
After hypotheses are collected, the number of hypotheses is reduced, to avoid having to test similar hypotheses. Selections are made by looking at the level of agreement. Only those hypotheses that are sufficiently different and interesting are taken into account in the development stage. Merging the hypotheses into one set T for testing is done by following the next steps:
1. Translate the hypotheses collected into diagrams, showing the factors (constructs) and their relations
2. Standardize and specialize the constructs (synonymous factors are re- placed by one synonym; constructs are possibly replaced by their sub- construct)
3. Determine causal influence of constructs and group the hypotheses to the same causal influence. One hypothesis per group gets added to the set of hypotheses to be tested ‘T’.
The last part of the first stage is to define constraints. Through consult-
ing experts domain constraints are collected, which regard to data, deploy-
ment and interestingness. Domain constraints origin from the domain the
prediction instrument is being developed for, for example having to com-
ply with privacy standards. Data constraints are constraints on structure,
quantity and quality of data. Whether or not a prediction instrument can
3.3. Prediction Instrument Development for Complex Domains 15
actually be used in existing technological infrastructures, relates to deploy- ment constraints. Finally, the interestingness constraint relates to the perfor- mance of the instrument.
3.3.3 Stage II: Predictive modelling
Once the hypotheses set is completed, prediction models are created. The hypotheses are translated into available variables for learning the models.
After this selection of data, the data might need to be cleaned before usage (for example delete outlier rows). Through exploratory data analysis and consulting the experts, it can be checked which selection- & cleaning strate- gies need to be applied. Next, the different selection & cleaning strategies are reduced by checking compliance with the data and domain constraints collected in stage I. After that prediction methods and parameters (like size of the training/test set) are selected. For every strategy and every predic- tion method selected, predictive models are trained and evaluated, using calculated performance measures. Based on the interestingness, deploy- ment & domain constraints, models that do not meet the constraints are discarded.
3.3.4 Stage III: Model convergence
The final stage of the method is to select (a) predictive model(s). Domain experts are consulted to make this selection. Selection is done based on pre- dictive performance, but also factors like training time, or personal pref- erences can be taken into consideration. If a model gets selected it will, together with the data selection & cleaning strategy, prediction method, pa- rameters and constraints form the developed prediction instrument.
3.3.5 PID-SEDA
As explained, besides using PID-CD, Prediction Instrument Development with Soft-Exclusive Domain Analysis (PID-SEDA) will be used to serve as a benchmark method. Using this method represents using a soft-exclusive approach in analysing a complex domain. Almost no knowledge of the do- main is used for selecting factors or algorithms. By comparing its results to the results of using PID-CD, the effect of including soft-factors in analysing a complex domain is researched.
PID-SEDA is the soft-exclusive development method which is used in
Chapter 4. The method differs from PID-CD by collecting hypotheses only
through conducting a literature review. The predictive modelling stage is
similar to the one in PID-CD, except no domain, deployment and interest-
ingness constraints need to be met. At the end of th (iterative) process, the
best predictive model is chosen based on predictive power [2].
17
Chapter 4
Soft-exclusive development
This chapter displays the results of using the soft-exclusive development method (PID-SEDA), to develop a prediction instrument for predicting park- ing lot occupancy in The Edge. The different stages of the method, as well as the final prediction instrument developed, are presented below.
4.1 Stage I: Assumptions
The first stage of the soft-exclusive development model focuses on gather- ing assumptions on how to predict parking lot occupancy. The prediction goal is determined and a structured literature review is performed to see which factors are mentioned in existing literature. Concluding this stage a description of available data is given.
4.1.1 Goal definition
The prediction goal is to predict the occupancy of the parking lot of The Edge. Occupancy is the number of (Deloitte) cars that are currently in the parking lot. Different time windows are tested: predicting occupancy half an hour in advance, predicting two hours in advance and predicting the evening before the predicted moments. Output (occupancy) will be part of a wide range of possible, continues values (appr. 0-250 cars). Therefore, a regression approach is taken, trying to predict the exact number of cars in the parking lot (referred to as a prediction goal [11, p. 23]).
The performance of the model(s) is determined by calculating the mean
square error (MSE), the root mean square error (RMSE) and the mean abso-
lute error (MAE). See Table 4.1 for a description of these methods. The MSE,
RMSE and MAE are scale-dependent measures, useful when comparing
different methods applied to the same dataset [21]. These measures will be
used to select the best soft-exclusive developed prediction model: the lower
the error, the better the model. MAE will be treated as the most relevant
measure, since it is the most natural and unambiguous measure of average
error [22]. Another often used performance measure is the mean absolute
18 Chapter 4. Soft-exclusive development
T ABLE 4.1: Performance Measures
Measures Formulas
Mean squared error MSE = 1
n
n
X
t=1
e 2 t
Root mean squared error RMSE = v u u t 1 n
n
X
t=1
e 2 t
Mean absolute error MAE = 1
n
n
X
t=1
|e t | Mean absolute percentage error MAPE = 100%
n
n
X
t=1
e t
y t
n = the number of prediction values. e t = prediction error: the difference between the t th prediction and the t th actual value. y t = the t th actual value.
percentage error (MAPE) (see Table 4.1), which can be used to compare pre- diction performance across different data sets [21]. This measure cannot be used in this case since the actual data frequently contains zeros (for example occupancy at night), resulting in an infinite MAPE.
4.1.2 Literature review
To select factors from existing literature a structured review, described by Webster and Watson [23], is conducted. Criteria for inclusion and/or exclu- sion are defined; fields of research are determined; appropriate sources are selected and specific search terms are defined. After the search the results are refined by language, title, abstract and full text. Forward and backward citations are used to search for new relevant articles, until no new articles come up [23]. These steps are described in detail below.
Inclusion/exclusion criteria
To ensure the relevance of the selected articles inclusion and exclusion cri- teria are determined. Articles should mention the topic of parking or any- thing synonymous. Real time, near real time and non-real time predictions are all included, trying to collect as much hypotheses as possible. Articles that do not use empirical data are excluded, since we try to find articles which test factors influencing parking lot occupancy.
Fields of research
No limits on fields of research will be set, since non-related articles will be
filtered out in the refine-steps.
4.1. Stage I: Assumptions 19
T ABLE 4.2: Search Terms
Prediction terms Synonyms Prediction goals Predicting Parking lot Occupancy Prediction Parking space Availability
Parking area Parking spot Lay-by Garage Parking
Sources
The sources used for the search are Google Scholar and Scopus. Accord- ing to Moed, Bar-Ilan and Halevi [24], both of these databases cover a set of core sources in the field of study. Although Scopus is a good source for finding published articles, Google Scholar can add to a search by also show- ing ‘intermediary stages of the publication process’ [24]. Using these both databases can therefore provide a surround search.
Search
Table 4.2 displays the specific search terms that are used for the literature search. Besides ‘occupancy’ as a prediction goal, also ‘availability’ is used, since predicting occupancy can also be done by predicting the available spots left. In the middle, synonyms for ‘parking lot’ are given. Using dif- ferent synonyms in the literature search will limit the impact of use of lan- guage.
All possible combinations of these terms, synonyms and goals are used in the search, resulting in a total of 461 articles (354 from Google Scholar, 107 from Scopus).
Refine sample
The results are refined using the following steps:
1. Filter out doubles
2. Filter by language
3. Refine by title
4. Refine by abstract
5. Refine by full text
20 Chapter 4. Soft-exclusive development
F IGURE 4.1: Filter & refinement steps of the structured lit- erature review. n is the number of articles remaining after
the refinement step before.
4.1. Stage I: Assumptions 21
Figure 4.1 displays the refinement of the search results. The search results contained 79 doubles, because two different sources were used. Seven arti- cles were removed because they were not written in English. Articles with a title that did not mention ’parking’ (or related terms) were removed. The abstracts of the remaining 85 articles were read. Articles that do not seem to contribute to the purpose of predicting parking lot occupancy are removed.
The remaining 35 articles were read, leaving nine relevant articles.
Forward & backwards citations
The nine selected articles were cited by, and referred to 147 articles in total.
The same refining steps as above were applied, resulting two new articles.
These two new articles contained two new citations, which were removed from the list after reading the abstract.
This structured review resulted in eleven articles useful for selecting vari- ables in the soft-exclusive development method.
Analysis
The final factors which will be used in the prediction model, derived from the eleven articles, can be found in a concept matrix, as recommended by Webster and Watson [23], see Table 4.3.
The factor time of day is mentioned in most articles. The occupancy of a parking garage might for example be higher during business hours and low during the night, or vice versa if it is a residential garage.
The second factor derived from literature is day of week. Whether it is a working day or a non-working day (like in the weekend), might influence occupancy.
Weather is the third factor, displayed in Table 4.4, mentioned in three different articles. Weather conditions might influence people’s choice to go by car or not.
Holidays also is a straightforward factor, derived from literature. Whether or not it is a holiday, e.g. like Christmas, likely influences the occupancy of a parking garage.
A factor mentioned only by David, Overkamp and Scheuerer [28] is the effect of a day being a day between holiday and weekend. A lot of people take, or have to take, a day off on such days, possibly resulting in an effect on the parking lot occupancy.
Where Chen et al. [29] and Reinstadler et al. [30] only mention normal
holidays as an influential factor, David et al. also mention the influence of
school-holidays [28].
22 Chapter 4. Soft-exclusive development
T ABLE 4.3: Soft-exclusive prediction factors
Articles Concepts
Time of day Day of week Events Weather Holidays
Chen (2014) X X X
Chen et al. (2013) X X
David et al. (2000) X X X X
Fabusuyi et al. (2014) X X X X
Kunjithapatham et al. (n.d.) X X
McGuiness and McNeil (1991)
Richter et al. (2014) X
Reinstadler et al. (2013) X X X X
Soler (2015) X X
Vlahogianni et al. (2014) X X X
Zheng et al. (2015) X X
Articles Concepts
Historic occupancy
Day between holiday and weekend
School holidays Parking lot
accessibility Parking price
Chen (2014) X
Chen et al. (2013)
David et al. (2000) X X
Fabusuyi et al. (2014) Kunjithapatham et al. (n.d.)
McGuiness and McNeil (1991) X
Richter et al. (2014) Reinstadler et al. (2013) Soler (2015)
Vlahogianni et al. (2014) X
Zheng et al. (2015) X