Data-driven modelling of internal and external factors' impact on bus punctuality.

(1)

Data-driven modelling of internal and

external factors’ impact on bus

punctuality

Patrick Wernke

10770313

Master Thesis Credits: 42 EC

Masters Computational Science University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor dr. Valeria Krzhizhanovskaya Examiner dr. Mike Lees Second Assessor mr. Mert Dekkers November 1st, 2019

(2)

Abstract

With strict standards set by authorities, the increase in urbanization and growing interest in environmental friendly solutions to traffic problems, the need for efficient bus management systems has created a desire for more insight into their punctuality. Analyzing recorded bus travel data as a complex system with many influencing factors, can help guide current methods of optimizing punctuality. This is carried out in 4 steps: (i ) description and combination of the data; (ii ) punctuality sensitivity analysis of external (e.g. traffic, weather and population demographics) and internal (e.g. previous punctuality and peak-usage) factors; (iii ) creation of a predictive and a mathematical model on these factors; (iv ) best practice prescriptions for public transport operators. Previous research has focused primarily on late departures and single case studies, whereas the goal of this research is directed to the whole punctuality spectrum, from both early and late arrivals and departures, using bus travel data from multiple transit networks in the Netherlands. The model shows statistically relevant influence of weather, traffic and time dependant factors. With this, a method of responding to ’bad’ punctuality forecasts is devised by adjusting a bus trips departure time.

(3)

1 Introduction 6 2 Literature Review 9 2.1 Passenger satisfaction . . . 9 2.2 Metrics . . . 10 2.3 System modelling . . . 10 2.4 Data-driven optimization . . . 11 2.4.1 Result-driven . . . 12 2.4.2 Performance-driven . . . 12 2.5 Prediction . . . 13 3 Method 15 3.1 Design overview . . . 15 3.2 Case study . . . 16 3.3 Data sources . . . 17 3.4 Models . . . 17 3.5 Simulation . . . 18 4 Data processing 20 4.1 Standardization . . . 20 4.2 City definition . . . 20 4.3 Internal . . . 21 4.4 External . . . 22 4.4.1 Weather . . . 22 4.4.2 Traffic . . . 23 4.4.3 Demographic . . . 24 4.5 Integration . . . 24 4.6 Factor definition . . . 25 4.7 Characteristics . . . 26 4.7.1 Ranges . . . 26 4.7.2 Correlations . . . 31 5 Model creation 36 5.1 General model . . . 36 5.2 Output definition . . . 36 5.2.1 Training . . . 37 5.2.2 Analysis . . . 37

5.3 Stop specific model . . . 39

5.3.1 Training . . . 39

5.3.2 Analysis . . . 41

5.4 Mathematical model . . . 45

(4)

5.5 Accuracy . . . 46

6 Results 49 6.1 External variations . . . 49

6.2 Start time adjustment . . . 51

6.3 Model validation . . . 53 6.3.1 Trip comparison . . . 53 6.3.2 Sensitivity analysis . . . 56 6.3.3 Parameter optimization . . . 56 7 Conclusion 58 8 Discussion 59 8.1 Validity results . . . 59 8.2 Future research . . . 60 9 Acknowledgements 61

(5)

List of Abbreviations

APC Automatic Passenger Count. 6, 13

AVL Automatic Vehicle Location. 6, 10, 11, 13

LR Linear Regression. 18, 37

NN Artificial Neural Network. 13, 14, 18, 37, 39, 43, 45, 59

OTP On-Time Performance. 6, 10

PT Public Transport. 6, 9, 11, 13, 15

PTO Public Transport Operator. 6, 7, 9–13, 15–17, 61

(6)

1 Introduction

One of the most important trends found in the past decades on a global scale is urban-ization [30]. Both city land mass and population size are increasing on all continents in such a rapid rate that it has prevented organic or well-planned urban development. For inhabitants this affects their travel times, due to poorly designed road infrastructure for the increasing population, leading to congestion and incidents [7]. Public Transport (PT) is considered a critical factor in coping with the urbanization process. By effi-ciently and methodically transporting a large number of passengers, PT can decrease the amount of traffic in a city.

Another important global issue is climate change. The greenhouse gas emissions of personal vehicles are stated as one of the largest negative influences on the environment [5]. This means that PT also has an environmental incentive to further decrease the amount of traffic. The incentive is also reflected in the increase in usage of electric vehicles, for both personal vehicles and buses [36].

Based on these issues, governments are a large stakeholder in the performance of Public Transport Operators (PTOs). For this reason, Public Transport Authorities make clear performance standards that each PTO has to adhere to. An example of this would be the standards set by the Senior Traffic Commissioner for Great Britain for local bus service operators [29]. While trying to achieve set performance standards, PTOs also have their own incentive to reduce costs and increase revenue. A key performance indicator that influences these 3 goals is punctuality, also referred to as On-Time Per-formance (OTP). The Public Transport authorities strife for good punctuality to satisfy and increase the number of PT users.

Punctuality can be seen as the adherence to schedule, meaning the percentage of vehicles at the desired stop within a fixed time-frame. This means a binary representation of success P for a vehicle arriving or departing with deviation δ, where a and b are the upper and lower bound minutes of deviation:

P = (

1, if a < δ < b

0, otherwise (1)

A closely associated term is reliability, which can be seen as the combination between punctuality and number of cancellations. However, reliability and punctuality are often used interchangeably, when discussing bus performance.

The goal of this thesis is to create better insight in the factors that influence bus punc-tuality using historical data. In the past 1.5 decades, buses have seen a widespread im-plementation of various data collection and systems, such as Automatic Vehicle Location (AVL), Automatic Passenger Count (APC) and planning management using HASTUS. Analyzing punctuality using these data sources has proven to be a complex optimization problem, meaning it is still a rich field of scientific and business oriented research.

The bus operations high complexity stems from the following 5 processes as defined by [3]:

(7)

2. High level of demand and human behavioural uncertainty.

3. Unpredictable operational events and random disturbances (congestion blockades, accidents).

4. Unreliable operation of the driver-vehicle entity.

5. Complex interactions with other vehicles, passengers demand and the environment.

These processes, which in turn consist of different sub-processes that all influence the bus operation in some way, are impossible to incorporate in a single model without resorting to approximations. A perfect model does not exist, which means that events or actions of the agents have to be evaluated as a stochastic process. This stochasticity has to be derived from experience or historical data sources. The latter will be the main focus of this research. Creating more insight into these complex processes by carefully combining and analyzing different data sources for parameter optimization can increase the accuracy of bus operation models.

In particular, process 1 and 5 are modelled in this research. Due to the complex nature of these processes, artificial neural networks are trained and optimized to predict the punctuality based on these factors. With sensitivity analysis the influence of each of the factors is quantified to indicate their relevance to the problem. To ensure the smaller factors are not overlooked, punctuality is measured and predicted in seconds either early or late. By looking at this whole spectrum of punctuality, instead of classifying it as a binary early or late as described in equation 1, this research sets itself apart from previous methods. The method is applied to bus lines from multiple PTOs, allowing for a higher level of generality and reliability than previous work, which is almost exclusively applied to a single line.

Even more recent years have seen the introduction of Bus Rapid Transit, which ties the speed and reliability of a rail service with the operating flexibility and lower cost of a conventional bus service [14]. This is achieved by assigning dedicated bus lanes, off-board fare collection and priority at traffic lights. These systems circumvent some of the causes of unpunctual service in traditional bus transit. In this research, the target will be traditional bus services, without having to resort to major infrastructural changes. This means that all suggestions made, should be applicable to currently operational systems and PTOs.

Research Question

What is the influence of internal and external factors on bus punctuality and can it be used for predictive modelling?

In this research the difference between internal and external factors is based on what the source of the underlying data is. It is internal if the data is provided by a PTO and external otherwise. The internal factors defined and analyzed in this research are: peak-usage, previous punctuality, day of the week and month of the year. The external factors are descriptors of traffic, weather and passenger demographics. The hypothesis

(8)

is that the external factors have a large influence, because the drivers have to respond to them while operating the vehicle, whereas they can prepare before the trip for the internal factors.

(9)

2 Literature Review

In this literature research the papers are discussed that in some way influence the punc-tuality of public transport. This has been examined from different angles such as op-timizing the scheduled timetable, frequency, headway or reliability. Often, the goal of previous research is not directly punctuality, but the passengers waiting time or their usage cost, which can have many definitions.

2.1 Passenger satisfaction

The operators main goal of maintaining current PT users and attracting non-users is to be achieved by increasing customer satisfaction. In order to do this, the shortcomings of using a bus as the preferred choice over any other mode of transport have to be perceived from the users perspective. This can be done passively, by analyzing filed complaints, or actively, by means of a passenger survey. These 2 methods have been applied by [16] for a PTO in G¨oteborg, Sweden. The conclusion was that drivers know their passengers poorly, because their perception of what customers would dislike are not in line with the results of the passengers complaints and survey. The interesting statistics were that passengers complain about driver behaviour, yet when asked they name punctuality the primary cause of dissatisfaction. This shows that optimizing punctuality will improve customer satisfaction, but other factors such as driver behaviour, fares and technical faults also play an important role.

In order to get a better understanding of the importance of punctuality on cus-tomer satisfaction, [28] estimate the benefits for passengers and operators for improved punctuality. A survey was conducted in Nagaoka, Japan which estimates the cost im-provement for both users and operators when bus punctuality is optimal (perfect arrival and departure on the scheduled times). They estimated a 20% decrease in passenger cost and a 9% increase in PTO revenue. This revenue consisted of increased passenger numbers, estimated using modal choice (i.e. bus, car, bike), and decreased operation costs. Considering that this research had a relatively small (<200 users) sample size in a single case, applied to perfect punctuality, it does show multiple benefits of improving punctuality.

The goal of this thesis is to reach general statements about increasing bus punctuality that could be used globally. A research survey [4] in Dhaka City, Bangladesh, shows that this is not possible for every city. This rapidly growing and developing city has such irregular road usage (car, bus, rickshaw, bike) with high levels of congestion, that punctuality was not even considered for their questionnaire, nor was there any form of automatic data gathering. They found comfort levels to be the largest cause of customer dissatisfaction. It shows that this research thesis is better suited for technologically advanced PTOs in strongly regulated traffic.

(10)

2.2 Metrics

Clear standards for punctuality are needed for multiple reasons. Firstly, it is impor-tant that researches maintain the same standards, so that results can be compared and correctly interpreted. Secondly, PTOs need them as internal performance indicators, in order for them to quantify how well they are doing as a business. In [8] a survey is held among 146 PTOs as to their definitions of OTP. Using the bus arrival and departure times, a range from 1 minute early to 5 minutes late would capture most of the standards. However, with the increase in accurate and sophisticated data sources, the definitions for punctuality and reliability have seen debate, because more inclusive options are now available.

In [6] the authors propose a measure defined as the fraction of passengers who will be served in an acceptable time of arrival. They test apply measure on multiple routes in a period of 4 weeks in Cagliari, Italy. Moreover, they provide steps to clean the raw AVL data, such as bus overtakings and failures, in order to create a fair assessment. They conclude that this measure is useful for Business Intelligence.

In [10] a paradigm shift is proposed. Instead of timetable based operations, where a bus has to be in specific locations at specific times, regularity-driven is stated as the future type of operation. Regularity-driven means controlling the headway between vehicles, which works well for high frequency routes [11]. It is applied to Stockholm, Sweden and results in shorter waiting times and better capacity. A business model is created that uses the difference in headway as a performance indicator instead of OTP. In [21] three punctuality indices are described: adherence, regularity and evenness. They are applied to 22 routes in Seoul, South-Korea. Adherence is the standard OTP measure, whereas regularity is the one described earlier and evenness describes the de-viations in headway on a single day. It describes the effects of 8 factors on the evenness measure. They conclude that evenness can show the effects of some of the factors with significant confidence.

These previous studies emphasize the importance of using multiple and clearly de-fined punctuality measures, that are also commonly used by PTOs and academia. Clearly defining the measure, will provide more complete and communicable results.

2.3 System modelling

There are 2 commonly used ways of creating models. The first, which will be described in this section, is to reason about the influencing variables and to define their interactions mathematically. By giving each interaction a parameter, data can be used to fit these parameters. The alternative, described in the next section, is to infer the underlying interactions from the data, as well as the fitting of its parameters. By analyzing the weights, the influence of variables on the outcome can be found.

The earliest papers that incorporate a bus passengers cost, are based on a determin-istic corridor frequency model, with the goal in mind to find the optimal dispatching rate of buses on a single route. In [23] the users waiting time is analyzed by changing the bus frequency, in order to increase fare prices at a reasonable rate, which then optimizes

(11)

the PTOs operational revenue. They found a ”square root formula” of users demand that defines the frequency as a function of passenger waiting cost, number of passen-gers and single bus operation cost over time. In [19] an extension was made assuming buses always stop and removing the number of bus stops as a variable. As is often the case in complex system modelling, this too saw multiple extensions in order to be more inclusive and accurate. An example is the inclusion of the influence of vehicle size on operation cost and crowding in [20]. In [37] this model is applied to Jaipur, India, with-out resorting to estimates. It concluded a change in frequency and number of stops, and discussed changes to the model to include congestion and headway control. Overall, the models lack in their accuracy of day-to-day operations, where the existence of stochastic processes can have a large impact.

In [22] steps are made towards a non-deterministic bus scheduling model, by having a stochastic user demand during each day, using a normal distribution based on estimates. They found that their stochastic demand model had a positive influence on temporal coverage as compared to deterministic alternatives.

In [3] stochastic disturbances are modelled in a schedule optimization method, by means of dynamic feedback regulation based on control matrices. The application of the control support tool proved useful in simulation, but required the use of on-board computers for distributed dispatching control, which will be reviewed further in the following data-driven research section.

2.4 Data-driven optimization

In this section, previous research will be described that uses data-driven optimization. This means finding both the systems interactions and its weights, based on data, without making any assumptions as to the underlying model. Doing this can result in finding unexpected interactions, which is the goal of this research.

First, it is useful to show the importance of advanced data management systems and its influence on punctuality. In [24] the benefits of Intelligent Transport Systems, such as Automatic Vehicle Location (AVL), Real-Time Passenger Information and Urban Traffic Control, are analyzed for 7 PT networks in Europe. In Madrid for example, the punctuality rose from 93% to 96%, while user satisfaction increased with 6%. This emphasizes the role of PTOs as the executing party as well. This thesis project will be carried out in such a way that novel findings can actually be implemented.

Before doing any form of data analysis, the data has to be processed. In [18] a modern intelligent bus system is described. The importance of where the processing happens (in-vehicle or at terminals) and database infrastructure (often non-relational). The latter especially complicates holistic analysis, whereas the former decides possible optimizations. These data system details have to be described in research to ensure reproducibility of the results. Furthermore, the paper shows benefits of traffic light priority for London, United Kingdom, which closes the gap between traditional bus systems and Bus Rapid Transit.

To this extent, many researchers have tried to use this relatively new and grow-ing source of information to increase bus punctuality in a variety of ways. They can

(12)

be divided into two sub-problems, both of which influence punctuality. The first is result-driven. By fitting the timetable to match with previously recorded arrival and departure data, accurate scheduling can be achieved. The second problem comes from the performance-driven perspective. By combining data from different sources, factors influencing punctuality can be extracted, analyzed and optimized.

2.4.1 Result-driven

In [35] one of the earliest models to analyze passenger wait times by estimating bus and passenger arrivals is proposed. The bus times are seen as log-normal and passenger times as a combination of random and normal distributed non-random. The proportion of non-random arrivals was found in an empirical study. The same researchers refined their results in [9], by making the proportion of non-random passenger arrivals depend on the reliability of service, under the assumption that more punctual buses lead to more passengers deciding to plan their arrival. This model described the data more closely than previous models. Current data systems do not capture when a user arrives at a bus stop, which means that this research is still valuable in modelling passengers.

In [27] an addition to HASTUS software is proposed. This software package is widely used by PTOs to optimize their crew and vehicle scheduling based on automatic data gathering systems. Their addition includes run-time values and incident-recovery times, while addressing causes of non-compliance. To ensure punctuality, hierarchical time-bands are introduced that deal with variability in round-trip times for different segments of the day. It was applied to a bus route in Barcelona, Spain, resulting in more optimal punctuality and a satisfied PTO. However, it did indicate structural faults in the data gathering process.

In [11] the impact of holding strategies on punctuality was modelled. Holding means to have a bus wait at a stop, when it is ahead of schedule, in order to adhere to the timetable. It was applied in simulation using BusMezzo to a high frequency line in Stockholm, Sweden, and found that the best practice was to keep the mean headway to the leading and trailing bus. This optimizes passenger waiting time, fleet cost and crew management, without influencing the punctuality.

2.4.2 Performance-driven

In [13] bus service reliability is analyzed on stop, route and network levels, by examining correlations in the data for 3 reliability measures. Data is gathered from 36 routes in Beijing, China, to estimate the effect of 4 factors on service reliability. Route length, headway, bus lanes and length of stop to origin terminal are described for the different reliability measures. The results from this research will be re-evaluated and extended upon using predictive models in this thesis.

In [2] the following factors affecting bus run-time were examined: link length; total number of intersections, also segmented by number of signalized intersections, unsignal-ized intersections with and without bus right of way; number of passenger stops; link proportion of allowed parking; boardings; alightings; time period; direction of travel;

(13)

and running time deviation. The factors were tested on their statistical influence and correlations, concluding in mean running time being strongly influenced by trip distance, people boarding and alighting, and signalized intersections. All the parameter estimates were statistically significant, with the independent variables accounting for over 90% of the variation in running times. They discuss their limited data, and how the collinearity makes ”cause and effect” models difficult. In [1] the same researchers propose a method for PTOs to optimize their reliability. They apply it to a case in Cincinnati, Ohio, with validation on data from Los Angeles, California. Based on mean run-time, run-time variation, headway variation and passenger wait-time, the optimal control strategy can be found as either schedule- or headway-based. It is one of the earliest papers to pro-duce this as user-friendly software, which is also the goal of this research. In [33] the gap between reliability focused and punctuality focused studies is closed. It describes internal and external factors and the difference between short and long term remedies. They propose a multinomial logit model based on a medium-sized case study in Port-land, Oregon, and proceed to find small specific changes that can be made to optimize punctuality. In [32] the same group performed similar analysis on AVL and APC data, when it saw its first introduction in PT in 1999. It showed improvements in on-time performance, headway regularity, and running time variation.

In [26] travel time reliability is increased by looking at the removal of in bus top-up of e-tickets. A case study is performed in Queensland, Australia, where the driving time, boarding time of different passenger types and passengers per stop are recorded. Setting the boarding time for top-up passengers to that of standard passengers leads to 15% increase in punctuality. Even considering the small sample size and rough estimates, this shows that efficient payment methods can have a large influence on variability of dwell times and thus indirectly on punctuality.

In [25] promising bus arrival estimation improvements are obtained by modelling distance, number of stops, dwell times, boarding and alighting passengers, and weather descriptors. The latter seemed non-important and demand seems to be highly correlated. However, it should be noted the authors stressed the importance of their small data size, and thus these results should further be proved, which will be part of this research.

In [15] various factors are analyzed that can influence dwell time: passenger activity, lift operations, floor bus height, time of day, and route type. It provides a detailed account of rare factors that largely increase dwell time, in particular that of lifting disabled passengers.

These performance-driven models have analyzed many different factors influencing punctuality. This thesis will built upon this by adding even more possible factors. However, to test the impact of control methods on these factors in different scenarios, predictive models have to be used.

2.5 Prediction

In [12] artificial neural networks (NN) and Kalman filters (KF) are used to predict bus arrival-time, based on APC and weather data. NNs were used to circumvent the need for the definition of precise cause and effect relationships. It proved better than the

(14)

schedule in all cases. The KF was to update the NN predictions during a trip, because its value lies in its ability to use new information dynamically. Applying it led to very accurate arrival estimations. Here we can see the influence of up-to-date information, where the schedule is from season start, NN from trip start and KF from last stop.

In [31] two KFs were used to predict running and dwell times. This is different from other models that have previously always estimated the running and dwell times together, even though much of the proposed optimizations come from changing the dwell times. It is compared with linear regression algorithms and an NN, made for the same dataset consisting of 5 days on a single route in Toronto, Canada. The KFs method worked better for most measures, only outperformed by the NN in 1 of 4 test cases.

In [40] bus running times are estimated for multiple routes, which sets it apart from previous research. Four models are used: support vector machines, artificial neural networks, k-nearest neighbours and linear regression. Three error measures are used to fully describe the benefits of each estimator. It was applied to data from 3 days on 14 routes at 2 stops in Hong Kong, China. Single route models were significantly worse than the multi-route models, of which support vector machines turned out to be most reliable and precise for all measures.

In [34] the running times of a bus is estimated when the amount of stops on a route is reduced. This can be used when splitting a single bus route into a frequently stopping or regular line and a sparsely stopping or express line. It was applied to a single route in Montreal, Canada, covering 6.000 trips. Using multivariate linear regression to calculate the influence of different factors, the conclusion was made that selecting heavily used stops spaced between 800 and 1.600 meters, could best be used to reduce bus running time.

(15)

3 Method

The research question stated in the Introduction can be dissected and sub-divided in a large number of different tasks in order to come to an answer. As seen in literature, the same problem can be solved in different methods as well. This means that an overview of the design difficulties and decisions is needed to explain the reason behind the actual implementation and analysis itself. This will also show the range to which this research is applicable.

3.1 Design overview

This research tries to satisfy the following requirements: • The punctuality estimation has to be accurate. • The model needs to have a high degree of generality.

• The results have to be descriptive of the underlying system.

The demand for accuracy is fairly straight-forward in that a model with high accuracy is a good estimator for the unknown variable. This leads to confidence in the taken method and reliability of the result. However, this does require a thorough experimental setup, that excludes the existence of a training bias. The high level of accuracy will be achieved by using good quality data sources. By having multiple input variables from different sources, the complexity of the underlying system can be modelled more precisely than by reverting to estimates for these parameters.

Generality in this sense means the re-usability of the results and the implementation itself for other researchers. The reason for this is that it can actually contribute to the body of knowledge already established for PT. By using open source data, where available, and a clearly defined implementation with descriptions of all steps, the research should be replicable for other cases. These cases can be either different locations, types of transport or a similar research question.

The results have to be usable for a description of the underlying processes within the system. This is required for getting insight into the reasons why the system behaves the way it does, as well as for the creation of best practise rules for PTOs. This will be achieved by multivariate analysis into the factors that can be influenced. Experimenta-tion with slight changes to the influenceable parameters can then be used to show the impact of policy changes.

Taking these requirements into account, a conceptual outline for the project can be created. The outline of the data processing and modelling pipeline is shown in figure 1. It has five different stages that have individual tasks. In the Data Sources stage the historical data for both internal and external sources is gathered. In the Data Processing stage the data is written to columnar storage, useful columns are selected, values are parsed and combined with different sources. This leads to a set of Normalized Factors, which get analyzed based on their correlations. These factors are used as input for a

(16)

Regression Model, which can learn to calculate an approximated output. This output is the Predicted Punctuality.

Figure 1: Conceptual overview of the data processing and modelling pipeline.

3.2 Case study

For this research, two different PTOs have agreed to participate. They have chosen to remain anonymous, but the characteristics of the bus lines will be described. The stop locations are confined within their respective cities (A and B) in the Netherlands. For each case we have access to a single bus route in 2 directions.

• Case A: bus route in a city with around 200 thousand inhabitants, moving from one suburb, through the city center, into a second suburb.

• Case B: bus route between two adjoining cities, with around 200 thousand and 30 thousand inhabitants.

The directions of the routes are henceforth referred to as D1 and D2 respectively. The leading example for this research will be Case A D1, whereas the other routes and direction serve as validation of the results.

The same date ranges are taken for all cases to standardize the data set. The last recorded 2 years of data (2017 and 2018) have been requested to ensure relevancy of the results to modern practices. Finally a number of weeks (Monday-Sunday) have been selected, where the amount of events, such as holidays or vacations, is minimal. The same weeks have been taken for both years to reduce the influence of factors that were only present for a single year. The week numbers, as defined by the ISO-8601 standard:

(17)

3.3 Data sources

In figure 1, two types of data sources are shown: internal and external. Internal data refers to data provided by PTOs. External data comes from other, only indirectly bus-related sources. Each of these sources will be described here, to show why they can be important to the outcome of the model.

The most essential data source is the actual recorded bus location data, split over the aforementioned Case A and B. Both PTOs use HASTUS for scheduling and a system called Albatros for recording and storing the data.

The first openly available data source used here is weather. Previous research has not been able to confidently state whether it has any influence on punctuality. This means that even if the weather turns out to have no significant influence, it is still worthwhile to do the experiments to prove this.

Another data source that is publicly available for most countries and regions is demo-graphic data. By differentiating between population types, better insight can be gained as to why for instance boarding and alighting time is not comparable at each stop.

More difficult to acquire due to its relatively young technology is road usage. By measuring the amount of traffic at a road at multiple times a day, a description of transit activity can be made. It is save to assume that an increase in traffic will decrease the bus travel speed between stops without dedicated bus lanes. Traffic congestion can be used to predict delays and is thus a good factor to take into account in the model.

To combine all the sources, time and space have to be known. The former is always provided by these sources, but the latter is often missing. In order to find a location for a given name, geocoding can be applied, which gives useable coordinates. By indexing on both space and time, a fair assessment of the influencing parameters can be made.

3.4 Models

As can be seen in figure 1, normalized factors serve as input for the predictive regression model. The output is the bus punctuality. A regression type model is used, because a detailed description of how much a given bus has deviated from the schedule is desired. Another choice would have been to train a classification model, where the output is either on-time, early or late. But this model would not be able to differentiate between a deviation of 3 or 20 minutes, which is a sought after feature of the model in the form of accuracy. It would also require a predefined on-time range, which can not be changed after training. The analysis of how many trips were on time, can still be done on the output of a regression model as well, and is thus the preferred model type.

To gain an understanding of the complexity of the system, experiments are done on three different models. Each of these is described in a separate section, where they built upon the findings for the previous models. The first is the most general model, where a single large estimator is trained that takes all internal and external factors as input parameters. The stop specific model trains an estimator for each individual bus stop, where the factors that do not rely on time are excluded. In the mathematical model the findings of the previous estimators are used to create a discrete-space model.

(18)

The general and stop specific model make use of regression methods. Algorithms used for regression can have differences in accuracy based on the problem to be solved. To ensure an accurate method is used, three regressors are tested. The first is Linear Regression (LR), which gives a weight to each input parameter to minimize the least-squares error. Support Vector Regression (SVR) is also applied, which enhances the input vector to get hyper parameters, also used in its classification counterpart, and uses this set of parameters to minimize the error. Lastly, a Multi-Layer Perceptron is trained, known as a Neural Network (NN), which works by setting a fixed number of layers and nodes, that get weights assigned based on back-propagation of the error.

The regression models work by randomly splitting the data into a train and test set. To ensure that the score of a model is not biased towards a single split, cross-validation is used. The method used in this research is k-fold cross-validation, which splits the data into k equal sized subsets or folds. The regression method gets trained k times, for each k − 1 subset groups, and tested k times, for each of the remaining subsets. The mean of the R2 scores for each fold is the score reported in this research and the standard deviation is reported as the confidence interval.

To quantify the accuracy of the trained predictors, the R2 metric is used. This metric takes the sum of the square of the residuals and compares this with the residuals of simply taking the mean of the values. This results in a method, where a score of 1 means that all values were correctly predicted, 0 means that it has the same accuracy as a constant model that always predicts the expected value of y, disregarding the input features, and negative scores are indicative of a model that is arbitrarily worse.

To uncover the impact of a single feature on punctuality using the predictive mod-els, Sobol sensitivity analysis is used. This method is needed, because many accurate regression methods work like a ”black box”, where the complex interactions are not un-derstandable for a person to read. This sensitivity analysis is done in three orders, each analyzing the variance in the output by changing values in the input. The first order sensitivity (S1) is the result of changing a single input parameter, the second order (S2) changes two input parameters and the total sensitivity (ST) is gained from changing all inputs together. It achieves this by sampling in the input range, meaning that there are errors in the resulting sensitivities. In [39], the process of analyzing the sensitivity of a NN is described. They use the first and second order sensitivity to get insight into the black box that this neural network presents.

3.5 Simulation

The three models output the punctuality at any bus stop in the route, given the factors at the previous stop. This means that a single trip can be simulated, assuming the starting punctuality and external factors are known, by iteratively updating the difference from the schedule at each stop according to the predictions. Showing the time at each bus stop according to the planned schedule, the actual recorded trip and the predicted punctuality, will be the method to test the models.

This simulation will allow for multiple experiments, that answer the research ques-tion. The difference from the actual trip punctuality will be the determining value of

(19)

prediction quality. Changing a single input parameter while keeping the others fixed, can show the positive or negative influence of this factor on punctuality. Lastly, early departures at the first stop allows for adjustments to a predicted longer run-time.

In order to apply shifts to the factors in a systematic and understandable way, min-max scaling is used. Given a value x for a parameter with minimum kmin and maximum

kmax, the shift scale h is applied by:

y = (

x + (kmax− x)h, if h ≥ 0

x − (kmin− x)h, otherwise

(2)

This results in a scaling system where -1 gives the parameter minimum, 0 gives the current value and 1 gives the parameter maximum.

(20)

4 Data processing

This section is dedicated to describing the data and how it is processed. The resulting factors are also analyzed, including their ranges and correlations. This will give the reader the knowledge needed to follow the actual model creation steps taken in sections 5.1, 5.3, 5.4.

4.1 Standardization

The data in this project comes from multiple different sources, each with its own format. These formats can have a wide range of reporting styles, such as differences in database type, character encoding and time values.

To create a single coherent database, the choice is made here to rewrite all data sources into column separated values (csv ) files. Each of these files contain columns that represent the attributes of each of the entries. Entries in multiple files can be linked based on their column values. Indexing into other data files can best be done using a direct link, such as a unique route ID. To create these links between the different data sources, they have to be matched on both time and space.

Time can prove difficult to process, due to the existence of timezones and daylight savings time. Both of these issues can be circumvented by using coordinated universal time (UTC ), which is a clock that counts the seconds from 1 january 1970. By converting any local times to UTC, the reported times can be compared based on their difference in seconds. The datetime format used in the implementation is YYYY-MM-DD hh:ss:mm. Space is limited in this project to a location on earth. The standard format for this is World Geodetic System 1984 (WGS-84 ). In this coordinate system, longitude and latitude are represented on an ellipsoid. Distance can be calculated between 2 points on earth to meter accuracy using this system, which is sufficient for this project. The coordinates for each entry are stored in 2 columns called Lat and Lon.

The goals for data processing in this standardization stage are the following: • Select relevant attributes.

• Parse strings to numbers. • Clean bad values.

• Standardize time and space.

• Remove irrelevant time and space entries.

Doing this for each resource results in smaller volumes of correct data, that can be integrated in a larger single coherent database.

4.2 City definition

Standardizing the data is a relatively fast process due to the independencies of the values. This is not the case for the data analysis and factor definitions, however. These

(21)

operations require one-to-many and many-to-many lookups, which are time and memory expensive. To reduce this overhead, all data is subdivided in time and space, into chunks sizes of around 1MB. For time this means that the files have to be stored in either hours (traffic), days (Albatros) or years (weather).

Subdividing on space proves more difficult. For each bus line that has been used, a data range was defined. Because all are intra-city lines and cities tend to have a circular shape, the data has been subdivided based on distance to a city center. By setting a wide maximum range, that fully covers the city limits, data processing and analysis can be done in heavily reduced computational time without missing relevant values. The city range for Case A is 5.5km and a larger 6km for Case B.

4.3 Internal

The data comes from 2 sources: HASTUS and Albatros. The former is the previously mentioned software used for designing the network and provides the locations of stops and configuration of lines. The latter is the software solution for communication man-agement of buses, created by the supplier of Case A. This means that the actual bus movements originate from Albatros. The combination of both sources results in a full description of bus movements, where the following attributes are known for each record:

Stop ID Unique identifier for the current stop

Stop Name Name of the current stop

Stop Location Longitude and latitude coordinates of current stop Distance Distance (m) between the current and previous stop Trip ID Unique identifier for the current trip

Planned Arrival The scheduled arrival time (m) at current stop Planned Departure The scheduled departure time (m) at current stop Actual Arrival The recorded arrival time (s) at current stop Actual Departure The recorded departure time (s) at current stop

Table 1: Attribute descriptions of the processed internal dataset.

This result is reached after multiple cleaning, combining and reducing steps. First raw Albatros data is removed based on time difference values between planned and actual, when the bus is either more than 15 minutes early or more than 30 minutes late. All entries of trips with less stops than the expected total are also filtered out. Around 1% of trips have to be removed, because the difference from the schedule can get set to 0 at a random stop in the trip, for which the cause is unknown. This is filtered out based on having more than 20% of the reported punctualities for a single trip to be 0.

An important distinction to make is that the planned times are given in full minutes, whereas the actual recorded times are accurate to the second. In addition to this, the difference between the planned arrival and departure time at a stop is always 0 minutes, except for rest stops, where the driver gets extra time to wait for passengers. For

(22)

example, at a large train and bus station for Case A, where the additional time is 2 minutes.

The locations of the stops were not reported in the HASTUS data. To get these, a Python library called geopy was used to do geocoding with OpenStreetMap in order to match the bus stop names with longitude and latitude coordinates. Similarly the route between 2 stops was also not reported, which was retrieved using OpenRouteService between the stops. Here the decision was made to use the bicycle as transportation mode, as bus was not supported and a car is unable to go on bus lanes.

In addition, the mean punctuality for multiple time frames are calculated. The shortest is each half-hour of the day, which is included as a measurement of expected peak ridership and traffic. Secondly the day of the week is used to find differences between workdays and weekend. Lastly, the month of the year is calculated to reflect seasonal effects. These three values will be used as input factors for the regression models.

4.4 External

The external data sources are provided by 3 separate parties. Each of these needs its own approach of downloading the data, cleaning bad values, reformatting to csv and filtering for the needed time and location. The code to do this is implemented in Python with use of the Pandas library to manage large data objects.

4.4.1 Weather

The data archive is created at weather stations. To ensure that the weather data used is accurate, the closest weather station to each bus line, city or concession has to be chosen. This data can then be pulled manually as zipped csv files, which is reasonable due to the option of downloading a year directly. A year of weather descriptors comprise around 1.2MB of data, which is a relatively small size.

The csv file contains 29 attributes for each of the 8760 entries. The time density is limited to a single measurement at each hour. These hours are reported in local time CET, without daylight savings time, meaning the global time in UTC can be retrieved by subtracting 1 hour of each measurement. Some attribute columns are dismissed outright due to their perceived irrelevance to bus punctuality. Attributes such as precipitation and visibility have values less than 1 described with strings, which are parsed into 0. Others are defined in ranges, such as cloud coverage and humidity, that get parsed to the average of the given range.

The relevant attributes from this data source are the temperature, precipitation, visibility and wind speed. The visibility is the only attribute with missing values for 7% of the data, which have been added using the forward filling method, where the last seen value is used instead. This is acceptable on the assumption that the visibility does not drastically change within a single measurement and consecutive missing values occur rarely in the data.

(23)

4.4.2 Traffic

When using the NDW open data expert module [38], a pre-download location filter is not provided. This means the data has to be pulled for the entirety of the Netherlands. Coupling this with the large amount of measurements per second, the total data volume can reach 12GB for a single week. Because the intended use of the traffic measurements is to describe how changes in bus travel time come to exist, we do not need the actual intensity and speed. A pre-download filter is available for recorded travel times, reducing the total with 90% to 1.2GB per week. The locations for travel time measurement sites are also denser in cities, which is more in line with our intended use.

After manually selecting a download time frame, the archive gets created. This can take multiple hours for larger requests consisting of multiple weeks. Afterwards, a link is provided to the resource, where a small chance exists the download gets aborted. These issues have incentivised the need for a more automated traffic pulling method. Using Python, code was created that takes the initial download link as a user argument. It refreshes this every 30 minutes and downloads the data when the link is live. The data gets unzipped, and each day in the archive gets passed to another Python script that parses it in approximately 90 minutes. The ease of use of this method means that a single week of data can be requested and parsed in around 12 hours.

The data for a single day of traffic is provided in zipped minute measurement xml files. Each is extracted into a single Python object, consisting of nested dictionaries and lists. This object contains different types of measurement recordings and attributes, of which the following are kept: SiteID, Time, Value. Even though xml has options for different data types and thus different classes, most data can be transformed into universal attributes without much issue. Time is reported correctly in UTC, Value describes a vehicles travel time in seconds and SiteID links to a site description file. Value errors are sometimes reported, as well as negative values and lower than 50% value quality, all of which get filtered out.

The location and measurement characteristics for the traffic sites is defined in a separate lookup xml file. The coordinates are correctly provided in WGS-84, as well as the name of the measured road. The distance between the measurements start and stop location can be defined over multiple segments of which the sum is taken to have a single length attribute for each site. Using the city definition, the SiteIDs can be filtered on their location. This greatly reduces the size of a single day from 2.2GB to 10MB.

The traffic is reported each minute per site. The reports contain the mean travel time of all measured vehicles within that time-frame. By dividing the measurement distance with the mean travel time, the speed of traffic at that minute can be retrieved. For each site the mean speed is calculated. To get a factor relevant to bus punctuality, multiple minutes of data are aggregated. The speed is defined as the mean of the measured values divided by the mean speed at the site. This generalizes the speed, so that it is comparable between sites that are in a busy city street or on a highway. Aggregating over multiple minutes showed that there are missing minutes in the data, which can be attributed to too few vehicles on the road. This number of measured vehicles is also kept, which has a minimum of 0 and a maximum of the number of aggregated minutes.

(24)

4.4.3 Demographic

The population characteristics are provided in terms of 100 meter by 100 meter squares and are measured and reported yearly. The data has been formatted as a shapefile type, which can be read using the Python library Geopandas, an extension on the previously used Pandas. The locations of the shapes are reported in the Rijksdriehoek coordinate system, a system designed specifically for the Netherlands. It can be mapped onto WGS-84 longitude and latitude coordinates by using EPSG 4326, which is a geodetic parameter look-up dataset. The arithmetic average is taken of the shape coordinates to have a single coordinate for each entry. The entries are then filtered based on the city definition.

Different years can contain other sets of attributes. Taking into account that the data quality is not too high with many missing values, that accuracy is limited to multiples of 5 individuals and that the sum of males and females does not always match the total, the choice is made to average all values for 2015, 2016 and 2017 as a single description of the population. This is based on the assumption that within this time frame, no significant demographic changes have occurred. The result is a 1MB csv file.

For each data square the number of inhabitants is kept, because it could influence the dwell time of a bus due to passenger numbers. The mean age, percentage of females and percentage of natives are kept without previous assumptions, but for possible interesting influences or connections.

4.5 Integration

The punctuality measurements need to be integrated with the external factors based on time. To achieve this, the external data are binned to fixed intervals, where the bin with the shortest time difference is used as the value for that measurement. The traffic measurements are aggregated on 5 and 30 minutes intervals. The weather values are reported in hours.

In figure 2 the locations of 3 sources of data are shown for Case A. The green squares represent the demographic data, where the brightness indicates the number of inhabitants. In blue the stops and route between them is shown for line 2. The red and orange dots represent the traffic measurement sites. We can see here that most stops in the upper left corner do not have traffic sites nearby, which will reduce the accuracy of this factor. Another interesting observation is that when a site is close to a stop it tends to be on a slow road.

(25)

Figure 2: Map of spatial data sources for Case A.

Each bus stop gets a traffic site assigned based on the shortest route. The traffic descriptors for each stop are the measured values at that closest site. All demographic squares are assigned to the closest bus stop based on the geodesic distance, if it is less than 800 meters away, which is the maximum distance a person would take to get to a bus [17]. The inhabitants are summed and the other values are averaged to get the demographic factors for each bus stop.

4.6 Factor definition

With the global definitions of time and space for all data sources, they can be combined to create factors that could have an influence on punctuality. These factors are defined in tables 2 and 3.

Name Type Description

Mean Duration Float Average time between stops (s) Stop Distance Float Distance to previous stop (m) Inhabitants Int Number of inhabitants in area

Age Float Average age in area

Female Float Percentage female in area Native Float Percentage natives in area

Site Distance Float Distance to nearest traffic site (m)

(26)

Name Type Description

Last Punctuality Int Difference from schedule (s) at the previous stop Temperature Float Temperature (°C)

Precipitation Int Precipitation (mm) Visibility Int Horizontal visibility (km)

Wind Float Wind speed (m/s)

Vehicle Count 5m Int Number of vehicle measurements in the last 5 minutes Mean Speed 5m Float Relative traffic speed change in last 5 minutes

Vehicle Count 30m Int Number of vehicle measurements in the last 30 minutes Traffic Speed 30m Float Relative traffic speed change in last 30 minutes

Half-Hour Float Mean punctuality at this half-hour of the day Weekday Float Mean punctuality at this day of the week Month Float Mean punctuality at this month of the year

Table 3: Formal definition of all temporal factors influencing bus punctuality.

The factors serve as input for the models in the following sections, where their impact on punctuality will be quantified. The temporal attribute depicts whether the factor is dependant on time. Factors that are not temporal only change when a different bus stop is considered, meaning they are spatial variables.

4.7 Characteristics

Before doing estimation experiments, it is important to have a good grip on the underly-ing data values. Thoroughly examinunderly-ing the ranges and correlations between inputs will allow for meaningful insight in the results.

4.7.1 Ranges

The value ranges for the factors defined in tables 2 and 3, are described here for Case A D1. The relevant observations that can be made from the data ranges are noted, in order to understand the bus route.

(27)

Figure 3: Value ranges of spatial factors from demographic, traffic and internal data sources.

(28)

In figure 3 the data values for each stop are aggregated and shown. The duration and distance between two stops show a wide range (up to 7 times the distance). The duration and distance have similar distributions, meaning the longer stretches of road do not necessarily increase the average driving speed. Most bus stops have an area with one- to two-thousand inhabitants, with a few exceptions that serve up to six-thousand. The inhabitants are aged on average mostly between 37 and 39. The females make up half of the population as expected. The stops that have a large percentage of females and a high average age are nearby a nursing home. Difference in neighbourhoods can be seen in the distribution of natives, which has a large range between 40% and 90%. The nearest traffic measurement site tends to be within 1.3 kilometers, except for 7 stops that are more remotely located.

In figure 4 the data values for each recorded arrival at any bus stop are aggregated and shown. The selected weeks reflect the average temperature in the Netherlands, with values ranging between -5 and 35 degrees Celsius. The precipitation values show that rain and snow are uncommon. The visibility is reported with a decrease in accuracy for the larger values, leading to gaps in the graphs. Wind speeds show a similar distribution to temperature. The measurement counts show that most traffic sites report at least one vehicle per minute. The relative traffic speeds are by definition centered around 1, with similar normal distributions for both 5 and 30 minute intervals.

(29)

Figure 4: Value ranges of external temporal factors from traffic and weather data sources.

(30)

Figure 5: Value ranges of temporal factors from internal data sources.

In figure 5 the values for the factors from the internal data are shown. The punctuality at the previous stop seems to follow a slightly skewed normal distribution with a mean of 60 seconds late departures. Departures more than 200 seconds early or 400 seconds late are rare, but do take place. The operational hours of this bus line are from 5:30 to 24:00, with peak travel times at 8:30 and 17:00. Interestingly, the Mondays have the best punctuality on average compared to the other days of the week. In the month of the year the gaps are caused by missing data, as the data has been taken from a small subset of weeks as described in the Method section.

(31)

4.7.2 Correlations

A basic understanding of the way factors are interconnected can be gained by looking at their correlations. In this case, the correlation between two sets of numbers is quantified by taking their covariance. The covariance is high when both increase at the same time, 0 when the values are random and negative when one value seems to decrease as the other increases or vice versa.

Figure 6: Covariance of all 2-pair temporal data values as well as the recorded punctu-ality.

In figure 6 the correlations between all temporal data values have been displayed in a heat map. It shows some expected behaviour, such as:

• The diagonal (all factors compared with themselves) has a full match.

• The matrix is symmetrical as the covariance is commutative meaning the order of the 2 inputs are irrelevant.

• Punctuality at the current stop is very dependant on the punctuality at the pre-vious stop.

• Traffic speeds slightly decrease and vehicle counts increase as the half-hour punc-tuality (measure of peak hours) increases.

(32)

This does not yet explain some of the more unexpected results. A few are highlighted with specific examples to show the difficulties in generalizations for this system.

Figure 7: Data spread of temperature and the mean punctuality for each month.

In figure 7 the spread of all data points is shown for the temperature and month inputs. The goal of this graph is to clarify the unit of month punctuality, which is the mean punctuality in seconds for each month of the year. This means that the y-values do not change based on time of the year (i.e. where February would be higher than January), but on the mean punctuality for that time period. As seen in figure 5, the summer months have a lower mean punctuality, which corresponds to the higher temperatures seen in the graph above. This explains the negative correlation between temperature and month as seen in the covariance matrix. Only 9 different month values are in the dataset, instead of 12, because the sampling does not cover all months of the year.

(33)

Figure 8: Data spread of actual punctuality and the mean punctuality for half-hour of the day.

Figure 8 shows that the slight positive correlation that is seen in the covariance matrix can also be found in the data spread. The lower half-hour punctuality values, for 5:30 and 23:30 have a relatively lower punctuality than the peak hours such as 8:30, which can be found at the top of the graph. However, the spread of punctuality within any half-hour tends to be large, meaning this input on its own is not a sufficiently accurate predictor.

(34)

Figure 9: Data spread of punctuality and the number of vehicle measurements in a 30 minute time frame for a single bus stop.

In figure 9, the slightly positive correlation between vehicle measurements and punc-tuality is visualized. Intuitively, this meets expectations as the increase in road usage will hinder a bus in driving optimally. The reason that this relationship is not more prevalent in the data is that punctuality of the later stops are very reliant on previous events, that can be unrelated to traffic. This means that an absence of traffic congestion is not enough to ensure optimal punctuality.

(35)

Figure 10: Data spread of relative traffic speed and the number of vehicle measurements in a 30 minute time frame for a single bus stop.

In figure 10, a negative correlation between traffic speed and vehicle counts is found. This meets expectations as increase in road congestion will result in a decreased average vehicle speed. However, in the covariance matrix a positive correlation is found between the two values. The reason for this discrepancy is that the data spread is shown for a single stop, whereas the correlations are calculated for all stops. The stops should always be considered separately, as the road types of their respective traffic sites can be in a range between city street and highway. On a highway, the number of vehicles and their speeds are higher, resulting in a seemingly positive correlation. Considering values at a single stop level, the correlations are the other way around.

(36)

5 Model creation

In this section multiple iterations of model will be described with their advantages and disadvantages. The three finalized models explained in detail are the general model, stop specific model and the mathematical model.

Each model relies on the basis where a single trip can be seen as a list of alternating time segments. The segments are the travel time between the stops and the dwell time at the stops. The 2 segment types will be predicted separately, which is similar to what has been done previously in [31]. These predictions can then be used to analyze the complete trip from the first to the last stop.

5.1 General model

This model consists of two predictors that feed into each other. The predictors are trained regression models, that output punctuality. The model will start at the first travel segment, which results in an expected arrival time at the next stop. Here the dwell time is predicted using the updated factors with the second predictor. These estimated times are added together, leading to the starting state.

T ravel Dwell

Figure 11: The interaction between the two states used in the general model.

5.2 Output definition

This model relies on all previously defined input factors in tables 2 and 3. For the output, three separate solutions have been tested:

• Time difference from schedule. • Duration difference from schedule. • Duration difference from mean.

The first listed approach is the most straight-forward. By directly predicting the time difference from the schedule (punctuality), there are no additional steps required to feed the output into the next iteration of the model. The downside of this approach, however, lies in the loss of accuracy compared to the other output options. As seen in

(37)

figure 6, punctuality at the current stop is extremely dependant on the punctuality at the previous stop. For regression algorithms it is difficult to take the smaller factors such as traffic and weather into account.

An improvement to this would be to predict the difference from the planned driving and dwelling duration instead. The premise for this method is that the output is not fully dependant on a single input parameter. This does require another computation step to go from the stop level duration difference to the trip level punctuality. This is done by adding the previous punctuality, planned duration and duration difference together to get the updated punctuality in seconds.

The final solution resolves the issue with schedule accuracy. In table 1 it is noted that the planned schedule is provided in minutes, whereas the actual bus movements are in seconds. Because the planned dwell duration is 0 for most stops, this has an intrinsic bias with the mean dwell duration, which will always be larger than 0. To get around this issue, the mean duration for travel time between two stops and the dwell time at a stop is taken instead of the scheduled duration. The method to go from this output to punctuality is the sum of the previous punctuality, mean duration and the duration difference.

5.2.1 Training

In table 4 the scores for three regression methods are shown. We can see that Neural Networks (NN) are the optimal method for punctuality prediction, as both the dwell and travel time have the highest scores. As could be expected Linear Regression (LR) has the lowest scores, due to its inability to have complex interactions between inputs. The Support Vector Regression (SVR) method is also not as accurate as the NN.

Even though this gives us the best method, it also shows that the travel scores are quite low compared to the dwell scores. The scores have this gap, because there is a larger variance in the dwell duration difference than in the travel time. This is caused by the drivers waiting longer at certain stops when they arrive early and have the chance to wait. The NN seems to be the most accurate in predicting this behaviour.

Method Travel Score Dwell Score

Linear Regression 0.047 0.536

Support Vector Regression 0.080 0.623

Neural Network 0.121 0.747

Table 4: Cross-validation R2 scores for three regression methods.

5.2.2 Analysis

In figure 12 the total sensitivity is shown for each input parameter in the NN model using the Sobol method at 5 ∗ 104 samples. It shows that most of the input factors do not have any impact on punctuality, except for an almost negligible contribution by the number

(38)

of vehicle measurements. The other values that do have an impact are spatial factors. As discussed for the training scores, there are a number of stops where the driver will wait when ahead of schedule. Because these stops are not known previously, it learns them based on its unique set of spatial descriptors. Other interactions are neglected in this model of the system.

Figure 12: Total sensitivity for all input parameters with a 95% confidence interval is indicated by the error bars.

The goal of this 2 state model was to underline the importance of generality, by providing data from which rules can be learned that are widely applicable. This has not been achieved, judging from the sensitivities. Instead it created a very specific model that only uses its input to determine which stop the bus is currently at.

(39)

5.3 Stop specific model

The general model posed the problem that stops had to be determined from their spatial attributes. This can be circumvented by creating a separate regressor for each individual stop. This abolishes the need for the spatial factors as input for the models as they are constants.

5.3.1 Training

As found in the general model, a NN is the optimal regression method for this problem and is thus used for this model. For each stop 60 NNs are trained and 5-fold cross-validated for different sizes. The sizes range from 5 to 100 nodes and the number of layers between 1 and 4. For each stop the NN with the highest accuracy is selected. No patterns have been found in the stops spatial attributes and the sizes of their best performing NNs.

In order to find the impact of the external factors (weather and traffic), they have been trained separately from the internal factors.

Figure 13: Sorted cross-validation score for the travel prediction to each of the stops on route Case A D1, with error bars indicating the confidence interval.

In figure 13, the prediction scores for all stops on the route are shown, ordered by accuracy. This graph shows how well the neural networks predict travel time for each stop, the dwell scores in figure 13 are very similar. The first thing that stands out is the large difference in accuracy between stops. The three stops with the highest accuracy are this predictable, because here the bus drivers fully respond to being too early and waiting in order to get back on schedule.

(40)

It also shows that there is no significant change in driving time for around half of the stops. This means that the input parameters either do not have an influence on driving time or the influence of unknown factors or events is too large.

Lastly, there is the middle ground. From Stop 5 up to Stop 9 we can see the hard effects of the input parameters. These stops show the most complex interactions with the external and internal factors.

The time only model (where the external factors have been excluded) shows a similar trend as the model that is based on the external factors only. For both, the previous punctuality leads to good results for 3 stops, around half of the stops are not predictable and a couple show a factor influence of 5-10%.

Best case scenario would be that simply adding the internal factors to the model will also simply add the prediction scores. This is not the case. The external and time models together do show the best of both worlds scenario, where the prediction is almost always better then both the external or the time factors alone, meaning that this model is the best description of the underlying system we can get. The added complexity to the input does increase the uncertainty of the predictions accuracy, as indicated by the red error margins, caused by occasional overfitting of the data.

Figure 14: Sorted cross-validation score for the dwell prediction at each of the stops on route Case A D1, with error bars indicating the confidence interval.

For the dwell time, the positive influence of the time factors is even more pronounced, as seen in figure 14. Some stops that behave unpredictably based on external factors, are slightly predictable (5%) based on the time factors. Adding the models together again leads to the desired result where the complete model is almost always the best of both worlds.

(41)

5.3.2 Analysis

As seen during training, the model performs best when all temporal inputs are used. The Sobol total sensitivity is calculated for the travel duration to and dwell duration at each stop, using 5 ∗ 104 samples. The mean and standard deviation over all stops are calculated for each of the inputs. The results are shown in figure 15 and 16 for travel and dwell duration, respectively.

Figure 15: Total sensitivity of the travel duration for all input parameters with the error bars indicating the standard deviation.

Data-driven modelling of internal and external factors' impact on bus punctuality.