Demand analysis and privacy of floating car data

(1)

by

Giancarlo Camilo

B.Sc., University of Victoria, 2019

A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of

MASTER OF SCIENCE

in the Department of Computer Science

c

Giancarlo Camilo, 2019 University of Victoria

(2)

Demand Analysis and Privacy of Floating Car Data

by

Giancarlo Camilo

B.Sc., University of Victoria, 2019

Supervisory Committee

Dr. Kui Wu, Supervisor

(Department of Computer Science)

Dr. Baljeet Malhotra, Departmental Member (Department of Computer Science)

(3)

ABSTRACT

This thesis investigates two research problems in analyzing floating car data (FCD): automated segmentation and privacy. For the former, we design an auto-mated segmentation method based on the social functions of an area to enhance existing traffic demand analysis. This segmentation is used to create an extension of the traditional origin-destination matrix that can represent origins of traffic demand. The methods are then combined for interactive visualization of traffic demand, using a floating car dataset from a ride-hailing application. For the latter, we investigate the properties in FCD that may lead to privacy leaks. We present an attack on a real-world taxi dataset, showing that FCD, even though anonymized, can potentially leak privacy.

(4)

List of Tables

Table 4.1 Entry of data from the Twitter API. . . 55 Table A.1 Data entry of the GAIA Open dataset. . . 62 Table A.2 Data entry of the Urban Trips dataset. . . 63

(9)

List of Figures

Figure 3.1 Segmentation comparison. . . 23 Figure 3.2 Segmentation of Xi’an City Second Ring Road. . . 35 Figure 3.3 Visualization of intensity of demand for the selected area (grey

(10)

Introduction

The usage of geolocation in smartphones, cars, and other devices has become widely popular. Various applications make use of our locations to provide us with certain services. Ride-hailing applications (such as Uber, Lyft, DiDi) and mapping applica-tions (such as Google Maps, Apple Maps) are examples of mainstream applicaapplica-tions that use our locations. This type of information is often referred to as floating car data (FCD) [51] and it presents the opportunity for traffic analysis.

Classic traffic analysis is limited and difficult to scale, so we look towards floating car data to enhance traffic analysis. Intelligent traffic systems need to use this data to extract insights about traffic conditions. After data extraction and processing, the results can be presented in a way that can be easily understood and manipulated by human operators. This is a relatively new method of traffic analysis so more research is needed on the subject. After reviewing the current literature, we identify certain research gaps that are the focus of this thesis.

Traffic analysis is a subject with many facets, one of them being traffic demand analysis. Traffic demand can be seen as the need that vehicles have for certain ar-eas/roads in traffic. Understanding demand is important for traffic operators and managers who need to maintain traffic infrastructure. A common way to analyze traffic demand is by using an origin-destination matrix. Since the creation of this matrix depends on area segmentation, we present an enhancement that uses crowd-sourced information to generate better area segmentation than traditional methods. Also, we present a method that extends the classic origin-destination matrix to repre-sent origins of demand. This allows us not only to determine high-demand areas but

(11)

also to identify where traffic comes from. These two contributions are represented by the following research questions:

Q1: Can we enhance traffic demand analysis to find origins of demand by extending origin-destination matrix?

Q2: Can we improve the current methods used for traffic de-mand analysis with automated social function classification method? While working with floating car datasets, we have found that the state of privacy of these types of datasets is not yet fully understood. Because these datasets contain detailed information about people’s driving habits, we wondered if they can be used by malicious entities to infringe on the privacy of drivers. Therefore, in this thesis we also provide a discussion of this topic, which is represented by the following research question:

Q3: Can the use of floating car data infringe the privacy of its drivers?

Before looking into our research questions we present our literature review in which we found the research gaps addressed by this thesis. We cover classic traffic analysis and how floating car data has been used, so far, to power intelligent traffic systems. We also look into common metrics, anomaly detection, traffic prediction, traffic visu-alization and the evaluation of traffic systems and datasets.

After the review, we look into traffic demand. We present our enhanced methods for segmentation and for extending the origin-destination matrix to find demand origins. We tie these two methods together by providing a use case application, which implements these two methods to create a traffic demand visualization which focuses on interactivity. We test this application with a real floating car dataset obtained from [20].

We then look into the privacy of floating car datasets. We describe the current methods for privacy protection. Then, we identify what properties may be exploited to retrieve the identity of drivers by matching the floating car dataset to a secondary public dataset that contains identities. Finally, we describe a privacy attack algorithm and test it using a real floating car dataset.

(12)

Chapter 2 Background

Traffic analysis aims to understand and track traffic conditions. Traffic jams and overall road congestion cause billions of dollars of loss every year [52]. Traffic systems are important to help traffic managers and operators understand how the traffic structure is used, allowing them to make better decisions on traffic maintenance and improvements. Traffic systems can also provide navigation tools that help drivers by finding best routes or identifying congested areas that should be avoided. Google Maps [30] and Apple Maps [29] are popular examples of this type of traffic system.

Traffic systems normally depend on data from stationary sensors distributed through-out the traffic infrastructure. However, these sensors provide limited data and are also difficult to maintain. The usage of geolocation devices in vehicles (on vehicles them-selves or on cellphones inside vehicles) has become widely popular and this presents an opportunity to use this data to enhance traffic analysis systems. This type of data is known as floating car data. In this chapter we provide an overview of traffic anal-ysis with a focus on how floating car data is used. These concepts will be important for answering the research questions of this thesis.

2.1 Traffic Analysis

The phenomenon of urban agglomeration continues to be a trend in most countries. With such agglomeration of people living in the same area we start to notice that various city infrastructures become overly busy. The traffic infrastructure of a city is one of the first affected by this overuse. The fact that most traffic infrastructures in cities were not originally designed to handle so much usage, combined with cars

(13)

being more accessible than ever, has created tremendous strain in the traffic of large cities [52]. As such, there’s a need to create systems that can analyze traffic to help us manage it more efficiently.

2.1.1 Classic Traffic Analysis

Traffic analysis is a sub-field of traffic flow theory and its focus is to extract and process traffic data in order to make inferences about real traffic conditions [21]. Historically, traffic analysis has been done by collecting data from sensors that were purposely designed to track traffic conditions [15, 25]. These sensors are usually installed at fixed points and can collect information such as the speed of passing vehicles using speedometers and induction loops or number of vehicles on a given road by using image processing on traffic cameras.

Although this type of traffic analysis has been widely used [46], it has certain limitations. Because these sensors are purposely created and placed in traffic by a managerial entity, the cost of scaling up to keep up with traffic growth and changes is high. For this same reason, areas covered by these types of sensors are usually limited. Therefore, analysis based on this data is not ideal to infer information about the overall traffic of a city. Also, because most of the sensors used are static, it is impractical to find origins and destinations information, which is very important for understanding direction of traffic [28, 33].

2.1.2 Fundamentals of Traffic Analysis

First we will look into what are the users and the infrastructure of traffic. Generally, traffic users are pedestrians, cyclists and vehicles. In the context of floating car data, we will consider traffic users to be the drivers. The traffic infrastructure is generally defined as any structure that enables and supports traffic, such as roads, stoplights and monitoring devices. For the purposes of this research we consider traffic infrastructure to be only the roads used by traffic users.

Traffic analysis can be done in macro or micro scales. Macro scale traffic analysis usually refers to data that is collected from static sensors [25]. As mentioned, this method was extensively used but it has proven to be difficult and expensive to ex-pand. On the other hand, micro analysis is defined as the analysis of traffic data from

(14)

dynamic sensors present inside vehicles [39]. This allows us to extract driving tra-jectories, speed and other metrics. Some of these metrics could already be collected using static sensors, however micro analysis provides more detailed information. Also, it usually represents a larger set of users in the traffic system, making the insight from this data more valuable for traffic management. We will describe how this data can be used to get important insights about traffic.

2.1.3 Efficiency

Before delving into traffic analysis itself, let us define what efficiency means in this context and what can influence traffic to become more/less efficient. The traffic of an area is said to be perfectly efficient if users can freely navigate the traffic infrastructure without any waiting times. Of course in the real world this is often not the case. In fact, perfectly efficiency is virtually non-existing in the real world. So, since we know there is no perfect traffic to use as a comparison, we can gauge how efficient traffic is by tracking known traffic problems that cause inefficiency.

An important task in traffic analysis is grading traffic areas based on their efficiency. This is very useful when trying to identify if a certain area has a traffic problem (based on efficiency history), when comparing traffic of different regions, or when designing a new traffic infrastructure. Efficiency is one of the first flags that can identify problems in traffic. Further analysis is then performed to find the causes of these issues. Such as finding bottlenecks, street design redundancies and more [42].

There are many factors that influence efficiency. A poorly designed street pattern may lead to overall slower speed of vehicles; too many vehicles can cause congestion in roads; traffic accidents generally also cause slowness and so on [14].

2.1.4 Congestion

A commonly studied condition of traffic is the level of congestion for a given area. Congestion can be seen as any “less than efficient” usage of infrastructure caused by the number of users interacting with each other at the same time [42]. Congestion analysis is based on how human drivers react to road conditions. Finding congestion points is one of the main tasks of traffic systems. Given that it is not viable to get

(15)

information about all drivers in traffic, this task is particularly difficult and should use inference and extrapolation from a subset of all drivers.

A traffic bottleneck is an important term when analyzing traffic conditions. Bottle-necks are localized disruptions of traffic that cause average vehicle speeds to go down and are classified as stationary or moving. Stationary bottlenecks usually happen when a road is temporarily limited (by closed lanes or such), causing congestion and therefore slower speeds. We note that sometimes stationary bottlenecks are desirable if they have strategical purposes. Moving bottlenecks are caused by slow vehicles that can limit the speed of other vehicles. One common example is if a truck is using a road that has few lanes, it might cause overall speeds to go down. Traffic systems should be able to locate bottlenecks and identify its type so that we can understand how the efficiency of traffic is being affected [24].

2.1.5 Traffic Demand

Traffic demand is another important concept in traffic analysis. Traffic is usually not evenly distributed in a city. What usually happens is that traffic users tend to use certain roads more often, increasing the demand for these roads and possibly creating the previously mentioned bottlenecks.

Traffic demand refers to the need that vehicles have for a particular area/road. If more vehicles use the area, then the demand for the area increases. The definition of high or low demand is relative to other areas of the traffic infrastructure. Because of their high usage, high demand areas have a certain correlation with traffic congestion. While high demand may be an indicator of congestion, we note that these two terms have different meanings. As mentioned, congestion specifically points to less-than-efficient usage of infrastructure (which causes slowness). On the other hand, traffic demand does not depend on efficiency but it is simply defined by the need that vehicles have for certain areas. An area might be in high demand, but not present any sort of congestion. Inversely, an area might be considered low-demand but present congestion points.

The choice of which roads to use is based on origins and destinations of trips, because usually drivers want to use the shortest paths available. However, certain roads can become so overused that taking longer paths (detours) might be faster.

(16)

And so, the demand for these detour paths also increases. With micro-level data we can identify origins and destinations of traffic users and try to understand why certain roads have more demand than others. This analysis is based on the road capacity and the number of users that use it. In Chapter 3 we used the number of vehicles that drive through an area to represent traffic demand intensity.

The motivation for demand analysis is to find which roads should be given more attention on maintenance or which roads need upgrades (i.e. given extra lanes and such). Alleviating bottlenecks are often indicators of areas with high demand. Re-moving bottlenecks improves the overall quality of traffic in the area. New research has also shown how origins and destinations detection can be used to make predic-tions about future demand by combining data from traffic with social media data from users [58].

2.2 Floating Car Data Overview

Recently we have seen a widespread usage of geolocation devices inside vehicles. These devices are usually used by drivers to help them navigate traffic. These devices can be seen as simple moving sensors which navigate traffic, and so this presents us with an opportunity to gain insights into traffic conditions of a certain area. Data that contains GPS coordinates sent directly from traffic users (vehicles, pedestrian, etc) is called Floating Car Data (or sometimes Floating Cellular Data when data is collected strictly from the cellular network).

2.2.1 Sources

Although any traffic user can serve as a sensor in traffic, floating car datasets sources often have a homogeneous profile. This happens because of the logistics of getting permission and access to the sensors. Most common scenarios for floating car datasets are of companies that already track a certain vehicle fleet, that then decide to use the GPS data in traffic analysis (either themselves or by sending the data to a third party). Examples are Uber, DiDi and others. Others might specifically collect the data so that they can help users navigate traffic, such as Google Maps [30], Apple Maps [29] and more.

(17)

One of the most used sources are taxis [24] [39], where drivers most likely already have GPS devices installed in order to allow company headquarters to manage their rides. Floating car data from ride-hailing applications like Uber and DiDi can be in the same category as taxis because the fleet is operated by professional drivers, which are managed by a centralized entity. Sources can also come from non-professional drivers, although that is not common because of privacy issues. Still, some systems have successfully performed traffic analysis on fleets of privately owned cars [53, 16]. It is important to keep in mind that traffic insights obtained from floating car datasets are based on driver behavior, and so they can be biased depending on their source [40]. Professional drivers may behave differently from regular drivers because of being “on duty”. Data from public buses would differ from other datasets, given that public transport will have very different driving patterns with lower speeds, constants stops and different routes. These properties need to be investigated and the traffic system must be adjusted accordingly (i.e. for a private car dataset a certain speed may be considered slow or congested, but for buses it would be normal).

2.2.2 Applications

Floating car data can be used for a wide range of applications. Our dependency on vehicles is such that, by observing and tracking drivers, it is possible to answer questions not only about traffic but also about human behavior, city structures and more. One of its main uses is to power intelligent traffic systems (ITS), which allow traffic operators to maintain and improve traffic [51, 39, 48]. ITS also enables drivers to be better informed about their travels by providing fastest routes and incident reports based on current traffic conditions [27, 24].

Floating car data can also be used for answering some questions that are not about traffic itself but about traffic users or the infrastructure. Research shows that it can be used to approximate gas consumption [60] based on inferred traffic volume. Additionally, it can also help us determine social functions of city areas by tracking the origins and destinations of trips [34] or by leveraging social data along with floating car data [23].

(18)

2.2.3 Usage In Traffic Analysis

Floating car data provides an opportunity to improve intelligent traffic systems. Unlike classic traffic sensors which are expensive to scale and provide only limited area coverage, floating car data can provide traffic systems with low-cost moving sensors that cover large parts of the traffic area. However, it does not make the other classic methods obsolete, but rather it complements systems with data from a different perspective [53]. Floating car data provides great scalability because the cost is distributed between all drivers, most of which have already adopted the use of GPS devices to improve their driving. This type of data can also be used to empower smart city systems, as it contains human movement patterns that can be used not only in traffic but in security, commerce, health, urban planning, and more [57].

Aside from its benefits, floating car data also has some challenges. Because it only provides a sequence of GPS locations sent by drivers, it is up for traffic systems to ex-tract information from this type of data. On Section 2.3 we describe traffic indicators that are extracted from floating car data and used in traffic analysis. Even though these indicators alone can give us some understanding of traffic, we need methods for analyzing them. Using these indicators, traffic systems apply techniques from big data analysis such as statistical analysis, machine learning and data visualization to find real traffic insights.

2.3 Traffic Indicators and Metrics

To better understand traffic from the point of view of floating car data we need to rely on certain metrics and indicators. This process transforms raw sequences of GPS locations into information that models human behavior in traffic. In this section, we look into the most used traffic metrics and indicators for this type of dataset.

Speed of traffic is one of the first properties that come to mind when we think about traffic. In fact, many traffic systems use various algorithms to classify roads as congested or not based on average road speed [51]. Some floating car datasets may already include speed (most vehicles have the capability to measure their speed through CAN Bus, which can be sent along with the GPS location), while for others speed may need to be approximated based on GPS locations and their time-stamps.

(19)

Just approximating vehicle average speed does not provide much information, but we can combine these speeds to find the average speed for a given road segment (also called link average speed). While link average speed can be approximated by a simple average, certain systems have achieved more accurate approximations by using different heuristics. These heuristics take into account historic records of road speed and apply weights to them based on temporal and spatial proximity [53].

Another important measure for traffic analysis is Travel Time, which is how long it takes for vehicles to get to their destination. Because FCD contains the trajectory of various rides, one can easily calculate the time of travel for each ride [28, 27]. This measure is crucial for systems that aim to help users navigate traffic [38, 28, 24], or to allow drivers to approximate how long their travel will take upon their departure. These previous metrics do not make use of the directional information that is available in floating car dataset, so systems often use origin-destination matrix concept to understand directions of traffic [33, 26]. This matrix is a simplified way to look at places where trips start and where they end. This matrix is generated by first segmenting the area into zones and then creating a matrix which has all permutations of zones. An element in the matrix represents the number of trips between the zones. More recent research also analyses origins and destinations by using clustering of the data [26].

Traffic systems often combine these properties and others in order to get more accurate results. Although these properties provide insights into traffic, we need methods that allow us to use and understand them. In Section 2.4 we’ll look into how they are used to detect traffic anomalies such as congestion. In Section 2.5 we describe how they are used in traffic prediction. Finally, in Section 2.6 we’ll see the methods used to display traffic information so that operators can easily understand it.

2.4 Anomaly Detection

In the context of traffic analysis, anomaly detection refers to identifying well-known traffic events such as congestion or accidents. It may also refer to detecting changes or trends in routes and origins and destinations.

(20)

2.4.1 Travel Change Detection

Although routes and destinations of drivers may seem random, they are molded by human behavior. And human behavior, most specifically in traffic, generates certain patterns and trends [57]. Patterns of a specific vehicle can be hard to detect and alone it does not give us much information about traffic. However, when joined with other drivers, these patterns can become apparent thus allowing us to see what routes are most used, average speeds based on time of day, and other metrics.

Detecting changes in travels can be an important indicator in traffic. As we have seen, drivers follow certain patterns and, if those patterns are broken, that could mean that some traffic anomaly has happened. This is easily seen in our daily traffic: if a road in high demand becomes limited due to construction, other streets will inevitably become busier than before. Detecting these scenarios is an important task of intelligent traffic systems and we will now look into how this can be done.

Changes in driving patterns can be observed from the point of view of a floating car dataset. It is possible to generate historical data of traces of vehicles in traffic. And, by using pattern matching with the current traces and the historical traces, one can detect if a certain area has an unusual traffic pattern [57] [53]. This is often an indicator that some anomaly is happening.

Origins and destinations are often used to detect changes in travels. Travels with similar origins and destinations patterns are analyzed as a group. By using clustering, one can detect if there are any trips that are abnormal [33] [57]. This is useful for taxi companies that need to manage their drivers and detect any suspect activity.

Some anomalies in traffic may have similar patterns from the point of view of floating car data. To take advantage of that, some systems use a semi-automatic method for anomaly detection. In these systems, machine learning methods such as Conditional Random Fields are used to detect abnormal driving patterns, but traffic experts may be required to understand what is the source of the anomaly [59].

2.4.2 Congestion

(21)

That is because congestion has an immediate effect on people’s lives. Detecting congested roads can help drivers better navigate traffic, and it also allows traffic operators and managers to identify roads that may need to be changed, or point to some other problems in the traffic design of a certain area.

There are many ways in which systems try to detect congestion. Older systems have used a speed threshold to determine if certain road segments were congested [51]. While this approach may work for certain small areas, it does not work well when applied to large traffic networks, like entire cities. More recent methods calculate “congested speeds” based on historical data for road segments or small areas [24], and they classify roads into different levels of congestion (from free to completely jammed) [15].

Traffic density changes throughout the day, but it follows certain patterns based on driver demand (i.e. hours before and after work are usually the busiest because of people’s commute). Using hour of day and type of day (workday or not), along with road speed, has been shown to yield good results when used as parameters to classify roads into congested or not [39].

2.5 Traffic Prediction

Traffic prediction is useful for two main types of users [53]. Prediction can be used to aid managers of traffic networks by allowing them to be prepared for possible congestion or other traffic events. Such systems are called Advanced Traffic Man-agement Systems (ATMS). On the other hand, traffic prediction can also be useful for drivers by showing them the best routes to their destinations. These systems are called Advanced Traveler Information Systems (ATIS).

Traffic prediction works based on the principle that events in a road influence traffic of nearby roads. Additionally, this influence is measured by temporal and spatial distance (or dependency). For temporal dependency, the traffic on a road is more affected by what happened five minutes ago then what happened an hour ago. For spatial dependency, congestion in an adjacent road produces more effect than congestion in a road far away. Another important principle of traffic prediction is

(22)

that traffic can be seen as a sequence of patterns (or classes) that are often repeated. That is, traffic can be congested, some-what congested or free flow.

To find how the traffic of a given road is going to be in the near future, traffic systems can use Pattern Matching with nearby roads and historical data, giving them weights based on spatial and temporal distance [53]. Similarly, for prediction that is not based on traffic patterns, Neural Networks can be used to predict link travel speed in km/h [53]. Parameters of prediction are usually manually tuned based on our understanding of traffic theory, using temporal comparison up to 30 minutes has provided good accuracy.

These recurring patterns in traffic can also be seen as states, so that the current traffic state is based on the previous states. In this case, a Markov chain can be used to predict states, where the transfer probability matrix is derived from how roads change from congested to non-congested [24]. One important limitation of these techniques is that their prediction is based on patterns that have already happened, therefor new patterns are always missed at first. And given the limitations of temporal and spatial comparison, it is possible that certain patterns are cut-off, and therefore they are not represented in prediction.

2.6 Traffic Visualization

Another important role of traffic analysis is to present data to people in a way that it can be easily understood and used to answer important questions about traffic.

A popular visualization method displays traffic information in layers above a map of the area covered by floating car data. The map is often separated into small regions, and specific traffic information about the area can be seen. Areas are also color-coded based on traffic speed [51]. Speed of traffic is also commonly visualized by applying color coding to road segments covered by the dataset [24, 53].

Traffic disturbances are visualized by plotting their locations on the map of traffic. Congestion is usually represented by hot-spots on the map, and these hot-spots are represented by circles and their size and color intensity increases based on the level of

(23)

congestion [39, 23]. Traffic accidents or social events are often displayed along with their description and source (social network posts, tweets, etc).

Another visualization method, which is based on inferred origins and destinations, is to display them as spots in the map and then calculate and plot the route connecting the spots. This route is often color-coded based on the traffic experienced in each area [59, 33].

2.7 Social Data in Traffic Analysis

Although in this thesis we focus on how traffic systems use floating car data, it is worth noting what other types of data have been most recently used to improve traffic analysis.

Although floating car data offers a wide range of information about traffic, traffic systems might benefit from incorporating other types of data into their analysis. Authors of [58] provide a survey about systems that have incorporated data from sources that are not directly related to traffic, but they can indirectly affect traffic. One example is the usage of personal and community data from social networks to track possible agglomerations of drivers, such as concerts, special holidays and other events. Incorporating such data into traffic systems can help in predicting traffic patterns more intelligently.

Because social media often contains information about problems that are afflicting users, it can also be used to improve anomaly detection in traffic. Authors of [48] enhanced their anomaly detection system using tweets from the WeiBo platform to describe what the anomaly is. Once anomalies were detected they analyzed tweets originating from the same location and time. Then they matched tweet semantics to known terms used by users to refer to traffic (accident, event, etc.) to finally find the terms that were used the most during the anomaly.

[58] also describes an emerging trend in traffic analysis which is to use social data to create a human mobility model. These models can approximate user behavior trends such as the origins and destinations of trips (based on the origin-destination matrix)

(24)

and use that along with GPS data to better estimate travel times from any path in traffic.

2.8 FCD Evaluation

The sections above describe how powerful FCD can be in traffic analysis systems. However, it can only provide approximations since not all vehicles of traffic are moni-tored. Also, GPS coordinates sent by vehicles can have errors, which need to be taken into account. Therefore, it is important to question the accuracy of insights offered by these systems and find a way to evaluate their accuracy.

One can evaluate the usage of a floating car dataset by comparing its results to data gathered from classic sensors [39], since these sensors have proven high accuracy. This is typically done by matching locations of static sensors (such as induction loop or speedometers) with locations where floating car data is available, then the properties inferred by floating car data are compared to the values measured by the static sensors. Like any system that works based on a certain dataset, traffic systems have the risk of being too specialized. The excessive specialization will make the system really accurate with the current dataset, but if the data were to change somewhat (different coverage level, slightly different error margins, etc) then the system experiences a great loss in accuracy [15]. To guarantee that systems maintain a certain level of generalization, datasets can be programmatically generated with different levels of traffic coverage to verify the relevance of traffic results.

Another important evaluation method focuses on testing datasets, instead of the results of traffic systems. This dataset evaluation is valuable when choosing a dataset to work with, as well as when reviewing the results of new traffic systems (their results are influenced by the quality of the dataset). Floating car dataset evaluation focuses on evaluating various dimensions such as how well the dataset represents actual traffic conditions from the area where it was extracted, possible measurement errors, periods of missing information, the spatial coverage of the dataset and more [54, 55].

The most important evaluation of floating car datasets is regarding the proportion of vehicles that were used as sensors (data penetration level). If a certain dataset has

(25)

used too few, their results will not be applicable to overall traffic, but rather only to those specific drivers. Most experiments and studies show that with penetration of 2-3% one can reliably make assumptions about overall traffic based on FCD [19, 16, 53].

2.9 Open Questions

The usage of floating car data in traffic analysis is quite new so many of its methods to extract vehicle speed, link average speed, origins and destinations are not mature. Many of the papers we have reviewed offer new approaches for calculating these metrics, but more work is still needed.

More specifically, origin-destination matrices usually require the covered area to be segmented. Then the rows and columns of the matrix are used to represent every combination of the segmented regions. Area is usually segmented into equal squared regions, but research suggests [23] that by calculating more homogeneous regions (which have mostly the same social function) we can better understand how driver behavior is connected to social functions.

Various traffic systems have demonstrated how floating car datasets can be used to infer traffic conditions. However, developing systems that deal with historical and real-time data is challenging. This involves creating a system that not only learns traffic patterns from a dataset, but also continues to learn and improve its parameters based on the new data it receives [28, 33].

Privacy in technology has become very important, so it is natural to question what are the privacy implications regarding the use of floating car data. Because these datasets are gathered from people’s GPS devices, the data is usually anonymized. However, certain studies have revealed that anonymizing the data is not enough to guarantee privacy [57, 16]. It is unclear if a mix of origins and destinations analysis and social data extraction may be used by a malicious entity to recover identities. This becomes even more important when systems use live data, as users may be tracked in real-time. Implementing intelligent traffic systems that preserve privacy while still providing valuable information is currently an open question.

(26)

Finally, because traffic influences our daily life, floating car datasets offer the op-portunity for understanding more than traffic. We’ve seen examples where this data can be used to measure gas consumption [60] or to classify areas of a city based on their functions [34, 23]. Usage of floating car datasets to get information about city functions is also an open problem.

(27)

Chapter 3 Traffic Demand

Some intelligent traffic systems have started to incorporate floating car datasets into their traffic analysis, which can provide certain information that traditional traffic sensors cannot [24, 39, 53, 16]. In our literature review in Chapter 2 we have seen that using floating car data in traffic analysis can provide us with useful reports such as average speed of vehicles, traffic volume, travel time estimates and more. There is continuous research into how to better extract and use information from these types of datasets. Many of these are specifically useful for traffic control agents and personal to manage the traffic infrastructure. In this chapter we focus on traffic demand, a metric which is crucial for anyone overseeing the maintenance and improvement of traffic.

Traffic demand, which is often linked to a certain area, can be seen as the need that drivers have for using a part of the traffic infrastructure (roads, bridges, etc) [58]. In Section 2.1.5 we mentioned that there is a correlation between high-demand areas and congestion. High demand areas are used by many vehicles and usually suffer more from congestion. More specifically, certain roads that link very populous areas end up being the common paths that many vehicles take during their trips, thus increasing their demand. Understanding traffic demand is extremely important for maintaining and improving traffic infrastructure, paths that are in high demand require more upkeep but are also the spots where improvements have the most positive effect in traffic [22, 37].

Identifying high demand area is something that traffic systems already do reason-ably well [36, 41]. This can be done by slow speed and congestion detection or using

(28)

origins and destinations analysis. Although these methods are good in detecting high demand areas, they only provide their locations but not their cause. In this chapter, we look into how we can enhance methods of traffic analysis so that we can better understand the causes of traffic demand. This is represented in the following research question:

Can we enhance traffic demand analysis to find origins of demand by extending origin-destination matrix?

Given its importance, it is not surprising that analysis of traffic demand is well researched and documented [56]. Our objective is not to create a novel approach to analyze traffic demand, but rather to improve existing methods. Usually, traffic demand works by segmenting the area into sub-areas and then analyzing their inter-actions to understand traffic direction and volume. This segmentation has a large effect on the final results of the analysis [39, 51]. However, area segmentation has its challenges and more research is needed into the current methods. In this chapter we propose an enhancement to the existing segmentation methods, which is represented by our last research question:

Can we improve the current methods used for traffic demand analysis by using automatic social function classification method? Next we start by defining our problem and how we think that automatic social function classification can be used along with origins and destinations to enhance traffic demand analysis. Then we delve into the origin-destination matrix and how we extend it to identify origins of traffic demand. Following, we look into why area segmentation is needed and what methods are currently available, taking note of their strengths and weaknesses. Then, we make a case for the usage of automatic segmentation and how the social functions of an area should be taken into account to enhance segmentation. Finally, we describe our automated segmentation algorithm and provide a use case where we implemented the algorithm and applied it to a real-world floating car dataset.

3.1 Problem Definition

(29)

about what do we mean by traffic demand analysis. Our focus is not in finding areas with most demand, but to analyze where demand comes from and how that progresses over time. More specifically, we want to be able to select an area and visualize where traffic comes from. Inversely, we also want to be able to select an area and see what are the most common destination of vehicles driving from this area. Developing an algorithm that can extract the information needed for these tasks using floating car data is the main part of our efforts in this chapter.

In order to extract this information we build on the well studied concept of origin-destination matrix (as seen in Chapter 2) of floating car datasets. This concept has long been used to identify traffic demand [56, 49], but we want to extend its usage to understand the origins of traffic demand. In Section 3.2 we cover how we built on the origin-destination matrix to develop a process that can extract traffic demand information from floating car data and store it in a new structure. This structure can, later on, be used by a system to make queries about traffic demand and generate visualizations.

Because our process for extracting traffic demand is based on the origin-destination matrix, it similarly depends on segmentation of the area covered by the floating car dataset. This segmentation is usually done in a simple way by separating the area into small equal parts [39, 51]. However, we understand that the traffic demand of an area may have different meanings based on what its social function is (commercial, residential, etc) [40]. We note that the term social function has the same meaning as land-use in literature, and it represents the main purpose that an area has [23, 40]. Therefore, to generate more meaningful results about demand, our second challenge was to use an automated segmentation based on the social function of traffic areas. In Section 3.3.1 we describe this enhancement in detail.

To test our enhancements we implemented them in a simple traffic analysis appli-cation. Part of our challenge was creating a visualization method that can be used to display the information generated by our traffic demand analysis. As we mentioned above, we would like to allow users to easily interact with the data, allow specific areas to be selected and more closely analyzed. This interaction enables users to get insights from different angles, with detailed information about each area segment and not only the entire analyzed area. In Section 3.5 we show the results of combining

(30)

our traffic demand algorithm and automated segmentation to analyze a real-world floating car dataset. We then display the results using our interactive visualization methods.

3.2 Demand Based On Origins And Destinations

Traffic demand is governed by where vehicles start their trips (origins), where they need to go (destinations) and the path taken to get to their destination. This causes certain road segments to be used more than others. As expected, these origins and destinations are often not evenly distributed since cities are continuously growing and changing. This lack of uniform distribution often forces vehicles to take similar routes, thus increasing the demand for these roads and possibly causing congestion. Origins and destinations are also ever-changing as people move and change their routines. Some may be occasional changes (such as an event which drastically increases traffic of an evening) or they might be part of a trend of change in traffic. Therefore, to understand traffic demand we must continuously analyze traffic data over time to uncover trends and patterns in demand, which are one of the main goals of this type of analysis.

One of the main weaknesses of classical traffic analysis was its lack of information about direction of traffic [15, 25]. However, using floating car data we have access to GPS traces of various drivers and their trips, allowing us to see all paths they drove through. These traces often contain a significant amount of data, so it is common to limit the analysis of these datasets to only origins and destinations [28, 33]. This particular approach ignores information about speed and the paths took during the trips, but overall direction is retained.

These origins and destinations are most commonly represented by a matrix [33, 26]. This matrix represents the origins and destinations of an area which was previously separated into sub-areas. The rows and columns of this matrix represent all the area segments and their combinations. Matrix elements represent the number of trips originating from the row location to the column location, as seen in Equation 3.1. This matrix usually includes information about a certain time frame (1 hour for example), so that a day consists of an ordered array of matrices. A sequence of matrices show us how origins and destinations change throughout the day.

(31)

      a11 a12 · · · a1m a21 a22 · · · a2m .. . ... . .. ... an1 an2 · · · anm       (3.1)

Origin-destination matrix: ai,j represents the number of cars that have trips that

started at i and finished at j.

Using a traditional origin-destination matrix can give us some indication of where traffic comes from, but it is limited to only start and end locations. We are interested in knowing, for a given area, all other locations that contribute traffic to this area and all location to which this area sends traffic to. Assume that a floating car dataset was segmented into sub-areas [A1, A2, ..., An]. We want to know from which sub-areas

the traffic of a A1 comes from and how other sub-areas A2, A3, ..., AN are affected by

vehicles coming from area A1. The traffic demand of area A1 is not defined by only

the sub-area where a vehicle started before passing through A1, but also from all the

other-sub-area which the vehicle navigated until A1 was reached.

Therefore, we created a new matrix that extends the functionality from the origin-destination matrix. In this new matrix we analyze the entire path of driver’s trips. We call this structure a traffic demand matrix. In this matrix the rows and columns continue to form all permutations of sub-areas, but an element in position {i, j} represents the number of times that vehicles were at sub-area Ai and then, later on,

went through sub-area Aj.

This matrix allows us to see all locations where traffic comes from, along with the intensity of demand. Similarly to the origin-destination matrix, this traffic demand matrix depends on the segmentation of the area. So, before describing how to create the traffic demand matrix we will describe our enhancements to area segmentation on Section 3.3. Then on Section 3.4 we use the result of the segmentation to create the traffic demand matrix, which in turn will be visualized by our interactive visualization method in Section 3.5

(32)

3.3 Automatic Segmentation

A common approach to segmentation of an area is to simply separate the entire area into small equal-sized sub-areas [39, 51]. This is a fast and straightforward process and it can be used to understand the interactions between these sub-areas. However, this segmentation process does not take into account the contents of the area, and the social function of an area impacts the way we analyze its traffic [40]. Figure 3.1 shows the difference between this simple segmentation (3.1a) and a segmentation based on social functions (3.1b). In our case, segmentation is going to be used to analyze the interaction between sub-areas, and knowing the social function of the sub-areas offers more insight into the cause of traffic.

(a) Standard segmentation (b) Segmentation by social function

GREEN: Residential. BLUE: Commercial. RED: Public space. GRAY : Industrial.

Figure 3.1: Segmentation comparison.

Another common segmentation method is the usage of some pre-defined organiza-tion of the area, such as zoning maps designed by local government [18]. The area is then separated according to this reference model. Although social functions are taken into account in this case, these models can quickly become obsolete as cities change and it is costly to recreate segmentation since it relies largely on a manual effort of creating these maps.

(33)

We are particularly interested in automated approaches. As seen in Chapter 2, a common automated segmentation approach is to use satellite images of an area. By analyzing the colors of the areas one is capable of inferring certain information, which then can be used to classify the area as having a certain social function [61, 43, 34]. This is possible because each area with a certain social function produces a unique color profile on pictures from above. These color profiles are extracted and used in pattern matching to label areas. A downside with this case is that satellite images are not always quickly updated, leading to out-of-date segmentation. Also, while color analysis can perform some rough classification, fine-grained classification is very hard (such as differentiating between retail areas and office areas).

In this section we look into how we can create a segmentation that better represents areas of a city, so that we can provide more effective insights than the segmentation methods mentioned above. We focus on creating an automated method which is scalable and has low cost. At the same time, we want to be able to incorporate social function information into our segmentation. In the following section, we look more closely into what is social function and how we can classify areas based on them. Then in Section 3.3.2 we describe how we can use this classification to create an algorithm for segmenting the overall area into segments with similar functions.

3.3.1 Social Function Classification

As mentioned above, origins and destinations are not distributed evenly. Cities are often separated into areas that have different social functions such as rural, industrial, commercial and residential. The traffic between certain areas has its own pattern and intensity based on human behavior. Residential and commercial are often linked by movement of worker’s commute, creating intense traffic of light vehicles during hours before and after the workday. Industrial and rural are often linked by trucks that deliver raw goods and supplies, generating traffic with heavy vehicles. Each linked areas have their own traffic profile and so it is interesting to understand what are these zones and how demand changes in the roads between them.

Finding a reliable way to infer the social functions of an area is challenging because the areas are constantly changing. It is not uncommon for dense residential areas to eventually be considered commercial areas, since businesses start opening in the

(34)

area to take advantage of the local demand. Therefore, our classification can be executed constantly, so as to remain accurate. After research we found that we can use crowd-sourced information. Research suggests this type of information can have high accuracy [13, 45], and can be accessed at low costs.

Various map and navigation applications collect what can be seen as crowd-sourced information. These apps use user feedback to tag and classify places. Some reviewing might happen first but usually the feedback is quickly incorporated into the databases. Examples of this are Google Maps [30], Apple Maps [29] or others [1, 5] where users tag various locations such as business, parks, monuments, roads, etc. Users often provide not only the type of the locations but also detailed information such as working hours, photos, descriptions and more. Another interesting point is that various of these apps make their databases easy to access via open APIs. Therefore, we decided to use this poll of information to classify areas of floating car datasets on their primary social functions.

We are particularly interested in analyzing urban areas. So we considered the following as possible social functions:

• Residential: areas where there are mostly houses, with limited land occupied by businesses.

• Commercial: areas where there are commerce, companies and other establish-ments with the exception of industries.

• Industrial: areas largely occupied by industries and manufacturing buildings. • Public Space: areas with monuments, parks, government building, and other

similar buildings.

These areas have different profiles and generate different types of traffic, so we are interested in knowing how traffic demand happens between them.

We define our classification to classify an area that can be represented by two coordinates {x1, y1} and {x2, y2}, which are located at the diagonal corners of a

square. We limit the area to this format because of how the segmentation algorithm (Section 3.3.2) works. We used Google’s Place API [31] to gather information about each area. This API provides an endpoint which returns nearby places for a given

(35)

location [31]. This request takes a location parameter (latitude and longitude) and a radius parameter. The following is an example of the API request:

https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=-33.8670522,151.1957362&radius=1500&key=API KEY

Note that since the areas we are classifying are squares we use the center of the square as the location parameter and we select radius as half the size of the square. That way the searched area is exactly contained inside the square. This will not take into account the corners of the square, but it is a reasonable approximation.

This request returns a list of places found in the area. For each of these places there is a large amount of information such as description, websites, etc. One of these properties is called “types” and it contains a list of categories that users believe this place to be in. These categories can be seen as the social functions of the place. Since there are many possible categories returned by the API, we mapped each of these categories to either Commercial, Industrial or Public Space. Note that our mapping does not include Residential areas since the API only includes public places, no residential buildings/houses. Later on we explain how we inferred an area to be residential. Although we limited our classification to only four social functions, one could easily extend this algorithm with a more complex mapping based on the categories and possibly even by using other properties returned by the API.

After each of the places in an area is classified as one of our social functions, we must now decide what is the primary social function of the area. The area currently being classified is only a part of the overall area covered by the floating car dataset. Before we classify a single area, we retrieve places from all areas to find certain metrics. While retrieving places we calculate what is the average number of places for each social function. Then, by using this average, we look at each area and check the number of places for each social function. If the number of places for a given social function is above the average then we classify the area as primarily having this social function. Ties are resolved based on which social function is further from the average. After this process, many areas may not be classified because they do not have enough places of any social function. These areas are then considered to be Residential.

(36)

3.3.2 Segmentation Algorithm

Similar to classic segmentation, our algorithm first separates the analyzed area into small sub-areas of equal size. The optimal size of these sub-areas depends on the topology of the traffic infrastructure, but ideally a sub-area should be small enough to cover one block. This is done so that each sub-area as homogeneous as possible regarding the social functions within. We understand that a block is not guaranteed to be perfectly homogeneous.

At this point in our algorithm we have segmented the entire area into small equal parts. Then these parts are classified into Residential, Commercial, Industrial or Public Space by the methods described in the previous section. The size of our segments was selected so that it covers roughly a block, however it is common for various adjacent blocks to have the same social function. And so, we can join these segments together to form a larger segment. Given our classification of each small area, we can say that this larger segment continues to have the same primary social function.

The result of the segmentation can be seen as matrix M where each area is an element mij and i and j are latitude and longitude, respectively. To join areas that

have the same functions we decided to use a flood-fill algorithm which finds a homo-geneous large segment each time it runs. So, to find a new large segment we do the following:

1. Create a new array for the large segment, LSn

2. Choose a random element mij of the matrix M

3. If mij is already part of a LS array then stop

4. Add index of mij to LSn

5. Check each neighbor of mij and add the index of the neighbor to LSn if they

have the same social function as mij.

6. Recursively perform steps 4 and 5 on each neighbor that was added.

This is a breadth-first flood-fill and it results in LSn containing a sequence of indexes

(37)

when there is no element that is not already part of a large segment (when step 3 fails). Each large segment now has a sequence of indexes and the social function which it represents. The combination of all large segments covers the entire segmented area. This segmentation is a costly process which involves classification and flood-fills. Fortunately, we do not need to run it frequently. This is because the social function of an area generally changes in a very slow pace or remains unchanged for a long time. So, it is sufficient to perform this segmentation occasionally (i.e. monthly, yearly, etc). To be able to reuse the result of this segmentation every time an analysis is performed we export the results of the segmentation into a file that can be imported by a traffic analysis system. In this file we save the configuration of the initial segmentation: area covered and size of small areas. We also add the large segment arrays, each containing the social function it represents and the indexes of all small areas it contains. From this information, it is possible to quickly recreate the segmentation without going through the classification and flood-fill procedures every time.

3.4 Creating Traffic Demand Matrix

In Section 3.2 we described a data structure similar to an origin-destination matrix that contains general traffic direction and, for a given area, which other areas con-tribute to its traffic demand. We call this structure the traffic demand matrix. This matrix represents traffic between a set of areas, so the segmentation of the overall area is a necessary step before creating this matrix. More specifically, we are interested in using the output of the automated segmentation just described in Section 3.3. In this section we describe how this traffic demand matrix can be created.

As seen in Section 3.3, the overall area of the floating car dataset was automatically segmented based on social functions, resulting in a set of areas which cover the entire area of the floating car dataset. The traffic demand matrix has no dependency on the segmentation type and the steps described in this section to create the matrix can be applied on any segmentation. However, for our purposes, we will focus on using the areas found by our automated segmentation. This will include dealing with the exported file and performing the calculations needed to recreate the segmentation.

(38)

affect the traffic of other areas. According to our definition of traffic demand, the demand of an area is defined by the number of vehicles that need to use it [56, 36, 22]. Therefore, our matrix is built based on the number of cars that use a certain area, together with information about the direction. In particular we are interested in knowing, for a given area, how its origins of traffic compare and which ones provide the most traffic. The goal is to quickly see and compare the levels of demand intensity. Our traffic demand matrix is organized similarly to the origin-destination matrix. The rows and columns represent all permutations of areas, where rows are origins and columns are destinations. However, in this case, the element of the matrix represents the intensity of demand that an area has for another area. Assuming our matrix is M , an element mi,j represents the demand intensity of traffic that area i has for

area j. Consequently, column j contains the intensity of all areas that have demand for j. Note that the meaning of origins in this matrix is slightly different from the origin-destination matrix: while a car drives through areas in order [a1, a2, ..., an], a1

will be a demand origin for all following areas [a1, a2, ..., an]. But other areas will also

be considered origins, as a2 will be a demand origin for all ai | i >= 2 and so on.

Note that an area might contribute traffic to itself since trips happen within an area without leaving its borders. On a trip, every area that a car drives through has demand for every following area.

While the demand intensity of one area to another is measured in the number of vehicles, the final state of our matrix will not contain the actual number of cars but rather a relative scale between all origins of demand. Therefore, the combined de-mand intensity of all origins for a given destination must add up to 1 (i.e. the sum of each column must be 1). And so each origin must have a value between 0 and 1. We can better visualize this concept in Equation 3.2. The first column represents the demand intensity that each area has for area a. In this example, d is the area that has the most demand for a. The values of the column all add up to 1. This normalization provides an easier way to read the matrix and easier interaction with visualization methods. An element represents intensity of demand that a location (row) has for a location (column).

(39)

a b c d           a 0.1 0.25 0.7 0.13 b 0.3 0.5 0 0.38 c 0.2 0.25 0 0.16 d 0.4 0 0.3 0.33 (3.2)

a, b, c, d: represent locations from the floating car dataset.

Now we define the algorithm to fill the traffic demand matrix. The algorithm takes a floating car dataset that is separated into trips and the output of the automated segmentation described in Section 3.3. Each trip is analyzed separately by going through the array of GPS coordinates and separating them into sub-arrays depending on which area they belong to. The areas found during segmentation are made up of multiple small sub-areas of fixed size (as in Figure 3.1a), so to find to which area a GPS coordinate belongs to, we must first find to which sub-areas does it belong. For full details on this process see Algorithm 1 in Appendix B.

Now that we know to which sub-area the coordinate belongs to, we need to find to which area this sub-area is from. In the file exported by the segmentation, an area is defined by an array of sub-areas, where each sub-area is defined by a tuple (x, y). So, we can find the container of the sub-area by searching the array of each area to check if it contains the tuple (sub areax, sub areay).

We use this process to find to which area a coordinate belongs to, and so we can separate the array of GPS coordinates of a trip into sub-arrays each belonging to a certain area. This array of sub-areas shows the sequence of areas which the vehicle drove through [a1, a2, ..., an]. Then we iterate through all coordinates of the

trip starting from coordinates of area a1, but whenever we detect a change from the

current area ai (next coordinate is in another sub-array), we increase the value in the

matrix element ma1,ai. We repeat this process but the starting coordinate is always the first coordinate of an area. This means that on the second iteration we will start by coordinates from the area a2 until an, without going through coordinates from a1.

This is because a vehicle going from ak to an affects all areas in between, but no areas

(40)

vehicle counts to all columns of the matrix except a1, because a1 was drove through

before a2. We run this process in all trips of the floating car dataset.

At this point our matrix is filled with elements which contain the number of vehicles that an area i provides to area j. The more vehicles that an area provides, the bigger the intensity of the demand. To facilitate comparison we normalize all elements in the column, since those represent the number of vehicles from each origin. The sum of the column must add up to 1, so each element must be between 0 and 1. The result is a matrix which contains information about the intensity of demand for every permutation of the areas found in segmentation. For the complete pseudo-code of the matrix creation please see Appendix B.

This matrix allows us to answer certain questions about traffic demand. For a given area i, if we look at the column i of the matrix we can see which areas provide the most number of vehicles, therefore which areas provide the most demand for area i. By plotting this information we can also see direction of demand. On the next section, we will describe how we implemented this algorithm and what visualization methods we used to allow a user to better understand the data within this matrix.

3.5 A Sample Use Case

In Section 3.3 we have described our proposal for enhancing area segmentation by using an automated method which uses crowd-sourcing information to tag areas based on their social functions. Furthermore, in Section 3.4 we have described a method for calculating traffic demand by extending the origin-destination matrix. To display the usability of these proposed methods we have developed a simple traffic analysis application which combines these methods and displays their results using interactive visualization.

In terms of usability, the goal is to allow users to easily interact with the traffic demand matrix. For that, we need to set up certain parameters that will be used during segmentation and matrix creation. Our application provides an interface which allows calibration and execution of the automated segmentation. Then we plot the results of the segmentation into a map, highlighting the areas found, along with their social functions. Furthermore, we use the result of the segmentation, along with a

(41)

given floating car dataset, to create a traffic demand matrix and incorporate it into the visualization. Users can select areas and visualize where traffic demand comes from and with which intensity.

To test our algorithms we used a dataset from the GAIA Initiative [20], more specifically the Oct 2016, Xi’an City Second Ring Road Regional Trajec-tory Dataset. This dataset has trips from the DiDi ride-hailing application during the month of October, 2016. Research suggests that taxi trips may be a good sam-ple of the overall traffic of an area [24, 39]. Also, a 2-3% vehicle samsam-ple from the overall population can produce generalized insights [53]. So, we think this dataset is a good choice for analyzing demand as well. For full details on this dataset and implementation steps see Appendix A.1.

In the following sections we describe how we implemented and tested our applica-tion. We describe in depth the implementation of the application and how all the pieces are connected. We outline some of the technical challenges and optimization used to make our algorithms reasonably fast.

3.5.1 Implementation

Because of the visualization requirements of our application we decided to use Nodejs along with Javascript to implement our application. We used a standard server-client architecture, with data processing done in Nodejs and the UI in Javascript. More specifically for visualization, we used the javascript library Chart.js to power our custom visualizations. For rendering and interacting with the real-world maps we used a combination of OpenStreetMap [4], Leaflet [2] and Mapbox [3].

We already covered the main algorithms used in the application but it is worth mentioning some challenges that we faced when actually implementing them to handle the test dataset. The dataset is quite large, divided into a file for each day with about 2.5GB (with some days having much larger sizes). Therefore, we cannot assume that the file can be loaded onto memory. Our segmentation and matrix creation were implemented to work with a stream of data, one trip (a line of the dataset) at a time. Before the segmentation can be done the entire floating car dataset needs to be iterated through, so that we know the actual span of the analyzed area. For the

Demand analysis and privacy of floating car data

Contents

List of Tables

List of Figures

Introduction

Chapter 2

Background

2.1

Traffic Analysis

2.1.1

Classic Traffic Analysis

2.1.2

Fundamentals of Traffic Analysis

2.1.3

Efficiency

2.1.4

Congestion

2.1.5

Traffic Demand

2.2

Floating Car Data Overview

2.2.1

Sources

2.2.2

Applications

2.2.3

Usage In Traffic Analysis

2.3

Traffic Indicators and Metrics

2.4

Anomaly Detection

2.4.1

Travel Change Detection

2.4.2

Congestion

2.5

Traffic Prediction

2.6

Traffic Visualization

2.7

Social Data in Traffic Analysis

2.8

FCD Evaluation

2.9

Open Questions

Chapter 3

Traffic Demand

3.1

Problem Definition

3.2

Demand Based On Origins And Destinations

3.3

Automatic Segmentation

3.3.1

Social Function Classification

3.3.2

Segmentation Algorithm

3.4

Creating Traffic Demand Matrix

3.5

A Sample Use Case

3.5.1

Implementation