Mining moving flock patterns in large spatio-temporal datasets using a frequent pattern mining approach

(1)

Mining moving ﬂock patterns in large

spatio-temporal datasets using a frequent pattern mining approach

Andres Oswaldo Calderon Romero

March 2011

(2)

Course Title: Geo-Information Science and Earth Observation for Environmental Modelling and Management

Level: Master of Science (MSc.) Course Duration: September 2009 – March 2011 Consortium partners: University of Southampton (UK)

Lund University (Sweden) University of Warsaw (Poland) University of Twente,

Faculty ITC (The Netherlands)

GEM thesis number: 2011–

(3)

Mining moving ﬂock patterns in large spatio-temporal datasets using a frequent pattern mining approach

by

Andres Oswaldo Calderon Romero

Thesis submitted to the University of Twente, faculty ITC, in partial fulﬁlment of the requirements for the degree of Master of Science in Geo-information Science and Earth Observation for Environmental Modelling and Management.

Thesis Assessment Board

Chairman: Prof. Dr. Menno-Jan Kraak External Examiner: Dr. Jadu Dash First Supervisor: Dr. Otto Huisman

Second Supervisor: Dr. Ulanbek Turdukulov

(4)

Disclaimer

This document describes work undertaken as part of a programme of study

at the University of Twente, Faculty ITC. All views and opinions expressed

therein remain the sole responsibility of the author, and do not necessarily

represent those of the university.

(5)

Abstract

Modern data acquisition techniques such as Global positioning system (GPS), Radio-frequency identiﬁcation (RFID) and mobile phones have resulted in the collection of huge amounts of data in the form of trajectories during the past years. Popularity of these technologies and ubiquity of mobile devices seem to indicate that the amount of spatio-temporal data will increase at accel- erated rates in the future. Many previous studies have focused on eﬃcient techniques to store and query trajectory databases. Early approaches to re- covering information from this kind of data include single predicate range and nearest neighbour queries. However, they are unable to capture collective be- haviour and correlations among moving objects. Recently, a new interest for querying patterns capturing ‘group’ or ‘common’ behaviours have emerged.

An example of this type of pattern are moving flocks. These are defined as groups of moving objects that move together (within a predefined distance to each other) for a certain continuous period of time.

Current algorithms to discover moving flock patterns report problems in scalability and the way the discovered patterns are reported. The field of fre- quent pattern mining has faced similar problems during the past decade, and has sought to provided efficient and scalable techniques which successfully deal with those issues. This research proposes a framework which integrates techniques for clustering, pattern mining detection, postprocessing and vi- sualization in order to discover and analyse moving flock patterns in large trajectory datasets.

The proposed framework was tested and compared with a current method (BFE algorithm). Synthetic datasets simulating trajectories generated by large number of moving objects were used to test the scalability of the frame- work. Real datasets from different contexts and characteristics were used to assess the performance and analyse the discovered patterns. The framework shows to be efficient, scalable and modular. This research shows that moving flock patterns can be generalized as frequent patterns and state-of-the-art algorithms for frequent pattern mining can be used to detect the moving flock patterns. This research develops preliminary visualization of the most relevant findings. Appropriate interpretation of the results demands further analysis in order to display the most relevant information.

Keywords: Frequent pattern mining, Flock patterns, Trajectory datasets.

(6)

Acknowledgements

I would like to express my sincere gratitude to my ﬁrst supervisor, Dr.

Otto Huisman, and second supervisor, Dr. Ulanbek Turdukulov, for their great support and guidance during this research. I think I was the most fortunate student for having the chance to work with such great scientists. I very appreciate your support, critical comments and suggestions. Thank you so much!!!

I would also like to thank Petter Pilesjo, Malgorzata Roge-Wisniewska, Andre Kooiman and Louise van Leeuwen for their valuable help at diﬀerent stages of my studies.

A special “Thank you!!!” goes to all my GEM friends for the wonderful time we had together. You were my second family during the past months and I never will forget you. I will miss you a lot.

I would like to dedicate this thesis to my parents, Marcelo and Esperanza, my brother and sisters, Carlos, Paola and Carolina, and my little nephew and niece, Chris and Gabi. Thank you for believing in me even when I found it diﬃcult to believe in myself. I owe you much more than this.

Finally, I want to thank my ﬁancee. Nancy, you are the love of my life.

Thank you for all your inﬁnite love, support and patience during all this time.

I love you!!!

(7)

List of Figures

1.1 A ﬂock pattern example: {T 1 , T 2 , T 3 }. T i illustrates diﬀerent trajectories, c _i encloses a disk in which trajectories are considered close to each other

and t _i represents consecutive time intervals (after [82]). . . . . 2

2.1 BFE Algorithm for computing set of final disks per each timestamp and to join and report final flock patterns (source: [82]). . . . 8

2.2 BFE pruning stages. (a) The initial set of disks. (b) Just disks which overpass μ are retained (μ = 3). (c) Redundant disks with subset members are removed. . . . . 9

2.3 Shopping Basket Analysis example (source: [33]) . . . . 11

2.4 A trajectory dataset example. . . . 14

2.5 Example of a ﬂock where diﬀerent interpretation can apply. . . . 16

3.1 Oldenburg network representation. . . . 19

3.2 San Joaquin network representation. . . . 20

3.3 Comparison of internal execution time for the SJ25KT60 dataset. . . . 21

3.4 Comparison of internal execution time for the SJ50KT55 dataset. . . . 21

3.5 Systematic diagram for the proposed framework. . . . 23

3.6 Overlapping problem during the generation of ﬁnal disks. . . . 24

3.7 Performance of BFE algorithm and the proposed framework with diﬀerent values for in SJ25KT60 dataset. The additional parameters were set as μ = 5 and δ = 3. . . . 26

3.8 Performance of BFE algorithm and the proposed framework with diﬀerent values for in SJ50KT55 dataset. The additional parameters were set as μ = 9 and δ = 3. . . . 27

3.9 Visualization of the results from BFE (Left) and the proposed Framework (Right). BFE displays 448 ﬂocks while the proposed framework 104. . . . 28

4.1 Reported positions for all icebergs in the Iceberg dataset (1978, 1992-2009). 30 4.2 The circumpolar and coastal currents (West and East wind drifts) around the Antarctic continent (source: [93]). . . . 31

4.3 Spatial location of Antarctic krill catches (doted and line regions). Black areas illustrate ice shelves and fast ice during summer (source: [63]). . . . 32

4.4 Comparison between BFE algorithm and the proposed Framework perfor- mance for diﬀerent values of in Icebergs06 dataset. . . . 33

4.5 General view of the discovered patterns in Icebergs06 Dataset. Arrows

indicate the direction of the ﬂocks. . . . 35

(10)

4.6 Detail of discovered ﬂocks in Icebergs06 Dataset. Arrows indicate the di- rection of the ﬂocks. . . . 35 4.7 General view of the discovered patterns from January 01 to February 15. 36 4.8 General view of the discovered patterns from June 03 to August 17. . . . 36 4.9 Distribution points in study area. Left shows the sparse distribution around

China. Right focuses on 5th Ring Road area in Beijing (source: [98]). . . 38 4.10 Comparison of both methods with different values for in Beijing dataset. 39 4.11 General view of the discovered flocks in the Beijing Dataset. . . . 40 4.12 Close-up around the region which concentrates the major number of flocks.

Some universities and IT institutions are highlighted. . . . 40 4.13 Patterns shorter than 5 Km during workdays. Circle encloses the major

concentration around TSP region. Arrows highlight other locations. . . . 41 4.14 Patterns showing diﬀerent routes to connect TSP area with the South.

Yellow patterns go from TSP to South, green patterns show the return. . 42

5.1 Example of reported ﬂocks with diﬀerent values of . . . . 47

(11)

List of Tables

2.1 Transactional version of the dataset from Figure 2.4. . . . 14 3.1 Data format from generator. . . . 19 3.2 Synthetic Datasets. . . . 20 3.3 Number of combinations required for speciﬁc time intervals in SJ50KT55

dataset. . . . 22 3.4 Number of ﬂocks generated before and after postprocessing phase for BFE

and the proposed framework in SJ25KT60 dataset. . . . 27 3.5 Number of ﬂocks generated before and after postprocessing phase for BFE

and the proposed framework in SJ50KT55 dataset. . . . 28 4.1 Iceberg trajectories during 2006 in Antarctica. . . . . 32 4.2 Number of ﬂocks generated before and after postprocessing in Icebergs06

dataset. . . . 33 4.3 Description of the discovered ﬂock patterns in Icebergs06 dataset. The ﬁrst

column corresponds to tags in Figures 4.5 and 4.6. . . . 34

4.4 GPS log trajectories in Beijing. . . . 38

4.5 Number of ﬂocks generated before and after postprocessing in Beijing dataset. 38

4.6 Description of the discovered ﬂock patterns in Beijing dataset. . . . . 43

(12)

(13)

Chapter 1

Introduction

1.1 Background

Modern data acquisition techniques such as Global positioning system (GPS), Radio- frequency identiﬁcation (RFID), mobile phones, wireless sensor networks, and general surveys have resulted in the collection of huge amounts of geographic data during the past years. The popularity of these technologies and ubiquity of mobile devices seem to indicate that the amount of georeferenced data will increase at accelerated rates in the future.

However, and despite the growing demand, there are few tools available to apply a proper analysis of spatio-temporal datasets. The natural complexities in data handling, accuracy, privacy and its huge volume have become the analysis of spatial data into a challenging task. Traditional spatial analysis is not an effective solution. They were de- veloped in at time when access and quality of geodata was poor, as a result, they can not offer scalability conditions to manage the increasing dimensionality of data. Therefore, there is an urgent need for new and efficient techniques to support the analysis and po- tential extraction of valuable information from voluminous and complex spatio-temporal datasets.

Trajectory data associated with moving objects is one of the fields which has increased in volume considerably. Early approaches to recovery of information from this kind of data include single predicate range and nearest neighbour queries, for instance, “find all the moving objects inside area A between 10:00 AM and 2:00 PM” or “how many cars drove between Main Square and the Airport on Friday”. Recently, diverse studies have focused in querying patterns capturing group behaviour in moving object databases, for instance: moving clusters, convoy queries and flock patterns [42, 47, 43, 82, 54].

Flock pattern detection is particularly relevant due to the characteristics of the object

of study (animals, pedestrians, vehicles or natural phenomena), how they interact each

other and how they move together [50, 31]. [82] deﬁne moving ﬂock patterns as groups of

entities moving in the same direction while being close to each other for the duration of a

given time interval (Figure 1.1). They consider group of trajectories to be close together

if there exists a disk with a given radius that encloses all of them. The current approach

to discover moving ﬂock patterns consists in ﬁnding a suitable set of disks in each time

instance and then merging the results from one time instance to another. As consequence,

(14)

Figure 1.1: A ﬂock pattern example: {T 1 , T 2 , T 3 }. T i illustrates diﬀerent trajectories, c _i encloses a disk in which trajectories are considered close to each

other and t _i represents consecutive time intervals (after [82]).

the performance and number of ﬁnal patterns depends on the number of disks and how they are combined.

In parallel, some areas of traditional data mining have also focused on discovering fre- quent patterns in general attribute data. Association rule learning and frequent pattern mining [37] are popular and well researched methods for discovering interesting relations between variables in large databases. Frequent patterns are itemsets, subsequences, or substructures that appear in a dataset with frequency no less than a user-specified thresh- old. Initially, association rule learning and frequent pattern mining algorithms were de- signed to solve a specific task in the commerce sector [33]. However, the approach shares interesting similarities with the problem of finding moving flock patterns, for example, the efficient handling of candidates and combinations [1, 39].

1.2 Problem statement

Proposed algorithms to discover flock patterns scan the data in order to find disks which can be joined between consecutive time instances. The number of possible disks in a given time interval can be quite large and the cost to join those disks between time intervals can be quite expensive. Handling and analysis of all possible combinations have a direct impact on the algorithm’s performance. [82, 10] have tested some heuristics and approx- imations aiming to reduce the number of disks evaluated. However, experimental results still show large response times. In addition, the number and quality of the discovered flock patterns make it particularly difficult to perform a proper interpretation of the results.

1.3 Research identiﬁcation

Traditional data mining techniques, such as association rule learning and, particularly,

frequent pattern mining, were faced with combination and interpretation issues. This in-

vestigation aims to deﬁne a new methodology to mine moving ﬂock patterns in trajectory

(15)

datasets based on the frequent pattern mining approach, aiming to tackle the aforemen- tioned drawbacks. Procedures and conceptualization will be outlined together with its validity and usefulness using synthetic and real study cases.

1.3.1 Research objectives

In order to accomplish this purpose there are three main objectives:

1. To conceptualise an appropriate procedure to ﬁt the concept of moving ﬂock pat- terns into the frequent pattern mining methodology.

2. To implement a framework for pattern recognition in moving object datasets based on the methodology proposed.

3. To test the performance of the resulting framework using study cases with real and synthetic datasets.

1.3.2 Research questions

1. For design:

(a) How to apply the basic concepts of the frequent pattern mining approach in spatio-temporal datasets?

(b) How to adapt existing methods and data structures to ﬁt the speciﬁc require- ments of frequent pattern mining algorithms?

(c) What would be an appropriate method to visualize and interpret the results?

2. For testing:

(a) How does the proposed framework perform in datasets with diﬀerent charac- teristics?

(b) Which parameters and characteristics are the most important in determining the algorithm’s performance?

(c) Is the proposed framework applicable to diﬀerent context and phenomena?

(d) Are the results from the framework useful and interpretable?

1.3.3 Innovation aimed at

Innovation in this research will be aimed towards the implementation of a novel moving

ﬂock pattern framework which adapts traditional frequent pattern mining techniques in

order to reduce the number of combinations and to improve the understanding of the

results. The scalability and performance of the proposed framework will be tested with

synthetic and real datasets in the context of human movement (pedestrians) and natural

phenomena (icebergs). Generation and visualization of the most relevant results will be

also explored.

(16)

1.3.4 Related work

Due to the increasing collection of movement datasets, the interest on querying patterns which describe collective behaviour has also increased. [82] enumerate three groups of

‘collective’ patterns in moving object databases: moving clusters, convoy queries and ﬂock patterns.

Both moving clusters [42, 47, 53] and convoy queries [43, 44] have in common that they are based on clustering algorithms, mainly density-based algorithms such as DBSCAN [21]. The main differences between those two techniques are how they join clusters between two consecutive time intervals and the use of an extra parameter to specify minimum duration time in convoy queries. Although these methods are closely related to flock patterns, they differ from the latter technique because the resulting clusters do not assume a predefined shape.

Previous work in detection of moving ﬂock patterns are reported by [30] and [10].

They introduce the use of disks with a predefined radius to identify groups of trajectories moving together in the same direction. All trajectories which lie inside of the disk in a particular time instance are considered a candidate pattern. The main limitation of this procedure is that there is a infinite number of possible placements of the disk at any time instance. Indeed, [30] have shown that the discovery of fixed flocks, patterns where the same entities stay together during the entire interval, is an NP-hard problem.

[82] are the first to present an exact solution for reporting flock patterns in polynomial time, and also for those that can work effectively in real-time. Their work reveals that polynomial time solution can be found through identifying a discrete number of locations to place the centre of the flock disk. They propose the Basic Flock Evaluation (BFE) algorithm based on time-joins and combinations, and four other algorithms based on heuristics, to reduce the total number of candidates disks to be combined and, thus, the overall computational cost of the BFE algorithm. However, pseudo-code and experimental results still show relatively high computational complexity, long response time and a large number of discovered flocks which makes interpretation difficult.

Recently, [88] have proposed a new moving flock pattern definition and developed the corresponding algorithm based on the notion of spatio-temporal coherence. The experi- mental results focus on finding flock patterns in pedestrian datasets. Although they used a real dataset collected in a National Park in Netherlands, it is relatively too small to test appropriately the scalability of this algorithm. An interesting contribution in this study is a comparison framework of existing flock detection approaches according to the classification criteria recently introduced by [92] for collective movement.

In order to reduce the time response, spatial data structures and indexes have been tested, e.g. k-d tree and some variations. [10] have applied skip-quadtrees which make use of compressed quadtrees as the bottom-level structure. However their study just explores ﬂock identiﬁcation in single time intervals, so the inclusion of temporal variables was not considered.

Traditional data mining techniques, and particularly the ﬁeld of frequent pattern mining, have treated the number of combinations by reducing the number of elements to be combined or compacting the size of the dataset. [1] have applied pruning techniques based on the downward-closure property, which guarantees that all the subsets from a frequent pattern must be also frequent. Using this property, authors identiﬁed invalid candidates and then removed them from the analysis. However this technique still scans through the dataset repeatedly.

[36] proposed an intermediate layer which organizes the records in a compact data

(17)

structure called frequent-pattern tree (FP-Tree). Main advantages of this methodology are compression of datasets, minimization of scans and detection of patterns without candidate generation [36, 12]. Recently, [39] have proposed a novel and improved FP-tree structure applied in diﬀerent contexts, for instance: market basket, association rules and sequential patterns. [75] have applied this methodology successfully to ﬁnd co-orientation patterns from satellite imagery. The empirical results show an improvement around one degree of magnitude respect to the traditional approach.

Recently, the Linear time Closed itemset Miner (LCM) [81] have demonstrated a re- markable performance in dense databases using Binary Decision Diagrams, a compact graph-based data structure. Frequent patterns can be eﬃciently processed by using al- gebraic operations. LCM requires linear time to mine frequent patterns when the data compression works well. A comparison performance of LCM and other state-of-the-art techniques can be consulted in [9, 28].

However, [33] show how frequent pattern mining may generate a huge number of frequent patterns. It is even worse when there exist long patterns in the data. This is because if a pattern is frequent, each of its subpatterns is frequent as well. It clearly increases the complexity of analysis and understanding. To overcome this problem, Closed and Maximal pattern mining were proposed [7, 69]. The general idea is to report just the longest patterns avoiding its subpatterns.

The aforementioned techniques have been applied successfully to diverse scenarios such as bioinformatics [17, 16], GIS [60, 35] and marketing [96, 24]. Interested reader should refer to [37] for a complete survey in the current status of the frequent pattern mining approach. Additionally, the Frequent Itemset Mining Implementation repository (FIMI) [26] have gathered a collection of open source implementations for the most eﬃcient and scalable Frequent/Closed/Maximal pattern mining algorithms.

Overall, the frequent pattern mining approach has made a tremendous progress in the last decade and it is thought that this can contribute adequately to solve the drawbacks of ﬁnding moving ﬂock patterns in trajectory datasets.

1.4 Thesis structure

The remainder of the thesis is outlined as follows:

Chapter 2 explains the basic concepts to identify patterns in moving objects. The Basic Flock Pattern algorithm is introduced together with the formal definition of a moving flock pattern. Afterwards, frequent pattern mining in traditional databases is briefly discussed in order to explain deeper relevant concepts used in following chapters.

Then, the general steps of the proposed framework are explained. Finally, a discussion about possible ﬂock interpretations is presented.

Chapter 3 concentrates basically in implementation and technical issues. The ﬁrst part explains the methods and technologies used in the development of the BFE algorithm.

Then it explains the generation and main characteristic of synthetic datasets used to test the implementation. Later, it focus on the internal comparison between the two phases of the BFE algorithm. Afterwards, the main issues in the implementation of the proposed framework are described. The ﬁnal part of the chapter present a performance comparison between BFE and the proposed framework using the aforementioned synthetic datasets.

Chapter 4 focuses on study cases with real datasets. Two diﬀerent moving entities

are studied: pedestrians and icebergs. The chapter presents similar tests evaluated with

synthetic datasets, together with justiﬁcation, possible applications and results discussion.

(18)

Chapter 5 deals with a more detailed discussion about the framework implementation.

The main point of discussion are the impact of the size of trajectories in the framework’s performance. Then, the discussion focus on the limitations and alternatives of the tech- niques used in the framework in the understanding and interpretation of the results.

Finally, chapter 6 shares the conclusions and recommendations.

(19)

Chapter 2

Framework Deﬁnition

2.1 Identifying patterns in moving objects

Due to the increasing availability of spatial databases diﬀerent methodologies have been explored in order to ﬁnd meaningful information hidden in this kind of data. New under- standing in how diverse entities move in a spatial context have demonstrated to be useful in topics as diverse as sports [41], socio-economic geography [23], animal migration [20]

and security and surveillance [58, 71].

Early approaches to recovery information from spatio-temporal datasets include ad- hoc queries aimed to answer single predicate range or nearest neighbour queries, for instance, “ﬁnd all the moving objects inside area A between 10:00 AM and 2:00 PM”

or “how many cars drove between Main Square and the Airport on Friday”. Spatial query extensions in common GIS software packages and DBMS are able to run this type of queries, however these techniques try to ﬁnd the best solution exploring each spatial object at a time according to some metric distance (usually Euclidean). As results, it is diﬃcult to capture collective behaviour and correlations among the involved entities using this type of queries.

Recently, a new interest for querying patterns capturing ‘group’ or ‘common’ be- haviour among moving entities have emerged. Of particular interest is the development of approaches to identify groups of moving objects whose share a strong relationship and interaction in a deﬁned spatial region during a given time duration. Some examples of these kinds of approaches are moving cluster [47] [42], convoy queries [43] and ﬂock patterns [30] [10] [82].

Although different interpretations can be taken, a flock pattern refers to a predefined

number of entities which stay close enough during at least a given time interval. The

challenge to identify this kind of movement patterns is particularly relevant due to the

intrinsic interactions among the members of the ﬂock, specially in the context of animals,

pedestrian or vehicles. In this research an alternative framework for discovering moving

ﬂock patterns is proposed. Part of this framework is based on an existing state-of-the-art

algorithm, extended to take advantage of well-known and tested frequent pattern mining

algorithms in the area of association rule learning. The details of these concepts and

the methodology used to build the proposed framework will be discussed in the following

sections.

(20)

Figure 2.1: BFE Algorithm for computing set of final disks per each timestamp and to join and report final flock patterns (source: [82]).

2.2 Basic Flock Pattern algorithm

Flock pattern finding was firstly introduced by [31] and [50], however they did not consider the notion of duration in time. In a first approximation to identify flocks, just two variables were used : a constant maximum distance among moving objects () which represents the radius of a disk and a minimum number of moving objects (μ) which should lie inside of that disk. Later [30] added a minimum time duration (δ) to be considered as a parameter of a flock.

Initial experiments showed that ﬁnd an appropriate location for the disk was not a trivial problem. It is shown in [30] that discovering the longest duration ﬂock pattern is an NP-hard problem. For that reason, the work presented only approximation algorithms.

Recently, [82] introduced an on-line algorithm to find moving flock patterns called Basic Flock Evaluation algorithm (BFE). This appears to be the first work to present exact solutions for reporting flock patterns in polynomial time.

It was decided to share the general deﬁnition for moving ﬂock patterns used in [82]

illustrated in Figure 1.1. It deﬁnes a dataset of trajectories and the parameters , μ and δ as function’s inputs:

Deﬁnition Given are a set of trajectories τ , a minimum number of trajectories μ > 1(μ ∈ N), a maximum distance > 0( ∈ R) and a minimum time duration δ > 1(δ ∈ N).A ﬂock pattern F lock(μ, , δ) reports all maximal size collections F of trajectories where:

for each f k in F the number of trajectories in f k is greater or equal than μ(|f k | ≥ μ) and there exist δ consecutive time instances such that for every t i ∈ [f _k ^t ⁱ ..f _k ^t ⁱ ^+δ ], there is a disk with center c ^t _k ⁱ and radius /2 covering all points in f _k ^t ⁱ .

The general operation of the BFE algorithm can be explained in two parts. A ﬁrst

function (left at Figure 2.1) aim to build a ﬁnal set of disks which, per each timestamp,

brings together a minimum number of objects that remain close enough each other. The

second part (right at Figure 2.1)joins candidate disks which share the same set of objects

during consecutive timestamps if and only if it exceeds the minimum value of μ. In

(21)

Figure 2.2: BFE pruning stages. (a) The initial set of disks. (b) Just disks which overpass μ are retained (μ = 3). (c) Redundant disks with subset members are

removed.

addition, the minimum duration parameter δ must be satisﬁed for the objects to be reported.

The ﬁrst section of the algorithm uses a grid-based index to organize the set of loca- tions at every time instance and identify couples of points which are less than units of distance one another. For each pair of points is possible generate two disks with radius

/2 that have those points on their circumference. These two disks are considered as candidates before of testing if they ﬁt the required minimum number of μ trajectories.

In large data sets, the number of possible pairs of points and, therefore disk candidates, can be huge. Additional tasks are therefore required in order to eliminate redundancy in the initial set of disks. If the complete set of trajectories within a disk also appear in other disk, just one of them should be kept it. The algorithm organizes the initial set of disks in a KD-Tree structure, so it is easy to detect groups of disks which intersect each other and then to check if one of them has supersets or subset elements with another disk.

Figure 2.2 illustrates the pruning stages to calculate a valid set of ﬁnal disks.

When a ﬁnal set of disks is found for consecutive time instances, the second part of the algorithm compares one by one the disks in each set to ﬁnd those which have a minimum number of trajectories in common (μ). When a new timestamp is explored, the new disks which match the requirements are joined with the previous stored candidates. At the moment that one of them is longer than the δ parameter, it is immediately reported.

However, the number of disks in a given time instance can be quite large and the cost to join those disks in a flock pattern can be quite expensive. BFE limits the number of candidates storing just those with δ time duration. As consequence of that, BFE reports flocks with a fixed time duration.

2.3 Finding frequent patterns in traditional databases

Frequent patterns are itemsets, subsequences, or substructures that appear in a dataset

with frequency no less than a user-speciﬁed threshold [37]. The issue of unveiling inter-

(22)

esting patterns in databases under different contexts has been a recurrent research topic during the last 15 years. General data mining has become widely recognized as a critical field by companies of all types. As a part of the data mining methods, the task of associ- ations rule learning have studied different frequent pattern mining algorithms to identify relevant trends in datasets in different disciplines [17, 60, 96].

One of the areas where the techniques of association rule learning and frequent pattern mining algorithms have been more often applied is in analysing data and market trends in transactions of costumers of large supermarkets and stores [1]. Usually this technique has been called ‘the shopping basket problem’ even though the methods derived to solve it can be applied under diﬀerent contexts [33]. During this chapter these techniques will be referred to as ‘Shopping Basket Algorithms’ to facilitate their explanation and reference.

The shopping basket problem represents an attempt by a retailer to discover which items its costumers frequently purchased together [79]. The goal is an understanding of the behaviour of a typical customer and the identiﬁcation of valuable items and relationships among them. For this kind of problem the input is a given database with information about the items purchased. When a customer pays for its products at the cashier, a record with the bought items is inserted into the database. In a general view, it is enough to capture just the transaction ID and the product ID (one record per each item purchased).

It is known as {TID:itemset} schema. As the records in the database usually refer to transactions, these databases are called transactional databases. The goal of shopping basket analysis is to ﬁnd sets of items (itemsets) that are “associated” and the fact of their association is often called an association rule [79].

For instance, if we know that a high percentage of customers are buying milk and bread at the same time in their visits to a supermarket, this relationship represents an association rule. It can be used to formulate new marketing strategies, promotions, introduction of new products, catalog design, cross-marketing or shelf space planning [33]. It is usual to locate associated items in different aisles and high-profit or new products between them to ensure they are exposed to more customers [79]. [24] discussed other case studies applied in commerce and marketing where different association rules methods are explored. During the last years many improvements and new techniques have been developed and proposed in order to enhance and take advantage of the benefits of association rules analysis.

2.3.1 Shopping basket analysis: an example

Given the small example illustrated by Figure 2.3, we can take as input a database of 4 transactions. Visually, it is easy to identify that Milk is present in 3 out of 4 transactions.

It is also easy to see that Bread appear in all of the transaction where Milk is. Therefore, we can report the pair Milk and Bread as a frequent pattern and, for example, infer an association rule as:

M ilk ⇒ Bread [support : 0.75, conf idence : 1]

where support and conﬁdence are two measure of the rule interestingness. A support

count threshold of 0.75 means than the number of transactions involving Milk is equal

to 75% (3 out of 4) of the total number of transactions in the database. A conﬁdence of

1 means that all (100%) transactions where Milk appears, also Bread appears. This two

measures are used to assess the quality of the obtained rules, which in large databases can

be signiﬁcant and they are deﬁned as the parameters minimum support and minimum

conﬁdence in most of the association rules algorithms.

(23)

Figure 2.3: Shopping Basket Analysis example (source: [33])

The process to retrieve a complete set of association rules from large databases can be divided in two parts. First, all of possible itemsets which get over the support threshold are found. This group is called frequent itemsets and refer to the most frequent patterns in the database. The techniques used to discover the set of frequent itemsets are also called frequent pattern mining algorithms. Then, from the frequent itemsets, strong associations are generated among the members of each itemset. Depending on the size of a itemset, all possible combination among its members are computed to obtain pairs of antecedent and consequence statements which will define a rule. The confidence value is used in this stage to report just the most significant rules.

2.3.2 Maximal and Closed frequent patterns

Although the ﬁrst generation of algorithms designed to mine associated rules aim to ﬁnd the complete group of frequent itemsets, in large databases using low values for minimum support threshold this number can be huge [33]. This is because if an itemset is frequent, each of its subsets is frequent as well. Long itemsets will contain large number of shorter frequent subsets. For instance, let a long itemset I = {a 1 , a 2 , ..., a 100 } with 100 items.

It is usually called type 100 or 100-itemset (for its number of members). It will contain

₁₀₀

1 1-itemsets, ₁₀₀

2 2-itemsets, and so on. The total number of frequent itemsets that it would contain would be:

₁₀₀

1 + ₁₀₀

2 + ... + ₁₀₀

100 = 2 ¹⁰⁰ − 1 ≈ 1.27 ∗ 10 ³⁰

This magnitude of values is obviously too large to handle even for computer applica-

tions. To overcome this drawback the concepts of closed frequent pattern and maximum

frequent pattern are used. A pattern α is a closed frequent pattern if α is frequent and

(24)

there exists no other pattern, with the same support, whose contains α. On the other hand, a pattern α is a maximal frequent pattern if α is frequent and there exists no other pattern, with any support, whose contains α. For example:

α = {a 1 , a 2 , a 3 , a 4 : 2 } α is maximal

β = {a 1 , a 2 , a 3 : 4} β is closed but not maximal

The set of maximal frequent patterns is important because it contains the set of longest patterns such that any kind of frequent pattern which exceeds the minimum support can be generated. [33] provides a detailed and theoretical deﬁnition. For clariﬁcation, these two concepts can be illustrated with an additional example:

Suppose a database D contains 4 transactions:

D = {a 1 , a 2 , ...a 100 ; a 1 , a 2 , ...a 100 ; a 20 , a 21 , ...a 80 ; a 40 , a 41 , ...a 60 }

Note that the ﬁrst transaction is repeated twice. The minimum support min sup = 2.

A complete search for all itemsets will generate a vast number of combinations. However, the closed frequent itemset approach will ﬁnd only 3 frequent itemsets:

C = {{a 1 , a 2 , ...a 100 : 2 }; {a 20 , a 21 , ...a 80 : 3 }; {a 40 , a 41 , ...a 60 : 4 }}

The set of closed frequent itemsets contains complete information to generate the rest frequent itemsets with their corresponding support. It is possible to derive, for example, {a 50 , a 51 : 4 } from {a 40 , a 41 , ...a 60 : 4 } or {a 90 , a 91 , a 92 : 2 } from {a 1 , a 2 , ...a 100 : 2 }.

On the other hand, we just obtain one maximal frequent pattern, in this case:

M = {{a 1 , a 2 , ...a 100 : 2 }}

From the results it is known that {a 50 , a 51 } and {a 90 , a 91 , a 92 } are frequent patterns, although it is not possible to assert their actual support counts.

2.4 Proposed Framework

It is thought that current frequent patterns mining algorithms developed in the area of association rule learning have made a tremendous progress bringing efficient and scalable algorithms for discovering frequent itemsets in transactional databases which can be ap- plied on numerous research frontiers. Therefore the main aim of the remainder of this thesis is to explore a methodology which allow the identification of moving flock patterns using traditional and powerful algorithms for association rule mining.

In order to accomplish this goal, a framework including 4 steps is proposed:

1. Obtain a ﬁnal set of valid clusters in each timestamp.

2. Construct a transactional version of the trajectory dataset based on the disks visited by each trajectory.

3. Apply a frequent pattern mining algorithm in the generated database.

4. Perform postprocessing procedures to check consecutiveness, prune duplicates and report patterns.

Each of the steps of the proposed framework are explained in the remainder of this

chapter.

(25)

2.4.1 Getting a ﬁnal set of disks per timestamp

The ﬁrst step of the framework is to identify a ﬁnal set of clusters in each timestamp.

Although the first step of the BFE algorithm is affected by the number of trajectories, the initial implementation showed acceptable time responses in preliminary testing on large synthetically generated datasets (See Section 3.4). This fact promoted its use as a first step in the proposed framework. The main objective with this is the generation of a final set of disks which cluster the number of trajectories in groups according to proximity.

This step still uses the parameter to deﬁne the diameter of the disks and μ for pruning procedures to reduce the number of valid disks.

For simplicity, BFE algorithm and the proposed framework uses a fix disk shape; a circumference with a predefined radius and the Euclidean distance metric. However dif- ferent shapes and metrics could be used. Indeed, alternative spatial clustering techniques, such as DBSCAN or grid-based methods, which allow the identification of dense regions with a minimum number of trajectories, could be used at this stage. These issues are discussed further in Section 5.2.2.

2.4.2 From trajectories to transactions

In a general sense, spatio-temporal datasets are comprised of information for the location of an entity at a specific time. Each entry in the dataset reflects an observation of a point, which in turn describes a specific trajectory. To be able to analyse trends in the data, we assume that spatio-temporal datasets contain at least 4 fields: a trajectory ID to which belongs a point, the time when it was measured and the X, Y coordinates of the location.

In order to use frequent pattern mining algorithms, the input database should follow the {TID:itemset} schema (see Section 2.3). The ID of the trajectory can be used to identify its corresponding transaction, but it is necessary to define an Item ID which collects information for the time and location for each point. An unique ID is tagged to each disk generated in the first step of the framework. In addition, information about which trajectories visited a disk in a particular time interval is stored in a separate table, so it is possible to get a transactional version of the trajectory if we match the time and location of a point with the ID of the corresponding disk. A specific disk will represent a particular region in space and time and each trajectory can be translated according to the disks which this visits during its lifetime. This concept is illustrated in the following example:

At Figure 2.4 we can see a dataset of 7 trajectories (T i ). From that, 5 disks can be identified throughout the dataset lifetime (c i ). Table 2.1 is created from the disks which are visited for each trajectory at a specific timestamp (t i ). If Table 2.1 is treated as a transactional databases, it is possible to apply any frequent pattern mining algorithm to find the frequent patterns. For instance, let set the minimum support count (min sup) at the same value that the minimum number of trajectories μ. If we use μ = 3 the patterns {C 1 , C 2 , C 4 : 3 } and {C 3 , C 5 : 3 } should be found. These patterns contain the information about the trajectory members and duration of the possible moving flock patterns.

It is no necessary a complete set of all frequent patterns. The set of maximal frequent

patterns will retrieve the required information. The main advantage of using this approach

is that the longest ﬂock patterns are reported. The maximal or closed sets of frequent

patterns avoids the need to set a parameter δ to limit the duration of the patterns. In the

proposed framework, the parameter δ is only used to set the minimum duration allowed,

but ﬂocks with any duration will be reported. By contrast, BFE used δ to report ﬂocks

(26)

Figure 2.4: A trajectory dataset example.

Table 2.1: Transactional version of the dataset from Figure 2.4.

TID Disk IDs T 1 C 1 , C 2 , C 4 T 2 C 1 , C 2 , C 4 T 3 C 1 , C 2 , C 4 T 4 C 3 , C 5 T 5 C 3 , C 5 T 6 C 3 , C 5

T 7 ∅

with this specific time duration in order to minimize the number of intermediate flocks to be combined. As a result, the final number of flocks reported by the proposed framework is significantly smaller than the number of flocks reported by BFE.

Although the patterns are considered as valid output from frequent pattern mining algorithms, they will require additional checking before they can be reported as valid ﬂocks.

2.4.3 Frequent Pattern Mining Algorithms

Since [1] many improvements and new methods have been proposed by the scientific community to find frequent patterns in an efficient and robust way. The most popular solutions involve the use of compact data structures which compress the original database such as FP-Trees [36, 39] and Binary Decision Diagrams [80, 81, 61]. Their main principles have resulted in different implementations depending on the context and they have also inspired additional variations in order to find representative types of patterns such as maximal and closed frequent itemsets.

The Frequent Itemset Mining Implementation repository (FIMI) [26] is one of the

most important initiatives to discuss and analyse the performance in computation time

(27)

and memory of the most relevant algorithms in this topic. In addition, it collects open source code and sample datasets from the original authors. [25, 27] gave an introductory survey of the state-of-the-art methods and techniques as well as their performance with diﬀerent types of datasets and parameters.

According to the needs of the proposed framework, the technique which shows better results with preliminary datasets was the Linear time Closed itemset Miner (LCM)[81].

LCM demonstrated an remarkable efficiency using extremely low values of support in dense datasets, two characteristics present in mining moving flock patterns. LCM is a backtracking (or depth-first ) algorithm based on recursive calls. The algorithm inputs a frequent itemset P and generate new itemsets by adding unused item to P . Then, for each new frequent itemset, it computes recursive call with respect to P . The process ends when new items cannot be added. Here, we omit the detailed description of the algorithm which is described in [80, 81].

2.4.4 Postprocessing Stage

As discussed above, information about time and location for each point of the trajectories was encoded into unique IDs for the disks. Once the LCM algorithm retrieves the set of frequent patterns, it is necessary to decode this information and check the quality and validity of the ﬂocks. It is possible that the members of a valid frequent pattern belong to disks in non-consecutive times, so it is necessary to check this requirement, in addition to the minimum duration (δ), before reporting it as a valid ﬂock.

As in the BFE algorithm it is required to prune possible duplicate patterns. Due to the fact that a fixed diameter is used to define the disks, it is inevitable that some disks overlap others. Points belonging to different disks at the same time interval lead to the generation of redundant patterns. An additional scan is needed in order to identify and remove repeated flocks. Alternatives to avoid this behaviour are discussed in Section 5.2.2.

2.5 Flock Interpretation

Although a formal definition was stated previously in this document different interpreta- tions of a flock are possible depending on the application. Figure 2.5 illustrates a case where according to the context and nature of the moving objects diverse set of patterns can be derived. The different interpretations are supported by the concepts of maximal and closed frequent patterns in the implementation of the proposed framework.

Let set μ = 3. If a maximal frequent pattern approach is implemented using a minimum support count equal than μ (min sup = 3), a moving ﬂock pattern with member {T 1 , T 2 , T 3 } from time t 1 to t 6 would be identiﬁed. It is the general scenario used in the tests to measure the performance of the framework.

A second alternative will use the closed frequent pattern approach. In this case, four flock patterns could be identified with different start times and number of members. They are: {T 1 , T 2 , T 3 } from time t 1 to t 6 , {T 1 , T 2 , T 3 , T 4 } from time t 2 to t 4 , {T 1 , T 2 , T 3 , T 5 } from time t 3 to t 5 and {T 1 , T 2 , T 3 , T 4 , T 5 } from time t 3 to t 4 . It will bring more details about the interaction among moving objects but it will increase considerably the number of final flocks. However, it was useful during the validation stage because it generated a set of patterns similar to that generated by the BFE algorithm.

Finally, based on the maximal frequent pattern approach, a third alternative is pro-

posed doing a further analysis over the additional trajectories. After identiﬁcation of the

(28)

Figure 2.5: Example of a ﬂock where diﬀerent interpretation can apply.

core members of the flock (leaders), the additional points will be treated as followers of the core trajectories. In this fashion, just one flock will be reported from the example, where {T 1 , T 2 , T 3 } from t 1 to t 6 , will be the leader trajectories. T 4 , joining the flock at time t 2 until t 4 , and T 5 , joining it from t 3 to t 5 , will be tagged as the corresponding followers.

The last interpretation is semantically more appropriate because reﬂects the intrinsic attraction and repulsion forces present, especially, in social entities such as animals or pedestrians. For instance, it is able to represent how a person joins a crowd, interacts with its members for a moment and then he leaves it. However, this approach needs additional processing and the format of the results require a more suitable representation.

This interpretation was implemented in the visualization of the patterns generated with

the real datasets.

(29)

Chapter 3

Implementation

3.1 BFE Implementation

An implementation of the BFE algorithm was developed keeping two goals in mind. First, to understand the bottlenecks processes during the execution of the method. Experimen- tal results in [82] showed high time responses dealing with large datasets but it does not clarify which parts of the algorithm are the most aﬀected. Second, an available imple- mentation of the BFE algorithm would be useful so parts of the code could be re-used in the development of the proposed framework and testing of the results.

Based on the pseudo-code published in [82], a version of the BFE algorithm was developed using several open source libraries and utilities. An initial attempt used Java 1.6 programming language connected to spatial functions provided by PostGIS [72]. Spatial queries were used to calculate the optimal location of the final set of disks at the first stage of BFE algorithm. However, this approach showed low performance due to multiple read/write operations and indexing. Together with this, difficult integration of SQL results and efficient spatial data structures (e.g. KD-Tree) was also a limitation.

An alternative was an application written in 100% pure Java which allows to work with the data in main memory avoiding multiple read/write operations. JTS Topology Suite (JTS) [85] was used for this purpose. It is an API for processing linear geometry which provides a complete, simple and robust implementation of distance and topological functions on the 2-dimensional plane. JTS implements the geometry model deﬁned in the Simple Features Speciﬁcation for SQL by OpenGIS Consortium [65]. The software is published under the GNU Lesser General Public License (LGPL).

Although JTS supports almost all the spatial functions offered by PostGIS, it requires efficient data structures to manage attribute data. Fastutils [78] is a fast and compact implementation which extends the Java Collections Framework offered by default. It provides type-specific maps, lists, sets and trees with a small memory footprint and fast access and insertion, minimizing the number of write/read operations. It was developed by the Laboratory for Web Algorithmics (LAW) at the University of Milan. The source code and API are released as free software under the Apache License 2.0.

Additional data management (specially for storing the resulting patterns) and some

query veriﬁcation was performed using PostGIS and OpenJump GIS [86].

(30)

3.2 Synthetic Generators

Many different approaches have been proposed in order to model moving entities under different criteria and scenarios. [70, 73, 48, 13] represent alternative efforts to recreate the movements and dynamics of diverse entities such as pedestrians, cars and even fishing ships in the real world.

In this research, a group of synthetic datasets were created using a framework for generating moving objects, as is described in [13, 14], to test the initial implementation of the BFE algorithm. An important characteristic provided by this generator was the possibility that moving objects follow a given network. In addition with the supplied network, one can set distinct parameters, e.g. number of objects, number of intervals and maximum speed. Each edge in the network and trajectory is associated with a category of roads and a probability permitting varying movement speeds and lifetime duration. The source code and sample networks are available on the project’s website at [11].

3.3 Synthetic Datasets

[11] provides a set of examples and resources which can be used in the online demo or downloadable version of the generator. To begin with, a relatively small dataset collecting position of 1000 random moving objects in the German city of Oldenburg was used to test the above explained BFE implementation. The network data (edges and nodes ﬁles) are available in the website. The simulated data collects latitude and longitude of generated points during 140 time slices. The total number of locations stored is 57016 points.

Figure 3.1 illustrates the network used for this dataset and the Table 3.1 shows the output format from the generator.

The Oldenburg dataset was useful to test the ﬁnal implementation and results from the BFE algorithm, but it was relatively small to test the scalability of the method.

Two additional synthetic datasets were created using the network from San Joaquin also provided at the project’s website. Figure 3.2 illustrates this network. The ﬁrst dataset collects 992140 simulated locations for 25000 moving objects during 60 timestamps. The second one collects 50000 trajectories from 2014346 points during 55 timestamps. Ta- ble 3.2 summarizes the main information from the synthetic datasets used at this stage and a tag name which will be used in the remainder of the thesis.

3.4 Internal Comparison

Using the large datasets previously generated and the implementation of BFE algorithm, a set of tests were performed to analyse the performance of the technique. The main idea of these tests was to identify bottlenecks and differences between the two internal phases of the algorithm. For each time interval in the dataset, the execution time for getting the final set of disks and for joining possible flocks was recorded separately. At the end of each test, the individual times for each interval were summed up.

Figure 3.3 shows the performance of BFE algorithm in the SJ25KT60 dataset with

the value ranging from 50 to 300 metres. The values for the minimum number of trajec-

tories (μ) and minimum time duration (δ) were setted to 5 trajectories and 3 consecutive

timestamps respectively. Similar test was performed using the SJ50KT55 dataset setting

diﬀerent values for . Parameters μ and δ were setted to 9 trajectories and 3 consecutive

timestamps respectively. Time performance for this case can be seen in Figure 3.4.

(31)

(32)

(33)

50 100 150 200 250 300

0 5 0 100 150

Change in ε

ε (m)

Processing time (s)

[ SJ25KT60−P992140M5D3 ]

Getting Flocks Getting Disks

Figure 3.3: Comparison of internal execution time for the SJ25KT60 dataset.

50 100 150 200 250 300

0 5 0 100 150 200 250 300 350

Change in ε

ε (m)

Processing time (s)

[ SJ50KT55−P2014346M9D3 ]

Getting Flocks Getting Disks

Figure 3.4: Comparison of internal execution time for the SJ50KT55 dataset.

(34)

Table 3.3: Number of combinations required for speciﬁc time intervals in SJ50KT55 dataset.

Time Number of Number of Number of Time for Time for interval disks previous ﬂocks needed disks (s) ﬂocks (s)

generated and disks combinations

10 2112 3469 7326528 4.3 15.9

11 2070 3331 6895170 6.3 16.4

12 2121 3414 7241094 4.2 16.4

13 2031 3283 6667773 4.0 15.6

14 1918 3094 5934292 5.0 14.2

15 1950 2929 5711550 4.2 13.5

As is shown in Figures 3.3 and 3.4 the increment in the radius of the disk affects both stages of the algorithm. However, it is clear that after a critical point (around 150 metres in SJ25KT60 and 200 metres in SJ50KT55) the most affected step is the combination and checking of possible flocks. While with low magnitudes of , joining possible flocks is slightly faster that getting a final set of disks, for larger values the latter step is much faster than the former.

This can be explained by the number of combination required in the second part of the algorithm. As the radius of the disk increases, it will enclose more trajectories. As a result, the number of disks which exceeds the minimum number of trajectories will rise considerably. This number of disks in each time interval has to be compared one by one with the number of disks generated in the next time interval plus the set of candidates disks identiﬁed until that moment. If the size of those sets are large enough, it can take exponential time to combine all their elements.

Table 3.3 illustrates the problem. It shows a segment of the SJ50KT55 dataset between the time intervals 10 and 15 with a value of 300 metres. At this instance, around 2000 new disks are generated each timestamp. As the number of stored disks is also large (3000 approximately) the number of combinations is signiﬁcantly high. It takes on average more than three times longer to analyse such large number of combinations than to generate the set of ﬁnal disks for this dataset.

3.5 Framework Implementation

A functional prototype of the proposed framework was implemented in Java 1.6. To build the proposed framework, it was decided to keep the first part of the BFE algorithm but to address the combinatorial problem using a frequent pattern mining approach. A systematic diagram for the proposed framework is shown in Figure 3.5. The pseudo-code of the proposed framework is presented in Algorithm 3.1 at the end of this section. The framework works with a plain text file as input, with the same format that is generated by the synthetic generator (See Table 3.1). The initial step in the framework implementation re-use the procedure to calculate the final set of disks for each timestamp of the BFE implementation (line 2 in Algorithm 3.1).

At this stage, an eﬃcient data structure was introduced to associate point locations

in each trajectory with their respective disk in order to generate a transactional version

of the dataset (line 3 to 9 in Algorithm 3.1). It is expected that from a disk ID (c i .id),

(35)

Figure 3.5: Systematic diagram for the proposed framework.

the values for the points contained by it (c i .points) and time interval (c i .time) can be retrieved.

As just those point locations which lie inside of a valid disks are associated, trajectories beyond a threshold distance () from others are pruned at this stage. Consequently, in most of the cases the translation from trajectories to transactions results in a considerable reduction in the number of valid trajectories. However it also introduces limitations; As BFE uses a ﬁxed distance to cluster the trajectories, it is inevitable that some disks overlap others. As consequence, the same point location can be associated with more than one disk.

Figure 3.6 illustrates a snapshot of the Oldenburg dataset using = 200 (metres) and μ = 3 (trajectories). Trajectories such as T 2 and T 3 can be easily associated with a unique disk. On the other hand, T 4 is contained by two and T 1 by three diﬀerent disks.

While isolated locations such as T 5 will not appear in the transactional version, T 4 and T 1 will increase their number of members. However it seems that this does not aﬀect the ﬁnal size of the transactional version, which results to be considerably smaller than the original dataset.

When the transactional version of the dataset (D) is complete it is passed, together

with the minimum support threshold (min sup), as parameters of the LCM algorithm (line

12 in Algorithm 3.1). It is a independent program, written in C programming language,

available for download at [26]. Two variants of the program are available; LCM max and

LCM closed will retrieve the maximal or closed set of frequent patterns depending on the

case. The output M (line 12 in Algorithm 3.1) will be a plain text ﬁle where each line is

(36)

Figure 3.6: Overlapping problem during the generation of ﬁnal disks.

a maximal pattern which contains a set of Disk IDs separated by spaces.

The set of core trajectories and consecutiveness is checked in the post-processing stage.

Lines 14 to 18 in Algorithm 3.1 declare initial values to iterate through the maximal pattern. Afterwards, information for time intervals and trajectory members is retrieved for each disk contained in the pattern (lines 20 and 21 in Algorithm 3.1). The start and end for each ﬂock pattern is set after checking time consecutiveness. The set of trajectories common to all the disks in a maximal pattern are considered as the leader trajectories (line 23 in Algorithm 3.1).

In many cases, each frequent pattern can be associated with a unique ﬂock pattern.

However, it is possible that long frequent patterns contain disks from non-consecutive time intervals. It will report various ﬂock patterns from the same maximal pattern if each segment is greater than the minimum time duration (δ) (lines 26 to 29 and 32 to 34 in Algorithm 3.1).

As in the BFE algorithm, the overlapping problem required the pruning of duplicates and redundant patterns. Using a tree structure, the set of suitable flocks patterns are stored. In this way, patterns with the same members (and the same start and end times- tamps) will be easily detected and excluded (additional validation in lines 26 and 32 in Algorithm 3.1). Redundant patterns occur when two patterns share exactly the same members but the time duration of one of them is contained by the longest one. Using the same data structure, this kind of pattern can be also detected, keeping just the longest one. Once the postprocessing stage finishes, the final flock patterns are saved to a file.

The last phase of the framework covers the visualization of the resulting ﬂock pat-

terns. Key information about a speciﬁc ﬂock pattern are its start and end timestamps

and the trajectory IDs of their members. From this information the location (latitude

and longitude position) of the members along its lifetime can be queried from the original

dataset. However in large spatio-temporal datasets this could be costly. The implementa-

tion stores a line representation of the ﬂock, together with its key information when a ﬂock

passes the postprocessing stage. Two variants were used as representations depending on

the context and application: ﬁrstly, a line generated from the centroids of the trajectory

Mining moving flock patterns in large spatio-temporal datasets using a frequent pattern mining approach

Mining moving ﬂock patterns in large

spatio-temporal datasets using a frequent pattern mining approach

Andres Oswaldo Calderon Romero

March 2011

Course Title: Geo-Information Science and Earth Observation for Environmental Modelling and Management

Level: Master of Science (MSc.) Course Duration: September 2009 – March 2011 Consortium partners: University of Southampton (UK)

Lund University (Sweden) University of Warsaw (Poland) University of Twente,

Faculty ITC (The Netherlands)

GEM thesis number: 2011–

Mining moving ﬂock patterns in large spatio-temporal datasets using a frequent pattern mining approach

by

Andres Oswaldo Calderon Romero

Thesis submitted to the University of Twente, faculty ITC, in partial fulﬁlment of the requirements for the degree of Master of Science in Geo-information Science and Earth Observation for Environmental Modelling and Management.

Thesis Assessment Board

Chairman: Prof. Dr. Menno-Jan Kraak External Examiner: Dr. Jadu Dash First Supervisor: Dr. Otto Huisman

Second Supervisor: Dr. Ulanbek Turdukulov

Disclaimer

This document describes work undertaken as part of a programme of study

at the University of Twente, Faculty ITC. All views and opinions expressed

therein remain the sole responsibility of the author, and do not necessarily

represent those of the university.

Abstract

An example of this type of pattern are moving flocks. These are defined as groups of moving objects that move together (within a predefined distance to each other) for a certain continuous period of time.

Keywords: Frequent pattern mining, Flock patterns, Trajectory datasets.

Acknowledgements

I would like to express my sincere gratitude to my ﬁrst supervisor, Dr.

I would also like to thank Petter Pilesjo, Malgorzata Roge-Wisniewska, Andre Kooiman and Louise van Leeuwen for their valuable help at diﬀerent stages of my studies.

A special “Thank you!!!” goes to all my GEM friends for the wonderful time we had together. You were my second family during the past months and I never will forget you. I will miss you a lot.

I would like to dedicate this thesis to my parents, Marcelo and Esperanza, my brother and sisters, Carlos, Paola and Carolina, and my little nephew and niece, Chris and Gabi. Thank you for believing in me even when I found it diﬃcult to believe in myself. I owe you much more than this.

Finally, I want to thank my ﬁancee. Nancy, you are the love of my life.

Thank you for all your inﬁnite love, support and patience during all this time.

I love you!!!

Contents

1 Introduction 1

1.1 Background . . . . 1

1.2 Problem statement . . . . 2

1.3 Research identiﬁcation . . . . 2

1.3.1 Research objectives . . . . 3

1.3.2 Research questions . . . . 3

1.3.3 Innovation aimed at . . . . 3

1.3.4 Related work . . . . 4

1.4 Thesis structure . . . . 5

2 Framework Deﬁnition 7 2.1 Identifying patterns in moving objects . . . . 7

2.2 Basic Flock Pattern algorithm . . . . 8

2.3 Finding frequent patterns in traditional databases . . . . 9

2.3.1 Shopping basket analysis: an example . . . . 10

2.3.2 Maximal and Closed frequent patterns . . . . 11

2.4 Proposed Framework . . . . 12

2.4.1 Getting a ﬁnal set of disks per timestamp . . . . 13

2.4.2 From trajectories to transactions . . . . 13

2.4.3 Frequent Pattern Mining Algorithms . . . . 14

2.4.4 Postprocessing Stage . . . . 15

2.5 Flock Interpretation . . . . 15

3 Implementation 17 3.1 BFE Implementation . . . . 17

3.2 Synthetic Generators . . . . 18

3.3 Synthetic Datasets . . . . 18

3.4 Internal Comparison . . . . 18

3.5 Framework Implementation . . . . 22

3.6 Computational Experiments . . . . 26

3.7 Validation . . . . 26

4 Study Cases 29

4.1 Tracking Icebergs in Antarctica . . . . 29

4.1.1 Implications and possible applications . . . . 30

4.1.2 Data cleaning and preparation . . . . 31

4.1.3 Computational experiments . . . . 32

4.1.4 Results . . . . 34

4.1.5 Findings in iceberg tracking . . . . 34

4.2 Pedestrian movement in Beijing . . . . 37

4.2.1 Implications and possible applications . . . . 37

4.2.2 Data cleaning and preparation . . . . 37

4.2.3 Computational experiments . . . . 38

4.2.4 Results . . . . 39

4.2.5 Findings in pedestrian movement . . . . 39

5 Discussion 45 5.1 Implementation and Performance Issues . . . . 45

5.1.1 Impact of size trajectory . . . . 45

5.1.2 Possible solutions . . . . 46

5.2 Interpretation Issues . . . . 46

5.2.1 Number of patterns and quality of the results . . . . 46

5.2.2 Overlapping problem and alternatives . . . . 47

1.1 A ﬂock pattern example: {T 1 , T 2 , T 3 }. T i illustrates diﬀerent trajectories, c _i encloses a disk in which trajectories are considered close to each other

and t _i represents consecutive time intervals (after [82]). . . . . 2

Figure 1.1: A ﬂock pattern example: {T 1 , T 2 , T 3 }. T i illustrates diﬀerent trajectories, c _i encloses a disk in which trajectories are considered close to each

other and t _i represents consecutive time intervals (after [82]).