• No results found

Grid architecture for distributed process mining

N/A
N/A
Protected

Academic year: 2021

Share "Grid architecture for distributed process mining"

Copied!
196
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Grid architecture for distributed process mining

Citation for published version (APA):

Bratosin, C. C. (2011). Grid architecture for distributed process mining. Technische Universiteit Eindhoven.

https://doi.org/10.6100/IR699500

DOI:

10.6100/IR699500

Document status and date:

Published: 01/01/2011

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be

important differences between the submitted version and the official published version of record. People

interested in the research are advised to contact the author for the final version of the publication, or visit the

DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page

numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne

Take down policy

If you believe that this document breaches copyright please contact us at:

openaccess@tue.nl

providing details and we will investigate your claim.

(2)

Grid Architecture for

Distributed Process Mining

(3)

Copyright © 2011 by Carmen Bratosin. All Rights Reserved.

Grid Architecture for Distributed Process Mining by Carmen Bratosin

Eindhoven: Technische Universiteit Eindhoven, 2011. Proefschrift.

Cover photo by Iman Mosavat

A catalogue record is available from the Eindhoven University of Technology Li-brary

ISBN 978-90-386-2446-4 NUR 980

This work has been supported by the NWO within grant “Workflow Management for Large Parallel and Distributed Applications (WoMaLaPaDiA)”

SIKS Dissertation Series No. 2011-12

The research reported in this thesis has been carried out under the auspices of SIKS, the Dutch Research School for Information and Knowledge Systems.

(4)

Grid Architecture for

Distributed Process Mining

PROEFSCHRIFT

ter verkrijging van de graad van doctor aan de

Technische Universiteit Eindhoven, op gezag van de

rector magnificus, prof.dr.ir. C.J. van Duijn, voor een

commissie aangewezen door het College voor

Promoties in het openbaar te verdedigen

op dinsdag 29 maart 2011 om 16.00 uur

door

Carmen Bratosin

(5)

Dit proefschrift is goedgekeurd door de promotor:

prof.dr.ir. W.M.P. van der Aalst

Copromotor:

(6)
(7)
(8)

Acknowledgements

Almost four years ago, I took on the challenge of doing a PhD in computer science in the Netherlands. This journey was very much motivated by the ambition to grow and learn more. As in all journeys, the road gets rough sometimes, and one faces ups and downs. Luckily, I was surrounded by a lot of people that supported and encouraged me.

First, I would like to thank my promotor Wil van der Aalst for his constant in-volvement and ideas. Guided by his supervision, I learned how to think and de-velop solutions in a structured and principled manner, especially by using formal methods. Moreover, he enabled me to link my research with real-life problems and improve its relevance.

Second, I am grateful to Natalia Sidorova for her day-to-day supervision. Most of my research ideas were inseminated during conversations with Natalia. Thank you Natalia, for giving me confidence when I needed it, and for providing a close supervision.

Third, I express great gratitude to Simona Caramihai, for introducing me to the field of discrete event systems more than 10 years ago, and being a constant support in my professional and personal life.

Thank you Kees van Hee, Farhad Arbab, Guszti Eiben, Pilar Herrero and Simona Caramihai for serving as my committee members and your valuable feedback.

I would further like to thank to all my colleagues from the both Information Systems groups and B.E.S.T. program, for their support and interesting conversa-tions. In particular, I thank Marc Voorhoeve for being always willing to help, Riet van Buul for her kindness and support, Eric Verbeek for answering to all my tech-nical questions, Boudewijn van Dongen for the ProM support, Nikola Trˇcka for the collaboration on the WoMaLaPaDiA project, Ana-Karla Alves de Medeiros for her work on the genetic process mining, and Ine van der Ligt for being so resoursefull in solving all the bureaucratic problems.

(9)

ii

Special thanks goes to Helen Schonenberg, the best office-mate I could ever imagine. Thank you for all the talking, the pet on the back, the complaining and ... the fitness classes. Life was so much easier with you being there.

I also want to thank to all my friends for their listening and comprehension, especially, (in Eindhoven) Christiane Peters, Lusine Hakobyan and Omar Komiha; (in Romania) Carmen-Luisa Ghi¸t˘a, Raluca Misleanu and C˘alin Munteanu.

Thank you Iman, for bringing the needed peace and happiness into my life in the past two years. Thank you for believing in me, and being next to me during the final stages of my PhD.

Last but not least I want to thank my family for their understanding.

In the end, I want to mention the person whose spirit watched over me for the past seven years: my mother Ana. Her kindness, love and understanding nurtured me. All my achievements are because of the love you gave me. Mul¸tumesc, mami!

(10)

Contents

1 Introduction 1

1.1 Process Mining and Its Challenges . . . 1

1.2 Research Problems . . . 4

1.3 Contributions . . . 5

1.4 Outline of the Thesis . . . 8

2 Preliminaries 9 2.1 Basic notations . . . 9

2.2 Processes and Process Models . . . 9

2.2.1 Formalisms for Process Models . . . 11

2.2.1.1 (Colored) Petri Nets . . . 11

2.2.1.2 Yet Another Workflow Language - YAWL . . . 15

2.2.1.3 Causal Matrices and Heuristics Nets . . . 17

2.3 Process Mining . . . 18

2.3.1 Event Logs . . . 18

2.3.2 Process Mining Domain . . . 21

2.3.3 PMAs Overview . . . 21

2.3.4 Process Mining Experiments . . . 22

2.3.5 ProM 6: the Process Mining Toolkit . . . 23

2.4 Genetic Process Mining Algorithm . . . 26

2.4.1 Genetic Algorithms and Genetic Programming . . . 26

2.4.2 Building the Initial Population . . . 27

2.4.3 Fitness Computation . . . 29

2.4.4 Compute Next Population . . . 32

2.5 Empirical Evaluation and Statistics . . . 34

2.5.1 Comparing Means: T-test evaluation . . . 34

(11)

iv CONTENTS

2.5.2 Graphical Representations . . . 36

2.5.2.1 Error Bar Plot . . . 36

2.5.2.2 Box Plot . . . 36

3 Distributed Genetic Process Mining 39 3.1 Introduction . . . 39

3.2 Overview of Distributed Genetic Algorithms . . . 40

3.3 Distribution Architecture . . . 43

3.3.1 Parameters and Strategies . . . 44

3.4 Conclusions . . . 46

4 Distributed Genetic Process Mining: Computation Distribution 47 4.1 Introduction . . . 47

4.2 Test Event Logs . . . 48

4.3 Related Work . . . 49

4.3.1 Performance Analysis of Coarse-Grained Genetic Algorithms 53 4.3.2 Distributed Genetic Programming . . . 53

4.4 Parameters Evaluation . . . 54

4.4.1 Experimental Evaluation of Different Migration Policies . . . 54

4.4.1.1 Selection Policy and Integration Policy . . . 54

4.4.1.2 Varying Migration Interval and Migration Size . . . 56

4.4.2 Experimental Evaluation of the Influence of the Population Size and Number of Islands . . . 56

4.4.2.1 Log A - Results for Various Stop Conditions . . . . 60

4.4.2.2 Log B and Log Municipality . . . 63

4.5 Discussion on the Experimental Results . . . 65

4.6 Conclusions . . . 66

5 Distributed Genetic Process Mining: Event Log Sampling 69 5.1 Introduction . . . 69

5.2 Related Work . . . 70

5.3 Event Logs Redundancy . . . 71

5.4 Sample-based Genetic Miner . . . 73

5.4.1 Algorithm . . . 73

5.4.2 Evaluation . . . 75

5.4.3 Improvement of the Execution Time . . . 77

5.5 Distributed Sample-based Genetic Miner Algorithm . . . 80

5.5.1 Evaluation . . . 83

5.5.1.1 Sample Distribution Strategy . . . 83

5.5.1.2 Population Size and Sample Size . . . 83

5.5.2 Discussion on the Execution Time Improvement . . . 89

5.6 Conclusions . . . 90

6 Modeling a Grid Environment 93 6.1 Introduction . . . 93

(12)

CONTENTS v

6.2 Related Work . . . 94

6.2.1 Grid Computing Architectures . . . 94

6.2.2 Modeling Information Systems with CPN . . . 95

6.3 Scientific Workflows . . . 96

6.4 CPN based Grid Model . . . 98

6.4.1 Inter-job Coordination . . . 100

6.4.2 Resource Allocation . . . 101

6.4.3 Resource Layer . . . 108

6.5 Analysis of the Grid Model . . . 110

6.5.1 Resource Allocation when Using a Simple Strategy . . . 110

6.5.2 Evaluating a Data Removal Strategy . . . 114

6.6 Conclusions . . . 117

7 Proof of Concept 119 7.1 Introduction . . . 119

7.2 YAGA: Yet Another Grid Architecture . . . 120

7.2.1 YAGA Inter-job Coordination Layer . . . 121

7.2.1.1 Using YAWL to Define Grid Workflows . . . 121

7.2.1.2 YAWL System . . . 123

7.2.1.3 YAGA Service . . . 124

7.2.2 YAGA Resource Allocation Layer . . . 125

7.2.3 YAGA Intra-job Coordination . . . 125

7.2.4 YAGA Resource Layer . . . 127

7.3 ProM Distributed Context . . . 128

7.4 Genetic Process Mining as ProM Plug-ins . . . 128

7.4.1 GMA and SGMA . . . 129

7.4.2 DGMA and DSGMA . . . 135

7.5 Conclusions . . . 136

8 Reference Model and YAGA: Model Validation 139 8.1 Introduction . . . 139

8.2 Time Parameters . . . 140

8.3 Validation Results . . . 141

8.3.1 Homogeneous Resource Pool . . . 142

8.3.2 Add a New Resource . . . 143

8.4 Conclusions . . . 146

9 Conclusions 147 9.1 Contributions . . . 147

9.2 Discussion on possible work distribution strategies . . . 150

9.3 Challenges . . . 151

9.4 Final Remarks . . . 152

(13)

vi CONTENTS

Bibliography 159

Curriculum Vitae 171

(14)

C

HAPTER

1

Introduction

Information systems are nowadays facilitating information access and the manage-ment and execution of different types of activities. Examples of such information systems are classical workflow management systems (e.g. Staffware), case handling systems (e.g. FLOWer), PDM systems (e.g. Windchill), middleware (e.g. IBM’s Web-Sphere), hospital information systems (e.g. Chipsoft).

Information systems record detailed information about the activities that have been executed. For example, hospital information systems log all events related to a patient, e.g., type of blood tests performed, surgery, medicines prescribed. Another example is “CUSTOMerCARE Remote Services Network” of Philips Healthcare (PH) where events occurring within an X-ray machine (e.g. moving the table, setting the deflector, etc.) are recorded and analyzed. These examples show that systems record (parts of ) the actual behavior, and thus implicitly or explicitly create event

logs. Table 1.1 presents an example of an event log inspired by a real life case study: handling objections against the real-estate property valuation and tax at a Dutch municipality. This log contains information about three process instances (indi-vidual runs of the objection handling process).

1.1 Process Mining and Its Challenges

The goal of process mining is to automatically discover models explaining the be-havior captured in the form of event logs. For example, one can deduce a Petri net model that represents the behavior recorded in some event log. The advan-tage of using process mining is two fold. First of all, it offers information on how processes are actually carried out in the real-world settings. Consider the ‘CUS-TOMerCARE Remote Services Network” example. The X-ray machines can be used for multiple purposes in different ways and there are no clear usage scenarios. Van Uden [106, 57] analyzed the PH logs and proposed a methodology for discovering

(15)

2 INTRODUCTION 1.1

Table 1.1: A simplified event log

Case id Activity name Originator Timestamp Data

1 Register complaint Helen 27-04-2005 12:00:00 ...

2 Register complaint Helen 27-04-2005 15:00:00 ...

2 Suspend billing process Natalia 27-04-2005 16:00:00 ...

1 Analyze complaint Wil 28-04-2005 10:09:00 ...

3 Register complaint Helen 28-04-2005 11:00:00 ...

2 Analyze complaint Natalia 28-04-2005 11:32:00 ...

1 Recalculate Wil 28-04-2005 12:00:00 ...

3 Suspend billing process Wil 28-04-2005 12:30:00 ...

2 Re-initialize billing process System 28-04-2005 15:00:00 ...

3 Analyze complaint Wil 29-04-2005 09:00:00 ...

3 Recalculate Wil 29-04-2005 10:40:00 ...

3 Re-initialize billing process System 29-04-2005 13:00:00 ...

1 Inform complainant Helen 30-04-2005 10:00:00 ...

2 Recalculate Natalia 30-04-2005 10:04:00 ...

2 Inform complainant Helen 30-04-2005 10:30:00 ...

3 Inform complainant Helen 30-04-2005 11:00:00 ...

user profiles and extracting use cases for testing the X-ray system under realistic circumstances.

A second advantage of process mining is the possibility to compare the ac-tual behavior with the intended one. The intended behavior is usually defined ei-ther in a formal way (by defining a formal process model) or in an informal way (e.g., guidelines, BPMN models). By comparing the models of intended behavior with the discovered process models, deviations from the intended behavior can be found and this information can be used to improve the processes. For exam-ple, Rozinat et al. [92] investigate the testing process of ASML wafer scanners. The conclusion is that the “real process is much more complicated than the idealized reference model” and that the process is less structured that it was intended.

Various process mining algorithms able to automatically discover process mod-els have been created in the last years. One of the earliest algorithms is the α-miner [12] that constructs a Petri net based on the events precedences in the log. The al-gorithm can be formalized in only eight lines of code and is linear in the size of the event log. The simplicity of the algorithm is misleading since it produces accurate models only if the event log contains all the possible direct successions, i.e., if ac-tivity a can be followed by acac-tivity b then the sequence ab should be present in the event log. Moreover, the algorithm requires the event log to be noise-free.

In order to discover processes from incomplete event logs containing noise, Weijters et al. [110] developed the Heuristics Miner. The main idea is to filter out all the infrequent direct relations under the assumptions that infrequent behavior represents noise. The drawback of the algorithm consists in the inability to discover complex patterns like choices depending on combinations of earlier events.

(16)

1.1 PROCESS MINING AND ITS CHALLENGES 3

Figure 1.1: Process model representing the behavior in the event log from Table 1.1

(GMA) that uses genetic operators to overcome the shortcomings of other algo-rithms. GMA evolves populations of graph-based process models towards a pro-cess model that fulfills a predefined fitness criterion: the propro-cess model manages to replay all the behaviors observed in the event log and does not allow additional ones. Commercial business intelligence tools such as Futura Reflect (http://www. futuratech.nl) and BPM systems such as BPM|one (http://www.pallas-athena. com) provide process mining facilities based on GMA. Figure 1.1 presents a process model derived from the information shown in Table 1.1 using Genetic Mining

Algo-rithm(GMA).

Many algorithms consider assumptions that are not realistic: absence of noise, log completeness, etc. The algorithms able to cope with real-life logs, such as GMA, have the disadvantage of being computationally expensive. Moreover, event logs may be huge, e.g., there may be thousands of different cases and there may be thou-sands of events per case. Logs such as the ones produced by the machines of PH and ASML illustrate the computational challenges [57, 92].

In all the real-life experiments performed [57, 92], practitioners follow different steps in order to gain insights into a process. The “process of process mining” sists of additional pre- and post-processing steps (filtering, cleaning, merging, con-formance checking, etc.). Currently, these steps are performed manually making the process tedious, error-prone and time consuming. Moreover, inexperienced end-users want a “one push of a button” approach to obtain the result without hav-ing to consider other details like filterhav-ing. Only few attempts have been made to automate this process [68].

Based on the above observations, we identify two main questions that we aim to answer in this thesis:

• How can we improve the time efficiency of advanced process mining algo-rithms?

(17)

4 INTRODUCTION 1.2

• Is it possible to run the “process of process mining” in an automated and efficient way?

We use distributed environments to find an answer to the first question. Dis-tributed environments such as grids offer the opportunity to run computationally and data intensive experiments in a shorter time by distributing the work over mul-tiple computers.

Grid computing[52, 64] is concerned with the development and advancement

of technologies that provide seamless and scalable access to world-wide distributed resources. The potential of grid is in using world-wide distributed resources. The application areas of grid computing include physics, astronomy and biology (see e.g. [1, 3]). Grid computing requires hardware and software infrastructures that provide dependable, consistent, pervasive, inexpensive access to high-end com-putational capabilities and resource coordination. Several tools have been devel-oped like Globus [2] and Condor [102], that support submission and management of requests to a grid.

For the second question, i.e., automating the “process of process mining”, we have to describe the “flow of work” in process mining. In business information systems, processes are usually described in terms of a workflow, where a work-flow represents “a network of tasks with rules that determine the (partial) order in which tasks should be performed” [83]. Multiple formalisms are currently avail-able to represent workflows: Petri nets, BPMN, Heuristics nets, EPC, etc. One of the most expressive formalisms is YAWL [101], a formalism based on the workflow patterns initiative [10]. Inspired by the business workflow concept, scientists in-troduced the idea of Scientific Workflows (SW); several systems that support their design and execution have been developed [42, 54, 58, 75, 85, 109]. However, un-like classical workflows, the control is decentralized and resources are computing power, memory, software etc. rather than people.

Next, we formulate research problems for this thesis and we show how dis-tributed environments and workflow concepts are used in order to answer to the above stated research questions.

1.2 Research Problems

The idea of using distributed environments to develop a time efficient solution leads us to the following research question:

How to provide an easy-to-use and scalable process mining framework that allows for the definition and execution of complex process mining experi-ments in distributed environexperi-ments?

To answer to the above question, we need to address the following problems:

RP 1 How can we improve the time efficiency of computationally expensive algo-rithms by using grid environments?

(18)

1.3 CONTRIBUTIONS 5

RP 2 How can we handle large event logs?

RP 3 How to design scientific workflows for defining a “process” of process mining?

RP 4 How to manage the distribution of the tasks of a scientific workflow on the nodes of a (shared) grid?

Note that we investigate the research question from an architectural point of view. We focus on the usability and versatility of the architecture and we abstract from such aspects such as scheduling algorithms and fault recovery techniques.

1.3 Contributions

We achieve the research goal by proposing a grid-based framework that, on one hand, increases process mining efficiency by parallel execution of process min-ing activities, and, on the other hand, allows users to define and execute complex process mining experiments. Figure 1.2 presents our conceptual architecture. The

Inter-job Coordinationlayer allows users to define and execute process mining

ex-periments in terms of scientific workflows. For solving RP 3 we define process min-ing workflows as a composition of jobs. The modules of the Resource Allocation layer provide solutions for RP 4. If a job requires more than one resource for the execution, the Intra-job Coordination level handles the orchestration of the inter-dependencies between the sub-jobs, where a subjob is a part of the job running on one resource. The Intra-job Coordination focuses on answering the first two research problems (RP 1 and RP 2 ).

The thesis addresses the research goal by considering two levels of details: 1) we focus first on the distribution of data and process mining activities, i.e., the intra-job coordination, and 2) we propose a framework for the automated execution of process mining workflows, i.e., the inter-job coordination and resource allocation. These contributions are summarized below.

Intra-job Coordination: Design of Distributed Process Mining Algorithms (RP 1 and RP 2 ) — Chapters 3, 4 and 5

Each process mining algorithm has its own characteristics and a uniform distributed solution is difficult to define. Therefore, we focus on a particular mining algorithm with high computation cost, i.e., GMA. Based on experiments with distributing GMA, we draw conclusions about the distribution aspects that can also be appli-cable to other process mining algorithms.

We present two distribution strategies: Distributed Genetic Mining Algorithm (DGMA) and Distributed Sample-based Genetic Mining Algorithm (DSGMA). The

(19)

6 INTRODUCTION 1.3

Figure 1.2: A conceptual architecture

DGMA focuses on distributing the computation of GMA by using a coarse-grained approach [103]. Experimental evaluation of the algorithm shows that the execution time of DGMA is considerably reduced compared to the classical GMA. We propose guidelines for choosing the distribution parameters.

The second strategy, denoted DSGMA focuses on the distribution of the input event log by exploiting redundancies in the event logs. Empirical studies show that such a strategy can further improve the execution time.

The two algorithms show that process mining algorithms can use distribution strategies to reduce execution time. We provide evidence that the event logs char-acteristics can be used in order to improve the algorithms time efficiency which is critical in dealing with large event logs.

For evaluating our algorithms we use the technology provided by the state of the art process mining tool ProM 6 [4]. ProM is a plugable framework for process mining. The framework enables process mining practitioners to perform log pre-processing, process model discovery, performance analysis etc.

(20)

1.4 CONTRIBUTIONS 7

Reference Model for a Grid Architecture inspired by experiences from the Process Mining Domain (RP 3 and RP 4) — Chapters 6, 7 and 8

We identify the specific characteristics of the process mining domain to enable the execution of process mining workflows in a distributed environment. Since process mining workflows have the same characteristics as scientific workflows, we pro-pose a grid-oriented architecture. We define our own formal executable reference model because of the lack of formal definitions of grid architecture in the literature. The proposed reference model clarifies the main concepts of our architecture and it allows for various types of verification and analysis. The formal model is defined in terms of Colored Petri nets (CPNs) and we use CPN Tools [41] for modeling, analysis and simulation.

We use the model to test various data distribution strategies in order to assess the model adaptability to new strategies, and also, to investigate how data handling can improve the average throughput time of particular workflows.

Implementation of the Results into a Conceptual Framework — Chapters 7 and 8

Based on the results presented in the thesis, a proof of concept for the framework was developed. The proposed distributed algorithms, DGMA and DSGMA, have been implemented as plug-ins in ProM 6. We also implement an Adaptable

Dis-tributed Genetic Mining Algorithm(ADGMA) to allow real-time tuning of the

distri-bution parameters and to provide a better understanding of the search evolution. Moreover, ProM 6 have been extended in order to be integrated in a distributed environment.

For the execution and distribution of process mining workflows, Yet Another

Grid Architecture(YAGA) was developed based on the reference model. YAGA

com-bines the powerful workflow engine of YAWL [6, 101] and ProM 6 framework through a java based grid middleware.

The use of a workflow language such as YAWL for workflow definition provides users with the possibility of defining their experiments in a clear and simple man-ner. Moreover, the development of the ProM framework is orthogonal to the defi-nition of the workflows. In this manner, there is a clear separation between the use of the framework and the algorithm development.

The grid middleware uses a modular approach to facilitate the deployment of new scheduling algorithms, data allocation strategies and fault handling mecha-nisms.

The results presented in this thesis are based on the following articles published at reviewed conferences and journals: [7, 25, 26, 27, 28, 29, 30, 104, 105].

(21)

8 INTRODUCTION 1.4

1.4 Outline of the Thesis

This section describes the remainder chapters of the thesis.

Chapter 2 introduces the process mining domain and presents the state of the art in workflow management systems for scientific domains.

Chapter 3 describes the main architecture of the Distributed Genetic Miner Algo-rithm and its advantages.

Chapter 4 focuses on the issues of the distribution of the computation and pro-poses guidelines based on event logs characteristics for the choice of distri-bution parameters.

Chapter 5 shows how event logs properties can be used in order to achieve better time efficiency. Similar to Chapter 4, we present guidelines for the choice of parameters.

Chapter 6 proposes a reference model for grid architectures. The main concepts are explained through a CPN-model that allows for different types of analy-ses.

Chapter 7 presents the tools developed for the proof of concept based on the pro-posed algorithms and architectures.

Chapter 8 validates the reference model proposed in Chapter 6 by performing a comparative study. The results of experiments on real grid architecture are compared with the results of various simulation experiments.

(22)

C

HAPTER

2

Preliminaries

2.1 Basic notations

Let S be a set. We denote by |S| the number of elements of the set S.

P(S) is the power set of S, i.e., the set of all subsets of S and P+denotes the power set of S without the empty set, i.e., P+= P (S) \ ;.

Sis the set of all sequences created from the elements of S. The length of a sequence s ∈ Sis denoted as |s|.

Let s ∈ S. We use set (s) ⊆ S for the set of elements of s. We use the following

shorthand notation a ∈ s for a ∈ set(s).

SM Sis the multiset type over the set S.

2.2 Processes and Process Models

A (business) process refers to the production of a type of product [11]. Processes can be found in any domain. Consider the handling of insurance claims, for an in-surance company, or weather predictions based on image processing, for a weather center. Activities need to be carried out in order to “deliver” the product. For exam-ple, possible steps for the insurance claim are:

A) Register the claim

B) Check the client policy

C) Verify the damage

D) Reject the claim

E) Verify client policy options

(23)

10 PRELIMINARIES 2.2

Figure 2.1: Processes Lifecycle

F) Evaluate the damage

G) Inform the client

Each step of the process corresponds to an activity that has to be executed. Note that not all the activities need to be performed for every case and the order in which the steps are executed can vary. Moreover, precedence relations can exist between activities. For example, the damage cannot be evaluated without a-priori verifica-tion.

A process instance is the representation of a single enactment of a process, in-cluding its associated data. Each process instance represents a separate execution of the process that is independently controlled and has its own state and unique

identity. A process instance has a trace associated to it, i.e., the sequence of

activi-ties that have been executed.

(Business) processes are usually supported by Process Aware Information

Sys-tems(PAIS)[46]. Examples of PAISs are Workflow Management Systems, e.g. BPM|one, and Enterprise Resource Planning systems, e.g. SAP R/3 or PeopleSoft. These sys-tems support the configuration of processes and facilitate the execution of tasks by users, e.g. by allocating tasks to human resources.

Figure 2.1 presents the life cycle of business processes. The lifecycle shows that processes are in a continuous transformation through their lifetime inside an orga-nization. The four phases of the process lifecycle are:

(24)

2.2 PROCESSES AND PROCESS MODELS 11

A) Modeling is the phase in which the goals of a process are identified. Based on the domain knowledge, a process model is created. The process model depicts the dependencies between the activities to be executed and the pos-sible alternatives of the process. Verification techniques can be used in order to ensure that the process model respects all the requirements (e.g. absence of livelocks and deadlocks.)

B) Implementation is the phase in which the process model is transformed into an executable model. In this phase, the process model may need to be trans-lated into the formalism used by the information system i.e., the conceptual model needs to be converted into an executable system. Moreover, informa-tion related to data and resources needs to be configured.

C) Execution/Monitoring is the phase in which the users can start instances of the process. The system follows and records the progress of an instance. New available activities are offered to resources. Information regarding the exe-cution of the process is recorded as sequences of uniquely identified events. An event represents the execution of an activity for a particular process in-stance. For a particular process, the collection of all events belonging to past executions is named event log.

D) Based on the event logs, analysis of the process can be performed. The role of this phase is to discover possible problems in the execution of the process. The acquired information can further be used to improve the process. In this phase different techniques, e.g. data mining, machine learning, and process

mining, can be applied in order to extract information.

Note that findings from the analysis phase (phase D) lead to insights used in the modeling phase (phase A). Ideally, processes are redesigned based on proper analysis of running processes.

2.2.1 Formalisms for Process Models

Process models represent partially ordered sets of activities. In order to represent the relations between activities for a given process, graph-based models are usually preferred. In the following we present some of the formalisms used for defining processes. Such models can help to a better understanding of the process and the possible problems that can occur when instances of the process are running. Addi-tionally, if formal semantics are defined, analysis and verification methods can of-fer more insights into the process. In the following, we present the main formalisms used in this thesis.

2.2.1.1 (Colored) Petri Nets

A Petri nets (PNs) is a bipartite graph with two types of nodes: places and

(25)

corre-12 PRELIMINARIES 2.2

spond to the activities composing the process. Places and transitions are connected by arcs; the arcs connecting two places or two transitions are forbidden.

Definition 2.2.1(Petri net). A Petri net is a triple (P, T, A) where:

• P is a finite set of places;

• T is a finite set of transitions such that P ∩ T = ;; • A ⊆ (P × T ) ∪ (T × P) is a set of directed arcs.

A node x ∈ P ∪ T is an input node of a node y if and only if a directed arc from x to y exists, i.e., there exists (x, y) ∈ A. Node x ∈ P ∪T is an output node of a node y if and only if there exists (y, x) ∈ A.

Each place has associated a number of tokens, which signify a thread of control flowing through the process. The state of the PN is given by the distribution of tokens across all the places, denoted as marking M, i.e., M : P −→ N. The state of a Petri net can change if and only if a transitions is fired. The firing of a transition can occur when:

A) The transition is enabled in the current marking, i.e., if each input place p of the transition t contains at least one token;

B) When a transition t fires, then t consumes a token from each of its input places and produces one token for each of its output places.

For more details about the semantics of Petri nets, we recommend [91]. Figure 2.2a shows an example of a PN with 7 transitions and 8 places.

Workflow netsare a subclass of Petri nets that require the existence of one start

placeand one end place such that the start place has no input transitions and the

end placehas no output transitions. The initial marking of an workflow net is

al-ways the marking where the start place has one token and all the other places are unmarked. The formal definition of a workflow net is given below.

Definition 2.2.2(Workflow net). A workflow net is a tuple (P, T, A, s, o) where:

• (P, T, A) is a Petri net;

• s ∈ P is the start place such that for any transition t ∈ T A(t, s) = ;; • o ∈ P is the output place such that for any transition t ∈ T A(o, t) = ;. Workflow nets are a known formalism for modeling business processes and for verification of correctness criteria [11]. The most important correctness criteria is the soundness one. A workflow net is sound if and only if for any marking reach-able from the initial marking there exists a transition sequence to reach the final marking (i.e., the output place contains one token and all the other places are un-marked). The soundness criteria guarantees that if a the workflow net is started exist always a possibility to finish it.

(26)

2.2 PROCESSES AND PROCESS MODELS 13

(a) Petri net example

(b) Heuristics net example. Note that the start activity is marked by light gray color and the final activity is marked by dark gray color

(c) YAWL net example

(27)

14 PRELIMINARIES 2.2

CPN [60] extend PNs with data, time and hierarchy, and combine their strength with the strength of programming languages. Each place has associated a “color set”, i.e., a specific significance of the tokens residing in the place. Transitions and arcs have associated expressions. The set of all expressions is denoted EXPR. Type[e] denotes the type of the expression e ∈ EXPR, e.g. boolean, integer etc. The set of free variables in an expression e is denoted V ar [e]. We use the notation EXPRV for the

set of expressions e ∈ EXPR such that Var[e] ⊆ V where V is a finite set of variables.

Definition 2.2.3(Colored Petri net [60]). A non-hierarchical Colored Petri Net is a nine-tuple CPN = (P,T, A,Σ,V,C ,G,E, I ) where:

• P is a finite set of places;

• T is a finite set of transitions such that P ∩ T = ;; • A ⊆ (P × T ) ∪ (T × P) is a set of directed arcs; • Σ is a finite set of non-empty color sets;

• V is a finite set of typed variables such that Type[v] ∈ Σ for all variables v ∈ V ; • C : P −→ Σ is a color set function assigning a color set to each place;

• G : T −→ EXPRV is a guard function assigning a guard to each transition t

such that Type[G(t )] = Bool;

• E : A −→ EXPRV is an arc expression function assigning an arc expression to

each arc a such that Type[E (a)] = C (p)M S, where p is the place connected to

the arc a;

• I : P −→ EXPR;is an initialization function assigning an initial marking to

each place p such that Type[I (p)] = C (p)M S.

CPNs follows similar firing rules as PNs. The difference comes from the fact that for CPNs a marking M of a place p is multiset over a color set, i.e., M (p) ∈ C (p)M S.

A transition t ∈ T is enabled if and only if there are enough tokens in the input places, i.e., in each input place p there are enough tokens to enable the evaluation of the connecting arc (p, t ) expression and its guard G(t ) is true for the same set

of tokens that enable the arcs expressions. All variables associated to a transition

(i.e., the variables in the guard and on input and output arcs) need to be bound to a particular value based on the tokens available. There may be multiple bindings for one marking. If a transition is enabled for a particular binding, then the transition can fire. By the firing of a transition, the tokens enabling the arc expressions are consumed from the input places and the output places receive tokens based on the arc expressions connecting them with the fired transition. For more details about CPNs, we recommend [60]. CPN Tools supports the editing, simulation, state space analysis, and performance analysis of CPN models.

(28)

2.2 PROCESSES AND PROCESS MODELS 15

Figure 2.3: YAWL constructs

2.2.1.2 Yet Another Workflow Language - YAWL

In order to identify the main features of workflow management systems, the

Work-flow Pattern Initiativewas established to identify the constructs required for the

specification of control-flow dependencies between tasks. Based on the identified patterns, the Yet Another Workflow Language (YAWL) was defined. The need for a new language came from the fact that none of the existing workflow languages could express most of the workflow patterns in a direct intuitive manner.

YAWL [101] was inspired by PNs but its formal semantics are defined directly in terms of a state transition system. Figure 2.3 presents the YAWL constructs. Each YAWL net is composed of a series of tasks and conditions similar to transitions and places in PN. There are two types of tasks: atomic tasks that represent executions of a particular activity and composite tasks that refer to a YAWL net.

A YAWL net has one start condition and one final condition. Similar to PNs, a YAWL net is a directed graph. The difference between PNs and YAWL, from the graphical point of view, is that YAWL allows tasks to be directly connected assuming that an implicit condition exists between them.

Multiple instance tasksrepresent the execution of the same task multiple times.

Lower and upper bounds can be defined.

YAWL tasks have different constructs such as join, OR-join, XOR-join, AND-split, OR-AND-split, and XOR-split. Finally, YAWL supports the notion of a cancellation

region, where tokens are removed from the net at execution time upon a condition

of a specific task execution.

Definition 2.2.4(YAWL net [101]). A YAWL net is a tuple (nid,C , i , o, T, T A, T C , M , F,

Split, Join, Rem, Nofi, ArcCond) such that:

• nid is the identity of the YAWL net;

(29)

16 PRELIMINARIES 2.2

• i ∈ C is the start condition; • o ∈ C is the final condition; • T is the set of tasks;

• TA ⊆ T is the set of atomic tasks; • TC ⊆ T is the set of composite tasks;

• TA and TC partition T , i.e.„ T = TA ∪ TC and TA ∩ TC = ; ; • M ∈ T is the set of multiple instance tasks;

• F ⊆ (C \{o}×T )∪(T ×C \{i })∪(T ×T ) is the flow relation, such that every node in the graph (C ∪ T,F ) is on a directed path from i to o;

• Split : T{AN D, X OR,OR} specifies the split behavior of each task;

• Join : T{AN D, X OR,OR} specifies the join behavior of each task;

• Rem : T ↛P+(T ∪ C \ {i ,o}) specifies the additional tokens to be removed by emptying a part of the net and tasks that should be canceled as a conse-quence of an instance of this task completing execution;

• Nofi : M −→ N × Ni n f× Ni n f × {dynamic,static} specifies the multiplicity of each task - in particular the lower and upper bound of instances to be cre-ated at task initiation, the threshold for continuation indicating how many instances must complete for the thread of control to be passed to subsequent tasks and whether additional instances can be created “on the fly” once the task has commenced;

• ArcCond : (F ∩ {t ∈ T |Split(t) = XOR ∨ Split(t) = OR} × (T ∪C ))) −→ BoolExpr identifies the specific condition associated with each branch of an OR or XOR split. BoolExpr represent the set of expressions that yield boolean results when evaluated.

A YAWL net is presented in Figure 2.2c. The net models the same process as the PN in Figure 2.2a. Task A is an XOR-split task: based on the internal variables of the task an explicit choice between task D and tasks B and C is made. An AND-split task is used before tasks B and C in order to ensure that both tasks are enabled in parallel. Note that except the start and the final conditions, no other conditions are used.

(30)

2.2 PROCESSES AND PROCESS MODELS 17

2.2.1.3 Causal Matrices and Heuristics Nets

A causal matrix describes a process as a set of activities and their input/output conditions. The input/output conditions are sets composed of multiple sets of ac-tivities. Activities in the same subset have an XOR-split/join relation, while the dif-ferent subsets have an AND-split/join relation. Input and output activities are used to clearly indicate the start, respectively the end, of the process.

Definition 2.2.5(Causal Matrix [20]). Let LS be a given set of labels. A Causal

Ma-trixis a tuple CM = (A ,C , I ,O,Label,i ,o), where: • A is a finite set of activities,

• C ⊆ A × A is the causality relation

• I : A −→ P (P (A )) is the input condition function, • O : A −→ P (P (A )) is the output condition function,

• Label : A −→ LS is a function mapping each activity to a label from the set of labels,

• i ∈ A is the start activity, • o ∈ A is the final activity, such that • C = {(a1, a2) ∈ A × A | a1∈ [ s∈I (a2) s} = {(a1, a2) ∈ A × A |a2∈ [ s∈O(a1) s}, i.e., (1)

a causal relation (a1, a2) exist if and only if the activity a1belongs to the input

set of a2and a2belongs to the output set of a1and (2) for any activities a1

and a2if a2∈ I (a1) exist (a2, a1) ∈ C or if a2∈ O(a1) exist (a1, a2) ∈ C

• I (i ) = ;, i.e., the start activity has no input arcs • O(o) = ;, i.e., the final activity has no output arcs

• Label : A −→ LS is injective, i.e., for all a1, a2∈ A , Label (a1) = Label(a2)

implies a1= a2.

Other activities than the input and output activities may have input/output conditions empty. However, an activity t 6= i with I (t) = ; is not considered en-abled since the process may start only from the start activity i .

Table 2.1 presents a causal matrix. The start activity is A. After the execution of

A, activities B, C and D are enabled. If D is started, we observe that due to the output condition of A, i.e., {{B ∨ D} ∧ {C ∨ D}}, B and C are disabled. After the execution of

(31)

18 PRELIMINARIES 2.3

Activity Input condition Output Condition

A (start activity) - {{B, D}, {C , D}} B {{A}} {{F}} C {{A}} {{E}} D {{A}} {{Z}} E {{C}} {{Z}} F {{B}} {{Z}} Z (final activity) {{B, D}, {C , D}} -Table 2.1: Causal Matrix Example

Case ID Trace 1 A B C E F Z 2 A B F C E Z 3 A C B E F Z 4 A B C E F Z 5 A C E B F Z 6 A D Z 7 A B C

Table 2.2: Event log

Note that causal matrices can be also represented as graphs, called heuristics

nets. In Figure 2.2b, the heuristics net for the causal matrix from Table 2.1 is de-picted. We observe that activity A has three output arcs to B, C and D correspond-ing to its output set, as depicted in Table 2.1: {{B, D}, {C , D}}. We observe that B and

Dare in an XOR-split relation (depicted by the ∨ sign) since they belong to one of the subsets. Similar, C and D are in an XOR relation. The two subsets are in an AND-split relation (signaled by the ∧ sign). Activity Z has three input activities: B,

Cand D as denoted by the input condition {{B, D}, {C , D}}.

2.3 Process Mining

Process miningcomprises a set of methods focusing on extracting knowledge about

concurrent processes. The following two subsections introduce the terminology and the main concepts of the process mining domain.

2.3.1 Event Logs

In this subsection we define the main concepts and the notations for event logs. In this thesis, we focus on the control perspective of an event log, i.e., the order in which events are registered. Therefore, we abstract from other information in the event logs such as timestamps or originators.

(32)

2.3 PROCESS MINING 19

Definition 2.3.1(Activity). An activity is an atomic logical unit of work.

Definition 2.3.2(Trace). Let A be a set of activities. A trace t is defined as a tuple (i d , s) where

• id ∈ PIidis the process instance id of the trace;

• s ∈ A∗is the sequence of activities executed.

Note that the individual activities belonging to the sequence of a trace are called

events.

Let Ti nstbe the universe of traces.

Definition 2.3.3(Event Log). An event log L ⊆ Ti nst is a set of traces, such that if t1= (id1, s1) and t2= (id2, s2) with id1= id2then t1= t2, i.e., the process instance

ids are unique.

Definition 2.3.4(Event log measurements). For an event log L the following mea-surements can be defined:

• The Number of traces NT of a log L represents the number of traces in the log, i.e., NT (L ) = |L |;

• The Size S of a log L is the total number of events in the log, i.e., S(L ) = X

(i d ,s)∈L

|s|

• The Number of activities NA of a log L represents number of different activi-ties present in the event log, i.e., NA(L ) = | [

(i d ,s)∈L

set(s)|

Table 2.2 presents an example of an event log for the process in Figure 2.2b. We observe that the event log contains seven traces, i.e., NT = 7, all identified by a unique case id. Note that traces of traces 1 and 4 are identical. For this event log we have S = 36 and NA = 7.

The log measurements offer some insight into the size of the process that cre-ated the event log but do not fully characterize the complexity of the process model. Such characterization in terms of metrics is an open research question.

Analysis algorithm cannot provide insights beyond the information available in the event log. Some notion of completeness is usually used as an event log re-quirement by analysis algorithms. The following types of completeness are used by process mining algorithms:

• Global completeness (GC) – An event log L is globally complete if and only if the event log contains all possible process behavior.

• Local completeness (LC) – An event log L is locally complete if and only if for any two activities A and B that can follow each other directly there exist a trace (i d , s) ∈ L in which B directly follows A.

(33)

20 PRELIMINARIES 2.3

Figure 2.4: Process Mining Domain

• Extended local completeness (eLC) – An event log L is extended locally

com-pleteif and only if for any two activities A and B that can succeed each other

there exist a trace (i d , s) ∈ L in which B happens after A, possibly with some other activities in between .

Note that algorithms using both LC and eLC does not capture long dependen-cies, i.e., when two activities always appear in the same trace together but never following directly each other.

Some algorithms focus on discovering the most frequent behavior. For this rea-son, the algorithms require the event log to reflect this by the number of repetitions of a direct succession (event significance (ES)) or the number of identical traces from the point of view of the control flow (trace significance (TS)).

Obviously, it is impossible just by inspecting an event log to determine which type of completeness the event log satisfies. Therefore, one cannot be certain that the insights derived from the event log are conformant with the reality.

(34)

2.3 PROCESS MINING 21

2.3.2 Process Mining Domain

Figure 2.4 summarizes the areas related to process mining. The input of process mining activities are an event logs. When available, a textual or formal description of a process, i.e., a process model, can be used as additional input. As depicted in Figure 2.4, we can divide the process mining activities in four categories:

• Pre-processing (filtering) removes “unnecessary” data contained in the event log or modifies the structure of the event log conform to the requirements of the algorithms that will be applied further on. For example, one may be in-terested only in the most frequent paths in the process. Therefore, all events related to activities occurring in less than , e.g., 10% of the traces is removed. Information from the process model can be used in this step, when the model is available.

• Discovery algorithms extract knowledge in the form of mined models from event logs. Mined models can present information from different perspec-tives: control flow perspective, focusing on the order of the executed activ-ities (e.g. Genetic Miner [20], Alpha Miner [12], Heuristics Miner [110]),

or-ganizationalperspective, identifying the responsibilities and relations of the

resources involved in the execution of the process (e.g. Social Network Miner [8])) , or data perspective, extracting information related to the data flow (e.g.

Decision Miner[93]).

• Performance Analysis algorithms within the process mining domain (e.g.

Per-formance Sequence Analysis[90]) require an Event Log and possibly a Model

(i.e., a hand-made or discovered process model). Their goal is to derive per-formance measurements such as the throughput time or bottlenecks of pro-cesses.

• Conformance Checking algorithms (e.g. [94]) can be divided into two cate-gories: the ones comparing the difference between the original process model and the mined models and the ones assessing how well a given model (orig-inal or mined) can reproduce the input event log. These algorithms can dis-cover deviations from the intended behavior of a process or assess the quality of mining algorithms.

Note that further on Process Mining Algorithm (PMA) refers to process mining algorithms, i.e., discovery algorithms for the control-flow perspective of a process.

2.3.3 PMAs Overview

PMAs focus on discovering process models based on the information from event logs. The algorithms focus on the control flow perspective, i.e., they capture the partial relation order between different events. Dongen et al. [44] classify the PMAs in the following categories:

(35)

22 PRELIMINARIES 2.3

• Abstraction-based algorithms which use log abstractions in order to derive process models. The abstraction consists in extracting the basic ordering re-lations between events. Based on these rere-lations, advanced ordering rela-tions are induced and further used for constructing a process model. Exam-ples of algorithms are the α-algorithm [12] and α++-algorithm [111]. These

algorithms are very fast but they are not robust with respect noise in the event log and they impose a very restrictive requirement of log completeness, i.e., the local completeness

• Heuristics-based algorithms can deal with possible noise in event logs under the assumption that noisy fragments appear with low frequency. These al-gorithms discover all the common control constructs. Example of such an algorithm is the Heuristics Miner [110].

• Search-based algorithms are algorithms that use search techniques in the space of all possible solutions to find the process model that represents the event log well-enough. An example is the GMA [21]. This algorithms handles all the common workflow patterns and it is robust to noise.

• Language-based algorithms use the theory of regions to synthesize PNs from the behavior captured in the event logs. These algorithms start from a PN composed only of transitions, each transition corresponding to an event. Then, places are added to restrict the behavior. The ILP-miner [112] uses integer linear programming to find the places that enforce the causal dependencies present in the event log.

• State-discovery algorithms translate each trace into a sequence of states with transitions corresponding to the events and then combine them into a tran-sition system. We refer to [9, 65] for details on state-discovery algorithms.

Table 2.3 [44] presents an overview of the most known PMAs, their complete-ness requirement, the workflow patterns that are able to discover and the balance between underfitting and overfitting. An underfitting process model allows “more behavior” than the one present in the event log. On the other hand, an overfitting process model does not allow other behavior than the one available in the event log.

2.3.4 Process Mining Experiments

A process mining experiment is a combination of process mining activities belong-ing to any of the four above mentioned categories. Figure 2.5 presents an exam-ple of such an experiment for the examexam-ple event log in Table 2.2. Since the event log contains no information about resources or data, only process mining algo-rithms related to the control flow can be applied. In the first step (Filter) we remove process instance 7 because it is an incomplete trace (the last executed activity is C while the only final activity of the process, according to business analysts is Z). As a

(36)

2.3 PROCESS MINING 23

Algorithm Completeness Workflow patterns Fitness

s p c l nfc it dt α LC + + + ± - - - overfitting α++ eLC + + + + - - - overfitting Heuristics Miner ES + + + + ± + - underfitting (Duplicate) Genetic Miner TS + + + + + + + balance be-tween over-fitting and underfitting Language Re-gion ILP unknown + + + + + - - underfitting State Discov-ery unknown + + + + + + + undefined

Table 2.3: Process discovery algorithms. Legend: Workflow Patterns: s - sequence, p - parallel, c - choice, l - loops, nfc - non free-choice, it - invisible tasks, dt - duplicate tasks; Completeness: LC Local Complete, eLC extended local complete, ES -event significance, TS - trace significance. [44]

result, a filtered event log is obtained (LF). Three different algorithms are applied

(Genetic Miner, Heuristics Miner and Alpha Miner) and the obtained models (Fig-ure 2.6) are assessed against the event log (Fitness Computation). The model that fits the event log LF the best is selected. In parallel with the mining algorithms,

the Performance Sequence Analysis is used to identify the main sequence patterns present in the log and to extract information regarding the average throughput time. Figure 2.6 presents the results of three mining algorithms and their fitness values (computed based on the same fitness metric) reflecting how well the model can reproduce the event log.

We observe that GMA retrieves the original model. The other two miners,

Al-pha Minerand Heuristics Miner, create models with a lower fitness because these

algorithms are based on local information from the event log, namely what activi-ties directly precede or directly follow each other. For the event log in Table 2.2, the sequence CF does not appear and therefore the two algorithms try to enforce that

Ealways happens after C.

2.3.5 ProM 6: the Process Mining Toolkit

ProM 6 represents a state of the art Java-based framework supporting the devel-opment and usage of PMAs. Currently, ProM contains plug-ins covering multiple domains: control-flow discovery, process model verification, conformance check-ing, performance analysis, etc. ProM is open source and allows developers to add new algorithms through new plug-ins.

(37)

24 PRELIMINARIES 2.3

Figure 2.5: Process mining experiment example

A) The Framework layer represents the base of the ProM architecture. This layer coordinates the execution of plug-ins (Plug-ins Manager) and manages the existing objects (Provided Objects Manager) and the possible connections be-tween objects (Connections Manager).

B) The Plug-ins and Models layer contains the classes defining the algorithms and the available objects type supported by the framework.

C) The Context layer defines the manner in which the users can interact with the framework and the plug-ins. This layer allows developers to define multiple “contexts” for the plug-ins execution. In this manner, a separation is made between the algorithm logic and the algorithm interaction with the users. The interaction with the user can be done, e.g., through user interface, us-ing the command line or remotely. Figure 2.8 shows the available contexts in the current ProM. Note that developers can create new contexts defining, e.g., new object visualizations.

As mentioned above, developers can add their algorithms by creating new plug-ins. Each new plug-in has to specify the context in which the plug-in can be exe-cuted. For example, a plug-in that contains an algorithm without any user inter-action belongs to the Plug-in Context. On the other hand, a plug-in requiring a user interface must be defined only in a graphical user interface context (e.g. UI Context). As seen in Figure 2.8, all contexts are derived from the Plug-in Context.

(38)

2.3 PROCESS MINING 25

(a) Genetic Miner Result in terms of Heuristics Net [20, 110]. Fitness Value is 0.98

(b) Heuristics Miner Result in terms of Heuristics Net [20, 110]. Fitness Value is 0.67

(c) Alpha Miner Result in terms of Petri Net [91]. Fitness Value is 0.34

Figure 2.6: Process mining algorithms results

(39)

26 PRELIMINARIES 2.4

Figure 2.8: ProM Contexts Hierarchy

Therefore any plug-in implemented in the Plug-in Context may be executed in all other contexts.

2.4 Genetic Process Mining Algorithm

Most of the PMAs [12, 110] use heuristic approaches to retrieve the dependencies between activities based on patterns in the logs. However, these heuristic algo-rithms fail to capture complex process structures and they are not robust to noise or infrequent behavior. In [20] Alves de Medeiros proposed GMA that uses genetic operators to overcome these shortcomings. GMA evolves populations of process models towards a process model that fulfills the predefined fitness criteria. Fitness value quantifies the ability of the process model to replay all the behaviors observed in the log without allowing additional ones. An empirical evaluation [21] shows that GMA achieves its goal indeed.

In this section, we present the main concepts of genetic algorithms and de-scribe GMA as introduced in [21] in more detail.

2.4.1 Genetic Algorithms and Genetic Programming

Genetic Algorithm(GA)s are search techniques that use genetic operators to find

approximate solutions to optimization and search problems. The idea is to find a solution that fits the best. The search space is formed by all possible solutions. A GA starts from an initial set of solutions, named Initial Population. Algorithm 1 shows the main steps of the algorithm. Each individual represents a solution of the pop-ulation. Starting from the initial population, the reproduction operators such as mutation, crossover and selection are applied in order to construct new individu-als (ComputeNextPopulation). To each individual a Fitness value is associated. The

(40)

2.4 GENETIC PROCESS MINING ALGORITHM 27

Input: InputData, StopCondition

Output: Solution P = ; ; // Population repeat if P = ; then P= BuildInitialPopulation(InputData) ; else

P= ComputeNextPopulation(P,Fitness); // see Algorithm 2

end

Fitness= ComputeFitness(P,InputData);

untilStopCondition(Fitness)==true ;

Solution= bestIndividual(P ,Fitness);

Algorithm 1: GMA

the best individuals have a higher chance to survive and to reproduce. The algo-rithm iterates until a solution that satisfies the Stop Condition is satisfied. Usually, the Stop Condition is expressed in terms of a desired fitness value.

The GA individuals are typically represented as bit strings. The crossover op-erators exchange parts of two individuals. The mutation operator modifies, in a random manner, the bits of an individual. The length of an individual is fixed. One of the challenges of using genetic algorithms is to define how the possible solutions of the problems are coded into bit strings.

Genetic Programming(GP) [47, 67, 87] evolves graph-based structures in order

to create an executable “program” that fulfills the fitness criteria. A GP algorithm follows the steps of a classical genetic algorithms. Initially, GP individuals were trees that represent the execution of a computer program. Currently, GP individ-uals can be anything that can be represented by a graph structure, e.g. digital cir-cuits [77, 78]. GMA is closely related to GP since the algorithm uses graph-based individuals. In GP algorithms, the next population is computed by swapping sub-graphs between individuals (crossover) and by replacing/removing subsub-graphs of an individual (mutation) [47]. However, GMA uses a different procedure, more similar with the ones of GA: each activity is considered as a string and crossover and mu-tation operations are done by swapping/replacing/removing parts of the activity code (see Subsection 2.4.4 for details).

2.4.2 Building the Initial Population

GMA as defined in [21, 20] follows the usual steps of the classical genetic algorithm [55]. The first step of GMA is the construction of the initial population. Individuals are causal matrices with the set of activities A representing the activities from the event log. GMA uses an heuristics method in order to ensure a faster convergence of the algorithm. The heuristics are based on the building of a dependency matrix. The dependency matrix measures how strongly an activity depends on another activity. The idea is to count how often patterns AB or ABA with A, B ∈ A occur in the event

(41)

28 PRELIMINARIES 2.4 A B C D E F Z A 0 0.75 0.66 0.5 0 0 0 B -0.75 0 0.25 0 0 0.66 0 C -0.66 -0.25 0 0 0.8 -0.5 0 D -0.5 0 0 0 0 0 0.5 E 0 0 -0.8 0 0 0.75 0.5 F 0 -0.66 0.5 0 -0.75 0 0.8 Z 0 0 0 -0.5 -0.5 -0.8 0

Table 2.4: Dependency Relation Matrix

log. Based on these measurements the dependency matrix is built as follows:

D(a, b) =                        #AB A+ #B AB #AB A+ #B AB+ 1 , if A 6= B and #AB A> 0 #AB− #B A #AB+ #B A+ 1 , if A 6= B and #AB A= 0 #AB #AB+ 1 , if A = B

where #swith s ∈ Ais the number of appearances of string s in the event log. The

dependency relation distinguishes between two-activity loops (where two activities are iterated), parallel activities and self-loops (the same activity is executed multi-ple times sequentially). The “+1” denominator favors more frequent patterns over less frequent ones. For example, if AA pattern was seen only once, the dependency relation will be D(A, A) = 0.5 and if it occurs 100 times then D(A, A) = 0.99.

Table 2.4 presents the dependency matrix for the example event log (see Table 2.2). A positive value for D(a, b) suggests that b follows a more often than other-wise, where a negative value suggest that b follows a. A value equal to 0 is inter-preted as no relation exists between the two activities. In Table 2.4, we observe that activity A has only output activities (i.e., the values in the row are all positive and the values in the column are all negative) and A is the only activity with only output activities. Thus A is the start activity, consistent with the traces of the event log. The activity Z is the final activity.

Based on the dependency matrix values, the initial population is built: the higher the dependency value is, the higher the probability of a causality relation between the activities is. In this manner, for each activity an input set and an output set of activities are chosen. In order to construct the input/output conditions, the ac-tivities of the input/output sets are randomly distributed into subsets. Note that subsets are not disjoint. Figure 2.9 shows a possible initial population of four indi-viduals built based on the dependency matrix in Table 2.4. In none of the individ-uals activity D is connected to any of the activities B, C, E or F because the corre-sponding dependency matrix values are all 0. For all individuals the start activity is

(42)

2.4 GENETIC PROCESS MINING ALGORITHM 29

(a) Individual 1 (b) Individual 2

(c) Individual 3 (d) Individual 4

Figure 2.9: Initial Population

A(highlighted by a light gray background) and the final activity is Z (highlighted by a dark gray background).

2.4.3 Fitness Computation

When new individuals are created, their fitness is computed. The fitness value flects how well an individual represents the behavior from the event log with re-spect to completeness, which is a measure for the ability of the individual to replay the traces from the log, and preciseness, which quantifies the degree of underfit-ting the event log. An individual is complete when it can parse (i.e., reproduce) all the event traces in the log and it is precise when it cannot parse traces that are not observed in the log. The fitness computation rewards an individual for correctly parsed events and penalizes it for the incorrectly parsed events as well as for the extra-behavior that the other individuals do not cover.

The fitness of an individual is computed by parsing the event log traces. Figure 2.10 presents the parsing of two traces from the event log from Table 2.2 by same individual. During the parsing, we use the following four counters:

Referenties

GERELATEERDE DOCUMENTEN

Mogelijk kan de spieker (structuur 2) ook in deze periode geplaatst worden, maar aangezien hier geen daterend materiaal werd aangetroffen blijft deze datering

De sleuf wordt -indien de aanwezigheid van sporen daartoe aanleiding geeft- aangevuld met 'kijkvensters' (grootte 10 * 10 m), op het terrein zelf te bepalen door de

Vier sporen met overwegend middeleeuws materiaal en middeleeuws voorkomen (6, 28 en 39/-b) bevatten eveneens handgevormd aardewerk in dezelfde traditie, de sporen (7a, 14, 16, 18

models to Èhreshold for incremental disks and to modulation transfer functions for circular zero-order Bessel fr:nctions of the first kind on the one hand and to

Significant losses of PAH are reported by (photo)chemical decomposition and/or evaporation during storage of aerosol-loaded filters and liquid extracts [7,8]. To avoid

Wanneer je een cohort op een locatie/afdeling/kleinschalige woonsetting wilt instellen voor mensen die corona hebben of verdacht worden van corona.. Daarbij maken we

“Verandering wordt in het team soms überhaupt lastig opgepakt omdat er meningsverschillen over het nut van de verandering ontstaan en omdat sommige collega’s het goed genoeg

The MABP signal is the main contributor to the loss in signal interactions during the first 30 minutes after propofol, due to the strong decoupling of MABP dynamics with respect to