Lossless compression of event logs to construct process models

(1)

Lossless compression of event

logs to construct process

models

(2)

Layout: typeset by the author using LA_TEX.

(3)

Lossless compression of event logs

to construct process models

Karst J. Backer 11700173

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. G. Sileno Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam June 26th, 2020

(4)

Abstract

Compression is generally necessary for storing and analyzing large amounts of data, as for instance event logs recorded by business processes of organizations. A compressed output strives to describe most of the data (a criterion functionally equivalent to high recall), not much more (high precision), without being too large (low complexity). This paper aims to investigate how precision, recall and com-plexity interact when extending a lossy compression (a process model obtained by process mining algorithms) to a lossless one. The proposed extension is obtained by 2-step method, introducing new operators handling trace logs. We also intro-duce a method for measuring the complexity of process models based on minimum description length. A recent public data set, divided in sequence in smaller frac-tions, is used as input to generate process models using a specific lossy compression algorithm. These outputs were then compared to their lossless extensions in terms of precision, recall and complexity. The results show that the analysis can serve indeed to make hypotheses on the nature of the models and of the data set.

(6)

Chapter 1 Introduction

In today’s digital world, events such as business transactions, user behavior, hard-ware events and more are recorded in forms of event logs at all systems levels, from business processes to infrastructure, under the assumption that something can be learned from observational data. To store and analyze the copious amounts of data generated today, compression, in implicit or explicit forms, becomes often a necessary step for the data-processing pipeline.

Various algorithms exist that can compress an event log into a process model (pro-cess mining algorithms), [1, 3, 5, 6] to name a few, which may each produce a different process model. Analyzing these process models can lead to insights into the processes that were recorded, potentially allowing for improvements to the system that executed those processes. In this context, having many algorithms to choose from can be helpful when analyzing an event log [11], because the different process models these algorithms generate may lead to different discoveries about the system that executed the processes.

While all these algorithms produce different and possibly useful process models, no current algorithm can guarantee that the process model it produces will perfectly match the event log. There may exist traces that match the process model which are not in the log (what might be called imperfect precision), or traces from the log do not match the process model (imperfect recall ). Because information is lost, compressing the event log in this way is a form of lossy compression. Dually, when a process model has both perfect recall and perfect precision, then it is a lossless compression of its event log.

The aim of this research is to introduce an algorithm that can extend a lossy compression algorithm to produce a lossless compression, and use this to answer

(7)

the question: "How do a compression’s precision, recall and complexity interact when extending a lossy compression algorithm to obtain a lossless compression al-gorithm? " In other words, we want to experiment with ways of interacting with process models (and process model construction algorithms) in order to enable, in the longer terms, a more fine-grained tuning of the process modeling phase. An intuitive expectation is that the lossless algorithm to be more complex, but have perfect precision and recall, while the lossy algorithms will be less complex but with lower precision and recall.

In this paper we will discuss how we designed an algorithm to extend a lossy compression algorithm into a lossless compression algorithm. We consider what operators are necessary to do so and how we will measure the precision, recall and complexity of their results. This will be done in order to inspect the changes in these measurements when a lossy process model is extended to a lossless one.

(8)

Chapter 2 Background

2.1 Event logs and Process models

All the events that belong to a single process can be grouped in a trace, ordered from earliest to last. All of a system’s traces can be stored in an event log, which can contain between less than ten to millions of these traces, meaning that it would be impossible to analyze them individually. Instead, a process model can be created to represent (and summarize) the event log, both for further computational treatment or to support a human-user analysing the system’s actual behaviour.

Process models are abstract tree structures that were first described in 1967 by S. Williams in the paper “Business Process Modelling Improves Administrative Control" [12]. Process models consist of events and operators. Events are the parts that make up a trace and are conventionally represented by a unique letter for every unique event. An exception is the silent event τ, which is not recorded by the system generating the event log, meaning that it does not appear in any trace. It is however necessary for the functioning of some models.

Operators have arguments, which can be events or other operators. Operators can be executed, meaning that their arguments are used to construct their results, which is the set of traces that operator can produce. The method with which an operator’s results are constructed (the operation) depends on which operator type that operator is. A trace can be said to match a process model in the event that that trace can be generated by that model.

2.1.1 Primitive operators

There are typically four operations in basic process models, whose names, symbols and descriptions can be found in table 2.1. All these operators need at least two arguments to perform a functioning, non-trivial operation. Several examples

(9)

demonstrating the execution of each operator type can be seen in table 2.2.

Operation Symbol Description

Sequential composi-tion

+ All arguments must be executed sequentially, from left to right.

Exclusive choice / Alternative choice

⊕ Exactly one argument is executed. Parallel

composi-tion

& All arguments must be executed exactly once, in any order.

Loop The first argument of the loop is called the

body, the other arguments are redo argu-ments. The loop argument must execute the body first. It can then execute a redo ar-gument followed by the body any number of times.

Table 2.1: Names, symbols and explanations for the functioning of all operators that occur in process models.

Model Generated traces

+(a, b, c) {abc}

+(a, ⊕(b, c)) {ab, ac}

&(a, b) {ab, ba}

(a, b) {a, aba, ababa, . . . }

Table 2.2: Examples demonstrating the execution of each primitive operator type. The letters a, b and c each represent a unique event.

2.2 Precision and Recall for process models

No standard definition of precision has been adopted in the process modeling com-munity [7], so we need to select the most appropriate measures for our purposes.

For lossless compression, perfect precision means that every trace that can be produced by a model is in the log. This implies that any model that has a loop can never have perfect precision, as it can produce an infinite number of traces. The "soundness" measure introduced by [4] fits this description:

soundness(W S∨, LP) =

| {s | s ∈ LP ∧ s |= W S∨} |

(10)

Where W S∨ is the "disjunctive workflow model", that is, the set of every trace that can be produced by the relevant process model and LP is the set of every trace

in the event log. This formula effectively means: the amount of unique generated traces that appear in the log divided by the amount of unique generated traces.

Since loops can generate infinite traces, not every trace can be calculated. To solve this, we propose a small change to the soundness measure: when calculating W S∨, loops cannot loop more times than the length of the longest trace. This way, the precision of models containing a loop is not incalculably small.

The same paper [4] also describes a measure that can be used for recall, called completeness, and defined as:

completeness(W S∨, LP) =

| {s | s ∈ LP ∧ s |= W S∨} |

| {s | s |= LP} |

Informal meaning: the amount of unique generated traces that appear in the log divided by the amount of unique traces in the log. Perfect recall means that every trace that is in the log can be produced by the model. Since there are no terms in the fraction that can be infinite, a loop can have perfect completeness.

In this paper, we will refer to soundness and completeness as precision and re-call, but keep in mind that these are not the only possible measures of precision and recall.

2.3 Complexity of process models

A third criterion which is relevant to evaluate a model is complexity, intuitively capturing the amount of information necessary to store the model. There is no standard definition of complexity available in the literature. However, the paper "Using Minimum Description Length for Process Mining" [2] introduces a method to measure the complexity of petri nets used to specify process models relying on a minimum description length (MDL).

This is done in several steps, the first of which is to represent all the infor-mation necessary to reconstruct the petri net as a sequence numbers, separated by commas and brackets. If this sequence of numbers was directly encoded to binary, a parser (a human or machine reading the encoding) would be unable to tell where each number starts and stops, since there are no commas and brackets in binary. To solve this, the parameter(s) that indicate the lengths of each type of number are explicitly given at the start of the encoding. The final encoding is the concatenation of the parameters and all the numbers in the sequence sequence in

(11)

binary. The length of this whole string of bits is the minimum description length, which is an objective measure for the complexity of the encoded model.

Finally, [2] constructs a formula to calculate the length of the encoding using only information about the petri net, so that the actual encoding process does not need to be performed.

In the following chapter we propose a mapping of their method to calculate the MDL of process models.

(12)

Chapter 3 Method

3.1 General architecture for a lossless extension

In order to answer our research question, we need to find a systematic way to extend a lossy compressor (generating the process model), in a way that the overall encoding is lossless.

The paper "Kolmogorov’s structure functions and model selection" [9] proposes a method to perfectly describe the input data (therefore a lossless compression) more compactly. The method relies on identifying a theory that describes most of that data, and a perfect description of the remainder of the data. The method followed in this project stems from the intuition that a process model generated by a lossy process mining algorithm can act as the "theory" that most traces in the log match. The remainder’s description must contain information about (1) which traces should not have been produced by the model and (2) which traces should have been produced but were not. The theory combined with the description form the lossless extension of the lossy compression. This is a general architecture and can be applied in principle to extend any lossy compression algorithm.

3.2 Lossy compression algorithm

In order to apply an algorithm that losslessy extends a lossy compression algo-rithm, that lossy algorithm is first needed. For this research project, we have selected the algorithm called "inductive miner directly-follows" (IMd ) from the paper "Scalable process discovery and conformance checking algorithm" [5], be-cause it is able to handle very large logs, which should allow the losslessly extended algorithm to handle larger logs as well.

(13)

Figure 3.1: Cut characteristics, image from [5].

IMd constructs process models using a directly-follows graph (or DFG for short). A DFG contains a node for each unique event in the event log. A connection from node A to node B exists if there is a trace in the log where event A is directly followed by event B.

To construct a process model, IMd attempts to find what it calls a cut in the DFG. A cut is a pattern that matches the behavior of one of the four operators of an unextended process model. Figure 3.1 shows how clusters of events (gray areas) need to connect to each other in order to match an operator’s behavior. If a cut is found, then the algorithm is repeated on each cluster until a cluster consists of only one event, in which case that event is returned, or until no cut can be found. The algorithm returns a model with an operator whose pattern was matched, with arguments equal to the results of the algorithm which was repeated on the clusters.

If IMd cannot find a cut that perfectly matches one of the patterns, then the flower model is used instead. The flower model is a loop with the silent event as a body, and each event in the cluster as a redo, which effectively allows any behavior of its events. As an example, the flower model of the events a, b and c would be (τ, a, b, c).

(14)

3.3 Extending operators

Using lossy process models to create lossless process models encounters two main problems. The first is that the loop operator can loop indefinitely, producing a theoretically infinite number of traces. This can never perfectly match a log, as no event log is infinitely long. To solve this, the iteration operator is introduced, which is an operator that can execute its arguments a specified number of times (see table 3.1).

The second problem is that lossy process models often generate traces that are not in their log. The option to produce these traces needs to be removed to achieve lossless compression. This is achieved by a subtraction operation, which can re-move traces from its argument’s results and can be seen in table 3.1. Examples of the execution of these operators can be seen in table 3.2.

Operation Symbol Description

Iteration * The first argument must represent an integer greater than or equal to zero. It then exe-cutes any of its other arguments a number of times equal to that integer.

Subtraction - Returns the result of the first argument, with all the results of the other arguments re-moved from it.

Table 3.1: Names, symbols and explanations for the functioning of all operators that are necessary to extend process models to perform lossless compression.

Model Generated traces

∗(2, a) {aa}

∗(2, a, b) {aa, ab, ba, bb}

−(⊕(a, b), a) {b}

−(∗(2, a, b), +(a, b)) {aa, ba, bb}

Table 3.2: Examples demonstrating the execution of each extending operator type. The letters a, b and c each represent a unique event.

3.4 Filling the loss in lossy compressions

Now that all the process model extensions are defined, the method to achieve loss-less compression can be discussed. This method is based on the two step model

(15)

of Kolmogorov theory [9], which is to have a theory that fits most of the data and specific corrections to the theory to fit the rest of the data.

A lossy compression algorithm is used to generate a process model that matches most of the log (a theory that fits most data). If there are loops in that model, then they would be able to create an infinite number of unique traces, which would require infinite corrections to perfectly fit the log. Since no method to construct a loop that perfectly makes these corrections is known, we opt to alter the process model itself instead.

The behavior of a loop can be described as follows: The traces that a loop can generate are the concatenation of the body and any number of redo+body pairs. This can be recreated without a loop, using the following structure:

+(body, *(⊕(0, 1, . . . ), +(⊕(redo 1, redo 2, . . . ), body)))

At this point, the numbers of iterations in ⊕(0, 1, . . . ) are still infinite, while only the numbers which can produce a valid trace are wanted. In order to find which numbers of iterations are valid for each iterator, a set per iterator is pro-duced with the following method: if looping the iterator N times allows the process model to produce a trace that is in the log, then N is in the set of that iterator. Each iterator’s argument for the number of iterations it can perform is then made equal to the alternative choice of its set.

Note that this changes the meaning of the process model, unlike the step before it which only changed the structure of the model. The iterator can no longer it-erate any number of times, but only numbers of times that can lead to valid traces. Now that the model can only generate a finite number of traces, those traces are compared to the event log that is being compressed. If there are any generated traces that are not in the log, then those traces are in the "minus" set. Similarly, if there are traces in the log that are not generated by the model, then those traces are the "plus" set.

The plus and minus sets can be viewed as new logs, which are then recursively compressed by this algorithm. Recursion ends when these logs are incompressible. The results of these compressions are then added to the existing model with the -and ⊕ operator for the minus -and plus set’s compressions respectively, which has the following structure:

(16)

where compress() is the lossless compression function described in this section. We will call models of this shape extended compressed models.

A log is considered incompressible if either of the minus or plus sets contain more traces than the log itself. This means that applying the corrections to the lossy model would be more work than describing the log directly. This direct de-scription is done using the explicit model. Each trace in the log are represented as the sequential composition of their events. The explicit model is the alternative choice of these sequential compositions. For example, the log [abc, cab, bca] would be converted to the explicit model ⊕(+(a, b, c), +(c, a, b), +(b, c, a)).

3.5 Further simplification

This algorithm may create a process model with some redundancy. In order to re-duce the needless complexity, an algorithm to remove redundancy calledSimplify

is introduced here, which is used as the final step of the lossless extension.

Each operator type must be handled differently, but most of them go through the same sorts of processes. For instance, every operator except_{and * can have} one argument, but performs no useful operation in that case. Simplifywill return the simplification of that argument, thus removing this operator entirely.

Furthermore, there are cases where an operator inside another operator is redun-dant due to the function of the operator it is in. In such cases, Simplifyreplaces the redundant operator with the simplifications of the arguments of that operator. All needless operators in operators can be seen in table 3.3, along with an example of their simplification.

Redundant operator in useful operator Removal of redundancy example

+ in + +(+(a, b), c) becomes +(a, b, c)

⊕ in ⊕ ⊕(⊕(a, b), c) becomes ⊕(a, b, c)

⊕ in the redo of (a, ⊕(b, c)) becomes (a, b, c)

⊕ in the redo of * *(2, ⊕(b, c)) becomes *(2, b, c)

⊕ in the subtraction of - -(a, ⊕(a, b)) becomes -(a, a, b) Table 3.3: All types of needless operators in operators with an example showing how they are simplified

If an iterator can only loop zero times, it cannot output anything but the silent event, so it can be replaced by the silent event. If an iterator can only loop one

(17)

time, it can be replace by a ⊕ whose arguments are the redo arguments of the iterator.

If a silent event is in a sequential composition or the parallel composition, it will never affect any trace and should be removed. This can be done before the step that checks if the operator has one argument, to allow these simplifications to work in conjunction.

As an example, let a lossy compression algorithm have generated the model (τ, a, b) for the log [aa, ab, ba, bb]. The second step of the lossy extension would convert this to the iterator based model +(τ, *(2, +(⊕(a, b), τ))).

Simplifystarts at the root operator and finds the silent event in a +, so it removes that. The + then has one argument, so the simplification of *(2, +(⊕(a, b), τ)) will be the new result. The iterator’s first argument is not 0 or 1 and it has more than one argument, so this operator is not redundant.

Simplify is then applied to the arguments of the iterator. 2 cannot be simpli-fied, so it stays the same. +(⊕(a, b), τ) has a τ, and after removing that there is only one argument. This step thus returns the simplification of ⊕(a, b). Since it cannot be simplified further, it is returned.

This brings the algorithm back to *(2, ⊕(a, b)). A ⊕ in the redo of * is re-dundant, so it is replaced by the arguments of the ⊕, resulting in *(2, a, b), which cannot be simplified further.

Note that none of these steps change the meaning of the model, only its com-plexity.

3.6 Computing complexity

In order to have a measure of complexity of the model, we will refer to Kolmogorov complexity [10] and minimum description length. A series of further steps need to be clarified in order to compute this value. First, the model will need to be converted into a series of numbers, which need to be encoded to a binary string. That string’s length is the minimum description length, which is the measure of complexity we will use. A method to calculate the minimum description length without performing the encoding will then be derived.

(18)

3.6.1 Symbol series representation

An encoding of an extended process model needs to store data about which oper-ators it has, what their scopes are and which integers and events appear where in the model.

In order to store where operators, events and integers are, their values (the oper-ators’ symbols, the events’ letters and the integers’ values) can be written down in the order that a human would read them. In other words, it is the collapsed version of the tree structure of the process model.

For example, the process model +(b, ⊕(a, b), a) would collapse into the symbol series +, b, ⊕, a, b, a. However, this series can be interpreted as +(b, ⊕(a), b, a), +(b, ⊕(a, b), a) or +(b, ⊕(a, b, a)). Clearly, the scope of the operators needs to be clarified.

3.6.2 Scope

In order to capture the scopes of the operators, only the amount of arguments each operator has needs to be stored. The "number of arguments" argument (the meta-argument) will be stored directly after its respective operator. As an exam-ple, the series "+, 3 arguments, b, ⊕, 2 arguments, a, b, a" can only be interpreted as +(b, ⊕(a, b), a).

Since all objects that are not the root need to be arguments of an operator, the symbol series representation can be simplified further. This can be done by re-moving the required arguments number from the root operator, as it just takes all the arguments that are not required by other operators, until the symbol series ends.

Returning to the previous example, the symbol series +, b, ⊕, 2 arguments, a, b, a must represent the process model +(b, ⊕(a, b), a), because the final "a" must be an argument of the root operator, since the ⊕ already has its two arguments.

3.6.3 Encoding the symbol series

In order to encode a symbol series into a bit string, the symbol series first needs to be represented as a series of numbers. In order to distinguish events, integers and the different operators, an integer representing the data type is stored with the necessary data. The data types can be seen in table 3.4. Which number refers to which type has no effect on the minimum description length, as the type always takes 3 bits to encode when there are eight types. This order was chosen simply

(19)

because it is relatively easy to remember.

Type Encoding Binary

event 0 000 int 1 001 + 2 010 ⊕ 3 011 & 4 100 5 101 * 6 110 - 7 111

Table 3.4: Arbitrarily chosen encodings for the type of each symbol that can appear in an extended process model

If the data types are applied to the example symbol series (+, b, ⊕ , 2 arguments, a, b, a),

it becomes (+2)(event0 , b)(

⊕

3, 2 arguments)(event0 , a)(event0 , b)(event0 , a),

where the first number of each group indicates the type, as indicated by the symbol or text above.

Next, the values of these types must be encoded to a number. The values each have some complications, which will be discussed now.

First, the letters in a process model represent an event. That event could be represented by any letter, and the model would still function all the same. Because of this, space can be saved by treating each model as though its letters are as close to the start of the alphabet as possible. For instance, the model +(b, f, c, f ) would be treated as though it was the model +(a, c, b, c).

Second, the silent event needs to be able to be encoded as well as the regular events. In order to achieve this, the silent event is encoded as 0 and each unique letter is represented as a unique number, starting at 1.

Third, negative integers are unnecessary for the iteration operator, meaning that integers can be stored as unsigned integers to save space.

Finally, number of arguments of a non-trivial operator is always at least two. By not allowing trivial operators, more space can be saved in the encoding. An operator needing two arguments can then be encoded as 0, 3 arguments as 1 and so on.

(20)

Applying these techniques to our recurring example (+2)(event0 , b)(⊕3, 2 arguments)(event0 , a)(event0 , b)(event0 , a) yields

(+2)(event0 ,2)(b ⊕3, 2 args0 )(event0 ,1)(a event0 ,2)(b event0 ,1)a

In order to turn this sequence of numbers into binary, a parser will need to know how many bits each value will be to keep the numbers separated, as there are no brackets and commas in binary. n bits can express 2n _{options, but we want to}

know how many bits are required to describe at least x options, so: 2n_{>= x}

n >= log₂(x)

Since we can only have whole bits and cannot describe x options in less than n bits, the ceiling of log₂(x) is taken to find the correct number of bits that are required to describe at least x options, resulting in the formula:

n = dlog₂(x)e

In the case of the types, extended process models always have 8 types, which requires dlog₂(8)e = 3 bits to encode. The resulting binary encoding can be seen in table 3.4.

The number of options for the events, integers and numbers of arguments can vary, depending on the model being encoded. Using a set number of bits for these could both waste space as not all options are necessary and limit which models can be encoded as not enough options are available.

Instead, the least amount of bits necessary for the events, integers and numbers of arguments can be used if the encoding explicitly tells the parser how many bits each of these uses at the start. This cannot be done using binary, or the parser will not be able to tell where each number ends.

Instead, unary is used to encode the amounts of bits. The value of a unary number is equal to the number of ones in the unary number, read from left to right. A zero indicates the end of the number. There need to be three unary numbers at the start of the encoding to tell the parser how many bits are needed for an event, integer and for an operator’s number of arguments, in that order. The order is a mostly arbitrary choice so that the parser knows which number is which, with the exception of the number of bits for an event being first. That number is first because it is zero if and only if every event in the model is the silent event, which allows the encoding to avoid leading zeroes for most models, which can help prevent confusion. The unary uses a little bit of space to avoid the problems caused by a set number of bits.

(21)

When applying this to our example number sequence: (+2)(event0 ,2)(b ⊕3, 2 args0 )(event0 ,1)(a event0 ,2)(b event0 ,1)a

The amount of unique events, the largest integer and the largest number of argu-ments in a non-root operator are checked first.

There are 3 unique events (silent, a and b), which requires dlog₂(3)e = 2 bits. There are no integers, so the largest integer defaults to zero, requiring 0 bits to encode.

The most arguments is 2. Because the assumption that no operator has fewer than two arguments was made, this requires 0 bits to encode.

The unary numbers 2, 0 and 0 are put at the front of the number sequence and the sequence itself is converted to binary, resulting in:

2 bits/event 110 0 bits/int 0 0 bits/op 0 010+ event00010b ⊕

011event00001aevent00010b event00001a

1100001000010011000010001000001, for a minimum description length of 31.

3.6.4 Calculating MDL without encoding

It is not necessary to perform the encoding to get the length of the encoding of a model, it can simply be calculated. The length of the encoding is equal to the number of bits needed to encode each type of value multiplied by the amount of times that type of value is used, plus the number of bits per type (3) times the amount of times a type is used, plus the length of the unary numbers.

First, the numbers of bits needed to encode the values of each type are needed. Let Me be the number of unique events in the model, always including the

silent event. It then takes dlog₂(Me)e bits to encode an event.

Let Mi be the maximum integer that appears in the model, or zero if there are

none. Since zero is the first option, one is the second option and so on, a one needs to be added to Mi to get the number of options that an integer needs, meaning

that it takes dlog₂(Mi+ 1)e bits to encode an integer.

Let Mo be the maximum number of arguments that an operator in the model

has, excluding the root operator. Because two is the first option, it takes dlog₂(Mo−

1)e bits to encode a number of arguments.

Let Ne, Ni and No be the total number of events, integers and operators in the

model respectively. The MDL of that model is then:

events Ne bits/event dlog₂(Me)e+ ints Ni bits/int dlog₂(Mi+ 1)e+ ops - root (No− 1) bits/op dlog₂(Mo− 1)e+ bits/type 3 · types (Ne+ Ni+ No)+

length of unary numbers

The length of the unary numbers is equal to the sum of the number of bits needed to encode the values of each type, plus the length of the three terminating zeroes. This leaves us with the function:

(22)

MDL(model) = Nedlog2(Me)e+Nidlog2(Mi+1)e+(No−1)dlog2(Mo−1)e+3(Ne+

Ni+ No) + dlog2(Me)e + dlog2(Mi+ 1)e + dlog2(Mo− 1)e + 3

Which can be simplified to: MDL(model) =

(Ne+1)dlog2(Me)e+(Ni+1)dlog2(Mi+1)e+Nodlog2(Mo−1)e+3(Ne+Ni+No+1)

To test this function, it is executed on the model +(b, ⊕(a, b), a) to see if an MDL of 31 is found.

MDL(+(b, ⊕(a, b), a)) =

(4 + 1)dlog₂(3)e + (0 + 1)dlog₂(0 + 1)e + 2dlog₂(2 − 1)e + 3(4 + 0 + 2 + 1) = 5 · 2 + 1 · 0 + 2 · 0 + 3 · 7 =

(23)

Chapter 4 Results

4.1 Generating results

In order to investigate how the statistics of recall, precision and complexity change when extending a lossy model to a lossless one, event logs to execute the compres-sion algorithms on are needed. The data set “Testing representational biases" by van der Aalst [8] provides 120 event logs that were generated to test process mining algorithms.

To inspect how the compression algorithms behave with (possibly incomplete) logs of various sizes, fractions in 5% increments from 5% to 100% were taken from each log. These fractions are used as input for the lossy compression algorithm IMd and our lossless extension algorithm using IMd. The precision, recall and complexity of the lossy models as defined in the sections 2.2 and 3.6 can then be compared to those of the lossless models. The lossless models’ precision and recall are always perfect, meaning that these statistics don’t need to be inspected.

A more interesting statistic is the relative change of the complexity of the model before and after the lossless extension of the lossy model. This statistic is plotted against the precision and the recall of the lossy model to illustrate which ranges correlate with that models’ complexity increase.

There are four types of results that the lossless extension can generate. The first two are the extended compressed model (ECM) and the extended explicit model (EEM). If the model generated by the lossy compression algorithm already happened to be lossless, then that model is returned (LM). Finally, the lossless compression algorithm can return an error (ERR), indicating that it could either not execute the lossy model due to time or memory issues, or because the log is invalid, such as an empty log.

Since the error result is not a relevant model for the present purposes1, it does

1_{The error cases may be relevant for other purposes, such as finding methods for improving}

the proposed extending algorithm

(24)

not have measurements and thus they are not analyzed in this section.

When the algorithm directly returned the lossy model, the precision and recall were already a perfect one, and the complexity did not change as the lossy model is equal to the lossless model in these cases. Since these results are always the same, they are not shown.

(a) Complexity plotted against recall. (b) Complexity plotted against preci-sion.

Figure 4.1: The relative increase in complexity from the lossy models to the losslessly extended models, plotted against the lossy model’s precision and re-call scores. The fraction of the log that was used to generate the models can be seen as the color of the plotted point.

Figure 4.1a shows a negative correlation between the fraction and the increase in complexity, and that every model that became an ECM had a recall of one. Figure 4.1b shows that as the fraction decreases, the precision decreases while the complexity multiplier increases. No ECM had a precision lower than 0.5.

(25)

(a) Complexity plotted against recall. (b) Complexity plotted against preci-sion.

Figure 4.2: The relative increase in complexity from the lossy models to the explicit models, plotted against the lossy model’s precision and recall scores. The fraction of the log that was used to generate the models can be seen as the color of the plotted point.

In figure 4.2a, various clusters can be seen. Most models with a high fraction tend to have a higher complexity, but there are numerous exceptions. Figure 4.2b shows that most models that were extended to an EEM had a precision near zero. Most models with higher fractions tend to higher complexity as well, but not without exceptions. No EEM had a precision higher than 0.4.

(26)

Figure 4.3: The amounts of results of each type plotted against the fraction of the traces from each log that were used.

The number of EEMs and ERRs remained relatively consistent as the fraction changed, see figure 4.3. When there was a high fraction, more LMs were generated than ECMs. The inverse is true when the fraction dropped below 0.6, until the fraction reaches 0.05, where there are no LMs or ECMs.

(27)

Chapter 5 Discussion

Our research found that lossy models that became ECMs had perfect recall re-gardless of the fraction and complexity, while the precision decreased. In contrast, the recall of the models that became EEMs was grouped and inconsistent, and the precision was near consistently low, regardless of complexity and fraction.

The lossy models from which the ECMs in figure 4.1 were generated most likely contain the flower model. Flower models always have perfect recall because these models include all traces, which is reflected in figure 4.1a. If it is the case that the same flower model is generated by the lossy algorithm when the fraction of the log is lowered, then the same traces are produced, but some of them are no longer in the log. This lowers the lossy model’s precision, which also means that the removed traces would have to be subtracted from the flower model to achieve lossless compression, causing greater complexity the lower the fraction. The correlation between a lower fraction, lower precision and higher complexity can be seen perfectly in figure 4.1b.

This behavior stops when precision falls below 0.5, as there would be as many or more traces that need to be subtracted from the flower model as there are traces in the log, which signals the lossless extension algorithm to use the explicit model instead, which can be seen by the fact that every lossy model that was extended to be an explicit model had a precision less than 0.5, as by the figure 4.2b.

This means that IMd may not be ideal in the case of our lossless extension algorithm, as the precision of the lossy model need to be greater than 0.5 to avoid producing an EEM due to the design of the lossless extension algorithm. IMd focuses on attaining high recall, often neglecting precision in the process. Further-more, IMd has trouble producing precise models when compressing a log in which events are used multiple times per trace.

(28)

The figures 4.2 both show that most EEMs with a higher fraction tend to have higher complexity. This is expected, because when a log has a higher fraction, there is more information that needs to be described. Since EEMs do not compress, this will result in a higher complexity than lower fractions. The exceptions of a high fraction but relatively low complexity multiplier data points are likely caused by lossy models which had a high complexity, thus lowering the complexity multiplier of the EEM.

Additionally, as expected, the EEMs have an overall higher complexity than ECMs since they do not compress.

It should be noted that our research was limited by the computers working mem-ory and processing time, which means we were unable to execute the models that would produce over hundreds of thousands of traces, in which case errors were returned instead. Thus, no conclusions can be drawn about these models. Ad-ditionally, the data set we used may not have been able produce every possible process model construction. For example, a different log may have resulted an ECM that was not a flower model. Due to time constraint, we could not confirm the theory that every generated ECM contained a flower model. We were also unable to look into the grouped clusters of models that became EEMs for the same reason. Because of this, no conclusion could be drawn about the interactions between precision, recall and complexity of these models.

Further research is necessary to repeat the experiment with better resources, as well as use different data sets. We also recommend using a different lossy algorithms than IMd to compare these results to.

Finally, further research could look into how the minimum description length could be further optimized. This could be done by the usage of Huffman encoding to encode operator types, but this would require research into which operators appear most frequently to produce optimal results. Additionally, since integers only appear in the first argument of the * operator, integers and events could be distinguished by whether they are in that argument or not, rather than being as-signed individual types.

Despite these limitations, this work still offers an initial starting point that can be used to investigate the effects of lossless extension and the measure of complexity in process models.

(29)

Bibliography

[1] Adriano Augusto. Accurate and efficient discovery of process models from event logs. PhD thesis, the university of Melbourne, 2019.

[2] Toon Calders, Christian W Günther, Mykola Pechenizkiy, and Anne Rozinat. Using minimum description length for process mining. In Proceedings of the 2009 ACM symposium on Applied Computing, pages 1451–1455, 2009. [3] Long Cheng, Boudewijn F van Dongen, and Wil MP van der Aalst. Scalable

discovery of hybrid process models in a cloud computing environment. IEEE Transactions on Services Computing, 13(2):368–380, 2019.

[4] Gianluigi Greco, Antonella Guzzo, Luigi Pontieri, and Domenico Sacca. Dis-covering expressive process models by clustering log traces. IEEE Transac-tions on knowledge and data engineering, 18(8):1010–1027, 2006.

[5] Sander JJ Leemans, Dirk Fahland, and Wil MP Van der Aalst. Scalable process discovery and conformance checking. Software & Systems Modeling, 17(2):599–631, 2018.

[6] Tijs Slaats. Declarative and hybrid process discovery: Recent advances and open challenges. Journal on Data Semantics, pages 1–18, 2020.

[7] Niek Tax, Xixi Lu, Natalia Sidorova, Dirk Fahland, and Wil MP van der Aalst. The imprecisions of precision measures in process mining. Information Processing Letters, 135:1–8, 2018.

[8] W.M.P. (Wil) van der Aalst. Testing representational biases. https: //doi.org/10.4121/uuid:25d6eef5-c427-42b5-ab38-5e512cca08a9, 2017. accessed: 2020-06-22.

[9] Nikolai K Vereshchagin and Paul MB Vitányi. Kolmogorov’s structure functions and model selection. IEEE Transactions on Information Theory, 50(12):3265–3290, 2004.

(30)

[10] Chris S. Wallace and David L. Dowe. Minimum message length and kol-mogorov complexity. The Computer Journal, 42(4):270–283, 1999.

[11] Philip Weber, Behzad Bordbar, Peter Tiňo, and Basim Majeed. A framework for comparing process mining algorithms. In 2011 IEEE GCC Conference and Exhibition (GCC), pages 625–628. IEEE, 2011.

[12] S Williams. Business process modeling improves administrative control. Au-tomation, December, 44:50, 1967.

Lossless compression of event logs to construct process models

Lossless compression of event

logs to construct process

models

Lossless compression of event logs

to construct process models

Contents

Abstract

Chapter 1

Introduction

Chapter 2

Background

2.1

Event logs and Process models

2.1.1

Primitive operators

2.2

Precision and Recall for process models

2.3

Complexity of process models

Chapter 3

Method

3.1

General architecture for a lossless extension

3.2

Lossy compression algorithm

3.3

Extending operators

3.4

Filling the loss in lossy compressions

3.5

Further simplification

3.6

Computing complexity

3.6.1

Symbol series representation

3.6.2

Scope

3.6.3

Encoding the symbol series

3.6.4

Calculating MDL without encoding

Chapter 4

Results

4.1

Generating results

Chapter 5

Discussion

Bibliography