Eindhoven University of Technology MASTER Predicting undesired business process behavior using supervised machine learning Sotudeh, Hadi

(1)

Eindhoven University of Technology

MASTER

Predicting undesired business process behavior using supervised machine learning

Sotudeh, Hadi

Award date:

2018

Awarding institution:

Royal Institute of Technology Link to publication

Disclaimer

This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration.

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

(2)

(3)

Abstract

Process mining enables making business processes efficient in order to improve productivity, to save time and cost, and to reduce waste through the incorporation of important patterns discovered in the business process behavior.

This can be effectively achieved through Predictive Pro- cess Monitoring by applying a wide range of algorithms on business process datasets to let businesses take prior actions and become proactive. In this study, the previous related work was studied to identify the knowledge gap and to design the research question. The designed research question was “What supervised machine learning algorithms are suitable for predicting undesired business process behavior on imbalanced datasets?”. In order to answer it, a real–

world imbalanced business process dataset was used and preprocessed, then different algorithms appropriate for the explained problem were investigated considering the proper strategies to overcome the imbalanced challenge. The re- sults showed that the Light Gradient Boosting Algorithms with weighted cost functions outperformed Naive Bayes, Decision Tree, Random Forest, and the stacked classifier in terms of the AUC ROC in general. The findings of the study can be applied to any business processes for predicting their undesired behavior where the business process datasets are imbalanced.

Keywords:

(4)

Sammanfattning

Processutvinning är en metod för att effektivisera affärspro- cesser så att produktiviteten ökar, tidsåtgången och kost- naderna hålls nere, samt svinn reduceras. Detta sker genom att upptäcka viktiga mönster i affärsprocessernas beteende.

Ett sätt att uppnå detta är att använda prediktiv proces- sövervakning- ett samlingsnamn för algoritmer som syftar till att låta företag upptäcka och åtgärda problem proak- tivt. I denna Det tidigare relaterade arbetet studerades för att identifiera kunskapsklyftan och att utforma forsknings- frågan. Den designade forskningsfrågan var “Vilka överva- kade maskininlärningsalgoritmer är lämpliga för att förut- säga oönskat affärsprocessbeteende på obalanserad data- mängd?”. För att svara på det, användes en förbalanserad affärsprocessdatamängd som användes och förbehandlades i realtid, då undersöktes olika algoritmer som var lämp- liga för det förklarade problemet, med tanke på de rätta strategierna för att övervinna den obalanserad utmaningen.

Resultaten visade att textit Light Gradient Boosting Al- gorithms med viktade kostnadsfunktioner överträffade Na- ive Bayes, Decision Tree, Random Forest och den staplade klassificeringsenheten i termer av AUC ROC i allmänhet.

Resultatet av studien kan tillämpas på alla affärsprocesser för att förutsäga deras oönskade beteende där affärspro- cessdatabasen är obalanserad.

Nyckelord:

Processprognos, Processövervakning, Business Process Ma- nagement, Process Mining, Maskininlärning.

(5)

Acknowledgment

I would like to express my sincere gratitude to my supervisors Prof. Magnus Boman, Julian Madrzak, and my examiner Prof. Henrik Bostrom for their guidance and support. The completion of this thesis would not have been possible without their enormous contributions. My special thanks go to Mr. Diego Roa, Mr. Jan Philipp Thomsen, Mr. Sebastian Roßner, Mr. Tom Shaffner, and Dr. Anton Kurz in the content-store team and other colleagues at Celonis for their assistance during my internship program. In addition, it was not possible to close this part without acknowledging previous studies in the process mining area, I should quote from Newton: “If I have seen further, it is by standing on the shoulders of Giants”.

Furthermore, I would like to thank EIT–Digital master school for providing their generous scholarship which allowed me to complete the first year of my master studies at TU/Eindhoven (The Netherlands), the second year at KTH in Sweden, and the master thesis in Munich (Germany). Last but not least, I should thank my father, mother, and sister who extended their support during my master studies abroad.

Eindhoven, September 17, 2018

(6)

List of Figures

2.1 A business process case. . . 10

2.2 The main flow of a P2P process. . . 12

2.3 A P2P process with different paths (non-deterministic behavior). . . 13

2.4 The closest point on the ROC plot to the XY-axes (0, 1). . . 17

2.5 Index-based encoding. . . 19

2.6 An example of a decision tree built on the Iris flower dataset. . . 22

3.1 Applied methodology in this study. . . 27

4.1 A heat map showing correlations among the features and the target labels. 37 4.2 The AUC ROC scores over different process steps on the training sets. . 38

4.3 The AUC ROC scores over different process steps on the test sets. . . . 39

4.4 The TPR and The FPR over different thresholds of LGBM at process step (5) on the training set. . . 40

4.5 The AUC ROC plot of the decision tree at process step (3) on the test set. 41 4.6 The training execution time of different classifiers over all process steps. 42 5.1 The decision tree classifier at process step (7) on the training set. . . 45

(10)

List of Tables

1.1 Previous work. . . 4

2.1 Main columns needed to perform process mining. . . 9

2.2 Most common P2P activities. . . 11

2.3 A Confusion Matrix. . . 15

2.4 A categorical feature. . . 19

2.5 One hot encoded feature. . . 19

3.1 The description of the input dataset. . . 28

3.2 Global parameters. . . 32

3.3 Results of predictions on ongoing cases. . . 33

4.1 The hardware configuration used in this study. . . 35

4.2 The utilized Python libraries in this study. . . 35

4.3 The ratios of the target class over different process steps in the dataset. 36 4.4 Statistics of the training sets up to process step (9). . . 38

4.5 The results of different classifiers over different process steps on the training sets. . . 38

4.6 The results of different classifiers over different process steps on the test sets. . . 39

5.1 Suitable classifier at each process step. . . 44

2

(11)

Chapter 1

Introduction

“The difficulty of beginning will be nothing to the difficulty of knowing how to stop.”

— Agatha Christie, Murder in Mesopotamia (1936)

1.1 Background

Prof. Wil van der Aalst, the founding father of process mining, on page 31 of his

“Process Mining: Data Science in Action” book says: “Process mining is a relative young research discipline that sits between machine learning and data mining on the one hand and process modeling and analysis on the other hand. The idea of process mining is to discover, monitor and improve real processes (i.e., not assumed processes) by extracting knowledge from event logs readily available in today’s systems” [1]. Traditional approaches to monitor business processes are “off-line” and

“online” bases. In the “off-line” setting, a database of complete business process cases is used to calculate performance metrics like bottlenecks and cycle times. In the “online” setting, a monitoring dashboard shows the status of running business process cases and raises alerts when an undesired situation happens. All these are considered descriptive techniques which means they show process issues after their occurrence, but they do not tell about the future and what is going to happen [2].

Predictive process monitoring [3] is a study field in process mining which makes use of historical data to forecast how an ongoing business process case will behave in the future by applying various algorithms. These predictions could be on numerical attributes like remaining time [4][5][6][7], activity delays [8], and violating specified deadlines [9] or categorical attributes like the next activity [10][4], abnormal terminations [11], and slow cases [12].

Algorithms to accomplish these tasks are classified as: a) Algorithms which have access to information about the business process model and use it to perform predictions, e.g. transition systems [8] and stochastic Petri net [13]. b) Algorithms which do not use this information, e.g. deep learning [14] and other supervised machine learning algorithms [15]. This study is positioned on predicting categorical attributes of business process cases using algorithms which do not have access to information about the business process model.

(12)

CHAPTER 1. INTRODUCTION

1.2 Problem

A lot of work has been performed in the predictive process monitoring area and more specifically undesired behavior prediction. The following table summarizes some of these studies which applied algorithms which do not use information about the business process model for predicting categorical attributes.

Year Summary Methods Evaluation Metrics Reference

2012 Predict the business process failure Support Vector Machine (SVM) Accuracy and proactiveness [11]

2012 Business process monitoring for prediction of abnormal

terminations

K-Nearest Neighbor Imputation (KNNI) based Local Outlier Factor

(LOF) Precision, Recall, and earliness [16]

2014 Business process monitoring like predicting whether a diagnosed

patient would be recovered. Decision Tree Accuracy [3]

2014 Predict the next activity in

business processes Hidden Markov Model (HMM) Accuracy [10]

2015 Comparing different encoding methods on a business process

predictions problem Random Forest Area Under the Operating Receiver Characteristic Curve (AUC ROC) [17]

2015 Comparing three different methods on predictive monitoring

Machine learning, Constraint satisfaction, and Quality of Service

(QoS) aggregation e.g. Accuracy, confusion matrix [18]

2017 Predictive monitoring to generate

decision rules Evolutionary algorithms e.g. Precision, Recall, and

F-measure [19]

2017 Predict the next activity in

business processes Long Short Term Memory (LSTM) Precision [14]

2017 Predict next activities in business processes (a steel manufacturing

case study) Deep neural networks Accuracy and Recall [20]

2018 Business Process Monitoring Clustering, Random Forest, and

Decision Tree Accuracy and failure rate [15]

2018 Nirdizati, a tool for predictive process monitoring

Random Forest, Gradient Boosting, Decision Tree, and eXtreme Gradient Boosting Tree (XGBT)

e.g. Accuracy, F1, logarithmic loss, and mean absolute error [12]

Table 1.1. Previous work.

Among the performed studies, Leontjeva et al. predicted an undesired activity on a balanced dataset [17]. Mehdiyeva et al. applied deep neural networks to predict next activities and to detect some defined undesired activities in a case study [20], but there is criticism regarding their chosen evaluation metric, as there are better options for the given imbalanced dataset. Di Francescomarino et al. performed predictive business process monitoring using several clustering methods and two tree–based classifiers, but their experiments were not on imbalanced datasets [15].

The latest study in this area is developed as an open source tool called “Nirdizati”

[12] which proposes a general framework for predicting categorical and numerical attributes of business process cases; however, this framework currently does not support undesired behavior prediction on imbalanced datasets.

To the best of our knowledge, there is no study which investigates proper supervised machine learning algorithms for undesired business process behavior predictions on imbalanced datasets using techniques like sampling methods, weighted cost functions, and applying ensemble algorithms [21]. To sum up, the problem statement is to predict an undesired activity occurrence in a business process on an imbalanced dataset where only a few percentage of business process cases go through the undesired activity in this binary classification task.

4

(13)

1.3. PURPOSE

1.3 Purpose

To the best of our knowledge, supervised machine learning algorithms in predicting undesired business process behavior on imbalanced datasets have not been reported in the published studies. That is the identified knowledge gap to be addressed in the form of the following research question:

“What supervised machine learning algorithms are suitable for predicting undesired business process behavior on imbalanced datasets?”

1.4 Objectives

Celonis^©1, the market leader in process mining recognized by Gartner^©2in 2018 [22], is a software that discovers and visualizes business processes to make them transpar- ent, fast, and cost–effective. The objective of this master thesis which is sponsored by the Machine Learning team at Celonis^© is to predict an undesired activity in business process cases with satisfactory results based on the chosen evaluation metric. To achieve the objective, proper supervised machine learning algorithms for the given problem should be investigated taking into account the imbalancy challenge of the dataset after applying the feature engineering part and other data preparation. In this study, the Purchase to Pay (P2P) process is chosen as the business process and the ‘Change Price’ activity as the undesired behavior to be predicted.

However, one could choose any undesired activity in any business process if enough information is provided. The outcome of this study will be integrated into Celonis^©.

1.5 Methodology

The empirical method is adapted to answer the proposed research question in this thesis, as various data analysis algorithms are tested in an exploratory way to explore a large parameter space on a real–world business process dataset provided by Celonis^©. The proposed research question is about finding the suitable non–

neural network supervised machine learning algorithm among the chosen ones based on the AUC ROC. That is why the “quantitative” data analysis method is applied to measure the performance of different algorithms in Chapter 3. All in all, any researcher should be able to replicate this study on any other business process and any undesired activity therein by reading the explained methodology in Chapter 3.

1http://celonis.com, accessed: June 18, 2018

2http://gartner.com, accessed: June 18, 2018

(14)

CHAPTER 1. INTRODUCTION

1.6 Ethics and Sustainability

Since the dataset used in this thesis is not publicly available, our main concerns are data security and privacy issues. From 25 May 2018, General Data Protection Regulation (GDPR) [23] has been implemented in the European Economic Area, which should be taken into account. although there is a big hype around GDPR, it has not changed the world and most of the data protection laws in Germany were there before already. Celonis terms and conditions generally let us to use anonymized user and aggregated data to secure users identity for benchmarking studies, marketing or other business goals. It is important to note that GDPR is only applicable to EU citizens, not other countries like the United States of America, while its ethical concerns are still valid.

Regarding the sustainability, one can make a process more efficient by detecting potential undesired behavior in advance. Assume that a company has 200,000 cases in their P2P process per year and around five percent of these cases contain ‘Change Price’ activities yearly. In addition, assume each ‘Change Price’ activity takes 20 minutes (this number varies among different companies) to interpret and to take proper actions by an employee. This takes 0.05*200,000*20 = 200,000 minutes of company’s time equal to 19 months of a single employee working on a full–time basis. With the outcome of this study, it should be possible to predict undesired behavior shortly to take right actions. All in all, predicting undesired behavior in a business process will:

• save operational costs

• boost efficiency and avoid massive rework

• improve quality, increase bilateral satisfaction, and reduce incidents

• affect the cash flow and revenue forecast

• redesign the process

Like any other prediction task, there is the risk to have incorrect predictions while the final goal is to come up with satisfactory results based on the chosen evaluation metric which minimizes the number of incorrect predictions, too.

In terms of the sustainable development goals introduced by the united nations in 2016 [24], “Goal 8: decent work and economic growth, Goal 9: industry, innovation, and infrastructure, and Goal 12: responsible consumption and production” are addressed.

1.7 Risks

This thesis could fail to answer the introduced research question properly and to predict the chosen undesired business process behavior as expected. In addition, missing milestones and deadlines is another potential risk.

6

(15)

1.8. DELIMITATIONS

1.8 Delimitations

The given problem is stated as a binary classification task and that is why multi–

class classification is out of the scope. In addition, among algorithms which do not use information about the business process model, only “non–neural network supervised machine learning” algorithms (classifiers) are within the scope of this thesis. That is why other algorithms like deep neural networks or HMM are not considered. The reason for this choice was to start with simple supervised machine learning algorithms and evaluate their results. Later, if it is needed to improve the results, we could experiment more complex algorithms like neural networks, but non–neural network supervised machine learning algorithms performed well and there was no need to include neural network and HMM ones. In this thesis, only Light Gradient Boosting Algorithms, Naive Bayes, Decision Tree, Random Forest, and the stacked classifier are studied. All in all, this study will only be performed on a single instance, as accomplishing it on a distributed framework is out of the scope.

1.9 Outline

The thesis is organized as follows: First, relevant theory are discussed in Chapter 2.

Then, the thesis methodology is elaborated in Chapter 3 and the obtained results are presented in Chapter 4. Finally, discussion and conclusions are covered in Chapter 5 and Chapter 6, respectively.

(16)

(17)

Chapter 2

Extended background

“Everything must be taken into account. If the fact will not fit the theory - let the theory go.”

— Agatha Christie, The Mysterious Affair at Styles (1920) This chapter is organized as follows: First, the concept of the business process case is defined in Section 2.1. Next, the Purchase to Pay process is described in Sec- tion 2.2. Then, implications of imbalanced datasets and corresponding approaches are explained in Section 2.3. The remainder of this chapter covers evaluation metrics (Section 2.4), filling features with missing values (Section 2.5), data encoding methods (Section 2.6), supervised machine learning algorithms (Section 2.7), and hyper-parameter tuning of machine learning algorithms (Section 2.8), respectively.

2.1 Business Process Case

A business process consists of several cases which each one has a set of activities occurring in a timely order. Each activity has a set of features, e.g. name of the responsible person and the occurrence time of that activity. The data generated during execution of business process cases is stored in IT systems, accordingly. In order to perform process mining, at least case–id (a business process case identifier), activity name, and its timestamp should be recorded (see Table 2.1) [1].

Case–id Activity Timestamp

506 h 3/2/2016 13:34

507 h 3/2/2016 18:58

506 u 3/2/2016 20:00

507 c 4/4/2016 13:20

508 a 5/5/2016 21:45

506 f 6/5/2016 18:48

506 v 8/5/2016 14:10

... ... ...

Table 2.1. Main columns needed to perform process mining.

(18)

CHAPTER 2. EXTENDED BACKGROUND From this table, information on business process cases can be extracted. For example, case–id 506 has (h, u, f, v) as its sequence of activities with four process steps, i.e.

process step (1) is h, process step (2) is u, process step (3) is f, and process step (4) is v. This means each business process case could be shown as a sequence of activities (circles) with their corresponding features like the following figure:

Figure 2.1. A business process case.

As a business process case proceeds, new activities and corresponding attributes occur. In addition, at each process step, we know the business process case from the beginning up to that step. For example, at process step (6), we know activities and their corresponding attributes which occurred for that case before step (7).

10

(19)

2.2. PURCHASE TO PAY PROCESS

2.2 Purchase to Pay Process

The purchase to pay process contains all steps needed in goods purchasing from supplying to payment steps [25]. These steps or activities could vary from company to company and there is no standard, but most common activities of this process which can happen in any order are shown in the following table:

Activity

1 Adjustment Charge

2 Block Purchase Order Item 3 Cancel Goods Receipt 4 Cancel Invoice Receipt

5 Change Currency

6 Change PR Approval

7 Change Price

8 Change Quantity

9 Change Vendor

10 Create Purchase Order item 11 Create Purchase Requisition item 12 Delete Purchase Order item 13 Delete Purchase Requisition item 14 Dun Order Confirmation

15 Print and Send Purchase Order (Paper) 16 Reactivate Purchase Order Item 17 Receive Oder Confirmation 18 Record Goods Receipt 19 Record Invoice Receipt 20 Send Overdue Notice 21 Send Purchase Order (eMail) 22 Send Purchase Order (eOrder) 23 Send Purchase Order Update 24 Vendor creates Invoice

Table 2.2. Most common P2P activities.

Each process can have undesired activities, e.g. ‘Change Price’, ‘Change Quantity’, and ‘Delete Purchase Order item’. For example, ‘Change Price’ means a change in the price of orders due to various reasons, e.g. dummy initial price values, vendors behavior, and to name but a few. The main process flow of a P2P process is shown by Celonis^©Process Mining software in Figure 2.2 which the number between consecutive activities shows the number of cases passing through them. Figure 2.2 only shows the main path, while a case could go through other sub-paths as shown in Figure 2.3 and this fact should be taken into account.

(20)

CHAPTER 2. EXTENDED BACKGROUND

Figure 2.2. The main flow of a P2P process.

12

(21)

2.2. PURCHASE TO PAY PROCESS

Figure 2.3. A P2P process with different paths (non-deterministic behavior).

(22)

2.3 Imbalanced Data

A dataset is called imbalanced when there is a significant inequality in the ratio of classes. There is no standard criterion to consider a dataset imbalanced, but Haibo He and Yunqian Ma define this concept in their “Imbalanced Learning: Founda- tions, Algorithms, and Applications” book on page 15 as: “A dataset where the most common class is less than twice as common as the rarest class would only be marginally unbalanced, that datasets with the imbalance ratio about 10:1 would be modestly imbalanced, and datasets with imbalance ratios above 1000:1 would be extremely unbalanced.” [21]

The imbalancy of a dataset will be generally an issue in machine learning algorithms, as their predictions will likely go to the majority class and their results would be ineffective. There are different ways to tackle this issue and make them balanced.

These techniques are grouped into Data–level and Algorithmic–level methods [21][26][27], as explained below:

• Data–level methods:

1. Collect more data:

Adding more instances from the minority class.

2. Sampling methods:

– Over–sampling

Add copies of instances from the minority class.

– Under–sampling

Remove instances from the majority class.

• Algorithmic–level methods:

1. Change the evaluation metric:

There are different metrics to evaluate classifiers (see Section 2.4). Choosing the right metric could solve the imbalance dataset issue.

2. Apply different algorithms:

There are different algorithms more especially “ensemble algorithms”

which are capable of handling imbalanced datasets (see Section 2.7).

3. Use weighted cost functions:

There is the possibility to make the optimized cost function weighted.

With this change, the cost of misclassifying a minority class instance will not be as same as the cost of misclassifying a majority class instance. For example, the cost of misclassifying the minority class instance could be 100 times more than the cost of misclassifying the majority class instance.

4. Change the problem statement:

Instead of looking at the problem as a binary classification task, the problem could be changed to a multi–class classification task.

14

(23)

2.4. EVALUATION METRICS

2.4 Evaluation Metrics

There are different metrics to evaluate classifiers as:

1. Confusion Matrix:

For a binary classification task, the performance of an algorithm can be seen in a confusion matrix (shown Table 2.3) which is defined as predicted classes over real classes. True Negatives (TN) are instances which are negative and are predicted negative. False Positives (FP) are instances which are negative, but they are predicted as positive. False Negatives (FN) are instances which are positive, but they are predicted negative, and True Positives (TP) are instances which are positive and they are predicted positive.

Predicted

Negative Positive Total

Observed Negative T N F P T N+ F P

Positive F N T P F N+ T P

Total T N + F N F P + T P

Table 2.3. A Confusion Matrix.

2. Accuracy:

Accuracy is defined as the number of correctly predicted instances divided by the number of all instances:

Accuracy= T P + T N

T P + F N + F P + F N (2.1)

3. Precision:

Precision is defined as the number of correctly predicted true instances over the number of positive predicted instances:

P recision= T P

T P + F P (2.2)

4. Recall:

Recall is defined as the number of correctly predicted true instances divided by all real positive instances:

Recall= T P

T P + F N (2.3)

5. F-score:

F-score [28] combines Precision and Recall into one metric to deal with only one metric instead of two. This metric is a harmonic mean with different variants based on the associated weights, e.g. F1 and F2:

F1 = 2 P recision × Recall P recision+ Recall

!

(2.4)

(24)

F₂ = 5 P recision × Recall 4 × P recision + Recall

!

(2.5)

In general, Fβ where β is the assigned weight to the F-score is written as:

F_β = (1 + β²) P recision ∗ Recall β²P recision+ Recall

!

(2.6)

6. Kappa:

Kappa is the normalized accuracy for imbalanced datasets [26], where baseline is the average accuracy of an algorithm which randomly permutes predictions:

kappa= 1 − 1 − accuracy 1 − baseline

!

(2.7)

7. AUC ROC:

The Area Under the Receiver Operating Characteristic Curve (AUC ROC) was developed for the first time during the Second World War to detect enemy objects in battlefields [29]. Although this metric is still useful, its name is not related to its common usage. This metric assumes that a classifier outputs the probability of the target class for instances instead of assigning them a class.

If the probability of belonging to the target class for an instance is above a threshold value, then the target class is assigned to that instance; otherwise, the non–target class is assigned. The ROC plot is created by calculating the True Positive Rate (TPR) and False Positive Rate (FPR) over different values of this threshold.

T P R= T P

T P + F N (2.8)

F P R= F P

F P + T N (2.9)

After finding TPR and FPR for each threshold value, corresponding points (TPR, FPR) are identified on a two–dimensional plot and connected to each other. The AUC ROC metric is simply the area under this curve which lies between zero and one. A classifier which groups instances completely on a random basis has the AUC ROC equal to 0.5 (a diagonal curve on the ROC plot). This classifier could be considered as the baseline. Therefore, classifiers with the AUC ROC higher than 0.5 are named as good classifiers and classifiers with the AUC ROC lower than 0.5 are considered as poor classifiers [3].

16

(25)

2.4. EVALUATION METRICS

• Cut-off Threshold Value:

Finally, a cut-off threshold value should be chosen in order to include an instance into a class or not. There are different ways to make this choice:

– Euclidean Distance:

To take the threshold value of the point on the ROC curve which has the closest Euclidean distance to the XY–axes equal to (0, 1) [30][31]:

Figure 2.4. The closest point on the ROC plot to the XY-axes (0, 1).

– Youden’s Index:

To apply Youden’s index (Y ) [32][31]. This index picks the threshold value which maximizes the following function as its cut–off value:

Y(threshold) = T P R(threshold) + F P R(threshold) − 1 (2.10) – Financial Cost:

To define a financial cost function on the confusion matrix over different threshold values and pick the one which minimizes the cost function [30][31].

The AUC ROC metric is insensitive to imbalanced datasets and the higher, the better. The interpretation of this score highly depends on the given problem context.

For example, the university of Nebraska medical center uses following definitions to interpret the AUC ROC score [33] in their diagnostic tests.

(26)

• Between 0.90 and 1.00 is considered excellent (level A).

• Between 0.80 and 0.90 is considered good (level B).

• Between 0.70 and 0.80 is considered fair (level C).

• Between 0.60 and 0.70 is considered poor (level D).

• Between 0.50 and 0.60 is considered fail (level F).

2.5 Filling Features with Missing Values

There are different ways to fill missing values of features:

1. With mean or median:

Missing values of categorical features are replaced with median of existing values and missing values of numerical features are replaced with mean of existing values [34].

2. With sub–classification:

Missing values are filled with training classifiers. For features with missing values, instances with missing values are included in the test set and the rest in the training set. Then, a classification model is trained on the training set to predict missing values in the test set. This approach is more realistic, but it requires more efforts. In addition, this approach assumes that there are relevant predictor variables in the dataset to predict the target feature with missing values [34].

3. With ‘No Value’:

Missing values of categorical features are replaced with a new class called

‘No Value’.

2.6 Encoding

In order to apply classification models on the datasets, each business process case should represent a row with its features as the columns of the table. Therefore, the dataset should be transformed and encoded. There are different ways to accomplish this task. For example, Leontjeva et al. compared different encoding methods, e.g.

boolean encoding, frequency–based encoding, simple index encoding, and index–

based encoding for business processes [17]. Based on their discussions, index–based encoding (shown Figure 2.5) was selected in this study, since it performed well based on their experiments. In addition, this encoding method is lossless, meaning the original case could be reconstructed completely from the encoded one [35]. This allows the interpretation of classifiers to become easier in the end. This method encodes a business process case in a row, keeping all its activities and corresponding

18

(27)

2.6. ENCODING

Case ID Event1 ... Eventm ... Resource1 ... Resourcem ... Staticfeature ... label1 ... labelm

1233434 Create Purchase Order Item ... Clear Invoice .... Andi ... Hadi ... Value 1 ... 0 ... 0 8574626 Create Purchase Order Item ... Cancel Order .... Andi ... Diego ... Value 3 ... 1 ... 0

Figure 2.5. Index-based encoding.

features as the columns with the process steps as their indices in the table. Activityi

means activity name at process step i and featurei means feature value at process step i. These features which have different values over the process steps for a case are referred as dynamic features. The method also supports static features by adding extra columns to the table. Then, for each process step, a corresponding label, which shows whether an undesired activity happens in the future of a case, is generated. These labels are shown with labeli which is the prediction variable at the process step i. In the end, size of the encoded business case increases with each event, which could be considered as a drawback in terms of memory usage and computation speed [35]. After encoding the business process, it is the right time to encode the categorical features. There are different methods to accomplish this task:

1. One-hot encoding:

For each categorical feature, the number of its unique values is calculated as S, then S new binary features are added and the first feature is removed. In the other hand, the categorical feature is expanded into S new binary features [36]. For example, there is the ‘color’ feature which has three values (‘red’,

‘blue’, and ‘black’). By doing one–hot encoding on this feature, ‘color’ is removed and instead three new features (“color is red”,“color is blue”, and

“color is black”) are added with 0 (false) and 1 (true) as their corresponding values (see following tables).

Instance Id Color

1 red

2 blue

3 black

Table 2.4. A categorical feature.

Instance Id Color is red Color is blue Color is black

1 1 0 0

2 0 1 0

3 0 0 1

Table 2.5. One hot encoded feature.

2. Label encoding:

For each categorical feature, its values are mapped into numbers. This method

(28)

CHAPTER 2. EXTENDED BACKGROUND is appropriate when there is an order in the values of the categorical feature, as the numerical comparison among them will be meaningful. In addition, the applied classifiers should be tree-based, as linear classifiers will internally calculate variables like the average of the categorical feature which will not be meaningful when it is label encoded [36].

2.7 Machine Learning Algorithms (Classifiers)

Since supervised machine learning algorithms are the focus of this study, only these algorithms which are called classifiers are considered. There are different classifiers with numerous variants. The ones used in this study are introduced here:

2.7.1 Dummy Classifier

This classifier does not use any assumptions and it always outputs the majority class in the dataset as its prediction. In addition, the interpretation of this classifier does not provide valuable insights from the dataset; however, it can be used as a baseline to compare results of other classifiers.

2.7.2 Naive Bayes Classifier

Naive Bayes classifier, as its name suggests, is an implementation of the Bayes’

theorem:

P(A|B) = P(A)P (B|A)

P(B) (2.11)

This classifier is based on a Naive assumption stating that features are pairly in- dependent of each other. It means that the probability of an instance belonging to class Ck with features X1, X₂, ..., X_n corresponds to the probability of class Ck

multiplied by the probability of the existence of each feature in this class [37]. That is why this classifier is skew insensitive [21]:

P(Ck|X₁, X2, ..., Xn) = P(Ck)P (X1, X2, ..., Xn|C_k)

P(X1, X₂, ..., Xn) = P(Ck)^Qⁿ_i=1P(Xi|C_k) Qn

i=1P(Xi)

(2.12)

P(Ck|X₁, X₂, ..., X_n)= P (Cb k)

n

Y

i=1

P(Xi|C_k) (2.13) Although independence is not a strong assumption in this classifier, Naive Bayes competes well with other classifiers in practice [38].

20

(29)

2.7. MACHINE LEARNING ALGORITHMS (CLASSIFIERS)

2.7.3 Logistic Regression Classifier

This classifier is similar to linear regression with a small difference which makes it suitable for classification tasks. First, input features should be normalized to lie between zero and one. In this way, the effect of their units are eliminated.

Standardization is one of the possible normalization methods which transforms each feature (x) into an standardized feature (xs) by making its mean value (µ) as 0 and variance value (σ) as 1:

x_s= x − µ

σ (2.14)

This classifier uses a sigmoid function to transform the output of the linear regression into a number between zero and one for the binary classification purpose, then a threshold value on the final output (see Equation 2.17) defines the class type of the instance. If the probability is above the threshold, the target class will be assigned to the instance [39].

g(x) = 1

1 + exp(−x) (2.15)

θ0X0+ θ1X1+ ... + θnXn= θ^TX (2.16)

h_θ(X) = g(θ^TX) = 1

1 + exp(−θ^TX) (2.17)

This classifier can use an optimizer like Stochastic Gradient Descent to optimize its cost function and to find optimal parameters for θ^T. The cost function to be optimized is:

J(θ) = − 1 m

m

X

i=1

[yilog(hθ(Xⁱ)) + (1 − yi) log(1 − hθ(Xⁱ))] (2.18)

Where:

• C is a constant.

• L is the regularization method, e.g. L1 or L2.

• Xi is the standardized features of instance i.

• yi is the target label of instance i.

2.7.4 Decision Tree Classifier

Decision Trees use simple decision rules to predict the target class. There are different implementations of decision trees, e.g. C4.5 [40], C5.0 [40], and Classifi- cation And Regression Tree (CART) [41]. CART supports both classification and regression tasks. It constructs binary trees using input features in order to maximize

(30)

CHAPTER 2. EXTENDED BACKGROUND the information gain at each data split. An example of a decision tree (built using the scikit-learn library in python [42]) is shown in Figure 2.6. This decision tree was built on the Iris flower dataset publicly available [43]. As shown in this figure, the interpretation of this classifier is easy and it clarifies a set of rules leading to the defined target class. In general, tree–based classifiers do not need the normalization step, as the units of input features do not have an effect on the data splitting.

Figure 2.6. An example of a decision tree built on the Iris flower dataset.

2.7.5 Random Forest Classifier

Combinations of different classifiers of the same type could lead to better results.

This idea is the basis of ensemble classifiers. Random forest [44], is an ensemble classifier which combines different decision trees to improve the results. First, the random forest creates n different decision trees by taking random instances from the dataset, then a random set of features is selected to split and to build the decision tree accordingly. Later, these n different decision trees are applied on the dataset and for each instance there are n predictions (class probabilities in classification and numerical values in regression tasks). Finally, in order to output the final

22

(31)

2.7. MACHINE LEARNING ALGORITHMS (CLASSIFIERS)

prediction for each instance, the average value of predictions is reported. Although this classifier performs quite well, it is not easy to interpret it correctly.

2.7.6 Light Gradient Boosting Classifier

The Gradient Boosting Decision Tree, as a popular classifier, combines several classifiers to make a strong classifier in a sequential order. First, a classifier is trained on the training set, then the second classifier is fit to the residuals of the first classifier.

The process of fitting a classifier to the residuals of the previous classifier continues until a given condition is satisfied. This classifier has different implementations like the Extreme Gradient Boosting Tree (XGBT). Although numerous work have been done, different implementations of the Gradient Boosting Decision Tree are not efficient and scalable when the feature space and the dataset size become large.

The Light Gradient Boosting (LGBM) is a new implementation of the Gradient Boosting Decision Tree introduced by Microsoft in 2017 [45]. This classifier claims to perform up to 20 times faster than the XGBT while achieving almost the same accuracy based on the experiments on the public datasets. This classifier differs from other implementations, since others scan all data instances to estimate the information gain of all possible split points. This approach is tedious and time con- suming. That is why LGBM proposes two novel methods called ‘Gradient-based One-sided Sampling (GOSS)’ and ‘Exclusive Feature Bundling (EFB)’ in order to fast and efficient while achieving almost the same accuracy.

With GOSS only a sample of instances with high residuals is selected to estimate the information gain. While with EFB, features with simultaneous non-zero values (e.g.

one-hot encoded features) were bundled to reduce the number of features. That is how this classifier reduces the number of instances and the number of features in order to be fast, memory efficient, and highly accurate [45]. This classifier like any other tree–based classifier does not need the normalization step. To sum up, main properties of this classifier are [46]:

1. Fast training speed 2. Efficient memory usage 3. CPU and GPU support 4. Large-scale datasets support

Although this classifier has great advantages, it is not easy to interpret it correctly.

2.7.7 Stacked Classifier

This classifier which is commonly used in data science competitions like Kaggle [47], combines different types of classifiers (base–learners) and applies a new classifier (meta–classifier) on top of their predictions. This meta-classifier determines the relationship among predictions of base–learners and the target class [48][49].

(32)

2.8 Hyper–Parameter Tuning

Hyper–parameters cannot be learned by classifiers and should be defined before starting the training process. There are different search methods to find best hyper–

parameters for each classifier. Two important ones are:

1. Grid Search Method:

For each hyper–parameter of a classifier, a range of possible values are defined by an expert. Then, all possible combinations of these values for different hyper–parameters are tested. These tests are based on the validation set or K–folding cross–validation approach. Finally, the best combination of hyper–

parameters for a classifier, which maximizes the target result, are obtained.

2. Random Grid Search Method:

When the hyper–parameter search space is large, this technique is applied to pick the hyper–parameter combinations randomly instead of examining all possible combinations of them. The random grid search method will expe- dite the search speed, while it does not guarantee to give the best possible combination of hyper–parameters for a classifier [50].

In this chapter, the theory needed to perform this study were elaborated. In the next chapter, the applied methodology is explained.

24

(33)

Chapter 3

Methodology

“If one approaches a problem with order and method, there should be no difficulty in solving it - none whatever!”

— Agatha Christie, Death in the Clouds (1935) The main focus of this study is to adapt existing theory on handling imbalanced datasets in predicting business process behavior, as it was not done in the previous studied publications. In addition, the applied research method in this thesis is pseudo–empirical since we tested various data analysis algorithms and did not go out into the real–world to make observations to give an answer to the proposed research question:

“What supervised machine learning algorithms are suitable for predicting undesired business process behavior on imbalanced datasets?”

This research question aims at solving a known problem by developing an undesired business process behavior prediction capability in Celonis^© software based on existing research on data from the real–world. There are two main options to answer this research question:

1. Hypothesis–driven: where a hypothesis is formed before testing it pseudo–

empirically or algebraically, or proving it valid formally.

2. Exploratory: where a large parameter space is explored to get to grips with the problem, without a defined hypothesis.

The exploratory option is chosen since different supervised machine learning algorithms and imbalanced dataset handling techniques are explored with a large parameter space. In order to choose the hypothesis–driven approach, we should have made an educated guess based on prior knowledge and observations which was not possible in this study, as there were too many options for suitable algorithms to choose in the hypothesis.

The focus of this research is to choose suitable non–neural network supervised machine learning algorithms, so quantitative data analysis is chosen to measure and evaluate the performance of different algorithms, e.g. Naive Bayes, Decision Tree,

(34)

CHAPTER 3. METHODOLOGY Random Forest, and Light Gradient Boosting Tree based on the proposed evaluation metric (AUC ROC). Finally, the suitable classifier at each process step is chosen to give an answer to the research question.

In a nutshell, business knowledge of the business process is obtained by reading relevant reports and documents and the data understanding step is completed by Exploratory Data Analysis (EDA). Then, the knowledge gap is identified to make a research question and the best evaluation metric for the given problem is chosen among possible options. The data preparation step which consists of removing incomplete cases, conducting feature engineering, filling missing values, generating classification labels, and encoding process cases starts later.

In the modeling phase, the prepared dataset is iterated over process steps to make a larger labeled dataset. Then, different classifiers which are chosen based on their speed and evaluation metric performance in a small pilot study are applied and their results are stored to identify the suitable classifier at each process step in the end.

In the production step, suitable models at each process step will be used for the prediction of ongoing cases (whether a ‘Change Price’ activity will happen for a case) based on their case lengths. Figure 3.1 shows the applied methodology in this study.

26

(35)

Figure3.1.Appliedmethodologyinthisstudy.

(36)

CHAPTER 3. METHODOLOGY

3.1 Input Dataset

The real–world SAP^©dataset for a P2P process within a time frame of two months is used, as SAP^©have the highest share of the Enterprise Resource Planning (ERP) market [51]. This dataset contains only the top 99 percentile of existing case lengths equal to 52, 261 cases. The description of this dataset is shown in Table 3.1.

Field Name Definition

CASE KEY Case id

ACTIVITY EN Activity name in English

USER NAME Name of the user

USER TYPE A person, a computer agent, and etc PRICE Price associated to the case id

CURRENCY Currency type

QUANTITY Quantity of ordered items/products

VENDOR Name of the supplier

PURCHASINGDOCTYPE Type of the purchasing group

PURCHASINGGROUP Purchasing group

STORAGELOCATION Location of the storage

PLANT Plant

MATERIALGROUP Group of the ordered items PURCHASINGORGANIZATION The Organization which ordered items

OUTLINEAGREEMENT Availability of a long-term agreement with the supplier

COUNTRY Country of the supplier

CITY City of the supplier

YEAR Year the activity occurred

MONTH Month the activity occurred

DAY Day the activity occurred

WEEKDAY Weekday the activity occurred

HOUR Hour the activity occurred

GAP Time difference between two consecutive activities Table 3.1. The description of the input dataset.

3.2 Problem–Defining Step

In this step, the problem is defined based on the business and the technical requirements. Next, the right evaluation metric for it is chosen:

3.2.1 Imbalanced Dataset

In this study, a cost–sensitive function, ensemble algorithms, and the right evaluation metric approaches are taken to tackle the imbalancy issue. Weights of the cost–sensitive function are the inverse of the class ratios. Also, applied ensemble algorithms are explained in details in Subsection 3.4.2, and the chosen evaluation metric is explained in the next section.

28

(37)

3.3. DATA PREPARATION STEP

3.2.2 Evaluation Metric

Accuracy is the evaluation metric which comes to mind at the first moment, but it is not the right one when the dataset is imbalanced. In this situation, classifiers will be biased to the majority class to get a high accuracy and that is why the obtained result will be misleading if one does not take this point into account. Between Recall and Precision, none of them could be meaningful in this study. If one wants to increase one of these metrics, the other will decrease. In addition, measures as F1 or F2 are not chosen because they do not show the behavior of Precision or Recall individually. In general, Fβ is not commonly used because there are other evaluation metrics with more robust and better performance [21].

The desktop research has shown that the AUC ROC metric is a good choice in binary classification tasks for imbalanced datasets. That is why the AUC ROC is chosen in this study. Interpretation of this metric depends on the problem context, as discussed in Section 2.4. To the best of our knowledge, an interpretation of a good AUC ROC in this domain does not exist. Previous studies which applied this evaluation metric did not compare their results with a baseline. They only claimed that the closer the AUC ROC is to 1, the better it is [3][17]. In addition, use cases they implemented are different from ours and that is why we could not take their results as the baseline. The only interpretation of the AUC ROC we could find is from the university of Nebraska medical center which they used in their diagnostic tests (see Section 2.4). As detecting undesired behavior is a similar use case, we adopted their proposed interpretation to evaluate our obtained results in this study.

3.2.3 Cut–off Threshold

In this study, Euclidean Distance (see Section 2.4) is adapted with a minor change to find the best cut–off threshold value. Instead of taking the exact Euclidean distance, a weighted distance (D) is used to give more importance to TPR.

D=^q² (F P R)²+ 2(1 − T P R)² (3.1)

3.3 Data Preparation Step

This step contains selecting complete cases, designing feature engineering, filling missing values, encoding business process cases, and generating classification labels, as explained below:

3.3.1 Selecting Complete Cases

Creating classification labels is only possible for complete cases, as no further action will happen for a complete case and the whole trace of the case are known which shows the occurrence of the undesired activity (here ‘Change Price’) at different process steps. Based on the gained domain knowledge, a P2P case is considered

(38)

CHAPTER 3. METHODOLOGY complete when it contains the “Create Purchase Order Item” activity and ends with either “Clear Invoice”, “Delete Purchase Order Item”, “Refuse Purchase Order”, or

“Refuse Purchase Requisition” activities.

3.3.2 Designing Feature Engineering

Feature engineering plays a vital role in machine learning applications. Experience has shown that improvements in results directly come from the feature engineering stage. For example, Niek Tax et al. reported achieving a better result than previous studies in predicting next activity of the business process cases by performing a good feature engineering [4]. In this stage, business domain knowledge identifies important features. That is why some features are removed and some others are expanded. For example, the timestamp feature is expanded into year, month, day, weekday, and hour. Niek Tax et al. mentioned that adding interactions among features as new features had a high impact on their work [4]. For example, interactions among features could be taking the difference of two of them and adding it as a new feature called ‘DIFF’. Herein, we adapted same feature engineerings proposed by Niek Tax et al. [4] to our study. For example, We added the time difference between two consecutive activities as a new feature called ‘GAP’.

In the categorical features, there is the possibility of different representations of a unique value. For example, Barcelona and Barçelona are two different name representations of one city. To tackle this issue, similar values are mapped together to increase the data quality.

3.3.3 Filling missing Values

Applied supervised machine learning algorithms could not handle datasets with missing values, as they get an input array which all values are numerical [52]. That is why the input dataset should not have missing values. There are different ways to fill in missing values as discussed in Section 2.5. For the numerical features with missing values, mean of existing values are used to fill in missing values, while for the categorical features with missing values, ‘No Value’ option is applied.

3.3.4 Label Generation

Classifiers need target labels to perform the classification. There is no label in the extracted dataset. Therefore, classification labels are generated by looking through the whole trace of a business case. At each process step (i), a target label (i) is added to the dataset which shows if in future of a case from process step (i), ‘Change Price’ activity would happen. It means for a case with length m, there will be m different target labels (see Figure 2.5).

30

(39)

3.4. MODELING STEP

3.3.5 Data Encoding

First, the dataset is encoded using the index–based method. Then, the categorical features are transformed to numbers using the one–hot method (see Section 2.6).

3.4 Modeling Step

The modeling step contains splitting the train and the test set at each process step, applying classification algorithms, and hyper–parameter tuning as explained below:

3.4.1 Train and Test sets split

At each process step (i), business process cases with case length at least (i) are selected. Then, their corresponding activities and features from process step (1) to process step (i) and labeli are kept (see Section 2.1). This prepared dataset is split into a train and a test set in which both sets have the same target class ratio.

3.4.2 Classification algorithms

First, a various number of classifiers, e.g. Naive Bayes, SVM, K–Nearest Neighbors (KNN), logistic regression, and tree–based classifiers are investigated in a small pilot. In order to introduce a baseline and compare results of other classifiers, a dummy classifier is chosen. Next, other classifiers are chosen based on their performance (speed and evaluation metric) in the small pilot. All in all, selected classifiers for the final setup are:

• Dummy Classifier

• Naive Bayes

• Decision Tree (CART)

• Random Forest

• LGBM

• Stacked Classifier:

Logistic Regression as the meta–classifier on the base–learners.

3.4.3 Hyper-Parameter Tuning

In this stage, different sets of hyper–parameters for each classifier are chosen. Based on the discussions in Section 2.8, Grid Search and Random Grid Search methods are applied. These methods are integrated to the K–fold approach [53]. Therefore, the training set is divided into K subsets in a random way, then K − 1 of these subsets are used to train the classifier and 1 subset is used to validate the classifier. This

(40)

CHAPTER 3. METHODOLOGY process is repeated K times, where each time 1 subset is held out for the validation and the rest is used for the training. In the end, the average of these validations lets us choose best hyper–parameters.

The parameters used in this study are grouped into two categories:

1. Global parameters:

Parameter Value

Train/Test Split 80%/20%

K (in K–folding) 10

The number of classifiers in the stacked classifier Top two The number of iterations in the random grid search 10

Table 3.2. Global parameters.

The reason to choose only the top two classifiers for the stacked classifier is due to the infrastructure limitations in this study, while there should not be any difference by choosing all of them, or selecting diverse types of classifiers.

2. Hyper parameters:

• Dummy Classifier and Naive Bayes: No hyper–parameters

• Decision Tree:

– The information gain criterion: [’gini’,’entropy’]

– The maximum depth: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

– The minimum samples leaf: [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

• Random Forest:

– The information gain criterion: [’gini’, ’entropy’]

– The number of estimators:

[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150]

– The maximum depth: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

– Bootstrap (sample instances with replacement): [True,False]

• LGBM:

– The learning rate: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 11]

– The number of estimators:

[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150]

– The number of leaves: [10, 20, 30, 40, 50]

– Alpha (regularization): [0.0, 0.01, 1, 10, 100, 1000]

– Lambda (regularization): [0.0, 0.01, 1, 10, 100, 1000]

• Stacked Classifier:

– C is the inverse of regularization strength: [0.1, 1, 10]

32

(41)

3.5. PRODUCTION STEP

3.5 Production Step

Finally, it is the right time to apply predictions on the on–going business process cases. In this regard, the same stages in the data preparation and the modeling steps are implemented with two differences: a) Incomplete cases are selected for the predictions of on–going cases. b) There is no train, validation, and test set in this step. Instead, the suitable classifiers per process step are picked from the modeling step and are applied on the on–going cases based on their case–lengths. In the end, final results are sent to Celonis software as shown in Table 3.3.

Case ID Prediction

("Change Price" occurrence)

106883 Yes

344371 No

576623 No

754202 Yes

... ...

Table 3.3. Results of predictions on ongoing cases.

In this chapter, the applied methodology in this study was elaborated. In the next chapter, the obtained empirical results by applying this methodology are explained.

(42)

Eindhoven University of Technology MASTER Predicting undesired business process behavior using supervised machine learning Sotudeh, Hadi

Abstract

Sammanfattning

Acknowledgment

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1 Background

1.2 Problem

1.3 Purpose

1.4 Objectives

1.5 Methodology

1.6 Ethics and Sustainability

1.7 Risks

1.8 Delimitations

1.9 Outline

Chapter 2

Extended background

2.1 Business Process Case

2.2 Purchase to Pay Process

2.3 Imbalanced Data

2.4 Evaluation Metrics

2.5 Filling Features with Missing Values

2.6 Encoding

2.7 Machine Learning Algorithms (Classifiers)

2.8 Hyper–Parameter Tuning

Chapter 3

Methodology

3.1 Input Dataset

3.2 Problem–Defining Step

3.3 Data Preparation Step

3.4 Modeling Step

3.5 Production Step