Accelerating Model Driven Platforms: A study into the predictability of business process models

(1)

A study into the predictability of business process models

Submitted August 2019, in fulfilment of

the conditions for the degree of BSc Industrial Engineering & Management.

Frits Sieds Tuininga s1739409

Supervised by A. Abhishta, S. Kaya & L.O. Meertens

Department of Behavioural, Management & Social Sciences University of Twente

Study conducted at

CAPE Groep & eMagiz

(2)

Acknowledgements

Throughout the writing of this bachelor Thesis, I have received support and assistance from numerous people. Firstly, I would like to express my gratitude towards my super- visors, MSc A. Abhishta, Dr. L.O. Meertens and MSc S. Kaya. Abhishta acted as my first UT supervisor and provided valuable insights regarding both the construction this Thesis and the machine learning analysis. Meertens acted as my second UT supervisor and provided very detailed feedback regarding this Thesis. Moreover, Meertens made it possible to access all on-ramp related dataflows from the eMagiz platform. These were of vital importance for this research. Kaya ensured the research followed the right direction, not only regarding the Thesis, but also regarding the prototype. Moreover, Kaya set up a series of important meetings with eMagiz users and developers. The support of these three supervisors was essential for the creation of this Thesis.

During this study a number of fellow students were essential for the creation of this Thesis.

Hence, I would like to thank M. Blume, N. Bussmann, W. Klaassen, B. van Tintelen and A. van Vlastuin. Blume provided valuable feedback regarding this Thesis. Bussmann gave insights regarding the use of machine learning algorithms. The translation of dataflows into a table which is suitable for machine learning was established through discussions with both Klaassen and van Vlastuin. Furthermore, both were involved in the creation of this report. Van Tintelen assisted with the understanding and application of machine learning.

From my CAPE Groep colleagues, I would like to thank A. van Brakel and A. Willemsen.

Van Brakel created a demo of the assistance tool in action. In the beginning, the assistance tool could not be implemented. Therefore, a demo could demonstrate the inner-workings and appearance of the final product. Willemsen provided the first example dataflow and explained how it worked.

I would like to express my gratitude towards my parents Y. and A. Tuininga. Both have provided valuable feedback on the Thesis. Lastly, I want to thank my dog Koda for her moral support and overall enthusiasm.

i

(3)

In the modern business world, communication between distinct business units is in general achieved by the means of multiple computer systems. In some cases, these systems cannot communicate directly. Hence, an additional program must be added to make communi- cation possible. Typically, the construction of this program can take quite some time.

eMagiz is a company which creates such programs by the means of a model-driven plat- form (which is also called eMagiz). This research explores the opportunities of reducing time spent on creating such a program in eMagiz.

To reduce time spent on creating programs in a model-driven platform, an assistance tool is created. This assistance tool helps a user to create programs more effectively and efficiently. The assistance tool is based on a thorough machine learning analysis, which is described in detail within this report. During this study the following research question is answered: How can user actions in a model-driven platform be predicted, with machine learning, to increase modelling speed with the help of an assistance tool?. This question is answered in two phases.

Firstly, the available data (derived from the model-driven platform) is converted into a format which machine learning algorithms can understand.

Secondly, ten machine learning algorithms are tested on the available data. The algo- rithms are compared and the algorithm that makes the best predictions is used as basis for the assistance tool.

The study found that, given the eMagiz data set, the machine learning algorithm Ran- dom Forest performs best. For eMagiz, it is recommended to apply this algorithm to their model driven platform. In that case, the algorithm can act as assistance tool.

Future research must show whether the tool does increase modelling speed. eMagiz man- agement has indicated to implement the assistance tool as soon as possible. In addition, if the assistance tool does increase modelling speed, then it could be applicable for other model-driven platforms as well. This makes it applicable not only for eMagiz, but for all corporations that use a model-driven platforms and wish to increase modelling speed.

Future research must show whether this hypothesis holds true.

ii

(4)

Reader’s Guide

The research described in this report was conducted in fulfilment of the degree of BSc In- dustrial, Engineering and Management. This report is divided into five distinct chapters.

These chapters are introduced below:

Chapter 1 introduces the reader to the research. Here, the involved parties, overall problem and the structure of the report are elaborated upon. In addition, the current situation and the opportunities for improvement are discussed. Lastly, the general aim and measurability of the research are described.

Chapter 2 explains which literature from the scientific body of knowledge was applied within this research. This chapter provides the reader with essential knowledge required for the research approach. In addition, a substantiation is given of the choices made.

Chapter 3 outlines the research approach in two distinct phases. The first phase elab- orates on the manner in which data is converted. The second phase shows how machine learning algorithms are compared.

Chapter 4 describes the results of the research conducted. Moreover, the appearance of the prototype is explained.

Chapter 5 provides the reader with the recommendations and conclusions of this re- search. In addition, the opportunities of future research are explored in the section dis- cussion.

iii

(5)

Acknowledgements i

Executive Summary ii

Reader’s Guide iii

1 Introduction 1

1.1 Research Aim . . . . 1

1.2 Context . . . . 1

1.2.1 CAPE Groep . . . . 1

1.2.2 eMagiz . . . . 2

1.2.3 Stakeholders . . . . 2

1.2.4 Current Situation . . . . 2

1.2.5 Problem Identification . . . . 4

1.3 Scope & Limitations . . . . 6

1.4 Research Question . . . . 7

1.5 Methodological Framework . . . . 7

1.6 Structure of the Thesis . . . . 7

2 Theoretical Framework 10 2.1 Brief Summary . . . . 10

2.2 Data Analysis . . . . 10

2.3 Types of Machine Learning . . . . 11

2.4 Applied Machine Learning Algorithms . . . . 13

2.5 Validation . . . . 14

iv

(6)

CONTENTS v

2.5.1 Train and Test Sets . . . . 14

2.5.2 K-fold Cross-Validation . . . . 15

2.5.3 Balance score . . . . 16

2.6 ROC Curve . . . . 18

3 Design & Development 20 3.1 Brief Summary . . . . 20

3.2 Data Conversion . . . . 20

3.2.1 Summary Data Conversion . . . . 20

3.2.2 XML Conversion . . . . 20

3.2.3 Routes . . . . 21

3.2.4 Exclusion . . . . 22

3.2.5 Tables . . . . 23

3.3 Comparing Algorithms . . . . 24

3.3.1 Summary Comparing Algorithms . . . . 24

3.3.2 Accuracy Scores . . . . 24

3.3.3 F-scores . . . . 25

3.3.4 Number of Suggestions . . . . 25

3.4 ROC Curve Application . . . . 26

4 Implementation & Demonstration 27 4.1 Brief Summary . . . . 27

4.2 Results . . . . 27

4.2.1 One Prediction . . . . 27

4.2.2 Impact Number of Suggestions . . . . 28

4.2.3 Five Predictions . . . . 29

4.2.4 ROC Curve . . . . 30

4.3 Design of the Prototype . . . . 31

5 Recommendation & Conclusion 32

5.1 Brief Summary . . . . 32

(7)

5.2 Recommendations . . . . 33

5.2.1 Application . . . . 33

5.2.2 Continuous Improvement . . . . 33

5.2.3 Applicability to Other eMagiz Components . . . . 34

5.2.4 Another Point of Improvement . . . . 34

5.3 Discussion . . . . 34

5.3.1 Measuring Impact . . . . 34

5.3.2 Broader Application . . . . 34

5.3.3 Choice of Algorithms . . . . 35

5.3.4 Usage of Input Data Sets . . . . 35

5.4 Conclusions . . . . 35

5.4.1 Answering the Research Question . . . . 35

5.4.2 Advantage Routes . . . . 36

5.4.3 Advantage Rule Based Learning . . . . 36

5.4.4 ROC curve . . . . 36

5.4.5 Number of suggestions . . . . 36

Bibliography 37

References 38

Appendices 39

A The eMagiz platform 39

B Iris flower data set 41

C XML converter 42

D Comparison program 54

(8)

List of Tables

2.1 Labelled data (Fisher’s Iris data set) . . . . 12

2.2 Overview of Machine Learning Algorithms . . . . 13

3.1 Example of Routes 1 . . . . 22

3.2 Example of Routes 2 . . . . 22

3.3 Overview of all routes . . . . 23

3.4 Input data sets applied . . . . 25

4.1 Results Comparison Phase . . . . 28

4.2 Results Comparison Phase . . . . 29

A.1 Comparison standard system and eMagiz bus . . . . 40

vii

(9)

1.1 eMagiz Bus . . . . 2

1.2 eMagiz bus in detail . . . . 3

1.3 Example dataflow . . . . 3

1.4 eMagiz editor . . . . 4

1.5 BPM Current Situation . . . . 4

1.6 Problem Cluster . . . . 5

1.7 Framework Design Science Research Methodology . . . . 7

2.1 Overview Types Machine Learning . . . . 11

2.2 Train and Test Set . . . . 15

2.3 K-Fold cross validation (K = 10) . . . . 16

2.4 Example Confusion Matrix . . . . 16

2.5 Basic Confusion Matrix . . . . 18

2.6 ROC Curve . . . . 19

3.1 Example route . . . . 21

3.2 Wrong assumption . . . . 22

4.1 Random Forest ROC . . . . 30

4.2 Design of the Prototype . . . . 31

A.1 Two nodes with associations . . . . 39

A.2 Multiple nodes with associations . . . . 39

A.3 eMagiz bus with associations . . . . 40

viii

(10)

Chapter 1 Introduction

In this chapter, the reader is provided with the 1.1) Research Aim, 1.2) Context, 1.3) Scope & Limitations, 1.4) Research Question, 1.5) Methodological Framework and 1.6) Structure of the Thesis.

1.1 Research Aim

Over the years, computer systems became increasingly important in the business world.

Typically within large companies, information is shared between multiple computer sys- tems. Sometimes these systems are able to communicate directly, but sometimes an extra translation step must be taken. From this perspective, computer systems are quite similar to humans. A Dutchman could, for instance, communicate flawlessly with another Dutch- man (direct communication), but when the Dutchman tries to speak with a Japanese, the need for a translator arises (indirect communication). It is essential for businesses that internal computer systems are able to communicate fast without losing relevant informa- tion. Hence, the aim of this study is to make the ’translation process’ between different computer systems as efficient as possible.

1.2 Context

1.2.1 CAPE Groep

The client of this research is CAPE Groep(CAPE Groep, n.d.). This corporation is lo- cated in Enschede and is specialised in digital integration and low-code platforms. Digital integration refers to the ’translation process’ described earlier. If distinct computer sys- tems are able to communicate with the assistance of a ’translation process’, then these systems are digitally integrated.

A low code platform is software that provides an environment in which applications (or apps) can be created. As the name suggests, a low code platform does not require the

1

(11)

user to be well educated in programming. Instead, users that have little or no experience in programming can still develop apps.

1.2.2 eMagiz

CAPE Groep has constructed its own low-code platform, made with the sole purpose of digital integration. This platform is known as eMagiz (eMagiz , n.d.). As time went by, eMagiz became increasingly important. Large customers, such as the Royal BAM Groep, use eMagiz to integrate their internal systems. eMagiz grew so large that it was decided to make it a subsidiary of CAPE Groep. This research is concentrated on optimising one of the internal processes of eMagiz and applying the solution in broader context.

1.2.3 Stakeholders

For this research three distinct stakeholders are of importance: the eMagiz developer, eMagiz user and end-user. The eMagiz developer is responsible for the development and maintenance of the eMagiz platform. This person makes adjustments to the eMagiz plat- form based on feedback received from the customer. The developer assures that platform is of the highest quality, which makes it more appealing for potential customers. Quality, in this sense of the word, refers to both high user-friendliness and high functionality. The eMagiz developer is an employee of the eMagiz corporation.

The second stakeholder is the eMagiz user. This person makes digital integrations within the eMagiz platform. The user wants to create digital integrations as fast as possible while keeping potential malfunctions to an absolute minimum. A series of companies use eMagiz and can hence be described as eMagiz users, but the most important eMagiz user is CAPE Groep.

The third stakeholder is the end-user. This person can be seen as a customer. The end-user outsources the digital integration process to the eMagiz user and requires a ready-made product. The wishes of the end-user are similar to the wishes of the eMagiz user. Both want to implement digital integrations as quickly as possible while keeping potential malfunctions to a minimum. The key difference is that the end-user purchases the product (digital integration) while the eMagiz user sells the product. In general, the end-user is another company.

1.2.4 Current Situation

To integrate systems, eMagiz uses a concept known as a ’bus’. Via a bus, systems can be connected relatively easy (APPENDIX A). The bus can be seen as a translation program which makes communication possible between multiple systems.

Figure 1.1: eMagiz Bus

(12)

1.2. Context 3

In Figure 1.1, the bus acts as translator between systems 1 and 2. The arrows represent the direction to which information is sent. Information flowing through the eMagiz bus is known as a ’message’ and shall be referred to as such further on in this report.

In Figure 1.2, for two systems, the eMagiz bus is displayed in more detail. Within the eMagiz bus five components are displayed: entry connector, on-ramp, routing, off-ramp and exit connector. Each component performs an action on the message received. This implies that, if necessary, data could be removed or modified. Furthermore, the modified message is sent to the next component (or system). As one might observe, information can flow from system 1 to system 2 and vice versa. So, two ’information flows’ are displayed in this example.

Figure 1.2: eMagiz bus in detail

Figure 1.2 does not cover all actions taken within the eMagiz bus. Actually, each compo- nent consists of a series of actions. The number of actions and the order in which actions are executed is determined by the eMagiz user. The eMagiz user accomplishes this by the means of a dataflow. A dataflow is a concatenation of actions (visualised in Figure 1.3).

In this research, these actions are referred to as ’building blocks’.

Figure 1.3: Example dataflow

In Figure 1.3 a series of building blocks is shown. The building block on the far left (green circle) is referred to as a ’starting block’. Similarly, the building block on the far right (circle with green edge) is referred to as an ’ending block’. Consider the entry connector.

If a message enters the entry connector, then it starts at the starting block. Next a series of actions are performed. Finally, the message reaches an ending block and leaves the entry connector. Similar actions take place in the remaining four components.

Currently, in the eMagiz platform each dataflow is mostly manually constructed. In Figure

1.4 a dataflow of an eMagiz bus component is displayed. Circled in red, the list of all

available building blocks is shown.

(13)

Figure 1.4: eMagiz editor

Within this list, a user can search for a specific block. Subsequently, the user drags and adds the building block to the dataflow. This process is repeated until the dataflow meets the expectations of the user. This process is shown in the form of a BPM (Business Process Model) in Figure 1.5.

Figure 1.5: BPM Current Situation

The start block (green) represents the start of the dataflow construction process. Similarly, the end block (red) represents the end of the dataflow construction process. The four blue blocks represent the actions required for constructing the dataflow. The diamond (yellow) represents a decision the eMagiz user must make. This diamond block represents the question: ’Is the dataflow finished of should other building blocks be inserted?’ If the dataflow is not finished, then the cycle continues, and the four actions should be performed again. If the dataflow is finished, then the process stops. To improve the process of constructing dataflows, both the number and duration of tasks could be reduced.

1.2.5 Problem Identification

According to eMagiz management, time spent on creating dataflows can be reduced.

Hence, the norm is that dataflows are constructed in X amount of time, while in reality

dataflows are constructed in Y amount of time (where X < Y). This leads to a discrep-

ancy between norm and reality(Heerkens, 2017). eMagiz management indicated that the

construction of dataflows could be sped up if an assistance tool would be introduced. This

tool could assist the eMagiz user in constructing dataflows. The reasoning behind this is

(14)

1.2. Context 5

to reduce the time spent on constructing dataflows by the means of automation. Logi- cally, if automation is implemented, then productivity of the eMagiz user improves. This idea should be investigated before implemented. Through a series of unstructured inter- views, a set of problems were identified. To obtain more insight in the causality between problems, a problem cluster was created from the relevant problems (Figure 1.6).

Figure 1.6: Problem Cluster

Each block in the problem cluster represents a problem as stated by eMagiz personnel.

These problems were found as a result of an unstructured interview with the eMagiz competence center (eMagiz developers and users). Causality between problems is shown in arrow form. For instance, problem 1 is the cause of problem 5. On the right, problem 10 (blue) is displayed. This is the action problem. The action problem captures the discrepancy between norm and reality. This research is aimed at solving this problem.

The action problem can, however, not be solved directly. One must find a series of problems which have no causes and directly or indirectly influence the action problem.

These problems are known as candidate core problems and are displayed in red on the left of the problem cluster. From these four candidate core problems the actual core problem must be identified. The research aims at solving the core problem and as a result solving the action problem. So, from the problem cluster it became clear that there are four candidate core problems. Problem 2 cannot be influenced easily. This was made very clear in an unstructured interview with the eMagiz developers. Also, a client would rather not replace the digital systems in use. Therefore, problem 3 cannot be the core problem.

Problem 4 can be influenced. eMagiz management could target clients that originate from the same sector. Nevertheless, eMagiz management explained it was a conscious choice to target clients from multiple sectors. The reasoning behind this is that risk is spread.

For example, if demand for transportation by trucks decreases, then the transport sector

will economise. This could imply that demand for digital integration drops rapidly. To

(15)

prevent too much dependence on one sector, eMagiz wishes to spread risk. As a result, problem 4 cannot be the core problem. This leaves problem 1. This problem can be influenced and reduces time spend on creating dataflows. Therefore, problem 1 is chosen as the core problem.

Another point of interest is the building blocks displayed to the eMagiz user during the dataflow construction process. As shown in Figure 1.4, a list filled with all different types of building blocks is displayed. It is, however, possible that a high number of blocks are never used within the on-ramp. These blocks could, for instance, be entry- connector related. If this is indeed the case, then building blocks which are not used could be relocated. This makes it easier for the eMagiz user to find the desired building block, because obsolete blocks are removed from the list. As a result, the time spent on constructing dataflows could decrease.

To solve the core problem an assistance tool can be implemented. This tool makes it easier for the eMagiz user to construct dataflows. Also, the modelling speed increases.

1.3 Scope & Limitations

This study focuses exclusively on the on-ramp component of the eMagiz bus. The justifi- cation for this is that not all five components can be automatised in the given time-span of a bachelor thesis. eMagiz created a large number of dataflows for its customers, which are made accessible for this research. As described earlier, the aim of this research is to create an assistance tool. This tool can only make valuable suggestions if these suggestions are based on a thorough analysis. This analysis gives the best results if patterns are visible in the data. From a discussion with the eMagiz competence center (eMagiz developers and users) it became clear that patterns are most visible in both the on-ramp and the off-ramp. Consequently, it does not matter on which component the focus lies. In the end, it was decided that the possibility of an assistance tool for the on-ramp is explored in further depth. Hence, the off-ramp component is excluded.

Machine learning algorithms are very suitable for analysing the data as provided by eMa- giz. An algorithm is a set of instructions to solve a certain problem. Machine learning implies that a computer teaches itself to recognise certain patterns in the data provided.

From the recognised patterns, a computer can make valuable predictions. Hence, machine learning is used for analysing dataflows and acting as basis for the assistance tool.

Python is a programming language capable of applying machine learning algorithms. A huge advantage of Python is that it has built-in machine learning libraries. A library consists of multiple lines of code. In the case of machine learning libraries, this implies that you could apply an algorithm without writing the entire program yourself. Scikit- learn(Pedregosa, 2011) is such a machine learning library. This library is applied for all machine learning algorithms in this report with the exception of the rule-based learning.

The reason for this is that rule-based learning is based on statistics rather than pattern

recognition.

(16)

1.4. Research Question 7

1.4 Research Question

Since machine learning is applied to solve the core problem, the research question is formulated as follows:

How can user actions in a model-driven platform be predicted, with machine learning, to increase modelling speed with the help of an assistance tool?

1.5 Methodological Framework

The methodological framework applied for this research is the Design Science Research Methodology (DSRM)(Peffers, 2007). This methodology is used to create ’things’. In this research, these ’things’ are referred to as prototypes. The idea behind the concept of prototypes is that it serves a human purpose and is hence more focused on creation rather than description.

The aim of this research is to create an assistance tool for eMagiz. This assistance tool acts as prototype. The overall aim of the assistance tool is to reduce the time spent on constructing dataflows. In other words, the assistance tool is an prototype that serves human purpose. Therefore, the DSRM is the most suitable methodological framework and is used as a guideline throughout this report. The DSRM consists of six phases.

These phases are visualised in Figure 1.7.

Figure 1.7: Framework Design Science Research Methodology

The first and second phase are described in this chapter. Chapter 3 is dedicated to the third phase. Similarly, chapter 4 is dedicated to the fourth phase. The fifth phase and sixth phase are captured in chapter 5.

1.6 Structure of the Thesis

After identifying the core problem (section 1.2.5), the research concentrates on building and validating the assistance tool. In chapter 2 Theoretical Framework, the techniques required for building and validating the tool are elaborated upon. Chapter 3 Design

& Development describes how the techniques from chapter 2 are applied. Moreover, it describes the problem approach. In chapter 4 Implementation & Development, the results of the research are displayed. In chapter 5 Recommendation & Conclusion, the recommendation, discussion and conclusion are described.

For Chapter 2 Theoretical Framework the following questions are answered:

1. Which literature can be applied to understand and apply machine learning?

(17)

(a) In general, what does machine learning imply according to literature?

(b) Which machine learning algorithms should be applied according to literature?

(c) Which built-in Python library should be used according to literature?

(d) How, according to literature, can machine learning algorithms be compared and evaluated?

For Chapter 3 Design & Development, the focus lies on the research approach and proto- type development. In this chapter, the following questions are answered:

2. What is the research approach and how was the prototype developed?

(a) What are the requirements of the desired prototype?

(b) Which type of management framework should be applied to the research ap- proach and design?

(c) Can the mechanisms of the prototype be visualised to make replication of the research possible?

(d) Which data is used?

(e) How was the data used?

(f ) If applicable, which data should be excluded?

(g) How is the effect of the prototype measured?

For Chapter 4 Implementation & Demonstration, the research is viewed as a process.

The usage and effect of the prototype are discussed. Within this chapter, the following questions are answered:

3. How does the assistance tool work?

(a) How does the assistance tool appear?

(b) What input does the analysis tool require?

(c) What output does the analysis tool produce?

4. What results do the machine learning algorithms generate?

(a) What is the Ac score for each machine learning algorithm?

(b) What is the Fs score for each machine learning algorithm?

(c) What is the Dt score for each machine learning algorithm?

In Chapter 5 Recommendation & Conclusion, the outcome of the research is described

and evaluated. Advice is given regarding the construction of dataflows. Furthermore, the

research implementation is evaluated. Within this chapter, the following questions are

answered:

(18)

1.6. Structure of the Thesis 9

5. What are the recommendations and conclusions of the research?

(a) What is the link between results and the conclusions?

(b) How are the recommendations and conclusions substantiated?

(c) For eMagiz, what unexpected points of improvement were encountered during the research?

6. What points of improvement does the research have?

(a) In hindsight, was the research design correct?

(b) Could the time spend at constructing dataflows be reduced by other means?

How effective would it be?

(c) Could the accuracy be positively influenced by aspects that were left out of ac- count?

(d) Are old and new dataflows equally relevant for analysis?

(19)

Theoretical Framework

In this chapter, the techniques applied for both analysis and validation are discussed.

Also, the decisions made are substantiated with literature. The following sections are discussed 1) Data Analysis, 2) Types of Machine Learning, 3) Applied Machine Learning Algorithms, 4) Validation and 5) ROC Curve. In the Data Analysis section, a substantia- tion is given in favour of a machine learning analysis. Furthermore, the different types of machine learning and the algorithms applied are discussed. The section validation elabo- rates on how the machine learning algorithms can be evaluated. Lastly, the section ROC Curve tends to optimise predictions made. All sections are substantiated with existing literature.

2.1 Brief Summary

For the eMagiz data set, supervised machine learning proved to be the most suitable type of machine learning. Therefore, ten supervised machine learning algorithms are chosen for the comparison phase (further explained in chapter 3 Design & Development ). The techniques used to compare the algorithms are accuracy and F-score (Sokolova, 2006). In addition, K-fold cross-validation is applied to validate the algorithms. Lastly, the ROC Curve is implemented to determine a threshold that generates the most optimal results.

2.2 Data Analysis

The assistance tool must generate helpful and valuable suggestions. To achieve this, a thorough data analysis must be executed. For the assistance tool to generate the most helpful suggestions, it is essential to select the most suitable form of data analysis.

Since suggestions are made based on existing patterns in the data, it would be ideal to implement a form of data analysis which includes pattern recognition. For this reason, machine learning appears to be the reasonable choice. ’There are several applications

10

(20)

2.3. Types of Machine Learning 11

for Machine Learning (ML), the most significant of which is data mining. People are often prone to making mistakes during analyses or, possibly, when trying to establish relationships between multiple features. This makes it difficult for them to find solutions to certain problems. Machine learning can often be successfully applied to these problems, improving the efficiency of systems and the designs of machines.’(Kotsiantis, 2007) For the reasons outlined, it was decided to apply machine learning analysis to the data provided.

2.3 Types of Machine Learning

In general, there are three types of machine learning: 1) unsupervised, 2) supervised and 3) reinforcement machine learning (Figure 2.1). To select the most appropriate type, the concepts of labelled and unlabelled data must be introduced.

Figure 2.1: Overview Types Machine Learning

Labelled data implies that the used data set contains both input and (correct) output data(Kotsiantis, 2007). An example of labelled data is given in Table 2.1. In here, a small sample of the famous Iris flower data set(Fisher, 1936) is displayed. A machine learning algorithm can use the input data (Petal size, Petal width, Sepal length and Sepal width) and make predictions based on that. Predictions, in this case, refer to Flower species.

For instance, a machine learning algorithm reads that the Petal size, Petal width, Sepal length and Sepal width are 4.9cm, 3.0cm, 1.4cm and 0.2cm respectively. Given this input data, the algorithm predicts that the Flower species is the Iris-setosa. Furthermore, the algorithm can check whether the prediction made was correct and adjust its prediction strategy accordingly. In other words, the machine learns from experience! For more information about the Iris flower data set and its role in within machine learning, please see APPENDIX B.

The data, as provided by eMagiz, is labelled. Unsupervised learning requires unlabelled

data. As a result, this type of machine learning is unsuitable for analysing the eMagiz

data set. Reinforcement learning is indifferent to the labelling of data sets. This type

of machine learning is, however, used to interact with an environment. Reinforcement

algorithms can, for example, play chess against another player. After playing a number of

matches the algorithm notices which moves work well and which moves do not. Then it

changes its strategy accordingly. In the case of the eMagiz data set, however, the algorithm

cannot interact with an environment. It does not play against another player. Therefore,

reinforcement learning is not suitable for analysing the eMagiz data set. Supervised

learning, on the other hand, works with labelled data and does not need to interact with

(21)

an environment. This makes supervised learning the most suitable machine learning type to apply to the eMagiz data set.

Input data (cm) Output data

Petal size Petal width Sepal length Sepal width Flower species

5.1 3.5 4 0.2 Iris-setosa

4.9 3.0 1.4 0.2 Iris-setosa

6.1 2.9 4.7 1.4 Iris-versicolor

5.6 2.9 3.6 1.3 Iris-versicolor

6.9 3.1 5.4 2.1 Iris-virginica

6.7 3.1 5.6 2.4 Iris-virginica

Table 2.1: Labelled data (Fisher’s Iris data set)

(22)

2.4. Applied Machine Learning Algorithms 13

2.4 Applied Machine Learning Algorithms

Supervised machine learning is an umbrella term for a series of machine learning algo- rithms. Kotsiantis, in his paper Supervised Machine Learning: A Review of Classification Techniques, described the most effective supervised machine learning algorithms(Kotsiantis, 2007). Kotsiantis categorised these algorithms into six types: 1) Decision Tree, 2) Neu- ral Networks, 3) Na¨ıve Bayes, 4) k-Nearest Neighbors, 5) Support Vector Machines and 6) Rule based learners. K-Nearest Neighbors has a high intolerance to noise. Hence, k-Nearest Neighbor was left out of account. From the remaining categories, the following machine learning algorithms were used to analyse the eMagiz data set:

Algorithm Function

Decision Tree As the name suggests, it uses a tree-like model for decisions. It makes decisions based on attributes. The algorithm makes a prediction by interpreting the decisions made.

Extremely Ran- domized Trees

This algorithm is similar to Random Forest. The difference is that the decisions made in Ran- dom Forest are determined deterministically while the decisions made in Extremely Randomized Trees are random.

Multiclass Lo- gistic Regression

Logistic Regression plots input (X) against output (Y). The Y-values are binary. The One- versus-Rest technique makes it possible to apply it to multiple classes. It makes multiple plots in which the output falls either in a given class A or in all other classes. The algorithm tests which plot fits best to the input data given and makes a prediction based on that.

Na¨ıve Bayes Bernoulli

This algorithm implements Na¨ıve Bayes training and classification algorithms. It assumes binary-valued feature vectors and otherwise it binarizes its input (which is the case in this research).

Na¨ıve Bayes Gaussian

This algorithm implements Na¨ıve Bayes training and classification algorithms. It assumes a normal distribution and makes predictions based on that. In general, Na¨ıve Bayes Gaussian works best with continuous data. Since there is still a chance that this algorithm might perform well even though the given data is discrete, it is applied anyway.

Na¨ıve Bayes Multinomial

This algorithm implements Na¨ıve Bayes training and classification algorithms. It assumes that the data is multinomially distributed and makes predictions accordingly.

Neural Network A Neural Network optimises the log-loss function by applying either stochastic gradient descent or LBFGS.

Random Forest Random Forest consists of multiple decision trees. The data set used for each decision tree consists of a series of rows from the original data set. The rows from the original data set might appear multiple times or not at all in the newly constructed data set. In other words, the newly constructed data sets are constructed at random. For each decision tree a prediction is made. The average of all predictions is taken, and the result is the prediction the random forest makes.

Rule Based

Learning

As the name suggests, the algorithm makes predictions based on a rule. This algorithm counts the total number of combinations of building blocks (input and output) and makes a sorted table from it (highest to lowest occurrence). When given input data, the algorithm searches for rows which share the same input. It groups these rows and makes a table from it where the top rows occur often, and the bottom rows occur seldom. Furthermore, it makes several predictions starting from the top row. When the given input does not exist in memory, then the algorithm simply predicts the most occurring output independent of the input.

Support Vector Machines

SVM plots the input data over a number of dimensions. For instance, if the input is three blocks, then this data will be plot in a three-dimensional space. The algorithm plots all points and determines their class. If the output can have the form of multiple classes, then the one- versus-rest principle is applied. When the points are plotted, a separation line is added which separates the two classes. Based on this separation line, it is determined what the output is given the new input.

Table 2.2: Overview of Machine Learning Algorithms

(23)

As mentioned in chapter 1 Introduction, Rule Based Learning is not part of the Scikit-learn library. This algorithm is based on simple statistics. The data on which the algorithm is trained consists of input and output. For example, after the combination of building blocks A-B (input) comes block C (output). The Rule Based Learning algorithm counts the number of times a certain combination of input and output appears in the data set. For instance, the combination of building blocks A-B (input) and block C (output) appears twenty times in the data set. From the combinations and the number of occurrences the algorithm makes an ordered list (highest occurrences to lowest). Consequently, the algorithm makes predictions based on this ordered list. So, if the algorithm encounters the combination A-B, then it searches for this input in the list starting from the top. Once the combination is found the algorithm gives the corresponding output (for example block C).

2.5 Validation

2.5.1 Train and Test Sets

To evaluate the performance of the machine learning algorithms, the concepts train and test sets (Ordonez, 2006) are introduced (see Figure 2.2). The data set, as provided by eMagiz, must be split into a train and a test set. As the names suggest, a train set is used to train and a test set is used to test a machine learning algorithm. Training an algorithm implies that the algorithm tries to find patterns between the input and output data of the train set. To illustrate this, suppose a machine learning algorithm has been trained on the Iris flower data set (mentioned previously). The algorithm could have found the following pattern: if a flower has a Petal width that lies in the interval [4.5cm,5.3cm] and a Petal length that lies in the interval [3.0cm, 4.0cm], then the flower must be an Iris-setosa. The input data in the example are the Petal width and length, and the output data is the Flower species.

Once the algorithm has been trained, it is applied to the test set. From the train set, the machine learning algorithm has recognised a series of patterns. Given the input data of the test set and the patterns recognised in the train set, the algorithm predicts the output data of the test set. The predictions made could either be correct or incorrect. In addition, the number of correct predictions is divided by the number of total predictions.

The outcome of this equation (equation 2.1) is called the accuracy(Kotsiantis, 2007) of the machine learning algorithm.

Accuracy = Number of Correct Predictions

Total Number of Predictions (2.1)

(24)

2.5. Validation 15

Figure 2.2: Train and Test Set

A problem arises if the data in the data set is unevenly distributed. For example, take the Iris flower data set. Suppose the data set is split in a train and test set, and that all data regarding the Flower species Iris-virginica is only present in the test set. In that case, the machine learning algorithm cannot recognise patterns regarding the Iris-virginica (simply because it is not present in the train set). Hence, the trained algorithm has a low accuracy.

To prevent this, the data set is shuffled at random before it is split in a train and test set.

2.5.2 K-fold Cross-Validation

The probability on unbalanced data is reduced (by the means of a shuffled data set), but unbalanced data might still occur. To further reduce the probability of unbal- anced data, a technique called K-fold cross-validation(Bengio, 2004) can be applied. As Kotsiantis(Kotsiantis, 2007) puts it: ’In another technique, known as cross-validation, the training set is divided into mutually exclusive and equal-sized subsets and for each subset the classifier is trained on the union of all the other subsets. The average of the error rate of each subset is therefore an estimate of the error rate of the classifier.’ In other words, this technique splits the provided data set into a train and test set. Then the accuracy is determined. The validation continues with the same data set, but now it splits the data set into a new train and test set (where the test set differs from the first test set).

Once again the accuracy is determined. This process continues K number of times and is

visualised in Figure 2.3 with K = 10. Once K number of accuracy scores are calculated,

the average of these scores is determined. Now, the probability of a low accuracy due to

a unbalanced data set is minimised.

(25)

Figure 2.3: K-Fold cross validation (K = 10)

2.5.3 Balance score

It will not suffice to solely optimise (the assistance tool’s) accuracy. A high accuracy does not necessarily imply a good assistance tool. Presume that there are two situations: situ- ation A and situation B. Situation A occurs frequently, while situation B occurs seldom.

Given situation A, the assistance tool always makes correct predictions. For situation B, it is the exact opposite: the assistance tool always makes incorrect predictions. Since situation B occurs significantly less than situation A, the accuracy of the assistance tool is quite high. The balance, however, is very poor. The balance is a percentage that displays to which extend the accuracy scores of each situation differ. If there is no difference in accuracy scores, then the balance is 100%. To measure the balance, the F-score can be applied. To understand the F-score, one must first understand the concepts of a confusion matrix, precision and recall. The confusion matrix provides an overview of all predictions made by the assistance tool. In Figure 2.4, an example of a confusion matrix is displayed.

Figure 2.4: Example Confusion Matrix

From the confusion matrix the precision, recall and F-score can be determined. Suppose

(26)

2.5. Validation 17

a machine learning algorithm tries to predict a situation A, B or C (see Figure 2.4). The column on the far left represents the predictions made by the machine learning algorithm.

The top row represents the actual outcome (reality). From Figure 2.4 it can be deduced that situation A was correctly predicted 13 times, situation B was correctly predicted 16 times, etc. As one might observe, 5 times situation A was predicted while in reality the answer was situation B. Similarly, 2 times situation B was predicted while the actual answer was situation A.

For Figure 2.4, three recall and precision scores can be calculated. The first recall score refers to situation A regarding reality. Here, the number of correct predictions (13 times) is divided by the total number of predictions made in that column. The total number of predictions made is: 13 + 2 + 3 = 18. So, the first recall score is: 13/18 ≈ 0.722. Likewise, the second recall score is: 0.762 and the third: 0.833. The calculation of the precision score works similarly. The first precision score refers to situation A regarding predictions. The number of correct predictions is divided by the total number of predictions in that row.

So, the first precision score is: 13/(13 + 5 + 7) = 0.52. The second and the third precision score are respectively: 0.842 and 0.930. In conclusion, precision is the probability that situation X is correctly predicted in comparison to all cases that situation X is predicted.

Recall is the probability that the situation X is correctly predicted in comparison to all cases where the reality is situation X. In other words, precision measures from the perspective of predictions and recall from the perspective of reality. From the recall and precision scores, the F-score can be determined as follows:

F-score = 2 ∗ Recall ∗ Precision

Recall + Precision (2.2)

The F-score is the balance score of the machine learning algorithm. A high F-score implies

that the predictions are well balanced, while a low F-score implies the opposite.

(27)

2.6 ROC Curve

Figure 2.5: Basic Confusion Matrix

In Figure 2.5, a typical example of a confusion matrix is given. In here, there are only two possible states: 1 and 0. If a machine learning algorithm predicts the future state to be 1 and the future state is indeed 1, then the prediction falls in the category of True Positive.

In the ideal state the number of True Positive and/or True Negative predictions must be greater than zero, whereas the number False Positive and False Negative predictions must be equal to zero. Normally, this situation cannot be reached. Therefore, the True Positive Rate (TPR) was created to keep track of the number of True Positives in relation to the number of False Negatives. In addition, the False Positive Rate (FPR) was created to keep track of the number of False Positives in relation to the number of True Negatives.

The TPR must be maximised while the FPR must be minimised. The equations below show how the TPR and FPR are calculated.

True Positive Rate = TP

TP + FN (2.3)

False Positive Rate = FP

FP + TN (2.4)

To understand how the results of these ratios might change, one must first understand how a machine learning algorithm makes predictions. Based on input, a machine learning algorithm comes up with several probabilities. In the case of the basic confusion matrix (Figure 2.5) this could be: 40% chance that the future state is 0 and 60% chance that the future state is 1. Logically, you would say that the algorithm should predict that the future state is 1. This is correct, but this might not always lead to the desired result.

Suppose a machine learning algorithm is applied in a hospital to decide whether a patient

has serious illness. In this situation, it better to wrongly predict that a patient has a

(28)

2.6. ROC Curve 19

serious illness than it is to wrongly predict that a patient does not have a serious illness.

In other words, the TPR is of vital importance and the FPR is of less importance. To influence the TPR and FPR a decision could be made. For example, if a machine learning algorithm is for 60% sure that a patient is not ill, then you could tell the algorithm to still classify this patient as ill. In other words, being 60% sure that a patient is not ill is too low. The hospital might say that the algorithm must be at least 90% sure before classifying a patient as not ill. This 90% is called a threshold. By adjusting the threshold, the TPR (2.3) and FPR (2.4) change. So, one could plot the results of the TPR and FPR by adjusting the threshold step by step. The curve that appears is called the ROC Curve (James, 1982).

Figure 2.6: ROC Curve

In Figure 2.6 an example of a ROC curve is shown. The situation where the TPR is 1

and the FPR is 0 is ideal. In general, this situation cannot be reached. The closest point

to this situation is the optimal combination of TPR and FPR. In Figure 2.6, this point

is displayed as a red dot. In chapter 3, the implementation of the ROC Curve within the

research is discussed.

(29)

Design & Development

In this chapter, the research approach is sketched. The following three sections are ad- dressed: 1) Data Conversion, 2) Comparing Algorithms and 3) ROC Curve. For both Data Conversion and Comparing Algorithms a brief summary is given.

3.1 Brief Summary

The aim of this research is to create an assistance tool to reduce time spent on constructing dataflows. To achieve this, two phases must be completed: 1) Data Conversion and 2) Comparing Algorithms. The data provided by eMagiz is not suitable for machine learning. Therefore, the data format is changed. Next, ten machine learning algorithms are compared on accuracy and F-score (chapter 2 Theoretical Framework ). The best algorithm could act as basis for the assistance tool. This tool could then added to the eMagiz platform.

3.2 Data Conversion

3.2.1 Summary Data Conversion

The data provided by eMagiz is in XML-format (subsection 3.2.2). This file format is unsuitable for machine learning. Therefore, the file format is altered. First, the dataflows are translated into routes (subsection 3.2.3). Undesired routes are removed (subsection 3.2.4) and after some alterations added to a table (subsection 3.2.5). This table (or dataset) can be used as input for machine learning.

3.2.2 XML Conversion

For this research, eMagiz provided dataflows in XML-format (on-ramp related). XML stands for eXtensive Markup Language. It contains information which is both readable

20

(30)

3.2. Data Conversion 21

for humans and machines. It is, however, not suitable for machine learning. Therefore, this file format must be altered. For a computer, XML documents are accessible by applying a technique called parsing. Parsing enables a computer program to access the data very fast while understanding the underlying structure. In the case of dataflows, this implies that the computer understands which building blocks are connected.

From the XML-files, a series of routes could be obtained. An example of a route is given in Figure 3.1. Circled in red, a route is shown consisting of the following building blocks:

1) jms-message-driven-channel-adapter, 2) standard-transformer, 3) X-path-splitter, 4) standard-validating-filter, 5) X-path-router and 6) jms-outbound-channel-adapter. As one could imagine, multiple routes can be found in the same dataflow. Further on, these routes are used for machine learning. The code for the XMLconverter program is shown in APPENDIX C.

Figure 3.1: Example route

3.2.3 Routes

In Figure 3.1, there are two building blocks from which multiple channel originate. These are a standard-validating-filter (stretched yellow pentagon) and a standard-router (yellow diamond). As one might have noticed, four routes can be extracted from this dataflow.

These four routes are displayed in Table 3.1. In this Table, ’inbound’ represents the green circle, ’transformer’ represents the blue rectangle, etc. These routes are, however, ill-defined and that causes a problem. Namely, if the four routes are used as input for a machine learning algorithm, then the algorithm thinks there are four different series of blocks (see Figure 3.2). In reality there is only one with four branches (see Figure 3.1).

The problem with this situation is, for example, that the machine learning algorithm

thinks an X-path-splitter is followed by a standard-validating filter four times. While in

fact, the X-path-splitter is merely followed by a standard-validating-filter once. Suppose

an X-path-splitter is followed by a standard-router. In theory it would be possible that

the standard-router has 100 channels going out. This implies that the machine learning

algorithm thinks that in 100 cases the X-path-splitter is followed by a standard-router,

even though in reality, it appears only once. The result is a poor accuracy and balance

score.

(31)

1st Block 2nd Block 3rd Block 4th Block 5th Block 6th Block

inbound transformer splitter filter router outbound

inbound transformer splitter filter outbound - Table 3.1: Example of Routes 1

Figure 3.2: Wrong assumption

To solve this issue, the series of blocks prior to a validating-filter or a router are removed, with the exception of one route (Table 3.2). If this data is now used as input for a machine learning algorithm, then the algorithm observes that an X-path-splitter is followed by a standard-validating-filter only once. The data becomes more reliable. Therefore, poor accuracy and balance scores, as a result of ill-defined input data, is avoided.

1st Block 2nd Block 3rd Block 4th Block 5th Block 6th Block

inbound transformer splitter filter router outbound

- - - - router outbound

- - - filter outbound -

Table 3.2: Example of Routes 2

3.2.4 Exclusion

When the eMagiz on-ramp is opened for the first time, a standard dataflow is presented to the user. This dataflow is not constructed by the user, but it is used in various situations.

Hence, this standard dataflow appears quite often in the data set (as provided by eMagiz).

Since the user does not construct this dataflow and the assistance tool is concerned with

the construction of dataflows, these standard dataflows are excluded from the data set.

(32)

3.2. Data Conversion 23

Within the eMagiz data set, there exist a couple of unfinished dataflows. This means that a dataflow does not have a starting point (for example inbound-channel-adapter) and/or an ending point (for example outbound-channel-adapter). Also, the probability that the dataflow was wrongly constructed is very real. Therefore, these types of dataflows are excluded.

For Example and Training dataflows, a similar problem occurs. Since these dataflows were constructed with the sole purpose of pointing something out, they might be wrongly constructed or appear in a unique and unrepresentative manner. Hence, these types of dataflows were also excluded.

3.2.5 Tables

All remaining dataflows were written out in table form (as shown in Table 3.3). NaN simply means ’Not a Number’. In other words, NaN-values represent empty cells. In the Table, each row represents a route. For the sake of overview, the last two rows of the Table show that the Table continues until row n is reached.

Table 3.3: Overview of all routes

The assistance tool tries to predict the next building block based on a series of previous blocks. This series of blocks can be quite large. In fact, the larger the route, the smaller the number of occurrences. For instance, one route has a length of 14 blocks. It does, however, only occur once in the entire data set. This means that if all routes with length 14 are extracted from the data set and used as input for a machine learning algorithm, then the the newly formed input data set has merely one row. Since machine learning algorithms require a large input data set, an input data set existing of only one row does simply not suffice. Therefore, as a rule of thumb, it was decided to only use data sets with at least one thousand rows. As it turns out, the maximum route length that generates a data set with at least one thousand rows, is seven.

Another point of attention is the minimum route length. The machine learning algorithms

applied in this study, demand a multi-dimensional vector input. Alternatively stated, the

series of previous blocks must be at least two. Therefore, the minimum route length is

(33)

set on three (two previous block, one block to predict). From Table 3.3 a series of input data sets are created, with a minimum and maximum column length of respectively three and seven.

Suppose an input data set of three columns is created. This is achieved in three simple steps. Firstly, a series of sub-data sets are constructed from Table 3.3: column 1 to 3 forms a sub-data set, column 2 to 4 forms a sub-data set, etc. Secondly, the rows with NaN values are removed from all sub-data sets. Thirdly, all sub-data sets are merged into one input data set with three columns. The same process is applied for the other input data sets with different route lengths.

3.3 Comparing Algorithms

3.3.1 Summary Comparing Algorithms

A Key Performance Indicator is a measurable variable which makes it possible to measure performance. For this research the KPI’s accuracy score (subsection 3.3.2) and F-score (subsection 3.3.3) are used to compare the ten machine learning algorithms. Furthermore, the impact of the number of suggestions made by the algorithms is discussed.

3.3.2 Accuracy Scores

In total ten machine learning algorithms are compared (Table 2.2). Since accuracy is most valuable for eMagiz, it was decided to concentrate on the accuracy score. The balance score is of less importance, but is also taken into account. The input data sets are as described previously. Within an input data set, the last column is used as Y-value (output) and the remaining columns are used as X value (input). In other words, the last column represents the block that must be predicted and the remaining columns represent the series of previous blocks.

For a machine learning algorithm, the accuracy is determined K times (as a result of K-

fold cross-validation). The average of these accuracy scores is determined and the result is

the accuracy score associated to a certain input data set. There are five input data sets,

therefore, the comparison program calculates five corresponding accuracy scores: ac1,

ac2, ac3, ac4 and ac5. These accuracy scores are not equally important. For instance,

a situation with three blocks (two input and one output block) occurs more frequently

than a situation with five blocks (four input and one output block). Hence, the accuracy

score tied to the situation with three blocks (ac1) is of more importance than the accuracy

score tied to the situation with five blocks (ac3). To determine the accuracy score weights,

the number of situations are counted, where a corresponding prediction is applied. For

example, when a dataflow is under construction and consists of two blocks, then the only

prediction that can be made, is based on the input data set with three blocks (two input

and one output block). In addition, if there are three blocks present in the dataflow, then

the prediction is based on the input data set with four blocks. A dataflow which consists

of six or more blocks is based on the input data set with seven blocks. So, if the number

of blocks in the dataflow is greater than six, then the input data set with seven blocks is

applied. As a result, the input data set with seven blocks is applied more often than the

(34)

3.3. Comparing Algorithms 25

input data set with six blocks. In total, the number of times an input data set is applied, is displayed in Table 3.4.

Input data set Applied

3 blocks 1005

4 blocks 827

5 blocks 788

6 blocks 745

7 blocks 1007

Total 4372

Table 3.4: Input data sets applied

From Table 3.4, the weights of each accuracy score are determined. This is shown in the following equation:

Ac =

¹⁰⁰⁵

4372

∗ ac1 +

⁸²⁷

4372

∗ ac2 +

⁷⁸⁸

4372

∗ ac3 +

⁷⁴⁵

4372

∗ ac4 +

¹⁰⁰⁷

4372

∗ ac5 (3.1) The Total accuracy score of each machine learning algorithm is used to compare the algorithms.

3.3.3 F-scores

The F-scores, for the machine learning algorithms, are calculated in a similar fashion to that of the accuracy scores. K-fold cross-validation is applied and the average of the F-scores is taken. The result is five F-scores for each of the corresponding input data set: f1, f2, f3, f4 and f5. The Fs score (Total F-score) is calculated by applying the same weights as was applied for the Ac score:

Fs =

¹⁰⁰⁵

4372

∗ f 1 +

⁸²⁷

4372

∗ f 2 +

⁷⁸⁸

4372

∗ f 3 +

⁷⁴⁵

4372

∗ f 4 +

¹⁰⁰⁷

4372

∗ f 5 (3.2)

3.3.4 Number of Suggestions

According to eMagiz management the number of suggestions the assistance tool makes

must not be too high. This for reasons of convenience. If, however, the accuracy and

balance score increase significantly when an extra suggestion is added, then this extra

suggestion should be added. To measure the effect of adding an extra suggestion, a

graph is made. In this graph, the accuracy score of the best performing algorithm is

plotted against the number of suggestions n. After discussing the subject with the eMagiz

management, it was decided to stop adding extra suggestions to the assistance tool once

the accuracy score increases with less than 1% (for each suggestion added). In addition,

the maximum number of suggestions made by the assistance tool should be ten. This for

reasons of overview.

(35)

3.4 ROC Curve Application

As described in chapter 2 Theoretical Framework, the ROC curve is used to optimise performance by maximising the True Positive Rate and minimising the False Positive Rate. Since the assistance tool is based on only one machine learning algorithm, the ROC Curve is applied only to this specific algorithm. As described in this chapter, the assistance tool (and therefore the underlying machine learning algorithm) makes multiple suggestions. Therefore, 98 building blocks can be a future state. When a confusion matrix is created out of these possible classes, then it becomes difficult to construct a ROC Curve.

Therefore, all suggestions given are seen as one class and the remaining states are seen as

the other class. In this manner a simple confusion matrix can be constructed and a ROC

Curve can be made.

(36)

Chapter 4 Implementation & Demonstration

In this chapter, the 1) Results and 2) Design of the Prototype are discussed. The Results are the outcome of the research approach (chapter 3 Design & Development ). The Design of the Prototype describes how the assistance tool is implemented.

4.1 Brief Summary

The machine learning algorithms were allowed to give five suggestions in total. For the eMagiz data set the algorithm Random Forest performed best with an accuracy score of 95.70% and an F-score of 85.40%.

4.2 Results

4.2.1 One Prediction

If the machine learning algorithms are compared based on one suggestion, then the results are as described in Table 4.1.

Algorithm Ac (%) Fs (%)

Extremely Randomized Trees 76.66 81.83

Random Forest 76.60 79.97

Decision Tree 76.45 80.33

Rule Based Learning 74.65 81.07

Support Vector Machines 74.82 82.55

Neural Network 66.54 90.62

Multiclass Logistic Regression 53.15 78.98

27

(37)

Algorithm Ac (%) Fs (%)

Na¨ıve Bayes Gaussian 41.35 64.03

Na¨ıve Bayes Bernoulli 33.46 82.85 Na¨ıve Bayes Multinomial 26.54 42.34

Table 4.1: Results Comparison Phase

4.2.2 Impact Number of Suggestions

The results of Table 4.1 are, however, sub-optimal. If the number of suggestions is in- creased, then the accuracy scores of all algorithms increase tremendously. The advantage of increasing the number of suggestions is that the assistance tool can make more ac- curate suggestions. This (significantly) reduces time spent on constructing dataflows!

The results of an increase of suggestions on the accuracy score (of the best algorithm) is illustrated in the graph below.

1 2 3 4 5 6 7 8 9 10

76 78 80 82 84 86 88 90 92 94 96 98 100

Number of suggestions n

Accuracy score (b est algorithm)

The machine learning algorithms follow a similar pattern when tested on F-score. As

one might notice, the difference in accuracy score between one and two suggestions is

approximately 10%, which is quite large. The difference between nine and ten suggestions

is, however, less than 1%. As described in chapter 3 Design & Development, the eMagiz

(38)

4.2. Results 29

management wishes to increase the number of suggestions until the increase in accuracy score is less than 1%. Starting from four suggestions, if the number of suggestions is increased to five, then the accuracy score increases with more than 1%. Starting from five suggestions, if the number of suggestions increased to six, then the accuracy score increases with less than 1%. Therefore, it is decided to compare machine learning algorithms based on five suggestions. The results can be found in Table 4.2.

4.2.3 Five Predictions

Algorithm Ac (%) Fs (%)

Random Forest 95.70 85.40

Support Vector Machines 95.42 86.23

Rule Based Learning 95.33 84.44

Extremely Randomized Trees 95.22 84.11

Decision Tree 93.78 82.37

Neural Network 90.97 77.93

Multiclass Logistic Regression 89.35 72.60

Na¨ıve Bayes Gaussian 88.40 70.95

Na¨ıve Bayes Bernoulli 86.13 87.67 Na¨ıve Bayes Multinomial 62.60 45.63

Table 4.2: Results Comparison Phase

From Table 4.2, it becomes clear that the Random Forest algorithm performs best in

terms of accuracy. Support Vector Machines performs best in terms of F-score. eMagiz

management made clear that accuracy is deemed more important than F-score. Therefore,

it was decided to use Random Forest as basis for the eMagiz assistance tool.

(39)

4.2.4 ROC Curve

In Figure 4.1 the ROC curve of the Random Forest algorithm is shown. In the figure, it becomes clear that the best possible result is achieved at a True Positive rate of 0.68 and a False Positive Rate of 0.16. This point is shown in the figure as a red dot. In that situation the threshold is approximately 87.5%. This implies that if the Random Forest algorithm must be at least 87.5% sure that the five suggestions are correct. Similarly, if the algorithm is for less than 87.5% sure that the five suggestions are correct, then the algorithm should not make these suggestions but rather focus on alternatives. In comparison to not applying a threshold, however, a threshold of 87.5% increases the total number of False Positives and False Negatives. This implies that even though the combination of TPR and FPR is best at a threshold of 87.5% this does not mean the same for accuracy.

Figure 4.1: Random Forest ROC

(40)

4.3. Design of the Prototype 31

4.3 Design of the Prototype

The prototype is based on the Random Forest algorithm. If added to the on-ramp, then the prototype can appear in the form of Figure 4.2. The prototype could be opened by right clicking the last building block in the premature dataflow. This is displayed in Figure 4.2.

Figure 4.2: Design of the Prototype

In Figure 4.2, two input blocks are shown. Namely, a ’jms-message-driven-channel-

adapter’ and a ’file-to-string-transformer’. When a eMagiz user right clicks on the latter,

a suggestion window appears. In this window five suggestions are given regarding the

continuation of the dataflow. These suggestions are based on the Random Forest Algo-

rithm.