Process Mining in Audits

(1)

Process Mining in Audits

The use of process mining in the exploratory phase of an

audit

Bas Verhaar : 10177914

University of Amsterdam Faculty of Science Master Information Studies: Business Information Systems

Final version: 27-06-2018 Supervisor: Loek Stolwijk

Abstract:Financial audits are becoming more and more complex due to the growth of information systems. Traditional methods for exploring system processes are becoming obsolete. Process mining is a tool that can be used in order to make the audit process more efficient. The following question will be discussed in this paper: how to transform ERP data into an event log in order to discover fitting and general process models for exploring in the early stages of an audit? A design science methodology has been applied for researching this problem. A two component framework was created as solution. The first component is called the extraction phase and discusses how data from an ERP system needs to be extracted before it will be transformed. The second component, the transformation phase

, is the main element of the framework and consists of an algorithm that transforms the ERP data into an event log usable for process mining. When using this framework, an auditor can obtain understanding of the underlying process more efficiently and effectively than when using traditional methods.

(2)

Introduction 3

Methodology 5

Literature review and background information 5

Process mining 5

Financial Audits and Business Processes 6

Evaluating discovered process (evaluation) 7

Transformation Framework 7

Preparation phase 8

Potential complications and consequences 9

Transformation phase 10 Transformation algorithm 10 Evaluation 12 Preparation phase 12 Transformation phase 12 Iteration 1 14 Iteration 2 15

Conclusion and Further Research 16

Future research 17

Appendix 21

Appendix A - ER Diagram of ERP system used as case study. 21

Appendix B - Petri net of the first iteration 22

Appendix C - Petri net of the second iteration 23

(3)

Introduction

A traditional audit is a systematic and independent examination of financial records, accounts, business transactions and accounting practices of an organization to confirm whether financial statements present a true and fair view of the concern. Financial audits are not only important in order to increase the reliability of published financial statements of companies, they are also required by local or international laws, regulations and standards. The analyses of business processes is part of an audit. The International Standard on Auditing (ISA) 315 (Revised) states that “The auditor should obtain an understanding of the information system, including the related business processes, relevant to financial reporting (...)” (IFAC, 2012). Traditionally, auditors would proceed with interviews and inspections of documents in order to map out process models. However, due to increasing integration of information systems and data explosion, traditional audit methods are inefficient and ineffective (Werner & Gehrke, 2015). Most transactional data is stored in enterprise resource planning (ERP) systems. These data can be characterized as Big Data, due to its volume and velocity of data accumulation (Chen, 2012).

One potential solution to this problem might be the use of process mining. Process mining is a relatively young discipline that is a combination between, on the one hand, machine learning and data analysis and on the other hand process modeling analysis (Van der Aalst, 2011). By extracting data from event logs that are present in information systems, one can discover, monitor and improve processes. Event logs serve as the input for process mining. It is important that they are of high quality, since this will define the quality of the output; a process model (Suriadi et al., 2017). The quality of event logs will be discussed later. In order to use process mining techniques, an event log is required to contain at least the following three data elements; a case identifier, a task identifier and a time notation (Bernard et al. 2017). Depending on what type of analysis is performed, extra data can be added, like agent or department. Take the example of the process of going to the cinema. In this example, the case identifier would be a unique number of someone going to the cinema, for example someone’s (unique) name. The task identifier would be the tasks performed during this process, for example “Buying a ticket” or “Buying food”. The time notation is a timestamp of when the unique case performs said task. Van der Aalst (2011, p9) distinguishes three types of process mining:

discovery, conformance and_enhancement

. The first type, discovery, aims to produce a process model

from event logs without a normative process or any _{a priori}

information. The second type,

conformance, is where the process model, emerged from the event logs, is compared against a predefined process model. The predefined model could be the business rules. In this type, one typically answers questions like; does the process really behave according to the predefined rules? The third type of process mining is enhancement. The idea of this type is to improve the process model based on information extracted from event logs. The predefined business rules could be changed in order to let the process run more efficiently. For auditing purposes and purposes of this thesis, discovery is most essential, since it adds to the ISA 315 required understanding of business processes.

A difference with traditional process mining is that process instances of financially relevant data are not necessarily linear. The purchase of different goods could be paid by one payment (1-N relationship), one purchase could be paid by different payments (N-1 relationship) or the purchase of different goods could be paid by even more payments (N-M relationship). Therefore, it is important to keep in mind how data out of ERP systems should be transformed into event logs. This means that the event logs will not be fully objective, because a (subjective) choice needs to be made whether the event log will be from the perspective of an invoice or the perspective of an order. An event log will

(4)

therefore be partially objective. The process models from these event logs should be perfectly fitting and highly precise, since auditors rely on the correctness of these models.

Once event logs are available, the algorithm of Gehrke and Mueller-Wickop (2010) is capable to mine financial entries and open items to reconstruct the process instances which produced the financial entries. After using this algorithm there is a Financial Process Mining (FPM) algorithm (Werner and Gehrke, 2015) which can map out corresponding control flows. However, for the first algorithm to work, it is important for the event log to be well structured from the ERP system.

Usually when data is blended from different sources, the ETL method is used (What Is ETL?, n.d.). ETL stands for extract, transform and load, representing the three steps of putting data from one database into another. Data is taken (extracted) from different sources, after which it is carefully transformed into data that can be used for analysis before it is stored (loaded) into a data warehouse or other system. This means that constructing an event log needs to happen in the transform phase, after the right data is extracted from the sources.

Since ERP systems consist of multiple interconnected tables, the challenge is to connect the appropriate tables to retrieve the right information (see Appendix A). Generally, ERP systems not only differ from one another, but even the data within one system can be inconsistent. These inconsistencies occur when data is manually entered by different people in different locations (e.g. different branches). It is not always clear which tables are linked with each other, which could result in data loss when extracting data. One needs to be have certain expertise when combining and extracting this data. If this is not done accurately, data will be incorrect resulting in an inaccurate event log.

The goal of this thesis is to find ways of using process mining in the auditing process. The research question is as follows: _{How to transform ERP data into an event log in order to discover} fitting and general process models for exploring in the early stages of an audit? To narrow the scope

for the purpose of this paper, there will only be a focus on the planning phase of an audit. A transformation framework will be created which can be used to guide auditors for transforming ERP data into event logs. In the next section, the research methodology will be discussed, followed by a literature review. After this, the framework will be discussed with a case study for confirmation. The thesis closes with a conclusion and outlook to future research in the last section.

(5)

Methodology

For this thesis, the design science methodology will be used (March & Smith, 1995). The reason for choosing this method is because the research question at hand has a close proximity to the practical problems in auditing but will also contribute new knowledge to this discipline in the form of an artifact. It aims to develop and provide instructions for action that allow the design and operation of innovative concepts within IS (Österle et al., 2011). In contrast, behaviorism-based IS research analyzes existing IS as phenoma and not ‘to-be’ concepts. Österle et al. (2011) claim design science research (DSR) follows four different phases: _analysis

,_design, _{evaluation and diffusion}. During the

analysis phase, the business problem is identified, along with research objectives and gaps. This is partly done in the introduction and will be further elaborated in the next section. The design phase builds an artifact according to generally accepted methods. The methods used for this research are method engineering (Brinkkemper, 1996) and prototyping. Method engineering entails the creation of a new method. In order to do this, previously defined method fragments can be combined in a new manner or new method fragments can be created. Prototyping allows the development of software artifacts. In this study prototyping is used to evaluate the method engineered in the design phase. This brings us to the third phase, evaluation. With the prototype and a case study provided by an audit company, the method is evaluated. Using quality checking methods for process discovery discussed in Buijs (2012, September), the model generated from the event log can be evaluated. First a petri net (Petri and Reisig, 2008) will be created in ProMLite using the inductive mining algorithm. Then the quality of the petri net will be assessed by replaying the log on the model. The diffusion phase functions as an overlapping phase, aiming for the best diffusion among the target groups of the research. This means that results should be general and rigor.

Literature review and background information

Process mining

As discussed earlier, process mining is a research area that is rapidly expanding. Since the development of powerful heuristic (Weijters et al., 2006), fuzzy (Günther et al, 2007), genetic (van der Aalst et al, 2005, June) and financial process mining algorithms (Werner and Gehrke, 2015), process mining has started to mature as a discipline. Van der Aalst et al. (2011) provide an overview of most challenges en techniques used in process mining in _{The Process Mining Manifesto}

.

Process mining has been successfully used in many different professional disciplines. Two of those disciplines include internal auditing (Jans et al., 2013) and financial auditing (Werner and Gehrke, 2015). Both general and more specific data mining techniques have been used in auditing. Debreceny and Grey (2010) used general data mining techniques for internal fraud detection purposes. More specific techniques were used by Jans et al. (2014), who used the data from a financial service company to create an event log by transforming the company specific data structure. However, after transforming the specific data to an event log, still general process mining methods were used in order to map out the process model. For performing financial audits with process mining several methods have been created (Gherke & Mueller-Wickop, 2010 August; Werner et al., 2015, 2017). Those studies are based on the methods of the Financial Process Mining (FPM) algorithm and visually representing process models using Colored Petri Nets (CPN).

This thesis chooses an alternative direction. As stated earlier, Jans et al. (2014) exploited the data structure of one company’s ERP system for creating an event log. This thesis is about developing standardized method for creating said event log, in the form of a framework. I am not aware of any

(6)

other research that develops or researches a framework for transforming ERP data into an event log which can be used for audits.

A big challenge when creating an event log is to make sure the mined process model from the event log will be useful for exploratory purposes. Two extreme cases of mined process models are the

spaghetti processes and _{lasagne processes (van der Aalst, 2011 April). Extremely unstructured}

processes are called spaghetti processes due to their visual resemblance to spaghetti. For these processes it is difficult to know which information is required for activities because those processes are often driven by intuition, experience or vague qualitative information. Lasagne processes are more structured since activities have a well defined input and output. These processes are less driven by human judgement than spaghetti processes.

Fig. 2 -_{An example of a spaghetti process with 619 activities executed by 266 individuals.}

Financial Audits and Business Processes

A business process is a set of activities which, when performed in a certain order, reaches a specific business goal (Reichert & Weber, 2012). When performing risk-based audits, it is important to understand business processes and internal control. The relationship between the process activities and internal controls needs to be clear in order to perform an audit. Auditors use models for mapping out or gaining understanding of business processes. As discussed earlier, these models were traditionally created by interviewing contact persons from the audited firms or by inspecting financial documents. After this, the models were captured using modeling tools like Microsoft Visio, Powerpoint or Word (Werner, 2017).

Except for the fact that performing interviews takes a lot of time which makes it an inefficient way of working, it is also error-prone. Contact persons from the audited firms often lack sufficient knowledge about automated processes happening within the ERP systems. This results in inaccurate or incomplete information about a process.

Process mining is an automated discipline which can be used by auditors to get better and quicker insights in to business processes. Nevertheless, it is important that these processes are modelled accurately. The accuracy of processes can be assessed with different types of quality criteria. Using different mining algorithms will result in different results when evaluating the mined models. Buijs et al. (2012, September) discuss four different types of criteria for assessing discovered models; fitness, precision, generalization and simplicity.

(7)

Evaluating discovered process (evaluation)

Figure 3 shows the four quality dimensions for process model discovery discussed by Buijs et al. (2012, September) and Mining (2011). Replay fitness quantifies to what extent the discovered model can accurately reproduce the recorded cases from the event log. A perfect fitness has the value 1 and can replay all cases from the event log on the model. This is closely related to internal validity of an experiment. Precision is the dimension that is concerned about the amount of traces the model can play which are _{not in the event log. A so-called flower model (can produce an infinite amount of} traces) has a precision infinitely close to 0 (Rozinat & van der Aalst, 2008). A flower model is therefore underfitting the log. Simplicity captures the complexity of a process model. As discussed earlier, spaghetti processes have a low simplicity. The last dimension is called generalization and focuses on the fact whether the model not only shows the observed behavior, but is also able to produce future behavior. This is closely related to external validity of an experiment.

Fig. 3 -_{Four quality criteria dimensions for process model discovery}

Even though the ETM algorithm (Buijs et al. 2012, September) provides the best alternative, it is usually not possible to have a perfect score on all four dimensions. Most discovery algorithms only focus on one or two of the quality criteria dimensions for process model discovery. When evaluating the resulting event log from the framework of this thesis, it is important to keep in mind that when choosing a different mining algorithm, different results will be logged. For the purpose of this paper, the inductive miner has been chosen as discovery algorithm. The inductive miner not only allows to filter infrequent behavior, which is beneficial for exploratory purposes, it also is a generally accepted mining algorithm (Leemans et al., 2013, August).

Transformation Framework

The transformation framework consists of two components. The first component overlaps the extraction phase and the transformation phase of the ETL structure. It describes a method of how data can be prepared and extracted for it to be transformed to an event log subsequently. The second component of the framework considers the actual data transformation to an event log. This is the fundamental part of the framework, but for it to work it is important to have extracted the correct data

(8)

from the system. The first component functions more as a preparation to the actual framework, but is discussed in this paper to illustrate an example of how the appropriate data could be extracted.

Preparation phase

The goal of this phase is to join and flatten the ERP data into one table. Since both expertise in data as well as in accounting is required for this phase, it is recommended for a data scientist, an ERP specialist and an auditor to work together closely. The ERP specialist possesses the knowledge of ERP systems and preparing the data. The auditor knows which information is significant when in the exploratory phase of an audit.

First it is important to determine the case perspective

. The end result will show the life cycle

of all these cases. When mapping out an purchase-to-pay process, there are two obvious choices when choosing a case perspective: an order or an invoice. Both are related to each other in a many-to-many relationship, meaning an order can have one or more invoices and an invoice could contain one or more orders. This means that, in terms of process mining, one of these two should be chosen for the case perspective. Choosing orders as cases means that invoices are not represented with high accuracy and vice versa. This is an issue that needs to be discussed with and decided by the auditor responsible

for the exploratory phase.

The next thing is to _{determine relevant}

tables

. Tables with a column containing cases

and with columns containing timestamps are relevant. These columns will later be transformed into events. These timestamps are essential, since they are the key to creating processes. In the same tables, resources can also be found. To _{determine resource columns}

, the

auditor and data scientist should collaborate closely to examine which resources are necessary and which columns are available in the system. To illustrate, if an auditor wants to see who approved an invoice, the data scientist should look for a column named something like

APPROVED_BY

.

Table 3 is an illustration of how table 1 and table 2 can be joined in order to get an appropriate table. Note that it is not necessary to merge the rows into one. The rows can also be concatenated underneath the preexisting table, as long as all rows contain the case ID (in this case

invoice

(9)

Fig. 4 -_{Preparation and extraction phase}

Potential complications and consequences

Potential problems that are faced in the preparation phase are identified below. Some of them have little to no effect on the end result while some do have an effect. When executing this phase, it is important to keep these potential problems in mind. It can help to improve efficiency and the quality of the extracted data before being transformed.

The biggest potential complication is the fact that the joined tables are incomplete. If timestamps in certain tables have been overlooked, these tables might not be included in the data resulting in an fragmentary event log. The transformation algorithm will not prevent or work around this problem, therefore it is the shared responsibility of the accountant and the data scientist to thoroughly work through the data not to miss essential timestamps. When this happens, the external validity of the process model will be affected, since the results can not be well generalized (Bryman, 2015).

Another possible complication could be that too many events are added to the data. Cases will have timestamps of unnecessary events resulting in an over complicated event log and perhaps a spaghetti process. This complication does not have a lot of severe consequences. The second component of the framework works as an iterative algorithm and provides the auditor a choice to filter events. This means that it is better to have too manyevents than too few.

A third potential complication is when the tables and columns are incorrectly joined. This is something that can occur when the data scientist has insufficient expertise concerning the data model of the ERP system. This results in an incorrect event log making the data unreliable, affecting reliability (Bryman, 2015). To avoid this problem, it is essential for the data scientist to fully understand the system. This can be tricky, since ERP systems consist of interconnected tables, often unclear of how they are related to each other (see Appendix A).

Table 1. Demo table 1

INVOICE APPROVE_DATE APPROVED_BY

A1 2018.02.02 John A2 2018.02.05 John B5 2018.03.15 Lisa

(10)

Table 2. Demo table 2

PAYMENT INVOICE PAYDATE

P001 A1 2018.03.21 P002 A2 2018.04.07

Table 3._{Example joined tables 1 and 2}

INVOICE APPROVE_DATE APPROVED_BY PAYDATE

A1 2018.02.02 John A2 2018.02.05 John A1 2018.03.21 A2 2018.04.07 B5 2018.03.15 Lisa Transformation phase

For the transformation phase, the prepared data of the previous component will be transformed into an event log which can be used in the exploratory phase of an audit. The algorithm will be illustrated in form of pseudo-code and also be explained how it works and can be used. Pseudo-code is a notation that resembles a coding language but in fact is not one. It is used to notate algorithms so others can code it in their prefered language.

The transformation phase functions to transform ERP data into an event log. The framework will be used in one of the first stages of an audit, for exploratory reasons. This means that at first the auditor has no idea what the process will look like or what events in the process are of essence. Since all date fields are extracted in the first

component of the framework, there is a chance that some of these fields are unnecessary. To tackle this problem, the transformation algorithm works in an iterative way. The first time running it, all columns, both with dates and extra information, will be added to the event log. When a process model is generated from this log, the auditor can choose to run the algorithm again, but this time skipping unnecessary columns. This will result in a different event log and thus a different process model. This iteration

can be done until the auditor is satisfied with the results of the event log and process model.

It is up to the auditor to decide which process mining tool will be used for mining the process model. Easy-to-use tools like Disco will provide a quick fuzzy model, while open-source tools like ProM Lite provide more formal model notations and the ability to evaluate the processes. Since process mining is not the field of expertise for most auditors, it is advised to use Disco for these exploratory purposes.

Transformation algorithm

Listing 1 shows an abstract and simplified version of the algorithm that was developed. As discussed earlier, the goal of this algorithm is to have an extracted database as input and an event log as output. Therefore, the algorithm is called the Transformation Algorithm (TA) due its main feature of being able to transform the data. This is not the only feature, since it functions as an iterative artifact that has

(11)

a different output based on the different input. This input is dependent on which columns the auditor wants to skip and will therefore be called a human variable.

Listing 1._{Transformation Algorithm}

Set Standard Variables

Extract Extracted CSV file from ERP

SkipColumns Columns to be skipped for transformation

Type = x Appropriate transaction type (will differ per ERP) Values = ∅ Empty list where extra info of invoices will be stored

EventLog = ∅ Empty event log

Transformation Algorithm For Invoice ⊆ Extract

Line = ∅

If the TRANSTYPE of Invoice is Type For Column ⊆ Invoice

If Column ∈ SkipColumns Ignore Column If Column ∋ Date

EventLine = [INVOICENUM | Column | Date] If EventLine ∈/ EventLog

Insert EventLine into EventLog Else

Insert Column Variable into Line If Values ∈/ INVOICENUM

Insert Line into Values[INVOICENUM] For EventLine ∈ EventLog

Concatenate Values[Line] to EventLine

The first two standard variables function as input to the algorithm. _Extract

is the extracted

CSV file from the ERP and _SkipColumns

are the columns that need to be skipped. The first time

running the algorithm the SkipColumns will be empty, but as the auditor iterates, these columns can be filled in, in order to get different process models as a result. _{Type is the required transaction type of} the invoice, to be determined by the auditor in collaboration with the data scientist. This could be more than one type. _Values

starts as an empty set that will be filled with the values of the columns

containing extra information per invoice. Extra information is data that is not necessary to make a process model, but provides extra insight into the process. An example is an invoice that can be approved on date_{x by person y}

. In this case, person _{y is extra information, since it is not mandatory to}

add this to the log. _{EventLog starts as an empty event log and will be filled with both the relevant data} for making an event log and the extra information.

The TA will then commence by reading all the lines (invoices) of the extract. From the first iteration it is unclear which columns are dates and which are extra information. _{Line is an empty set} that will be filled with the extra information about one invoice when found. It is a temporary variable that will be filled per iteration of invoice.

(12)

The_TRANSTYPE

is the transaction type of the invoice. This should be the same as the earlier

defined _Type

. Transaction types can be a lot of different things, like purchase orders, sales orders,

payments, settlements and more.

Then the TA will iterate through all columns of the invoice. If the column is one to be skipped, the algorithm iterates to the next column. If the column contains a date, the event log is filled with _{INVOICENUM as the case ID, Column as event and the containing Date as the timestamp. If it} does not contain a date, then it means that it contains extra information and will be added to the temporary variable _Line

.

At the end of the iteration of one invoice, _{Line will be added to Values with the}

INVOICENUM as a key. At the end of the iteration of all invoices, _{Values is concatenated to the event} log mapped on case ID. This means that per invoice in the event log, the corresponding line of extra information is added. When this is done, the event log is complete.

Evaluation

To evaluate the framework, both components discussed in the previous section have been applied to the dataset Microsoft Dynamics AX 2012 (Microsoft, 2012), henceforth referred to as AX2012. AX2012 is one of Microsoft’s ERP software systems and functions in this thesis as a case study for evaluation. Even though it contains demo data, the structure of the tables resembles that of a real ERP. The main focus of the evaluation will be about the transformation algorithm. A short description of which columns and the amount of data extracted will first be discussed. Note that the first phase differs per ERP system and therefore will not be evaluated.

Preparation phase

To evaluate the framework, a case perspective has to be determined. Consulted by an auditor the perspective of an _invoice

was chosen. There is no particular reason for choosing invoice over order,

except for the fact that there had to be a decision made, because, as discussed earlier, it can’t be both. The relevant tables were selected by finding tables containing _{case ID}

s and timestamps. In

this dataset, the case is _INVOICE

. The relevant tables were _VENDTRANS, _{PURCHTABLE and}

VENDINVOICEJOUR. The relevant columns included all columns with a timestamp plus columns containing additional resource information. In cooperation with the auditor, those resource columns were decided upon. Some of the most relevant ones were _APPROVER

, _{TRANSTYPE and}

AMOUNTCUR

. A small sample of the extracted table is shown in Appendix B. See table 4 for some

general information about the extracted data.

Table 4. General information about extracted data from AX2012

Amount Lines 24073 Unique cases 3525 Timestamp columns 9 Resource columns 88 Transformation phase

In order to evaluate the transformation algorithm, it was first implemented in Python programming language. The python script takes a CSV database file as input, together with optional _{human input} (i.e. the columns to be skipped by the algorithm). The output of the algorithm is an event log, also in

(13)

the form of a CSV file. Since the framework works in an iterative way, there will be two evaluations. The first will be without skipping any columns and the second will be with skipping some columns. These skipped columns are decided by a collaboration between an auditor and myself.

When making a process model from the derived event log, different tools and mining techniques can be used. Two of those tools are Disco and ProM Lite. Disco is a commercial process mining tool that is easy to use but has limited functionality. ProM Lite is an open source tool that has more functionality but also requires more knowledge of how to operate the tool. Disco is a less suited tool for doing evaluation because for checking either of the four quality criteria, an event log needs to be compared to a petri net. Disco produces a fuzzy model, which does not provide functionality for these evaluations. Therefore, ProM Lite will be used for the evaluation of both iterations of the algorithm.

For the purpose of this thesis, two process discovery quality criteria will be assessed on the process model; fitness and generalization. Fitness closely relates to internal validity while generalization closely relates to external validity. Different calculations can be used for assessing quality of process models (Mining, 2011). In this paper the calculations presented by Buijs et al. (2012, September) are used as quality criteria because they generalized all equations to fit process trees. These equations are not necessarily the best method to evaluate process models, they are however the most convenient and easy to calculate. The fitness score is calculated as follows (Buijs et al., 2012, September):

1 −

_{Minimal cost to align arbitrary event log on model and vice versa}

Cost for aligning model and event log

The metric fitness is computed on a scale from 0 to 1, where 1 is optimal. A score of 1 means that all traces from the event log can be replayed by the model, whereas a score of 0.5 means that only half the traces in the event log can be replayed by the model. The second metric, generalization, however can only reach 1 in the limit. This means that the more often a node will be visited, the closer the value gets to 1, but will never reach. The generalization score is calculated as follows (Buijs et al., 2012, September):

1 −

_{#nodes in tree}

(

)

∑

_nodes

√#executions

−1

When calculating generalization, first a so-called process tree (Buijs et al., 2012, June) has to be created. A process tree is a method to describe models which, by definition are always sound (e.g. no deadlocks in the model) unlike, for example, petri nets. A process tree contains operator nodes and leaf nodes. Operator nodes specify the relation between its children. At the end of all edges are leaf nodes, typically depicted as events of the process model. As seen in the equation for calculating generalization, it is essential to count the amount of nodes in the tree.

(14)

Iteration 1

All 3525 cases make up a total of 125 different traces. Figure 5 shows the distribution of the number of cases over the number of events. Note that the y-axis is plotted on a logarithmic scale. The observation of the distribution reveals that 1640 cases have 5 events and 1291 cases have 6 events. This is not a surprising result, since it is expected that most cases (about 85%) follow the

a priori process. The right side of the graph could be seen as the _tail

. These are cases containing more than 10 events and might be interesting to

assess as an auditor. These cases could be exceptions allowed by the process or errors that should not have happened. The petri net mined using an inductive mining algorithm can be seen in Appendix B. The inductive miner mines with a noise threshold of 0.3. This means that in order to keep the mined petri net understandable and simple, perfect log fitness is not guaranteed. In other words, unique traces which only occur a minimal amount will not be taken into account by the miner. For exploratory purposes this is not a problem, since the goal is to find out the general form of the process model in question. A process model with perfect replay fitness would be unrealistic with data of this size. The auditor can always choose to mine it again at a later stage with a different noise threshold.

Fig 5._{Distribution of the number of cases over the number of events.}

The process tree mined with the inductive miner is illustrated in figure 6. Table 5 describes the value of fitness and generalization of the model. The fitness is relatively low due to the high noise threshold chosen as input for the miner. This prevents the miner from creating an overly complex process model which would be unreadable by an auditor. The generalization score is close to 1, meaning the model does not overfit the log and closely relates to the high external validity of the model.

(15)

Fig 6._{The process tree mined with inductive miner. (A = TRANSDATE,}

B = APPROVEDATE, C = DUE DATE, D = DOCUMENTDATE, E = LASTSETTLEDATE, F = CLOSED, G = MODIFIEDDATETIME, H =

CREATEDDATETIME)

Table 5._{Quality criteria scores of iteration 2}

Quality Criterium Score

Fitness 0.859 Generalization 0.957

Iteration 2

After reviewing the first extracted process model, the auditor decides upon columns not interesting to map out in a process model for exploratory purposes. _{MODIFIEDDATETIME, CREATEDDATETIME} and_{LASTEXCHADJ are the columns decided to be left out for the second iteration. The resulting petri} net can be found in Appendix C. All 3525 cases make up a total of 85 unique traces. Figure 6 shows the distribution of the number of cases over the number of events. Note that here too, the y-axis is plotted on a logarithmic scale.

(16)

For this iteration the noise threshold has also been set at 0.3 when mining the process model with the inductive miner. The process tree mined with the inductive miner is illustrated in figure 8. The scores of the quality criteria replay fitness and generalization can be seen in table 6. Both the fitness and generalization are higher. This is not surprising since events have been skipped in this iteration which leads to a simpler process model.

Fig 8._{The process tree mined with inductive miner. (A = TRANSDATE,}

B = APPROVEDATE, C = DUE DATE, D = DOCUMENTDATE, E = LASTSETTLEDATE, F = CLOSED)

Table 6. Quality criteria scores of iteration 2

Quality Criterium Score

Fitness 0.999 Generalization 0.992

Conclusion and Further Research

Auditors face the challenge of dealing with ERP systems that are rapidly growing in data. Traditional methods are becoming obsolete for auditing those financial statements generated by these automated processes. Process mining has been introduced in different corporate contexts, but are missing in the field of auditing. Mining and reconstruction of financial process models can be done using process mining methods. In order to use process mining, data should be in the form of an event log. However, due to the relatively immature technology, most ERP

systems are not set up to support this. In this thesis, a framework is provided that transforms regular ERP data into an event log which can be used for exploratory purposes of an audit. By using this

(17)

framework at the early stages of an audit, the auditor will be able to map out processes to obtain understanding and information about the processes, as required by the ISA.

The framework consists of two components, a preparation component and a transformation component. The preparation component, simply put, joins databases together on a pre-decided key (e.g. an invoice). This preparation is done by a data scientist and an auditor. The ERP specialist has knowledge about the structure of the ERP system, while the auditor has knowledge about the meaning of the data. The transformation phase focuses on transforming the previously prepared data into an event log. The algorithm responsible for this process works in an iterative way, making sure the account can customize the event log by skipping unnecessary events.

The goal of the framework is to transform data into an event log. To evaluate the quality of the event log, an inductive miner was used to map out a petri net. Two of the four quality criteria, replay fitness and generalization, were assessed on two iterations of the transformation algorithm, revealing that the quality is higher when ignoring unnecessary columns.

Practically, auditors will not evaluate their event logs in this manner. Due to the technical nature of this process, auditors will most likely use simpler tools for their exploratory purposes. Disco is an example of one of these tools. It offers auditors enough functionality to explore processes without needing any process mining knowledge or experience.

Future research

The proposed solution for transforming data into an even log consists of a dual component framework. The first component of the framework, the preparation phase, requires a lot of human resources. Both the auditor and the data scientist have to work together to select the appropriate data. When ERP systems get bigger and more complex, this phase could take a long time and the chance of error increases. A next step for improving this framework would be an automation of this part of the framework. This automation would save up a lot of time for the data scientist as well as the auditor. A complete automation of this phase is unrealistic, since no ERP system is identical. This means that human input will always necessary when initiating the framework.

Perhaps when the preparation phase is automated, both components can be merged, making it one framework. This would imply that an auditor could use the framework on any ERP system to quickly and easily learn what the processes look like and where exceptions or errors occur within the system or process.

Since the extraction phase still requires human input, it is unclear how much time is required to perform this phase. Perhaps when ERP systems grow bigger or more complex it will consume more (or less) resources than traditionally. Further research could be done by investigating the difference in consumed time between using and not using the framework for the exploratory phase of an audit.

Currently, the framework is used for an exploratory purpose. It provides the auditor with an understanding of the processes of the company. However, when actually performing an audit, the auditor needs to use financial process mining to map out corresponding control flows (Werner and Gehrke, 2015). Can this framework also be used to do this, or does there need to be a different artefact?

When evaluating the framework, the process quality criteria equations discussed by Buijs et al. (2012, September) were used. Since there are different methods and equations to calculate those quality criteria, there will also be different results. Perhaps in further research, those different equations could be used to determine whether the framework still scores well on these criteria.

(18)

References

Audit Manual (2015, November 25). University of Illinois

. Retrieved from

https://www.audits.uillinois.edu/UserFiles/Servers/Server_700/File/Audit%20Manual/Audit_Manual.pdf Bernard, G., Boillat, T., Legner, C., & Andritsos, P. (2016). When sales meet process mining: A scientific approach to sales process and performance management.

Bou-Raad, G. (2000). Internal auditors and a value-added approach: the new business regime. _Managerial Auditing Journal

, 15(4), 182-187.

Brinkkemper, S. (1996). Method engineering: engineering of information systems development methods and tools. _{Information and software technology}

, 38(4), 275-280.

Buijs, J. C., van Dongen, B. F., & van der Aalst, W. M. (2012, June). A genetic algorithm for discovering process trees. _{In Evolutionary Computation (CEC)}

, 2012 IEEE Congress on (pp. 1-8). IEEE.

Buijs, J. C., Van Dongen, B. F., & van Der Aalst, W. M. (2012, September). On the role of fitness, precision, generalization and simplicity in process discovery. In _{OTM Confederated International Conferences" On the} Move to Meaningful Internet Systems"

(pp. 305-322). Springer, Berlin, Heidelberg.

Bryman, A. (2015). Social research methods

. Oxford university press.

Chen, H., Chiang, R. H., & Storey, V. C. (2012). Business intelligence and analytics: from big data to big impact. MIS quarterly

, 1165-1188.

Debreceny, R. S., & Gray, G. L. (2010). Data mining journal entries for fraud detection: An exploratory study. International Journal of Accounting Information Systems

, 11(3), 157-181.

D’Arcy, S. P., & Brogan, J. C. (2001). Enterprise risk management. Journal of Risk Management of Korea, 12(1), 207-228.

Dumas, M., La Rosa, M., Mendling, J., & Reijers, H. A. (2013). Fundamentals of business process management (Vol. 1, p. 2). Heidelberg: Springer.

Gehrke, N., & Mueller-Wickop, N. (2010, August). Basic Principles of Financial Process Mining A Journey through Financial Data in Accounting Information Systems. In AMCIS

(p. 289).

Gramling, A. A., Nuhoglu, N. I., & Wood, D. A. (2013). A descriptive study of factors associated with the internal audit function policies having an impact: Comparisons between organizations in a developed and an emerging economy. Turkish Studies, 14(3), 581-606.

Günther, C. W., & Van Der Aalst, W. M. (2007, September). Fuzzy mining–adaptive process simplification based on multi-perspective metrics. In International Conference on Business Process Management

(pp.

328-343). Springer, Berlin, Heidelberg.

International Federation of Accountants (IFAC). (2012). Identifying and Assessing the Risks of Material Misstatement through Understanding the Entity and Its Environment. International Standards on Auditing 315

.

(19)

Jans, M., Alles, M., & Vasarhelyi, M. (2013). The case for process mining in auditing: Sources of value added and areas of application. International Journal of Accounting Information Systems

, 14(1), 1-20.

Jans, M., Alles, M. G., & Vasarhelyi, M. A. (2014). A field study on the use of process mining of event logs as an analytical procedure in auditing. _{The Accounting Review}

, 89(5), 1751-1773.

Leemans, S. J., Fahland, D., & van der Aalst, W. M. (2013, August). Discovering block-structured process models from event logs containing infrequent behaviour. In _{International Conference on Business Process} Management

(pp. 66-78). Springer, Cham.

March, S. T., & Smith, G. F. (1995). Design and natural science research on information technology. _Decision support systems

, 15(4), 251-266.

Microsoft (2012). Microsoft Dynamics AX [2012]. Retrieved from

https://mbs.microsoft.com/customersource/northamerica/AX/downloads/service-packs/MicrosoftDynamicsAX2 012R2

Mining, P. (2011). Discovery, Conformance and Enhancement of Business Processes. Springer-Verlag

, 8, 18.

Oikawa, M. K., Ferreira, J. E., Malkowski, S., & Pu, C. (2009, September). Towards algorithmic generation of business processes: From business step dependencies to process algebra expressions. In International Conference on Business Process Management

(pp. 80-96). Springer, Berlin, Heidelberg.

Österle, H., Becker, J., Frank, U., Hess, T., Karagiannis, D., Krcmar, H., ... & Sinz, E. J. (2011). Memorandum on design-oriented information systems research. European Journal of Information Systems

, 20(1), 7-10.

Parkinson, M. (1999). Presenter at the Institute of Internal Auditors Educators Symposium, 20 October. _Sydney, Australia.

PEMPAL (2014). Risk Assessment in Audit Planning: A guide for auditors on how best to

assess risks when planning audit work. _{PEMPAL Internal Audit Community}

. Internal Audit Community of

Practice.

Penn, S. (2018). Six-Step Audit Process. _Chron

. Retrieved from

http://smallbusiness.chron.com/sixstep-audit-process-17816.html

Petri, C. A., & Reisig, W. (2008). Petri net. Scholarpedia

, 3(4), 6477.

Reichert, M., & Weber, B. (2012). _{Enabling flexibility in process-aware information systems: challenges,} methods, technologies

. Springer Science & Business Media.

Rozinat, A., & Van der Aalst, W. M. (2008). Conformance checking of processes based on monitoring real behavior. Information Systems

, 33(1), 64-95.

Suriadi, S., Andrews, R., ter Hofstede, A. H., & Wynn, M. T. (2017). Event log imperfection patterns for process mining: Towards a systematic approach to cleaning event logs. _{Information Systems}

, 64, 132-150.

The Audit Process. (n.d.). Retrieved from https://internalaudit.ku.edu/project-process

Vaassen, E., & Koopman, A. (2016). Handen en voeten aan process mining. _{Management Control &} Accounting

(20)

Van Der Aalst, W., Adriansyah, A., De Medeiros, A. K. A., Arcieri, F., Baier, T., Blickle, T., ... & Burattin, A. (2011, August). Process mining manifesto. In International Conference on Business Process Management (pp. 169-194). Springer, Berlin, Heidelberg.

Van der Aalst, W. M., De Medeiros, A. A., & Weijters, A. J. M. M. (2005, June). Genetic process mining. In International Conference on Application and Theory of Petri Nets (pp. 48-69). Springer, Berlin, Heidelberg. Van der Aalst, W. M. (2011). Process Discovery: An Introduction. In Process Mining (pp. 125-156). Springer, Berlin, Heidelberg.

ISO 690

van der Aalst, W. M. (2011, April). Process mining: discovering and improving Spaghetti and Lasagna processes. In Computational Intelligence and Data Mining (CIDM), 2011 IEEE Symposium on

(pp. 1-7). IEEE.

Van der Aalst, W. M., & Koopman, A. (2015). Proces mining: data analytics voor de accountant die wil weten hoe het nu echt zit. MAB,

89, oktober.

What Is ETL? (n.d.). Retrieved May/June, 2018, from

https://www.sas.com/en_us/insights/data-management/what-is-etl.html

Werner, M., & Gehrke, N. (2015). Multilevel process mining for financial audits. IEEE Transactions on Services Computing

, 8(6), 820-832.

Werner, M. (2017). Financial process mining-Accounting data structure dependent control flow inference. International Journal of Accounting Information Systems

, 25, 57-80.

Weijters, A. J. M. M., van Der Aalst, W. M., & De Medeiros, A. A. (2006). Process mining with the heuristics miner-algorithm. Technische Universiteit Eindhoven

(21)

Appendix

(22)

(23)

(24)

Appendix D - Python code of algorithm

import csv

import datetime

def opencsv(csvfile):

# Open csv to list of lists

with open(csvfile, 'rb') as f:

reader = csv.reader( f, delimiter=';')

importedFile = list(reader)

headers = importedFile[0]

return importedFile, headers

def makeLog(importedFile, headers, skipColumns):

print 'Making log'

# Check what position transtype is

transtype = headers.index("TRANSTYPE")

invoicePos = headers.index("INVOICE")

transdatePos = headers.index("TRANSDATE")

eventLog = [['Case ID', 'Event ID', 'Timestamp']]

invoicecounter = 1

cleanHeaders = []

values = {}

## # Make clean headers ## for header in headers:

## if header not in skipColumns: ## cleanHeaders.append(header)

# Loop through file

for invoice in importedFile:

num = 0

line = []

if invoice == headers:

continue

if invoice[invoicePos ] == '':

invoiceNum = 'NoInvoice' + str(invoicecounter)

invoicecounter += 1

else:

invoiceNum = invoice[invoicePos]

# Transtype 3 = order, transtype 24 = settlement

if invoice[transtype ] == '3':

for column in invoice:

if headers[num] in skipColumns:

num += 1

continue

timeColumn = checkIfDateTime(column)

if timeColumn != 0 and timeColumn != 1:

eventLog = addToLog(eventLog, [invoiceNum , headers[num], column])

elif timeColumn == 1:

if headers[num] not in cleanHeaders:

cleanHeaders.append(headers[num])

line.append(column)

num += 1

if invoiceNum not in values:

(25)

# Concat the values to the row

eventLog[0] += cleanHeaders

for row in eventLog:

if row[0 ] == 'Case ID':

continue

row += values[row[0]]

return eventLog

def checkIfDateTime(value):

try:

time = datetime.datetime.strptime(value, '%Y-%m-%d %H:%M:%S.%f')

if time.year == 1900:

time = 0

except ValueError:

time = 1

return time

# Hij kan niet 2 keer precies dezelfde row in de event log hebben

def addToLog(eventLog, row):

if row not in eventLog:

eventLog.append(row)

return eventLog

def exportToCSV(eventLog, name='eventLog3.2.csv'):

print 'Exporting to csv'

with open(name, "wb") as f:

writer = csv.writer( f, delimiter=';')

writer.writerows(eventLog)

def whatToSkip(headers):

while True:

print 'Which columns would you like to skip? Seperate each column with a space. Type "?" to

see all column names\n'

columns = raw_input("")

if columns == '?':

for header in headers:

print header

else:

break

return columns.split()

#print 'Which values of certain columns would you like to skip? Type '

if __name__ == "__main__":

csvfile = 'roughExtract.csv'

importedFile, headers = opencsv(csvfile)

columns = whatToSkip(headers)

eventLog = makeLog(importedFile , headers, columns)

exportToCSV(eventLog)