Automated and Flexible Error Detection for Mobile Expense Reporting

(1)

MASTERS

INFORMATION

STUDIES

BUSINESSINFORMATIONSYSTEMS

Automated and Flexible Error Detection

for Mobile Expense Reporting

Author: Simon WIDLER 10850279 Supervisor: Dr. Ilya MARKOV July 19, 2015

(2)

Automated and Flexible Error Detection for Mobile Expense Reporting

Simon Widler

University of Amsterdam Amsterdam, Netherlands

Abstract

SecuReceipt BV, founded in 2010, provides an innovative digital service for mobile expense reporting. To satisfy the needs of a di-verse customer field, a generic approach to handle business rules and cover erroneous or fraud related behavior is desired. To solve the problem, a set of business rules was extracted from historical data and suitable solutions were chosen. First, a Business Rule Management System was implemented. While proving that for the sampled customers a surprisingly big number of expenses was falsely approved, the historical data was rectified and brought to a desirable state for Machine Learning. Using the modified datasets, several Machine Learning algorithms were implemented to learn the business rules from the data and then classify new expenses according to the rules. All algorithms were then tested and eval-uated. Testing on the data of three representative customers, the DecisionTree based C4.5 algorithm outperformed DecisionTables, Multilayer Perceptrons and Support Vector Machines significantly (p <0,05) in F-Measure (0,96) and AUC (0,98). All classifiers were optimized for the false negative rate, C4.5 achieved a false negative rate of only 3%. Despite the good results, in conclusion, a Business Rule Management System is recommended as a first step, as the current technical infrastructure of SRXP does not allow big scale Machine Learning across all customers.

Keywords: Error Detection, Business Rules, Machine Learning, Business Rule Management Systems

1 Introduction

1.1 Business case

Expense management and the associated expense reporting process can be found in nearly all companies, independent of size or indus-try. Allowing employees to request refund for costs that were issued for company related activities, the process usually ends with a re-imbursement from the company to their employee. This process is commonly paper based, as until in recent years the original receipts needed to be physically stored for auditing reasons. However, a change in the Dutch legislation removed the need to store physical receipts, leaving digital copies as sufficient [Belastingdienst 2012]. This change allowed a digitalization of the expense reporting pro-cess.

An early adopter to this change was SecuReceipt BV, further re-ferred as SRXP. Founded in 2010, SRXP provides an innovative service for mobile expense reporting, with more than 2000 cus-tomers worldwide. Having provided apps for several platforms, an online portal and various integrations to financial systems, SRXP facilitates an entirely new process. Employees that want to report expenses, now simply can take a picture of a receipt with their smartphone from within the SRXP mobile app, and then throw the paper receipt away. The pictures are automatically synchronized with the online portal, where employees then add information and send their expenses to their manager for approval. After the ap-proval, the expenses are then exported from the SRXP application to the financial system of a customer, replacing the manual transfer from paper to the system. By exporting the data to the financial system, a booking is created automatically, which then starts the

reimbursement process. This new process minimizes the duration between the creation of an expense and the reimbursement to em-ployees, reduces paper waste and saves manual efforts. Handing in expenses becomes easier for employees, manager do not have to spend their time with paperwork for their employees and the audit team no longer has to transfer the data from paper to the financial system.

The process of course is not always this straightforward and also includes cases when expenses have to be rejected. Rejection rea-sons vary between companies, but also within a company, and are caused by diverse sources. Those sources include national and in-ternational laws, bookkeeping guidelines or business policies, all summarized as business rules in the following. Due to the ease and speed of the process, customers fear that the business rules in place are less likely to be checked by approvers and the audit team. Ex-penses that violate any business rules are understood as erroneous expenses. If erroneous expenses are exported to financial systems and they stay undetected, they cause economic damage. In case they get detected, they need to be corrected manually, causing man-ual efforts. Basic error handling was implemented by SRXP right from the start, to prevent for example incomplete expenses from being reported, but this was only the start. Open questions like defining which fields are mandatory arose. For governmental insti-tutions, for example, reporting taxes is not necessary while corpo-rations heavily rely on reporting taxes accurately. Currently, the tax field is mandatory for all customers, leaving governmental institutes with the inconvenience of always having to choose a tax rate of zero percent to create a complete expense. In this case it results in an in-convenience, other cases might lead to financial loss. For example, industrial sized companies allow higher costs for travel and accom-modation than a small business. Defining a threshold centrally by SRXP is therefore not possible. Multinational companies have dif-ferent tax needs than a local business, and SRXP can not restrict one of both in their possibilities. Customers might also allow or restrict certain categories for certain employees, based on their po-sition or location. The number of examples seems endless, leaving SRXP with offering an extremely customizable solution that creates much freedom. However, customers do not want their employees actually to have all this freedom and need to restrict their behavior according to their business rules. Frequent questions during im-plementation meetings concerning an automated check of expenses according to internal business rules were raised. Even though basic error handling was implemented by SRXP right from the start, with a growing customer base, the need for configuration options was also growing.

By SRXPs philosophy, no customer-specific implementations are made, raising the questions of how the needs of all customers can be supported. With more than 200.000 stored receipts, in theory, all available business rules are represented by the stored data, given the attributes of an expense and its status. With the trend of Big Data analysis and the possibilities of artificial intelligence, it should be investigated if an automatic learning and applying of business rules can be realized across the whole customer base. Automati-cally learning the business rules from historical data and using it to approve or reject new expenses would not only help current cus-tomers but is also expected to be a great feature for marketing and sales purposes.

(3)

1.2 Research questions

The goal of SRXP is to deliver a reliable solution to check the com-pliance of business rules from within the application, without the need for customer-specific implementations. Therefore, a solution needs to be designed that supports the approver and audit teams of each company with detecting erroneous data according to their business rules. Automating this process is considered the perfect solution, if the quality can be secured. From this goal, the follow-ing research questions were derived to lead the research process. RQ1 Which kind of business rules need to be addressed? RQ2 How can those business rules be checked reliably and

automatically?

RQ3 How can an integrated solution be achieved?

1.3 Research Method and Findings

Based on a literature study to collect the necessary knowledge about business rules and Machine Learning, a case study, and several experiments were set up. Three customers, chosen from the ten biggest customers were investigated in detail, providing over 40000 receipts. Their datasets were manually analyzed to find the prob-lems that were mentioned by customers and to answer RQ1. The most common rejection reasons were caused by expenses that vi-olate financial business rules. During the research, it became ap-parent that Machine Learning can not be applied directly without extensive data rectification, as due to technical reasons no usable rejected expenses were found in the data. To clean the data, a Busi-ness Rule Management System was implemented to determine if an expense should be rejected or approved. While applying the rules of the BRMS to the datasets, a remarkable amount of erro-neous expenses were found, which were approved nevertheless. In an experimental setup, those expenses were then used as negative examples to train multiple Machine Learning algorithms for classi-fication. After applying optimization techniques, Machine Learn-ing successfully identified expenses that should be rejected accord-ing to their attributes, outperformaccord-ing the manual audit process by far. With successfully implementing both a Business Rule Manage-ment System and ML approaches, RQ2 was answered. All findings and the learning process were then used to give recommendations, suggesting solutions for RQ3.

1.4 Contributions and Outline

The following contributions were achieved with the research: • First cross customer analysis of historical data for rejection

reasons, condensed illustration of the found business rules and their associated attributes that have to be used for process au-tomation.

• Design and implementation of a Business Rule Management System to rectify the datasets, demonstrating that the current audit process does not fulfill the expectations regarding qual-ity.

• Design, implementation and evaluation of various Machine Learning algorithms on sampled data outperforming the man-ual process in place by far.

• Identification of implementation obstacles and recommenda-tions on future steps to realize an integrated solution. The remainder of this work is structured as follows. Chapter 2 elab-orates on necessary background and terminology for business rules, Machines Learning and sets the scope of the research. Chapter 3

shows recent research on the topic and summarizes the findings related to business rules and Machine Learning used to shape the research at hand. Chapter 4 examines the findings of the initial data analysis in detail and elaborates on the found business rules, but also on constraints and found problems. At this point the de-scriptive research phase finishes. We enter the experimental phase, putting the findings into practice. Chapter 5 illustrates the rectifi-cation of the datasets using a BRMS while Chapter 6 examines the experimental setup for Machine Learning. Chapter 7 shows the ex-perimental outcomes and transfers to the final part of the paper. The last part contains a summary of the found results and recommenda-tions for concrete next steps in chapter 8 and draws a conclusion in chapter 9.

2 Background and Scoping

The first chapter tried to frame the initial situation and the motiva-tion of the research, giving insight to the business of SRXP. The main part of the research focuses on Machine Learning and Busi-ness Rule Management Systems, and uses the appropriate terminol-ogy. To create a shared understanding of the terminology used in the analysis and experiment, the topics related are explored briefly. After business rules and Business Rule Management Systems are shortly discussed, Machine Learning is explored. Here, three main subtopics are elaborated, related to the components of the frame-work: the algorithm, the features and the datasets used. Afterwards, optimization techniques, found limitations and the concrete imple-mentations of the used systems are introduced.

2.1 Business Rules

Hay el al.[2000] give the following definition, ”A business rule is a statement that defines or constrains some aspect of the business. It is intended to assert business structure or to control or influence the behavior of the business”. Moreover, the following two per-spectives can be separated, a business perspective and an informa-tion system perspective. The business perspective focuses on the constraints that apply to the behavior of people in the enterprise. The information system perspective focuses on the facts that are recorded as data and constraints on changes to the values of those facts [Hay et al. 2000]. Violation of those business rules can hap-pen by mistake, or on purpose. The latter case is defined as fraud, as ”it is an intentional deception, misappropriation of a companys assets or manipulation of its financial data to the advantage of the perpetrator” [Hall 2010].

2.2 Information System

2.2.1 Business Rule Manamgent Systems

Business Rule Management Systems (BRMS) are information sys-tems that can be used flexibly to implement and maintain business rules without the necessity of software development. BRMS usu-ally consist of several components. One of them are the isolated rules, the so-called rule repository, which can be created and main-tained by domain experts. A rule engine then handles the external-ized rules to determine the state of a business object according to the rules. This setup allows a decentralized maintenance of rules, minimizing communication, development and maintenance efforts while increasing process automation [OpenRules 2015].

2.2.2 Machine Learning

In general, Machine Learning (ML) systems are supposed to ”learn” computer programs from raw data. ML approaches are at-tractive in cases where manual implementation is expected to be too

(4)

complex. A wide variety of ML approaches exists, though explor-ing them in detail is not in the scope of this research. Classifica-tion problems like classifying an expense as approved or rejected are called binary classification problems and are common problems in ML. ML approaches are expected to be more cost efficient than manual implementation and are heavily researched. To keep the pa-per precise, only the relevant parts for this research are described. In short, binary classifiers use a set of input features, we distinguish continuous and discrete feature values, to output a single discrete value, the predicted class [Domingos 2012].

Types of Machine Learning According to the ”no free lunch” theorem, there is not one approach that is superior to all other for all given problems, but for each problem different approaches can be found as the best suitable [Domingos 2012]. Hence a variety of approaches needs to be tested in this research. Domingos [2012] defines five classes to categorize approaches of Machine Learn-ing. One representing algorithm was chosen for each class. The Multilayer Perceptron, further referenced as MLP, represents the class of neural networks. As a rule based classifier DecisionTables were used (DT). As a simple probabilistic classifier, naive Bayes was chosen (nBayes). Instance based classifiers are represented by a Support Vector Machine(SVM). For DecisionTree classifiers, an implementation of the C4.5 algorithm was used.

Feature selection In Machine Learning, selecting the right amount of features, and the right features, in particular, is crucial for the performance of the classifier. For each classification prob-lem, there is an optimum amount of features. Increasing the number of features then will degrade rather than improve the performance of the classifier, known as the curse of dimension [Domingos 2012]. Simply using all available attributes is not the right way to approach this problem. Commonly known methods to reduce the number of features are ”feature selection” and ”feature extraction”. Whereas feature extraction meaningfully summarized features to reduce the number, feature selection removes redundant and irrelevant infor-mation [Kittler 1986]. As the question of meaningful summariza-tion might differ from customer to customer, no extracsummariza-tion of fea-tures was used for the experimental setup.

The data To train a classifier, training data needs to be available. The smaller the available dataset, the harder it is to learn mean-ingful things from it. In a real world environment, assuming data availability is not a problem, noise and/or a strong bias in the data usually is. Bias in the data means that the classes are not equally distributed, one class usually tends to be much bigger than the other classes. Those two issues can be approached in different ways. To remove noise, the dataset can be preprocessed. This includes iden-tifying and removing irrelevant or misleading information. To solve the issue of a biased dataset, the dataset can be preprocessed again, removing redundant information from the bigger class, or adding information to the smaller class [Longadge and Dongre 2013].

Optimization Techniques Classifiers themselves offer possibili-ties to be optimized for a given dataset and problem in various ways. The technique used is called cost-sensitive learning. Cost-sensitive learning can be used to attach costs of wrong predictions to classes. In our case, the costs for expenses that were classified as approved while violating the rules, false negatives, cause higher costs than false positives. Research suggests that ”preprocessing provides a better solution than other methods because it allows adding new information or deleting the redundant information” [Longadge and Dongre 2013]. Both approaches were tested, as the paper further suggests ”that applying two or more technique i.e. hybrid approach gives a better solution for the class imbalance problem”.

Limitations Besides the problem of strongly biased datasets, an-other problem was identified for the research. This is the case when the reasons for rejection are not represented in the facts that are stored associated with an expense. One frequent example, found in conversations with domain experts, is that the manually entered data does not fit the actual receipt. This might be a simple mistake while entering the amount of the receipt, or choosing the wrong cat-egory (choosing catcat-egory ”Hotel” for a flight ticket receipt), enter-ing the wrong currency and so on. The consistency of the reported expense with the picture of the receipt is currently checked, or in the worst case not checked, by the approver and the audit team. Au-tomating this step requires a computer-based reading of the receipt pictures, known as Optical Character Recognition (OCR). Even though OCR is a strongly researched topic, the current status of acquirable OCR solutions is not sufficient to integrate it into SRXP and the experimental setup. This has been researched by SRXP in the past and hence is out of the scope of this paper.

2.3 Implementation

For both ML and BRMS a wide variety of implementation options exist, ranging from open source and GLP licensed frameworks over simple commercial products for SMEs to industrial solutions cop-ing with huge amounts of data and rules. For the purpose of repro-ducibility of the research, only open source options were chosen. For the chosen ML approaches, Weka was used as an open source implementation, offering training, testing and evaluating the perfor-mance of several classifier [Witten et al. 2011]. For the BRMS the tool OpenRules by The OpenRules Inc. was used. OpenRules uses Microsoft Excel Spreadsheets as rule repositories and a Java Frame-work with Eclipse IDE Integration for the rule engine [OpenRules 2015].

3 Related Work

ML approaches can be used for a variety of purposes, so a lot of work can be found on those topics. As for the ML approaches several classifier should be compared, the findings of the related research were used to support the experimental setup.

Caruana and Niculescu-Mizil [2006] compared ten different learn-ing algorithms over a variety of datasets trylearn-ing to evaluate their performance. The findings go along with the free lunch theorem, stating that there is not a single best algorithm for all likely prob-lems. On average anyways, ”calibrated boosted trees were the best learning algorithm overall. Random forests are close second, fol-lowed by uncalibrated bagged trees, calibrated SVMs, and uncali-brated neural nets.” Those findings also cohere with the findings of [Entezari-Maleki et al. 2009]. Entezari et al. compared different approaches focusing on the number of features, feature type, and sample size. The results are summarized as ”Four classifiers DT, k-NN, SVM and C4.5 obtain higher AUC than three classifiers LogR, NB, and LC.”

Klieger et at. [2014] investigated the use of Machine Learning for business rules as well, however with a different goal. His research focused on discovering implicit business rules from data and mak-ing them visible and understandable for humans. He then used the discovered rules to be implemented in a BRMS. For this purpose, different algorithms, so-called association rule classifier, are used, solving the main problem of eventually contradicting rules gener-ated by some rule learner. However, as the business rules are explic-itly known by the customer, this problem is not relevant. Therefore, association rule classifiers are out of scope.

Verver [2014] gives a ten step starting guide for fraud and error detection using data analysis approaches. He begins by

(5)

identify-ing and definidentify-ing fraud risks and define for each risk an own fraud detection test. This approach was applied in the research at hand. Another central aspect of automated data analysis is according to Verver [2014] the organizational ownership of the solution. As SRXP is providing the infrastructure for a customer to store and maintain their data within their SRXP application securely, the ownership of the eventually new infrastructure needs to stay with SRXP as well. Therefore, including third parties to the planned architecture is not desired. However, it has to be mentioned that contemporary SaaS solutions exist, already modified to opt the pur-pose of fraud detection. Siftscience specialized on fraud detection as a service, providing a restful API and using Machine Learning to adapt constantly to fraudulent behavior. BigML, on the other hand, offers a more general service for Machine Learning, also providing a modern API for the usage.

Discussion and Decisions While the background introduced the terminology for BRMS and ML, the related work directly influ-ences the setup of the research. Using the classification of Domin-gos et al. [2012], and the experimental results of Caruana and Niculescu-Mizil [2006], the following 5 classifiers were chosen to be compared: Support Vector Machines, Multilayer Perceptron, C4.5, DecisionTables and naive Bayes. As suggested by Longadge and Dongre [2013], multiple combinations of optimization tech-niques will be tested to improve the classifier performance. Moreover, the first cases were identified that can not be handled at all currently, at least not from an automation point of view. A con-sistency check of the picture of the receipt and the entered data is without OCR not to realize. This topic is expected to be solved sep-arately, so we can assume that all data stored represents the receipt correctly.

4 Initial Analysis

At the point the research was conducted SRXP had more than 2000 customers worldwide with more than 200.000 expenses that have been submitted using the application. However, the ten biggest cus-tomers are responsible for more than 50% of the expenses, demon-strating the differences in company size in the customer fields of SRXP. Given the nature of the diverse customer field, SRXP needs to provide a broad range of configurations. Customers can set up the product in their preferred way and adjust it to internal processes. To understand which attributes of an expense can and should be used for ML / BRMS, the available attributes of expenses defined by SRXP are introduced, together with a detailed process overview. Afterward, the found business rules from the data analysis are elab-orated together with their associated facts to understand how the stored facts can be used to determine the status of an expense.

4.1 Reports and Expenses

Expenses are created and gathered by employees using the online portal. There the expenses are then bundled as a report and sub-mitted by the employee. To simplify the workflow, each expense is seen as an isolated instance. An explanation of why expenses are not used in the context of a report can be found in the next section. An expense, when being handed in, contains the follow-ing attributes: Audit Code, Date, Currency, Amount, Category, Tax Code, Total, Payment Method, Description and a Project. When-ever an expense gets rejected, the approver fills a field ”motivation” to explain why he rejected the expense. Therefore, this field can be used as an indicator, if and why an expense was rejected during the process. Figure 1 visualized a simplified version of the process. As an expense is always connected to a user, more fields are avail-able afterward for validation. For a user, also a job title and an

es-Figure 1: Simplified Expense Reporting Process

tablishment can be configured. All attributes are shortly explained in the following, supported by examples if necessary.

Audit Code Unique hexadecimal code to link to the picture of the receipt.

Date Calender date when the receipt was issued. Currency Currency in that the receipt was paid.

Amount Amount paid of the receipt. An expense can have several amounts, based on the items of a receipt. One expense can have several amounts. For example, a hotel bill contains one amount for Lunch, cleaning service and accommodation each. Those amounts then get different categories. See next point. Category An expense amount has one category code, which is

mapped to a general ledger account.

Tax Code The applicable tax rate that was paid. This is particu-larly important for international customers. Variable tax rates can be set up to support the VAT reclaim process.

Total The total of all amounts of an expense.

Payment Methods How the bill was paid. The allowed payment methods a user can choose from are usually ”Cash”, ”Pinpass” and ”Creditcard” (private or company card).

(6)

Description A free field description for the expense.

Project An expense can be assigned to a project for reporting pur-poses.

Establishment Office/country where an employee is located. Job Title Function of an employee.

The Job Title is interesting in particular, as other attributes can be attached to a Job Title in the current application of SRXP. This was a first attempt of implementing business rules within the applica-tion. For a job title, certain dependencies can be configured. Cus-tomers are here able to assign an employee to a Job Title and restrict the usage of categories and payment methods for all employees with this Job Title. Thus user related fields were not checked in this re-search.

The available values for all fields, except for the amount and date, are set up during the roll out of SRXP and various constraints apply for each customer. Therefore, a generalization is not easily possi-ble. Each customer was investigated separately, and common fea-tures were extracted to create a universally valid application. The results of this analysis can be found in the following sections.

4.2 From Motivation to Features

The field ”motivation” for rejected expenses is mandatory when re-jecting an expense. The field for input is currently a freetext field, which means that the rejection reasons are entered manually. Here not only the languages vary but also the style of feedback, thus no automatic analysis was possible. The motivations were checked manually. The following approach was used to choose the features for Machine Learning. The motivation field illustrates the applied business rule when rejecting an expense, and the business rule leads to the facts that can be used as features. As explained in section 2.2.2, only feature selection was used.

The following are some representative examples that we found in the rejected expenses for the ten biggest customers. Three categories were identified to cluster the motivations.

Picture related ”bon niet leesbaar” ”datum verkeerd” Report related

”Remove expenses September...”

”Please can you complete the credit card seperate from the cash and also the credit card details need to be whole and not partial. thanks”

”double”

Financial policy related

”Kantoorbenodigdheden, autokosten en parkeerkosten moeten onder 21% staan.”

”BTW moet 0% zijn. Maximale bedrag is 7 euro.”

”BTW percentages zijn niet correct ingevuld. Vraag Max hoe het moet. Eten is bijvoorbeeld 0% en geen 6%. Gr. Bas” ”Brandstof is 21% BTW”

”maximaal 7 euro per dag”

Picture related rules cannot be checked as elaborated in 2.2.2, as no suitable OCR technology exists. Report related rules are also out of scope, with the following reasoning. The example of not reporting expenses with different payment methods was frequently found, but this relationship is not represented in the facts related to an expense. This feature would need to be created, so we exclude this rule. The particular problem of duplicates can be addressed anyways, as for

duplicates only expense attributes are necessary. How this problem was approached can be found in 5.3. The financial policy related rules were formulated in the following; their concrete implementa-tion can be found in the next chapter.

• Max amount per category: For certain categories, e.g. lunch, a maximum allowed amount is defined. The amount is either the expense amount or the expense total. Therefore, the con-tinuous feature ”expense amount” was used, and the nominal feature ”category”.

• Limited Tax Code per category: For certain categories, e.g. ”public transport”, only a particular tax code is allowed. This constrains usually affected several categories, leading to mul-tiple rules. Therefore, the nominal feature ”tax code” was used as well.

• No duplicated expense reporting. Each expense can only be reported once. This was no problem before the digitalization of the process, as the physical paper receipt was attached to an expense. Now, several employees can take pictures of the same receipt and hand it in as a fraud attempt.

4.3 Summary

With introducing the found business rules, their associated fea-tures and the facts were extracted. Deciding on an expense-focused view was an important step on keeping the application generic, not creating new features that only certain customers would use. Even though the number of features and rules seems very small, those were repeatedly found throughout all customers. For the next component of ML, the training data, the defined features were se-lected from the databases for the ten biggest customers to create the datasets.

5 Data Preparation

5.1 Rectifying the data

After extracting the business rules and features, the datasets for the experiment were created and analyzed. Not all ten were used, but three anonymized customers, referred in the following as customer A, B and C were chosen. Those customers selected for the ex-periment provided, at the time of the exex-periment, 41972 expenses. Table 1 illustrates the available expenses, with a representation of the companies size by the number of employees using the mobile app. To create comparability for the experiment, only the newest 10000 receipts were used for customer A and C.

Customer Active user Expenses

A 125 18838

B 44 7670

C 118 15464

Table 1: Customers for Experimental Setup

Unfortunately, the infrastructure of SRXP does not support ver-sions. Put simply, when an expense is changed anywhere in the process, the change is directly applied to the database. There is, unfortunately, no way to see the change history of an expense. This means that for the process flow illustrated in figure 1, the latest ver-sion of the expense is usually correct and approved. As no history exists, the ”status” of an expense will in the end usually be ”ap-proved”, even though former versions of the expense were rejected. Therefore, the initial datasets contained only 6,1 and 14 rejected expenses for customer A, B and C respectively (see table 2, column

(7)

initially rejected). The introduction of versions should certainly be done in the future, but it’s absence should not hinder this research. To create meaningful datasets, different solutions were identified. One could manually go through the datasets and adjust the receipts to the status when they were rejected, or add new expenses that vio-late the found rules and reject them. Both solutions would skew the representation of the real dataset. To gain more insight, OpenRules was implemented with the determined features and rules. Compar-ing the status that OpenRules created with the actual status from the data unveiled a surprisingly big amount of falsely approved re-ports. This was not in the scope of the research but shows that the concerns of customers were justified. Despite the seemingly small number of rules, 488, 707 and 338 falsely approved expenses were found for customers A, B and C respectively (see table 2, column falsely approved). Those expenses have already been exported to financial systems and have created the feared economic damage. Opposed to that, 388, 158 and 233 expenses were rejected, adapted and submitted again. For those, the field motivation was filled in the initial dataset. The implementation of the business rules within OpenRules can be found in section 5.2.

Customer Dataset Initially Motivation Falsely size rejected filled approved

A 10000 6 388 488

B 7670 1 158 707

C 10000 14 233 338

Table 2: Summary of Data Analysis

The expenses from the last column were then, only for training pur-poses, changed to status ”rejected”, leaving a reasonable amount of rejected expenses in the dataset to train and evaluate classifiers. Ta-ble 3 shows the created datasets, after limiting to 10000 expenses and changing the status of the falsely approved expenses to rejected.

Customer Dataset Approved Rejected size expenses expenses

A 10000 9512 488

B 7535 6828 707

C 10000 9662 338

average 9187 8666 511 Table 3: Manually Modified Datasets

5.2 BRMS Illustration

To illustrate how the implementation in OpenRules looks like, the externalized rule sheets can be found in figures 2, 3 and 4. The structure of the rule sheets, the columns, need to be predefined to be read by the rule engine. The values and logic, the rows, can be changed by the domain experts of each customer. The figures show the data found for customer A.

Figure 2 shows the maximum amount per category rule implemen-tation, category ”7564” was ”Lunch” in this case, the amount was 7 Euro (700 cents). Figure 3 show the implementation of the limited tax code per category. The rule engine compares the conditions and comes to a conclusion. Conclusion ”yes” means compliant to the rules, conclusion ”no” means not compliant.

Figure 4 shows how a conclusion is drawn. If no rule was violated, ”Status” and ”Status1” are both ”yes”, which then determines the status ”yes” = ”approved”.

Figure 2: Maximum Allowed Amount per Category implementation

Figure 3: Limited Tax Code per Category implementation

Figure 4: Status Definition implementation, ”yes” = ”approved”, ”no” = ”rejected”

(8)

5.3 Deduplication

Before the experiment was conducted, the risk for duplicates has been approached with Machine Learning. Using the open source tool dedupe [Gregg and Eder 2015], ML-based clustering has been applied to the initial dataset. Duplicates should be rejected, but will not be found by the introduced approaches as those look for similar-ity to classify. The same receipt handed in twice, approved the first time, will be classified as ”approved” when handed in again. There-fore, the duplication check needs to be done in advance, removing real duplicates from the datasets. Dedupe compares continuous and nominal features correctly and also performs a semantic analysis of texts (the description) to cluster expenses. It does not delete dupli-cates directly but indidupli-cates a probability of being a duplicate.

Implementation The same features as for the rules were used, adding the description. No user related data was used because as well the same person or multiple employees can hand in the same receipt several times. The following fields were used for dedupli-cation. • Description • Expense amount • Category • Payment method • Tax Code

In a first attempt, similar receipts were clustered with high confi-dence. Most expenses that were identified as duplicates found to be re-occurring expenses. Such included e.g. monthly costs for public transport, or phone and internet costs. Therefore, the expense date needed to be covered. To exclude expenses that re-occur regularly, the date has been checked in a range of ten days to the past and ten days to the future. If the difference was bigger than 20 days between two expenses, no connection was assumed. As a result, fewer duplicates were identified. The remaining identified expenses have manually been sampled, and indeed one duplicate was found. The pictures can be found in the appendix, a discussion about the relevance of the results can be found in the next paragraph. For all other identified duplicates, the attributes of the expenses were nearly identical, but the pictures showed no connection.

Evaluation Deduplication on past data is difficult to evaluate, as it is necessary to compare all pictures of receipts manually. There-fore, deduplication would need to be tested in a productive envi-ronment with manual labeling. However, by manually comparing the images of the receipts, a true duplicate has been identified by dedupe. In this case, two different employees reported the same costs, handing in two different receipts. Both expenses are from the same supplier, same date, amount and even product. The re-ceipts show slightly different timestamps, but as one is a so called ”customer copy”, it might just have been issued shortly after the original receipt was issued. Both expenses have been approved and reimbursed. Without contacting both employees or the supplier, it can not be certainly stated if this is a true duplicate, or two em-ployees traveling together, hence having very similar expenses. For the experimental setup, we assume that no true duplicates are in the data, as just shown their occurrence could not be proven.

5.4 Discussion

The current process and its technical implementation are not yet fully supportive for automated ML for Business Rules, as without

the manual rule application no usable dataset could have been cre-ated. This leads to the first results and recommendations on future changes. Implementing versioning or a change history of expenses is therefore strongly recommended. However, the lack of change history led to the compliance check of the current data with the gathered rules and uncovered a surprisingly big amount of expenses that were approved while violating the business rules in place.

6 Experimental Setup

6.1 Setups

After selecting the necessary features, illustrating the business rules in place and rectifying the datasets of three customers, the pre-requisites are fulfilled for the experiment. Using the introduced baseline datasets, four different setups were tested. By using the initial dataset and a downsampled version of it, and testing once without weights and once with weights. Table 3 showed the base-line datasets. Table 4 shows the downsampled datasets. To even the classes out more, we randomly removed correctly approved ex-penses from the datasets.

Customer Dataset Approved Rejected size expenses expenses

A 2511 1928 488

B 2516 1809 707

C 2508 2170 338

average 2512 1969 511 Table 4: Downsampled Dataset

Not only differently biased datasets were used, but also cost-sensitive learning. By attaching costs to false negatives, we use weights to modify the outcome. This after all leads to four setups in a matrix of two datasets and unweighted/weighted learning. The four configurations are referred to as Setups (1) to (4). Table 5 vi-sualizes this for better understanding.

Setup Unweighted Weighted

Original (1) (2)

Downsampled (3) (4) Table 5: Final Experimental Setups

(1) Baseline Dataset

(2) Cost sensitive learning for false negatives (3) Downsampled, no weighting

(4) Downsampled, Cost-sensitive learning for false negatives

6.2 Classifier

For all datasets, five classifiers were compared for their perfor-mance. All of them represent a different class of approaches. Ex-ploring those in detail is not the goal of this research, as only the performance is of relevance. The study compared the following implementations.

C4.5 DecisionTree based classifier DT Rule based classifier

MLP Neural Network based classifier nBayes Simple probabilitic classifier SVM Instance-based classifier

(9)

6.3 Measurements

The measurements upon which the performance was compared are explained briefly. To compare classifier a variety of measurements exist, whereas their interpretation can be misleading. The most common used measurements are accuracy and F-Measure, the lat-ter is calculated from precision and recall. Accuracy, however, will not be used, as for the unevenly distributed datasets, accuracy does not give as much inside as other measurements. The area under the ROC curve, called AUC, simplified represents ”the probability that a randomly chosen positive example is correctly rated with greater suspicion than a randomly chosen negative example ” and is more sensitive than accuracy [Bradley 1997]. Both measurements have a value between 0 and 1, whereas the higher the value, the better the performance. As mentioned in paragraph 2.2.2 false negatives cause more damage than false positives. Therefore the false neg-ative rate and the number of false negneg-atives will be displayed to illustrate the desired performance of the classifier. For each mea-surement a five-fold validation was performed, therefore also the number of false negatives might be a decimal.

7 Results

For each setup, the results are reported averaged for customer A, B, and C, as all results matched. All tables are ordered descending on the false negative absolute measurement. After the raw data is presented, their performance was tested for significance. The chapter finishes with a discussion of the results.

7.1 Aggregated

Table 6 shows the results of the first experiment. C4.5 and DT separate themselves on the top of the other three algorithms on all four measurements.

Setup (1) F-Measure AUC FN-rate #FN C4.5 0.94 0.98 0.10 10.73

DT 0.88 0.97 0.14 14.40

MLP 0.65 0.86 0.47 47.07 SVM 0.59 0.72 0.56 52.27 Bayes 0.22 0.76 0.83 80.00 Table 6: Results Setup (1), average for A, B and C Applying only weighting, the performance of C4.5 improved whereas the performance of DT decreased. Still, DT could out-perform the others on all four measurements. Table 7 shows the detailed results.

DT 0.68 0.97 0.27 25.60

MLP 0.60 0.87 0.36 34.47 SVM 0.59 0.73 0.53 49.40 Bayes 0.27 0.76 0.72 67.13 Table 7: Results Setup (2), average for A, B and C Similar results were found when only downsampling was applied. The performance of all classifier improved compared to Setup (1), the details can be found in table 8.

Combining both optimization techniques, good results were achieved for both C4.5 and DT, with less than 10% false negative rate. Table 9 shows the results.

DT 0.91 0.98 0.11 12.53

MLP 0.77 0.94 0.28 28.47 SVM 0.63 0.75 0.51 49.73 Bayes 0.34 0.77 0.72 72.40 Table 8: Results Setup (3), average for A, B and C Setup (4) F-Measure AUC FN-rate #FN C4.5 0.95 0.98 0.04 3.93

DT 0.83 0.98 0.07 5.67

MLP 0.74 0.93 0.19 17.93 Bayes 0.49 0.77 0.39 37.80 SVM 0.63 0.75 0.50 49.53 Table 9: Results Setup (4), average for A, B and C

7.2 Significance Testing and Summary

After all 12 experiments were conducted, each comparing five clas-sifiers upon four measurements, only aggregated results are pre-sented. The data illustrated in the previous section indicates that C4.5 and DT separate themselves from MLP, Bayes and SVMs for the given classification problem.

For each setup, all classifiers were tested for significance. Com-paring the results of four measurements for three customers, 12 tests were conducted for each setup. MLP, nBayes, and SVM were significantly outperformed (p <0,5) by C4.5 and DT in nearly all cases. Interesting is the testing between C4.5 and DT to determine which of the tested classifiers suits best for the given problem. The detailed results can be found in table 10.

Setup C4.5 Draw DT

(1) 1 11 0

(2) 11 1 0

(3) 7 5 0

(4) 5 7 0

Table 10: Significant Performance Advantages

In 50% of all tests, C4.5 significantly outperforms DT, while DT never was able to outperform C4.5. Remarkable is that the combination of downsampling and weighting improved the performance of DT much more than the performance of C4.5, but C4.5 still consistently outperforms DT and all other classifiers. This indicates that C4.5 also is more robust regarding the size of the training set. Therefore, C4.5 is suitable for both bigger and smaller companies.

8 Final Discussion

8.1 Summary of Findings

The study at hand tried to tackle the real world problem of support-ing the approval and audit process of SRXPs customers while im-proving the quality. We decomposed the main problem into smaller, isolated problems. The first part described the analysis of common business rules. We answered RQ1 and showed that restrictions from financial bookings are frequently violated, but also duplicates in the expenses can lead to economic damage. Those two issues have been investigated separately. During the initial analysis, it became appar-ent that both cases already led to financial damage for customers,

(10)

as both duplicates and business rule violating expenses were found, which have been approved already. Those findings were achieved using dedupe for deduplication and OpenRules for business rules. Using the resulting datasets, several classifier were trained to clas-sify expenses based on their attributes. The evaluation of the effi-cacy of the algorithms was done in different setups, using prepro-cessing (downsampling) and cost-sensitive learning to optimize the algorithms. Using both downsampling and cost-sensitive learning, the performance of all classifiers was improved. However, with weighting the false negatives, F-Measure and ROC might decrease. Comparing the classifiers with each other, C4.5 and DT outperform MLP, SVM and nBayes on all datasets, whereas C4.5 performs sig-nificantly better than DT. With an average false negative rate of 3% and a constantly high F-Measure, C4.5 algorithm while using only cost-sensitive learning, performs outstanding compared not only to other algorithms, but also to the manual process. The implemented frameworks proved to be able to handle the found business rules, answering RQ2. The answer for RQ3 can be found in section 8.3.

8.2 Limitations

The research focused mainly on the technical feasibility and qual-ity of an integrated solution, but did not elaborate on other impor-tant points. From an Information Systems perspective, not only the performance of a system needs to be evaluated, but also economic factors like implementation and maintenance efforts. Those efforts not only refer to SRXP but also the efforts for customers. More-over, user acceptance for an ML approach is unclear. It has to be tested if ML would be accepted by audit teams, given all the design decisions that were made during the research. All quality related findings were achieved in a ”perfect world” setup with datasets that represented only the given rules. In a real world environment with the currently given infrastructure, maintaining a minimal level of noise like this is not unlikely.

Due to the given technical constraints and the extensive preprocess-ing necessary to enable the experiments, only three customers were tested in detail with a limited set of rules. Even though the rules are representative for the ten biggest customers, their implementation was moreover not investigated for all of them. Recommendations on how the research can be repeated on a more representative scale are given in the next section.

From an academic point of view, the research was very specific and might have created little universal knowledge due to the very specific problem statement. However, the paper illustrates the diffi-culties and necessary steps on applying Machine Learning to solve a given problem. It elaborates on each component in detail and can help another researcher who faces similar problems.

8.3 Recommendation and Future Research

There are several answers to RQ3 on how an integrated solution can be achieved. As shown, both BRMS and ML can handle the business rules in place. However, a BRMS can do this without the change of SRXPs infrastructure while for ML fundamental changes have to happen. However, as both solutions do not exclude each other necessarily, several options are elaborated, proposing next steps and indicating advantages and disadvantages.

Option a) Implement a BRMS, not open source software like Open-Rules, but using contemporary SaaS approaches. With the already defined business rules, a vast amount of mistakes can be prevented. With providing the infrastructure to let customers setup their busi-ness rules based on the features used in this research, customers are able to make use of the research’s results faster. Checking the business rules already before submitting a report prevents erroneous

data from being forwarded to an approver or the audit team, easing the approval process. This solution connects minimal efforts for SRXP with reliable execution for customers.

Option b) Implement change logs to store the change history of ex-penses. This way rejected expenses can be stored and don’t need to be created manually. Then with a more representative dataset for training purposes, all experiments should be repeated to see if the same rules will be learned again by the classifier. Moreover, ac-tual knowledge discovery could be investigated, finding unknown relationships for rejection. Overall performance then however is expected to decrease, as more noise will be found in the data with-out further preprocessing.

A combined approach should be investigated as well, to profit from the benefits of both systems. After realizing proper versioning and integrating a BRMS, the datasets for each customer are expected to be much more suitable for big scale Machine Learning. The BRMS prevents known cases of violations from being handed in, and all other instances are documented properly by the versioning. This new setup will create new datasets for all customers over time that can then be used to train classifiers.

For a combined approach, the performance of ML will drop, as the explicit cases are now already excluded. Therefore new feature se-lection or extractions need to be performed to detect outlying cases, probably including a bigger variety of features. This will also de-crease the certainty of a classification (the curse of dimensional-ity). As automation should not rely on the uncertain classification, a semi-automatic approach should be analyzed and tested in a pi-lot. Simplified, this method automatically outputs a risk score that indicates the risk of a report being erroneous in any way. This way dangerous cases can be highlighted and brought to an approvers attention for a manual check. The risk score should include the findings of this research and further ideas: Basic components could be the results of a deduplication evaluation, the results of the classi-fication, a consistency check using OCR and the potential financial damage. This idea requires further in-depth research.

9 Conclusion

Despite the little attention companies usually give their expense management processes, their proper execution is a necessity. SRXP optimized big parts of this process but lacks a flexible error check-ing component that can be configured and adapted to customer needs. To design an integrated solution, a Business Rules Manage-ment System and Machine Learning approaches have been imple-mented and tested. The research showed that automated error de-tection can be performed successfully with both a Business Rules Management System or Machine Learning. On the other hand, a variety of problems were identified that can only be detected by a manual check and were therefore excluded initially. However, the research created the foundation for SRXP to improve their product further and support their customers, and can help other researchers to solve similar problems faster.

10 Acknowledgment

I would like express my gratitude to the whole SRXP team for of-fering the environment to conduct the research, Mark Bothof and Rogier Schutte for all their experience and the help they offered. Moreover, I’d like to thank my supervisor Dr. Ilya Markov for the great flexibility, optimism, and knowledge with which he supported me during my research.

(11)

References

BELASTINGDIENST, N., 2012. Uw geautomatiseerde administratie en de fiscale bewaarplicht. ”http: //www.belastingdienst.nl/wps/wcm/connect/ bldcontentnl/themaoverstijgend/brochures_ en_publicaties/uw_geautomatiseerde_ administratie_en_de_fiscale_bewaarplicht.

BRADLEY, A. P. 1997. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recogni-tion 30, 7, 1145–1159.

CARUANA, R.,ANDNICULESCU-MIZIL, A. 2006. An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd international conference on Machine learning, ACM, 161–168.

DOMINGOS, P. 2012. A few useful things to know about machine learning. Commun. ACM 55, 10 (oct), 78.

ENTEZARI-MALEKI, R., REZAEI, A., AND MINAEI-BIDGOLI, B. 2009. Comparison of classification methods based on the type of attributes and sample size. Journal of Convergence In-formation Technology 4, 3, 94–102.

GREGG, F., AND EDER, D., 2015. Dedupe. ”https:// github.com/datamade/dedupe.

HALL, J. A. 2010. Information Technology Auditing. Cengage Learning.

HAY, D., HEALY, K. A., HALL, J.,ET AL. 2000. Defining busi-ness rules-what are they really. Final Report.

KITTLER, J. 1986. Feature selection and extraction. Handbook of pattern recognition and image processing, 59–83.

KLIEGR, T., KUCHA, J., SOTTARA, D.,ANDVOJIR, S. 2014. Learning Business Rules with Association Rule Classifiers. 8th International Symposium, RuleML.

LONGADGE, R.,ANDDONGRE, S. 2013. Class imbalance prob-lem in data mining review. arXiv preprint arXiv:1305.1707.

OPENRULES, 2015. Openrules architecture. ”http:// openrules.com/architecture.htm.

VERVER, J., 2014. Automating fraud detection: The essential guide.

WITTEN, I., FRANK, E., ANDHALL, M. 2011. Data Mining: Practical Machine Learning Tools and Techniques. Data Min-ing: Practical Machine Learning Tools and Techniques. Else-vier/Morgan Kaufmann Publishers.

A

Found Duplicates

Figure 5: Found Duplicates Nr. 1