Advanced Self-Enrichment Algorithm for Procurement Data Classification

(1)

Advanced Self-Enrichment Algorithm for

Procurement Data Classification

Author: Aleksandr BYKOV Examiner: Dr. Valeria KRZHIZHANOVSKAYA(UvA) Assessor:

Marcel BOERSMA (UvA)

Daily supervisor:

Elena VERBITCAIA(Philips)

A thesis submitted in fulfillment of the requirements for the degree of Master of Computational Science

(2)

(3)

University of Amsterdam

Abstract

Faculty of Science Graduate School of Science

Master of Computational Science

Advanced Self-Enrichment Algorithm for Procurement Data Classification

by Aleksandr BYKOV

Self-enrichment of procurement spending data in business enterprises is a serious classification problem. Due to the individuality and complexity of business struc-tures, there is no universal solution to algorithmic problems. This thesis is focused on designing a model and a generic algorithm that would not depend on the quality of the data, by adding flexibility of supervised learning but still having transparent rules of data enrichment using Decision Tree as a core machine learning method. The model is validated on different types of data by doing simulations of real-life problems recurring in business by adding noise to the data. Simulations are focused on analysing, replacing or excluding data features for the algorithm to be robust and computationally efficient, especially comparing to existing methods. The main idea of the thesis is to change the approach from deterministic algorithms currently used in business to non-deterministic supervised learning methodology, to treat business processes with a modern scientific approach. Results and the simulations defined a way to justify the exclusion of certain features in order to gain prediction accuracy and increase computational efficiency.

(4)

(5)

ment Data Management Team for the wonderful opportunities I was given during my internship.

My thanks also go to: Elena Verbitcaia for her daily mentorship during the in-ternship at Philips, John Willemsen for guiding me through creating an algorithm, Mike Lees and Niels Molenaar for assisting me with all the issues I had to start my internship, Valeria Krzhizhanovskaya for the indispensable effort to make this thesis happen.

(6)

(7)

List of Figures

3.1 The roll-up scheme for classes on WHAT levels . . . 20 3.2 General scheme of ownership in SMART . . . 21 3.3 WHO / WHAT spend table . . . 22 4.1 Concepts of Self-Enrichment. (A) Concept used in old Self-Enrichment.

Algorithm runs in SAP, processed by a third-party, cannot be easily changed (dark-grey colour box), ’Self-Enrichment N rules’ (classifier) is stable. Company has flexibility over data and dictionary files it provides for the algorithm. Yellow sub-box shows data being en-riched. (B) is different in a way that now there are separate train-ing and spend data. Machine Learntrain-ing algorithm is stable and solid, but actual ’MODEL’(classifier) is not – it is generated and depends on training data and dictionary files. Thus, classification process is every time different but the programme itself is the same so it does not need to be changed every time there is a mistake to be fixed. . . 30 5.1 Examples of Decision Tree models trained on random samples of (A)

10 spending lines, (B) 1000 spending lines.. . . 32 5.2 New algorithm (Decision Tree) accuracy of classification trained on

randomly sampled percentage of data set available, then tested on 100 % of data (lines ’L1’-’L4’) and on the remaining dataset not used for training(Cross Validation, same lines with ’CV’ addition). Simulation is performed starting from 1% to 99% of data with 20 steps and taking mean of 10 repetitions. . . 33 5.3 Computational Time comparison for old (’Current’) and new

(’Deci-sion Tree’) algorithms. In Figures (A) and (C) we see overall linear nature of algorithms depending on size of dataset. On (B) New al-gorithm divided into 4 steps: Pre-processing, Decision tree training, Classification and Post-processing. For each point on graph repre-sents mean with standard deviation as errorbar of 100 measurements for new algorithm total (and all steps), and 5 measurements for old algorithm. . . 34 5.4 Classification accuracy. (A) Simulation for different noise rates and

features. Accuracy measured on entire data set. (B) Same simula-tion but accuracy measured only on data explicitly on data affected by noise. 100 % of training set is used for training. For each measure-ment mean accuracy of 10 new Decision Trees are used. . . 35 5.5 Example:Integral ’damage’ of a feature 1 (A) and 9 (B) based on noise

coefficient. Surface area of Gray area shows desired ’damage’ integral: difference between accuracy on noisy data minus accuracy on noise only integrated over noise coefficient from 1 % to 99% . . . 36

(10)

10

5.6 Sampling distribution of ’damage’ integral. For each sample, Decision Tree is generated, weights calculated than for each feature ’damage’ integral is calculated and drawn with in pair with feature weight. 70 samples drawn, 10 features in each, total 700 points . . . 37 5.7 Simulation of how prediction accuracy would change based on amount

of iteratively least important features missing and training set size for cross-validation. 20 steps for each simulation, mean of 10 repetitions becomes a point of surface . . . 38 5.8 Accuracy and computational time for new algorithm with missing

features.(A) shows comparison of cross-validation accuracies of the new algorithm with 9 variations of the same algorithm such that each line represents amount of features excluded from the training set. It-eratively in each line the least important feature based on weights in the Decision Tree is excluded. 20 steps in measurements, mean of 10 repetition is put on plot with error bar as standard deviation. In order to decide which feature to exclude, 10 Decision Trees are gener-ated and least important feature is decided based on weights mean of generated models. Computational time measurements (dash-dotted marker) also put on the plot. (B) shows close-up on top-right corner. . 40 5.9 Prediction Accuracy and Computational Time depending on the

(11)

List of Abbreviations

BOM Bill Of Material

DT Decision Tree

IMS Indirect Services and Materials

IP Intellectual Property

MPP Market Project Procurement

NPS Non-Purchasing Spend

PDM Procurement Data Management team

R & D Research and Development

VGU Vendor Global Ultimate

VLOB Vendor Line Of Business

SE Self Enrichment

SAP Systeme, Anwendungen und Produkte in der Datenverarbeitung; (Systems, Applications & Products in Data Processing)"

SAP BI SAP NetWeaver BI (formerly SAP NetWeaver BW- "BW" is still used

(12)

(13)

Chapter 1

Introduction

Typically Procurement is a department responsible for purchasing goods, services, etc. for the company. Technically speaking, for Procurement Data Management ev-ery single purchasing item is represented by a spend line in a Database of a quite strict format.

Every month Procurement Data Management team (PDM) initiates running an algorithm that assigns each spend line a certain class based on several attributes (features) in the database. That class can be rolled-up into other layers of internal division of the company to put spend lines into commodities, clusters, branches, etc. Typically classes might simply look like codes and rolling-up is a process that is not a part of classification algorithm itself. In this thesis we will focus on classification only for the ’codes’ (lowest) level part. At the same time for evaluation of the accu-racy of the algorithm we might use other high level classes.

In most cases, because of the huge amounts of data, employees can see only spending data that has been rolled-up into their subdivision. If certain spend lines are having polluted data or no data at all, then they are classified into incorrect buck-ets or not classified at all. If a certain spend line was misclassified, then it means per-son who made the purchase would not find the spending data in his slice of the data and a person in another department would see extra spend data he or she should not be responsible for, thus creating two problems out of a single inaccuracy in the data.

In this work we will concentrate on how to improve quality of that process us-ing computational science methods and machine learnus-ing by implementus-ing super-vised learning, doing data analysis, computational time analysis and designing a new advanced algorithm which would make Self-Enrichment (SE) process scientifi-cally more accurate and computationally efficient as well.

1.1 Research questions

In order to know where this research is headed let us define the main research ques-tions for the thesis:

1. Can spending data classification algorithm be based on supervised learning approach, not deterministic? Can Decision Tree machine learning method de-liver same level of robustness and not lose in quality of prediction?

2. Is there a way to design an algorithm for spending data classification being universal in a way that it would not depend on specific formats data is stored?

(14)

14 Chapter 1. Introduction 3. Do more features used in a training set for Decision Tree deliver better qual-ity of classification and robustness? Is there a way to link feature weight in Decision Tree to a measurement of how method can be handling noisy spend data?

4. Is there a way to analyse how useful feature is for the classification and whether it is possible to, again, make this procedure independent of a particular dataset?

(15)

Chapter 2

Literature review

2.1 Procurement Data Classification

Public procurement is one of the key elements of a demand-oriented innovation pol-icy (Edler and Georghiour,2007).

Product classification in Procurement is a well-known field with reports on the methodology and results of a current state of respective standards (Leukel and Ma-niatopoulos,2005). Whereas Procurement spend data classification is a topic specific for Market Procurement. Having a clear concise product classification is essential in private sector, especially if one sells products. If a company buys products, ser-vices or pays fees, at some point of company growth one faces a problem with many of people responsible for many purchases. There should be one company algorithm that split purchases into categories. For small businesses this problem does not exist, when a few employees are enough to do classification. At the same time, large-scale companies have to come up with an efficient solution to do so.

The general tone of discussion around Procurement data classification used to be(Herfurth and Weiss,2010) quite hostile: "Enterprise information systems (such as enterprise resource planning (ERP) or procurement systems, etc.) do not adequately support electronic procurement of industrial services. Business processes suffer to-day from poor quality of underlying master data in supportive information systems. " Authors declare that better classification services lead to overall Procurement per-formance increase and that such services do not always come with ERP systems by default.

Classification properties is a significant part of product classification. For in-stance, Ondracek (Ondracek and Sander,2003) proposed a “property based prod-uct classification” where more than one different hierarchies of classification can be developed for specific purposes, in that case common standardized properties are used. (Kim et al.,2004) developed a “semantic classification model” based on prop-erties in order to enable an in-depth understanding of product classification. In that paper authors tried to "understand what it means to classify products and present how best to represent classification schemes so as to capture the semantics behind the classifications and facilitate mappings between them".

Unfortunately, current achievements and solutions mentioned previously are more than 10 years old and they are irrelevant to this work. Most essentially, rele-vant solutions and methods for spending data classification are protected by intellec-tual property (IP), and even best practices are quite mathematically vague: "Leading

(16)

16 Chapter 2. Literature review spend data management initiatives rely on access to all spend data sources; a com-mon classification schema; category expertise; efficient and repeatable data cleans-ing and classification capabilities; advanced reportcleans-ing and decision support tools; and sufficient resources and executive support" (Aberdeen Group,2003). ’Accuracy’, ’prediction’, ’formula’, ’algorithm’, ’test’, ’simulation’ – are not the keywords you see searching for solution for strictly algorithmic problem that is usually solved by non-algorithmic methods with ’strategies’, ’feedbacks’ and ’low-level approaches’.

It may look like the problem is not that important for businesses if in a free mar-ket there are no good solutions for it. Some (Aberdeen Group, 2003) beg to differ: ". . . inadequate spending analysis capabilities are costing businesses $ 260 billion in missed savings opportunities annually." There are solutions on the market that have ideas in the spirit of: "let us give each product some serial number we all agreed upon" and use it for classification. Well, frankly that kind of solutions do not solve problems involving repeating spend with no information (almost empty data fields) because there is nobody to define the code for the service. Most importantly if the company buys same kind of product, paper for example (a half for re-selling and a half for personal needs) then no ’product code’ can separate these purchases into two different categories.

To be fair, not all solutions are vague or simplistic. In the area of classification of product data there were interesting examples of applying more algorithmic meth-ods like machine learning in GoldBullet (Ding et al.,2002). Paper is focused around cleansing the procurement data with machine learning. The application, unfortu-nately is very different from what is needed for spend data classification. Although the core ideas and approaches are, in fact, going in the same direction.

Working on Procurement spend data analysis and classification in a large and di-verse company means extracting information from data, sometimes even from raw and incomplete erroneous sources. Current knowledge discoveries and data engi-neering methods have proven great success in lots of modern applications and pro-grammes, the issue of learning from the imbalanced data (the imbalanced learning problem) is in fact, a new and ongoing challenge. And that problem is getting a lot of attention from both academia and industry(He and Garcia,2009).

Closest scientific research to actual problems of misclassification of procurement data spend (correctly calculating savings, commodity mapping, etc.) was published by IBM. (Singh et al.,2005) analysed transactions that are not cross-indexed, trans-actions referred to the same suppliers by different names, and transaction that use different ways of representing information about the same commodities. In several data data sources it was found that commodity mapping is made more complex based on the fact that it has to be created on the basis of unstructured text descrip-tions. Machine learning and information retrieval techniques were applied as well. Unfortunately mentioned research was not about classification procedure but about data "cleansing" and generating spend "view". Methods are similar but the goal is very different.

(17)

still think it is important to draw beautiful schemes with subdivisions and ’pay close attention’ to the low-level scheme implementation details (Aberdeen Group, 2003). My personal and so far unproven vision is that like in any other classification task, coming up with a good solution does not depend on the classes proposed – it de-pends on classification algorithm and its prediction accuracy. Simply put, businesses lose money because spending data classification is far from perfect – not because their classes hierarchy is not sufficient, accurate, representative or diverse enough.

Creating a concise distribution model of class label with respect to term of pre-dictor features is the main idea of supervised learning. "The resulting classifier is then used to design class labels to the testing instances where the values of the pre-dictor features are known, but the value of the class is unknown." (Kotsiantis,2007). Paper by S.B Kotsiantis covers important topic for developing and choosing the right machine learning algorithm techniques: data pre-processing, feature selection, logic-based learning, perceptron-based techniques, Support Vector Machines, Neu-ral Networks and, finally, Decision Trees. Based on Kotsiantis, Decision Trees based on features and their values predict classes by sorting features. Each node’s features in a decision tree represents a classified instance. Each branch of the tree denote a value that the node can predict. Starting at the top node (root) instances are classi-fied and sorted with respect to their feature values.

The reason to choose Decision Tree as the main classifier for this thesis might be counter-intuitive, but beneficial for certain business applications – generalising (Murthy,1998) and over-fitting at the same time in spend data classification. A lot of data is repetitive and quite often only the price feature is changing (that is why over-fitting is needed). Many transactions are assigned to a class once, and afterwards company wants to see exact same transaction falling in the same category time after time irrespective of the price value. It then reacts (assigns different class) if some-thing is not as it was before (value of one of the important features has changed).

For that matter conclusion from (Kotsiantis, 2007) comes at hand: "a decision tree, or any learned hypothesis h, is said to over-fit training data if another hypoth-esis h’ exists that has a larger error than h when tested on the training data, but a smaller error than h when tested on the entire dataset."

As stated in (Badulescu,2007), during the induction phase of the Decision Tree, the measurement of attribute selection is determined by choosing the correct at-tribute that will optimally separate the node partitions’ remaining samples into classes that are individual. This property comes useful in the phase of analysing features’ importances and deciding which ones to select.

After looking through the analyses of advantages and disadvantages of Deci-sion Trees the method was chosen because spending classification techniques at the moment tend to be as transparent as possible and Decision Tree classifier seems to always have an opportunity to visualise classifier in a way that it is always clear

(18)

18 Chapter 2. Literature review which rules and how were used every iteration.

2.3 Data Science

Big international business is a good place to extract data, to apply data science tech-niques for the needs of the company, and to measure the role of predictive and ex-planatory modelling. Research in this emerging area should be evaluated for contri-bution and significance (Agarwal and Dhar,2014).

Companies often want to use data-science programs (Provost and Fawcett,2013), but this particular thesis is not about analytics and extraction of (Martinez,2017) ’un-known’ truths and knowledge from the data. Modern scientists tend to look at big data for somewhat educational reasons (Song and Zhu,2016) to see the impacts of science on life and to address big data problems.

At the moment there are certain spend data classification solutions (Clinton,

2017) that use deep machine learning and artificial intelligence to structure and cleanse data in a format suitable for procurement. Many researchers are sure that humans’ accuracy is to be replaced by accuracy of a modern, digitised system that uses innovative technologies to classify data more quickly and accurately. Moreover, the analytics solutions success depend mostly upon consistency and quality of data of suppliers and products. (An Oracle White Paper. April,2012).

Sometimes business needs science not to launch a rocket into space, but to solve more down-to-earth problems like predictive analytics (Waller and Fawcett,2013) and domestic local problems. And very often these problems are similar and can be solved via simple scientific methods that can define problem very locally but find its purpose not in larger scale of a company but in scaling solution to many companies with the same issues.

Procurement spend data classification is exactly the problem that needed a little scientific approach.

It is important to pay attention to how algorithm works with incomplete or erro-neous data because issue with noise is a problem that cannot be ignored or under-mined. It affects the process of data collection and data preparation in Data Mining applications (García, Luengo, and Herrera,2014).

The effectiveness of the models created with respect to such circumstances will depend mostly on the training data and its quality. Noise of the model learner will also be a property that affects the effectiveness.

(19)

Chapter 3

Problem description

As soon as data of purchases appears, system, based on "technical" data such as purchase order, has to do Self-Enrichment: classify spend line in a way that spend records will be assigned to departments that should be responsible for the spend. In reality, for large amounts of spend data from spend record, attributes become insuffi-cient to accurately assign right owner of the spend. Old classifier used in businesses do a very "mechanical" (repetitive, deterministic with rules almost duplicating each other) self-enrichment, making all the "noisy" records assigned to commodities they are not supposed to be assigned to. Some spend lines are not classified at all, and some are classified incorrectly. Enriched data becomes polluted, spend lines with large amounts of money (transactions with high costs) are re-classified ’by hand’ by employees afterwards when they notice them (and if they notice), but huge amounts of money in the thousands of misclassified small transactions are staying in the sys-tem forever. Because of the "mechanical" approach of the old algorithm, it is not possible to apply supervised learning, and even the systematic mistakes seen in the enrichment can be fixed only by changing the entire algorithm, which can only hap-pen once at the end of the calendar year.

Classification mistakes immediately lead to mistakes of assigning the ownership of a spend. In order to have a better understanding of internal company data classi-fication, let us take a look at the way Self-Enrichment happens in a given company’s case.

3.1 Data Self-Enrichment

3.1.1 WHAT / WHO ownership

In the company there are two kinds of ownership:

"WHO" is the ownership of vendors and VGUs (Vendor Global Ultimate – high-est level name of the company), which determines a person responsible for all the spending a company has regarding a certain VGU. One VGU can have only one owner. One employee can have multiple ownerships.

On the other hand, WHAT ownership defines for each spend line what is the cluster that this spend should belong to. There are 4 levels on WHAT axis. Low level classes roll-up to higher levels. One type can roll-up to only one higher class3.1.

(20)

20 Chapter 3. Problem description

FIGURE3.1: The roll-up scheme for classes on WHAT levels

L1 : Clusters

Cluster level is the highest level in the hierarchy. It has 6 types (classes) • BOM - Bill of Material

• IMS - Indirect Materials and Services • NPS - Non Purchasing Spend

• R&D - Research and Development • MPP - Market Project Procurement • # - Unassigned bucket

L2,L3 : Commodity levels

L2 indicate the commodity, L3 indicate the group of products or services. Actual names of classes and their quantity are related to business processes and are not to be disclosed because of the Intellectual Property (IP). Exact format also cannot be discussed, because it is not unified and without business specifics it cannot be disclosed.

L4 : CLOGS code

L4 is the lowest level in the hierarchy. Data format is ’0XX000’ strings of 6 symbols: the first and the last three are digits, second and third are letters. The number of classes is more than 600. L4 CLOGS code unequivocally rolls up to L3, L2 and L1.

Exception: # - unassigned bucket

In the unassigned bucket, a hash symbol ’ # ’ is used in all levels. This bucket indi-cates that a spend line was not classified to any existing cluster or commodity. This bucket contains not only absolutely empty spend lines but also spend lines with some data, but not enough for the algorithm to assign it to some class.

(21)

ciples.

SIS is the system where ownerships [employee:VGU] are defined. Every owner belongs to a certain cluster/commodity on WHO axis, thus the entire WHO owner-ship of VGU relations can be described as follows:

VGU type (L1) ⇔ employee ⇔ VGU

Given relationship says that for any spend within a certain VGU via employee-owner L1 type is defined on WHO axis. Levels L2, L3, L4 cannot be applied in this case.

To the contrary, WHAT axis is not defined explicitly. It is calculated via Self-Enrichment algorithm using dictionary files (made by the company). The way the dictionary files are built is a commercial secret and cannot be discussed in detail.

WHO and WHAT axes contribute to creating a spend ownership shown in Fig-ure3.2in SAP produced outsource enterprise management system called SMART. This Spend matrix is used for reporting and measuring Key Performance Indicators (KPI) related to Procurement.

FIGURE3.2: General scheme of ownership in SMART

The idea behind the WHO / WHAT matrix is that one VGU has to have one owner-employee responsible (WHO) for all the purchases from one VGU. But first, not all VGUs have owners as soon as they are in the system, and second, spend from one VGU can belong to more than one cluster on WHAT axis. Reporting and KPI are different for clusters and commodities, thus it has to be processed and analyzed by separate systems.

For example, imagine a company buys 3 packs of identical chairs from a VGU, paying separately for each pack, thus creating three separate transactions in the spend database. One pack will go for use in the office (L1:’NPS’), another will go for research on radiation on chairs from MRI machines (L1:’R&D’) and third are bought to be re-sold for a government project to build a new hospital (L1:’MPP’).

(22)

22 Chapter 3. Problem description Same chairs, same company, but the usage and purposes are different business-wise, and effectiveness of that purchases are measured differently. Spend lines for these three transactions are almost identical, but Self-Enrichment shall assign them into completely different categories based on only spend lines.

An example of WHO / WHAT spend matrix is given in Figure3.3. Each owner (usually it is a commodity manager) belongs to a certain commodity, and by default all VGUs owned by that owner belong to the same commodity on WHAT level.

FIGURE3.3: WHO / WHAT spend table

One of the main problems with the ownership is not even the possibility of spending data being assigned into the wrong commodity, but it is the possibility for it to end up on the intersection of two ’#’ classes: spending data that has no owner and no class. Therefore this spend intersection data belongs to nobody, and nobody is responsible for it.

In this work we develop an algorithm for WHAT spending classification only. The WHO ownership classification is less problematic due to the fact that it is as-signed by senior Procurement directors and rarely changes afterwards.

3.2 Old Algorithm overview

For research purposes, only the properties but not actual algorithm1 _{are described}

below.

3.2.1 Properties

The algorithm takes raw spend data set as an input. The features include company names, VGU, cost in Euro and several other records that cannot be disclosed. Most of the features are in string format.

• A Self-Enrichment algorithm has N rules in specific order;

1_{Numbers, datasets, names of companies, purchase and measurements cannot be revealed entirely}

(23)

• The algorithm uses K-4 attributes (out of K available);

• Each rule uses 1 or 2 attributes. Some also use dictionary files; • Some rules assign classes based on attributes;

• Some rules copy classes from the attributes (in case data is clear and correct) ; • Last rule assigns "#" class to the rest of the dataset, which has not been enriched

yet by the previous rules;

Dictionary files used also as an input are updated regularly, thus can be changed frequently. They are used to add certain properties-attributes for spending lines with VGU names.

The Self-Enrichment algorithm is encoded in SAP and can be changed only once at the end of the calendar year.

There is no way to evaluate prediction accuracy of the Self-Enrichment algorithm currently used. Employees fix mistakes individually when polluted data harms KPIs. In this research we only assume that data provided is classified 100% correctly.

3.2.2 Problems of the algorithm

We start by outlining key theoretical and practical issues the old algorithm has.

Theoretical limitations

"Rules" cannot do supervised learning. In other words, deterministic approach does not help to re-assign classes for the spend lines that were misclassified during the previous run. Other input files-dictionaries can help with well-known VGUs with stable product spend, but the same mistakes could be made over and over, especially for the new VGUs and/or unrecognized spending data. As an example, for one particular VGU a couple of million spending in Euro in 2017, that were put into unassigned bucket for several months, even though in was supposed to be classified.

Practical problem

The old algorithm did classify and roll-up CLOGS codes that were not in the class set due to "mechanical" nature of some rules. That mistake could not be observed by employees without access to actual CLOGS codes (those are almost all commodity owners). So far the ownership prediction is only as good as certain features of the data are accurate (which are not).

Implementation deficiencies

Even though an SE algorithm was designed by Procurement Data Management (PDM), actual implementation in SAP was performed by developers in SAP. PDM defines rules in a very high-level format, making many of nuances being resolved

(24)

24 Chapter 3. Problem description individually by developers. When the SE algorithm was re-built by the author of this work, in many cases results of classification were different from the actual SE run results (approximately 3 % of more than 70 000 spend lines), even though the model in theory was supposed to be deterministic. The differences are explained by the interpretations the programmers made while implementing the algorithm.

3.2.3 Legacy issue

Around seven years ago, the design of the SE algorithm was outsourced to an IT company specialized in Machine Learning. Back then, the company was focused on what they were told was important – word recognition and text analysis. The prob-lem is that the only attribute that people should fill in during purchasing in natural language is more or less optional (in most cases filled with object sizes, irrelevant symbols or simply random data). The algorithm based on that approach was over-fitting and not applicable during testing on ’untrained’ data (it was trained on one year data and showed very low result on next year data). PDM decided to develop their own algorithm that was more mechanical, state-of-the-art and used transpar-ent deterministic rules that would give the same results for the same data, all rules would be very much transparent.

(25)

Chapter 4

New Model

In order to address the emerging problem of procurement data spend classification we have to keep in mind the following core algorithm properties:

• Supervised learning

No other technical requirements were provided since data volumes are large, but considering current computer power, computational time was negligible. • Rule transparency and simplicity

This principle does not have any scientific or computational reasoning. Simply put, "business executives" prefer to have clarity and predictability in terms of how the spend is classified. Rules and logic of classification ought to be observable .

Supervised learning is what Machine Learning algorithms are famous for (Mohri, Rostamizadeh, and Talwalkar,2012). In this work Decision Tree algorithm was cho-sen because of its typical disadvantages. A more thorough explanation comes in the next section.

4.1 Decision Tree

Decision Tree uses a predictive model where observations about an item represented in the branches lead to conclusions about the item’s target value represented in the leaves. Let us focus mostly on properties of it, not on a definition (Decision Trees, Scikit-learn).

4.1.1 Advantages of Decision Tree

Let us define algorithmic properties of Decision Trees, as described in (Decision Trees, Scikit-learn):

• DT is simple to understand and to interpret. Trees can be visualised.

• It requires little data preparation. Other techniques often require data normal-isation, additional variables and filling the gaps in data. Note however that this module does not support missing values.

• The computational cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.

• DT is able to handle both numerical and textual data. Other techniques are usually specialised in analysing datasets that have only one type of data.

(26)

26 Chapter 4. New Model • DT is able to handle multi-output problems.

• DT uses a white box model. If a given situation is observable in a model the condition is easily explained by boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results are more difficult to in-terpret.

• It is possible to validate a DT model using statistical tests. With that, we can estimate the reliability of the model.

• DT performs well even if its assumptions are somewhat violated by the true model, from which the data were generated.

4.1.2 Disadvantages of Decision Tree and ways to alleviate the conse-quences

• Decision-tree learners can create over-complex trees that do not generalize the data well. This is called over-fitting. Mechanisms such as pruning, setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.

• Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by running an ensemble.

• The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality, simulations, generating many DTs and checking statistical significance even for simple concepts. Consequently, prac-tical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.

4.1.3 Mathematical Formulation

The model as described by (Breiman et al.,1984):

Given training vectors xi ∈ Rn, i = 1, ..., I and a label vector y ∈ Rl, a decision

tree recursively partitions the space such that the samples with the same labels are grouped together.

Let the data at node m be represented by Q. For each candidate split θ = (j, tm)

consisting of a feature j and threshold tm, partition the data into Qlef t(θ)and Qright(θ)

subsets

Qlef t(θ) = (x, y)|xj ≤ tm

Qright(θ) = Q Qlef t(θ)

The impurity at m is computed using an impurity function H(), the choice of which depends on the task being solved (classification or regression)

G(Q, θ) = nlef t Nm H(Qlef t(θ)) + nright Nm H(Qright(θ))

(27)

Recurse for subsets Qlef t(θ∗)and Qright(θ∗)until the maximum allowable depth

is reached, Nm < minsamplesor Nm= 1

Classification criteria

If a target is a classification outcome taking on values 0, 1, ..., K − 1, for node m, representing region Rmwith Nmobservations, let

pmk = 1/Nm

X

xi∈Rm

I(yi = k)

be the proportion of class k observations in node m. Common measures of im-purity are: Gini: H(Xm) = X k pmk(1 − pmk) Cross-Entropy: H(Xm) = − X k pmklog(pmk) or Misclassification: H(Xm) = 1 − max pmk ,

where Xmis the training data in node m.

4.2 Reasoning

This section is about why Decision Tree class was chosen for the task.

First and most important – because DT is a white-box model. That was a require-ment from the Company. All the other advantages require-mentioned previously perfectly align with what was needed for the task.

Unlike almost every other classification task in given case over-fitting is good. It is not a problem, but more of the solution. It could be explained by the fact that model has to predict more than 700 classes based on limited amount of parameters most of whom are "technical" string codes and parameters. Also because it is not the task people traditionally solve with machine learning.

Generally, first two of the disadvantages mentioned before are beneficial for the task. This assumption is counter-intuitive, but it does not mean it is irrelevant. The algorithms does not need to generalize data and small variation influence on predic-tion is what the Company was looking for.

Also one of the reason Decision Tree was chosen is a possibility to visualize pre-diction model and make beautiful slides for the executives but, spoiler alert, in the end prediction model was so big personal computer could visualize only a model

(28)

28 Chapter 4. New Model trained on 1000 spend lines.

4.3 Data preparation

In order to apply and train Decision Tree on the company’s data a few procedures had to be performed on the data, since most of the values were strings and Decision Tree requires numerical values.

4.3.1 Feature engineering and Mapping function

Generally speaking, most of the features were taken as they were before, but with minor changes such as replacing incorrect values for fixed-type features or splitting some features in two new due to the fact that they were combined into one at the database. Some available features were simply not used because they had no re-lation to Self-Enrichment and some had too much: feature storing cost in Euro was not used because it is not traditionally used in Self-Enrichment on Procurement and because it made Decision Tree over-fit instantly.

Due to company’s restrictions actual feature names and their nature cannot be disclosed.

Mapping function is a classic solution in the case where string values needed to be transformed into numbers. For each feature a dictionary-function was created that transformed strings into integer value and was able to inverse-transform inte-gers back into strings.

4.3.2 Pre- and Post-processing

Preparing dataset for training decision tree by doing feature engineering and defin-ing mappdefin-ing functions is a Pre-processdefin-ing. After that data is ready for traindefin-ing. Same process is used for classification (prediction) to testing data.

Actions in reverse order – inverse mapping functions and feature de-engineering are Post-processing

4.4 Technological overview: new versus old algorithm

For the big companies it is common to stick to a single algorithm or software that would be processing classification of spend data – changes are to be very rare and al-gorithm itself should be solid and and implemented deep into company IT solutions. Some companies also use outsource solutions for handling all spend data pro-cessing – for example, SAP BW. How exactly it is implemented is irrelevant for this research, but what is important to notice is that principles for the Self-Enrichment are similar: usually company designs algorithm itself and then software engineers implement it in SAP. The similarity in that case is in the general approach – algo-rithm should not be changed very often, because it is crucial for the data to be stable

(29)

Data from Philis provided for the research falls into last category. The way how supervised learning can change the concept of Self-Enrichment is shown in Figure 4.1.

Supervised learning integrated into classical Self-Enrichment theoretically solves problems with algorithm being unable to adapt to feedback and consider small changes in data being influential to the classification. There would be no need to change the classification algorithm, because it changes itself every time new train-ing data is used and(or) dictionary file.

(30)

30 Chapter 4. New Model

(A) Old (current) algorithm concept

(B) New algorithm concept

FIGURE4.1: Concepts of Self-Enrichment. (A) Concept used in old Self-Enrichment. Algorithm runs in SAP, processed by a third-party, cannot be easily changed (dark-grey colour box), ’Self-Enrichment N rules’ (classifier) is stable. Company has flexibility over data and dic-tionary files it provides for the algorithm. Yellow sub-box shows data being enriched. (B) is different in a way that now there are separate training and spend data. Machine Learning algorithm is stable and solid, but actual ’MODEL’(classifier) is not – it is generated and de-pends on training data and dictionary files. Thus, classification pro-cess is every time different but the programme itself is the same so it does not need to be changed every time there is a mistake to be fixed.

(31)

Chapter 5

Results

In this chapter we focus on how algorithm new algorithm performs on real data with and without cross-validation, feature analysis, making a simulation of noise in dataset and measuring importance and influence of each feature on prediction.

Spend dataset for January 2017 of company Royal Philips B.V. was used for the task. Data was Self-Enriched by company’s algorithm (’old algorithm’), consisted of more than 75 000 spend lines of real transactions performed by company. This data was used for supervised learning. Even though there is no way to evaluate how much of the data was actually correctly classified, for the sake or research we con-sider it all classified 100% correctly. That simplification not in any way contradicts justifying model’s applicability.

The example images of Decision Trees are available in Figure5.1

5.1 Accuracy of prediction

As it was expected implementing Decision Tree model does over-fit "perfectly": 100 % accuracy on training set. 10% of data for training is enough to give 90% accuracy on entire dataset and for more than 60% for training it already performs 99 % Figure 5.2.

On the figure it is shown that on all 4 levels of enrichment accuracy goes to 100 % and for comparison on the same training set Cross-Validation scores are also on plot (dashed line, tested on remaining data, so if training set is 80% than the rest 20% were used to test). What is noticeable, cross-validation scores do not reach 100%, but still performs more than 95% accuracy even if training set is less than 10%. Cross-validation was performed to see how New algorithm would handle data prediction on the data it was not trained on : prediction score (accuracy) of classification was measured for each of 4 levels (L1, L2, L3, L4(CLOGS)) depending on size of ran-domly sampled data: test size from 1% to 99% with total 20 steps taking mean of 10 measurements of each step.

Interesting to notice is that after about about 20-40 % actual prediction does not grow significantly, because it already reaches reaches critical accuracy of about 98 % - following increase in accuracy is achieved only because of the larger training set and smaller test set.

On higher levels (L1, L2 and L3) results are having the same tendency but with better starting point.

(32)

32 Chapter 5. Results

(A)

(B)

FIGURE 5.1: Examples of Decision Tree models trained on random samples of (A) 10 spending lines, (B) 1000 spending lines.

5.2 Computational time comparison

In this section we will look at Computational Time (or Time Complexity) needed for the new algorithm and how it correlates with the old (or ’current’) algorithm. For the new algorithm to be effective it should satisfy two conditions:

1. Decision Tree’s Time Complexity on given data should be less than of old Self-Enrichment.

2. First condition should also be valid when scaling to a larger dataset.

On Figure 5.3 (A) it is seen both new and old algorithms’ computational time grows linearly. New Algorithm’s growing also split into 4 sub-measurements of Pre-processing, Training, Classification and Post-processing to make sure Computa-tional Time is measured on delivering same result on same dataset (without default

(33)

(A) 4 levels accuracy. Full dataset and Cross

Validation

(B) Same plot, focus-ing on top-right part

FIGURE 5.2: New algorithm (Decision Tree) accuracy of classifica-tion trained on randomly sampled percentage of data set available, then tested on 100 % of data (lines ’L1’-’L4’) and on the remaining dataset not used for training(Cross Validation, same lines with ’CV’ addition). Simulation is performed starting from 1% to 99% of data

with 20 steps and taking mean of 10 repetitions.

pre-processing). And if plot (A) is to show general behaviour, Figure5.3(B) shows exactly how much is the difference between two algorithms in absolute value on largest data set available.

5.3 Simulation of noisy data

5.3.1 Feature importance (weight)

The feature importance is a measure for defining how valuable input of a certain feature to the overall prediction on a [0,1] scale. The higher the prediction value the more important the feature is. The importance of a feature is computed as the (nor-malized) total reduction of the criterion brought by that feature. It is also known as the Gini importance index (Breiman and Cutler,2004).

Every time a split of a node is made on variable m, the Gini impurity criterion for the two descendant nodes is less than the parent node.

This measurement will be used later as ’weight’ of each feature. Also worth re-minding that feature weights are not the same every time Decision Tree is generated due to the randomness element of creating a Decision Tree in the algorithm.

5.4 Noise rate

Common problem in Procurement comes from human error: quite often values in the spend data are polluted: some codes are missed, inserted incorrectly and so on. To simulate and measure effect of similar error on Enrichment let us define param-eter ’noise rate’ from 0.1 to 0.9 for a feature. It ’pollutes’ one column in a way that

(34)

(A) (B)

(C)

FIGURE5.3: Computational Time comparison for old (’Current’) and new (’Decision Tree’) algorithms. In Figures (A) and (C) we see over-all linear nature of algorithms depending on size of dataset. On (B) New algorithm divided into 4 steps: Pre-processing, Decision tree training, Classification and Post-processing. For each point on graph represents mean with standard deviation as errorbar of 100 measure-ments for new algorithm total (and all steps), and 5 measuremeasure-ments for

old algorithm.

noise rate defines chance for each value in the column to be replaced randomly by any other possible value for the column. Higher noise rates makes column polluted significantly and vice versa.

Next we measure effect on accuracy of new algorithm prediction of noise rate on each feature by doing a simulation generating noisy data. 3D surface area is in Figure5.4(A) indicates overall accuracy on noisy data, while plot (B) shows accu-racy only on data that was changed during ’pollution’ stage (to see how algorithm handles not only data overall, but changed data in particular).

On (A) we can separate features into two categories: first – features that can han-dle noise (features 0,2,3,4,5,6) and does not affect overall robustness of an algorithm. Second – features that are quite sensitive to noise and accuracy drops as soon as they are polluted.

(35)

(A) On all data (B) Only on noise-affected FIGURE 5.4: Classification accuracy. (A) Simulation for different noise rates and features. Accuracy measured on entire data set. (B) Same simulation but accuracy measured only on data explicitly on data affected by noise. 100 % of training set is used for training. For each measurement mean accuracy of 10 new Decision Trees are used.

5.5 Feature ’Damage’ integral

From the previous sections we learnt:

1. Features used for classification have weights to show how much Decision Tree depends on each feature and they are new and different every time Tree is generated.

2. Some features are robust to noise and some are very sensitive Which next led to quite reasonable questions:

1. Is weight related to how much of noise feature can handle?

2. Does small weight means feature is insignificant? Is is the reason it does not react on noise?

3. Is it possible to exclude some of the features without decrease in accuracy? If it is, will it save computational time?

So far we have only defined noise rate parameter that helps to measure accuracy of a certain feature with a certain noise rate, but it’s not enough to measure over-all damage noised feature can cause, since accuracy does not change linearly when noise rate grows. In order to find that measure let’s define the following integral:

If =

Z α

β

At_f(x)dx

– ’damage’ integral for feature f , where x ∈ [α, β] – noise rate. 0 < α < β < 1, Af(x)– accuracy of decision tree on training set t for feature f on noise rate x.

This definite integral give us a value needed to evaluate how well Algorithm’s feature performs. Illustration with intuition is given in Figure5.5

(36)

36 Chapter 5. Results Numerically in this work ’damage’ integral was calculated using Riemann sum.

(A) (B)

FIGURE5.5: Example:Integral ’damage’ of a feature 1 (A) and 9 (B) based on noise coefficient. Surface area of Gray area shows desired ’damage’ integral: difference between accuracy on noisy data minus accuracy on noise only integrated over noise coefficient from 1 % to

99%

5.5.1 Weights / ’damage’ integral distribution

Since there is not direct formula connecting ’damage’ integral and feature weight the easiest way to at least understand behaviour of the model on a given data, we can draw a distribution of ’damage’ integral for weight of each feature and explore it. Figure5.6– 70 Decision Trees, each 10 features and weight randomly generated and sampled on 100% training data.

Conclusions we make from the distribution are not as obvious as it could look before:

1. Each feature have somewhat linear dependency of ’damage’ integral from weight 2. Features 7 and 8 almost always have stable high weight and integral value

– those are core features (which is also true based on inside information of company data and structure)

3. Features ’around’ zero are most likely not relevant for classification, and there may be no need in them since even polluted the cause no harm (see Figure5.4 (A))

4. Features 1 and 9 are quite unpredictable – they often have different weights for each Decision Tree – more weight they have, more sensitive entire model to noise for these features

5.6 Excluded features simulation

There is a strong indication that some of features can be excluded while not really missing in accuracy and maybe even gaining. We shall design a blueprint for finding

(37)

FIGURE 5.6: Sampling distribution of ’damage’ integral. For each sample, Decision Tree is generated, weights calculated than for each feature ’damage’ integral is calculated and drawn with in pair with feature weight. 70 samples drawn, 10 features in each, total 700 points

the way to find right features to exclude and prove it’s of use:

1. Simulate New Algorithm Cross-Validation by sampling training set from 1% to 99%

2. Generate N Decision Trees on training data and weights arrays for each. Find feature that has lowest weight on average.

3. Exclude feature from the training set. 4. Train Decision Tree on new training set. 5. GOTO: step 1 until only one feature

On Figure5.7 you can see overall surface of accuracies. Till 8 of 10 close to 99% accuracy does not significantly decrease.

For more thorough look and comparison let us make another plot but also with measurements of computational time (Figure5.8)

If we slice this simulation on only 80 % training set, we can observe the nature of results of the simulation5.9.

Based on simulation we can predict 8(!) out of 10 actually make not much of difference only wasting computational time and creating paths in the decision tree

(38)

FIGURE 5.7: Simulation of how prediction accuracy would change based on amount of iteratively least important features missing and training set size for cross-validation. 20 steps for each simulation,

(39)

(40)

(A)

(B)

FIGURE 5.8: Accuracy and computational time for new algorithm with missing features.(A) shows comparison of cross-validation ac-curacies of the new algorithm with 9 variations of the same algorithm such that each line represents amount of features excluded from the training set. Iteratively in each line the least important feature based on weights in the Decision Tree is excluded. 20 steps in measure-ments, mean of 10 repetition is put on plot with error bar as standard deviation. In order to decide which feature to exclude, 10 Decision Trees are generated and least important feature is decided based on weights mean of generated models. Computational time measure-ments (dash-dotted marker) also put on the plot. (B) shows close-up

(41)

FIGURE5.9: Prediction Accuracy and Computational Time depend-ing on the number of features excluded. Data for 80 % traindepend-ing set.

(42)

(43)

Chapter 6

Conclusions

In this thesis one of the least dissected and even niche topics in data classification for businesses yet still important was explored. Spend data classification is a topic companies treat very individually and by year after year repeating same pipelines often misrepresent the importance of this algorithm.

In this work it was stated that spend classification solutions should abstain from strictly deterministic approach to a cheaper, flexible, and supposed-to-be more accu-rate in the long-run supervised learning method. For the new algorithm it is possible now to change almost any part of it – add or subtract any number of features, change set of classes. And it is absolutely irrelevant of what kind of string data is exactly in the dataset.

Model that is not dependent on data format or classes on real data proved to be almost 2-order more efficient in terms of computational time showing no signs of decrease in accuracy.

Simulation with noise generations showed there is a linear dependency between the weight of a feature in decision tree and how much noise can affect overall clas-sification. Simulation also showed that there can be features both important so little noise in this features’ columns can have negligible effect on the overall accuracy and features so ’useful’ that they only give an illusion of better classification, when, in fact, do not improve accuracy significantly and only slow down classifications mak-ing unreasonable paths in trees.

More importantly, simulation gave an idea of how one can adjust classification algorithm in a way that effectiveness of each feature could be evaluated.

6.1 Future work

So far algorithm was not yet tested on larger datasets. It would be beneficial for to know how, say, algorithm trained explicitly on year 2016 data would be able to do a classification of 2017 data.

Some features like cost of transaction was not used for algorithm at all (due to over-fitting and company restriction not to use it for classification) – it needs to be researched separately to at least prove it was a good idea not to use it in the first place or to find a way to also use it for classification.

(44)

44 Chapter 6. Conclusions Universality of the new algorithm is still in question because it was tested only on one type of data, thus needs to be tested on datasets preferably from companies with the same issues but different data storing approach.

(45)

Bibliography

Agarwal, Ritu and Vasant Dhar (2014). “Editorial — Big Data, Data Science, and An-alytics: The Opportunity and Challenge for IS Research”. In: Information System Research.

An Oracle White Paper. April, 2012 (2012). Spend Management Best Practices: A Call for Data Management Accelerators.URL:http://www.oracle.com/us/industries/ industrial manufacturing / spend management best practices -1621437.pdf.

Badulescu, Laviniu Aurelian (2007). “The choice of the best attribute selection mea-sure in Decision Tree”. In: Annals of University of Craiova, Math. Comp. Sci. Ser. Volume 34(1), 2007, Pages 88–93.

Breiman, L. and A. Cutler (2004). Random Forests.URL:http://www.stat.berkeley. edu/~breiman/RandomForests/cc_home.htm.

Breiman, L. et al. (1984). Classification and Regression Trees. Wadsworth, Belmont, CA. Clinton, Nancy (2017). Coupa Adds AI and Machine-Learning Spend-Data Classification to its Suite. URL: http : / / spendmatters . com / uk / coupa adds ai -machine-learning-spend-data-classification-suite/.

Decision Trees, Scikit-learn. Classification. URL: http : / / scikit - learn . org / stable/modules/tree.html.

Ding, Y. et al. (2002). “GoldenBullet: Automated Classification of Product Data in E-commerce”. In: BUSINESS INFORMATION SYSTEMS - BIS.

Edler, Jakob and Luke Georghiour (2007). “Public procurement and innovation—Resurrecting the demand side”. In: Research Policy. Volume 36, Issue 7, September 2007, Pages

949-963.

García, Salvador, Julián Luengo, and Francisco Herrera (2014). “Dealing with Noisy Data”. In: Data Preprocessing in Data Mining. pp 107-145.

He, Haibo and Edwardo A. Garcia (2009). “Learning from Imbalanced Data”. In: IEEE Transactions on Knowledge and Data Engineering 21 (9).

Herfurth, Maik and Peter Weiss (2010). “Approach to e-classification and e-tradability of complex industrial services”. In: eChallenges 2010 conference, IEEE.

Kim, D. et al. (2004). ““A Semantic Classification Model for e-Catalogs””. In: Pro-ceedings of the 2004 IEEE International Conference on E-Commerce Technology (CEC 2004), IEEE Computer Society, pp88-92.

Kotsiantis, S.B. (2007). “Supervised Machine Learning: A Review of Classification Techniques”. In: Emerging Artificial Intelligence Applications in Computer Engineer-ing.

Leukel, J. and G. Maniatopoulos (2005). “A Comparative Analysis of Product Classi-fication in Public vs. Private eProcurement”. In: The Electronic Journal of e-Government Issue 4, pp 201-212, 3.

Martinez, Lourdes S. (2017). “Data Science”. In: Encyclopedia of Big Data.

Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar (2012). Foundations of Machine Learning. The MIT Press.ISBN: 9780262018258.

(46)

46 BIBLIOGRAPHY Murthy, Sreerama K. (1998). “Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey”. In: Data Mining and Knowledge Discover, Volume 2 Issue 4.

Ondracek, N. and S. Sander (2003). ““Concepts and benefits of the german ISO 13584-compliant online dictionary””. In: Proceedings of the 10th ISPE International Conference on Concurrent Engineering (CE 2003),pp255-262.

Provost, Foster and Tom Fawcett (2013). “Data Science and its Relationship to Big Data and Data-Driven Decision Making”. In: Big Data.February 2013, 1(1): 51-59. Singh, Moninder et al. (2005). “Automated cleansing for spend analytics”. In:

Pro-ceedings of the 14th ACM international conference on Information and knowledge man-agement.

Song, Il-Yeol and Yongjun Zhu (2016). “Big data and data science: what should we teach?” In: Expert Systems. Special Issue: Big Data trends: Modeling, Management and Visualization.

Aberdeen Group (2003). Best Practices in Spending Analysis. URL: https : / / www . unspsc.org/Portals/3/Documents/Best%20Practices%20in%20Spending% 20Analysis % 20 -- %20Cure % 20for % 20a % 20Corporate % 20Epidemic . pdf.

Waller, Matthew A. and Stanley E. Fawcett (2013). “Data Science, Predictive Analyt-ics, and Big Data: A Revolution That Will Transform Supply Chain Design and Management”. In: Journal of Business Logistics. Volume 34, Issue 2,June 2013,Pages 77–84.

Advanced Self-Enrichment Algorithm for Procurement Data Classification