Feature Reduction for Binary Classication Problems using Weight of Evidence and XGBoost

(1)

Feature Reduction for Binary

Classification Problems using Weight of

Evidence and XGBoost

Dante Niewenhuis 11058595

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie

University of Amsterdam Faculty of Science

Science Park 904 1098 XH Amsterdam

Supervisor dr. Sander van Splunter

Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam June 29th, 2019

(2)

Abstract

In binary classification problems, several features present in a data set do not influ-ence the prediction process. These features are redundant and not used, but do cause the learning algorithm to be slower and to be more prone to overfitting. In this thesis, an attempt is made to create a system that removes these redundant features from a data set using a combination of Weight of Evidence and XGBoost. This system is eval-uated using neural networks comparing both the balanced accuracy and the F-score. This thesis is written in collaboration with ABN-AMRO, using their incident data set. Aside from the ABN data set, three other data sets are evaluated to get a broader understanding of the impact of the method used. All four data sets tested resulted in a significant reduction of the number of features without a drop in predictive power. One of the data sets resulted in a significant increase in both the balanced accuracy as well as the F-score. Evaluating the results has shown that a combination of Weight of Evidence and XGBoost gives more consistent and better results than one of the methods by themselves.

Keywords— Feature Reduction, Weight of Evidence, Information Value, Neural Networks, Readability.

(3)

Abbreviations

bAcc Balanced Accuracy

DTC Decision Tree Classification algorithm

IG Information Gain

IV Information Value

NN Neural Network

WoE Weight of Evidence

(5)

1 Introduction

Large scale organizations rely on multiple hundreds of applications. For such applications, lack of availability, reliability, or responsiveness can lead to extensive losses (Wang et al.,

2013). For example, customers being unable to place orders could cost Amazon up to $1.75 million per hour(Wang et al., 2013), which means that knowledge of software, hardware and their incidents is vital. This thesis is written in collaboration with ABN-AMRO1 _and

attempts to gain information from incident data. ABN-AMRO is a large organization based in the Netherlands, that deals with a large number of applications in many different fields of operation, ranging from online banking to internal communication systems. Having this many different systems working together creates many possible problems which need to be solved as quickly as possible. When an incident is reported, it gets assigned a priority rating as well as a time of completion. If this time is not met, it will result in an out of time incident (OOT). Reducing the number of OOTs is a big priority for ABN-AMRO.

In 2018 ten Kaate attempted to create a system capable of predicting if an incident would go out of time based on the first documentation (ten Kaate,2018). This was achieved by using a multi-layered neural network and resulted in an accuracy of 0.7679, but only a precision of 0.2169(ten Kaate, 2018). Neural networks have the positive characteristic that most data problems can be predicted quite accurately without much added knowledge. Neural networks, however, have problems with readability: It is hard to know what the more important features are, or why certain data sets are less complicated to predict than others. This makes neural networks very effective when only predictions are needed, but insufficient when looking for insight into the solution. Knowing the reasons why incidents are predicted to be out of time could help ABN-AMRO reduce the number of incidents rather than predict them.

In this thesis, an attempt is made to expand on the project by ten Kaateby making a system that removes features that are redundant when trying to make predictions. Besides readability, there are more advantages when reducing features. The first obvious improve-ment is the speed of the algorithm. Regardless of the kind of algorithm used, more features are almost always equal to slower execution. Removing redundant data will, therefore, al-ways have a positive impact on the speed. The second advantage is the lower possibility of overfitting. Overfitting is the process where the algorithm is not finding patterns that could help with predicting but is just memorizing the data. Many factors can cause overfitting and features that do not add new information about predicting is one of them. Reducing the features in a data set could lower the possibility of overfitting and thereby improve predictive power.

The system proposed in this thesis is a combination of Weight of Evidence2 _{(WoE) and}

Extreme gradient boosting3(XGBoost). WoE is a measure of how much a feature supports or undermines a hypothesis. WoE is ideally used when dealing with binary problems but can be modified to work on classifying problems with more than two possible categories. WoE is further explained in Subsection 2.1. XGBoost is a tree boosting algorithm. Tree boosting algorithms use multiple weak learners and combine them to create a strong learner. An advantage of XGBoost and other boosting algorithms is the readability. XGBoost is an

1_{hhtps://www.abnamro.nl}

2_{https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html} 3

(6)

ideal algorithm to determine the importance of features in a data set. XGBoost is further explained inSubsubsection 2.2.3.

Removing features has many advantages, but those advantages are worthless if the re-moval causes a significant drop in predictive power. This is why evaluation focuses primarily on the impact of the removal on the predictive power. Evaluation of the system is done using neural networks. For the evaluation, three neural networks are trained: one trained on all features as a reference, one trained on the important features, and one on the unimportant features. The three networks are compared based on predictive power. Predictive power is based on both the balanced accuracy (bAcc) and the F-score. In this thesis, a significant drop in predictive power is defined as a drop of more than 0.05 in either F-score or bAcc. All evaluation metrics used in this thesis are explained inSubsection 2.5. In an ideal result, the network trained on the important features would have no significant drop in predictive power compared to the reference network, while the network trained on the unimportant features would have a significant drop in predictive power.

1.1 Research questions

Given the goals stated in Section 1 the research question 1 is determined. The research question is divided into four subquestions. Subquestions1.1and1.2determine if the meth-ods used in this thesis are capable of reducing the number of features of a data set. This is important to know when explaining why the combination of the two is effective or non-effective. These subquestions can also reveal potential problems that the two methods have, which will be very usefull information when discussing the viability of the combination. Subquestion1.3is used to determine which of the two methods is more effective. If one of the methods is more effective, it will be used as the reference when answering Subquestion

1.4. Subquestion1.4is the most important to answer because even if it is possible to reduce the amount of features using a combination of WoE and XGBoost, it is only beneficial if it is more effective than both methods separately. If one of the features is as effective as the two combined, it is a waste of time to use the combined method.

Can a combination of Weight of Evidence and XGBoost be used to reduce the number of features of a data set without a significant drop in

predictive power when solving a binary classification problem?

(1)

Can Weight of Evidence be used to reduce the number of features of a data set without a significant drop in predictive power when solving a

binary classification problem?

(1.1)

Can XGBoost be used to reduce the number of features of a data set without a significant drop in predictive power when solving a binary

classification problem?

(1.2)

Does Weight of Evidence or XGBoost perform better when reducing the number of features of a data set without a significant drop in predictive

power when solving a binary classification problem?

(1.3)

Does a combination of Weight of Evidence and XGBoost perform better when reducing the number of features of a data set without a significant drop in predictive power when solving a binary classification problem

than one of them by themselves?

(7)

1.2 Hypotheses

The first subquestion is expected to succeed since the method used is based on a research on variable reduction using Weight of Evidence(Lin & Hsieh,2014). The second subquestion is also expected to succeed based on papers written on the possibilities of using random forest algorithms for feature reduction(Genuer, Poggi, & Tuleau-Malot,2010). XGBoost is a type of a random forest algorithm and is expected also to be capable of feature reduction. The third and fourth subquestions are much harder to predict. The combination of the WoE and XGBoost would ideally combine the positives of both methods and produce better and more reliable results. The questions stated above are answered using the ABN-AMRO incident data sets, as well as three extra data sets. The three extra data sets are chosen based on size and difficulty to predict. This ensures that this thesis provides a broader overview of the reliability of the methods used.

1.3 Context: Blue Student Lab and ABN-AMRO

This thesis is written in the context of the Blue Student lab. The Blue Student Lab is a col-laboration between the University of Amsterdam and large organizations in which bachelor students get the opportunity to write their thesis using the data from those organizations. This project started in 2018. Since then there has only been a collaboration with ABN-AMRO, but it will be expanded in the future. In 2019 there are ten students working together with ABN-AMRO divided into two groups: blockchain and incident management. This thesis is part of the incident management group, and thus, this section will provide an introduction to other projects in the management group.

1.3.1 Related Thesis 1: Predicting resolution time

The first thesis in the OOT group is written byRiemersma(2019). In her thesisRiemersma

attempts to expand on the system often Kaateby not only predicting if an incident will be out of time but also how much time. Solving incidents in time is a complex task that can be optimized in several different ways. One aspect that may help this process is knowing the resolution time of an incident beforehand.

1.3.2 Related Thesis 2: Predicting assignment group

The second thesis in the OOT group is written byWiggerman(2019). Wiggermanattempts to reduce OOT incidents by assigning incidents directly to the right assignment group. When an incident is noticed, it is assigned to an assignment group. If this assignment group is unable to solve the incident, it will be passed through to another. This process will continue until the incident is solved. The problem is that every assignment group needs to repeat many steps of the solving process, which means that time is spent very inefficiently. It is thereby no surprise that incidents with a high number of different assignment groups are more likely to take too long to solve. Wiggermanattempts to resolve this process using neural networks and k-nearest neighbour clustering algorithms to predict the best assignment group for a given incident.

(8)

1.3.3 Related Thesis 3: Predicting caused by change

The third thesis in the OOT group is written byVelez(2019). In his thesis,Velezis creating a theoretical model that could predict if an incident is caused by a change. In large-scale software organizations, up to 80% of the incidents are caused by previous changes made (Scott, 2001). Having a system that could predict the change that caused an incident would be beneficial when trying to solve software incidents and prevent further ones from occurring. Velezattempts to predict if an incident is caused by a change using PU learning

4_{. PU learning is a niche machine learning technique which uses a combination of machine}

learning algorithms and a special sampling method to handle incorrectly labelled data.

1.3.4 Related Thesis 4: Clustering events and incidents

The fourth thesis in the OOT group is written by Knigge (2019). At ABN-AMRO there are besides incidents also events. Events are incidents that are detected and registered by automatic systems within the organization. An example of an event is a bot that tries to log into the system every few minutes and creates an event every time it would not be able to. Because events are created automatically, there is a tendency to create many events for the same problem. This can be overwhelming for teams solving incidents, and thus, many of these events are mostly ignored. In his thesis Knigge looks at the possibilities of clustering these events so it would be easier to recognize new events and filter out the duplicates. Knigge also tries to connect the events to an incident. In the example given above, this would mean that when an incident is created because a customer could not log in, this incident would be connected to the events created by the bot.

4

(9)

2 Background knowledge

2.1 Weight of Evidence and Information Value

Weight of Evidence (WoE) is a topic that has appeared in scientific literature for at least the last 50 years(Weed,2005). It has mostly been used as a method of risk-assessment but can also be used for segmentation, variable reduction and various other things. In this thesis WoE is used for variable reduction using a method that is primarily based on a paper by

Lin and Hsieh(2014). Lin and Hsiehuses WoE to asses the predictive power of a feature by separating the data into multiple bins, and calculating the differences between the proportion of events in the bin compared to rest of the data. The bigger the discrepancy, the higher the WoE. In this thesis, an event means that the target value is true while a non-event means the target value is false. The target is the feature that the algorithm tries to predict. For example, in the OOT data set, the goal is to predict if an incident is going to be OOT. This means that the target is the feature OOT, an event is when the incident is OOT, and a non-event is when the incident is not OOT.

2.1.1 Binning

The method used in this thesis consists of four steps. The initial step is to separate the feature into bins. In a paper about WoEGuopingstates that three rules should be followed while binning a data set for WoE (Guoping,2014). The first rule states that each bin should have at least 5% of the observations. This is done to prevent the final score from being determined by a small fraction of the data. The second rule states that the missing values have to be binned into a separate bin. The third rule states that every bin should have at least one event and one non-event. The third rule of binning has not been followed in this thesis because the used data did not always allow for it. The problems that are caused by a bin with either no events or no non-events are solved using an adjusted WoE equation, which is explained inSubsubsection 2.1.2. In this thesis, the data is divided into nine bins plus one bin for missing data. The nine bins for the values are made as similar in size as possible. If the feature has less than nine unique values, the number of bins is equal to the number of unique values. The bins are made using the cut function from the Pandas5 package in Python.

2.1.2 Calculating Weight of Evidence

The second step is to calculate the WoE for every bin. The equation to calculate the WoE is as follows:

W oE = ln( %Events

%nonEvents) (1)

The WoE is calculated using the percentage of both events and non-events. Note that the percentage of the events does not mean the percentage of the observations in the bin that are events, but the percentage of events compared to the total number of events in the data set. The WoE is positive when the percentage of events is higher than the percentage of non-events and grows when the discrepancy grows. The WoE is negative when the percentage of

(10)

events is lower than the percentage of non-events and decreases when the discrepancy grows. The WoE is zero when the percentage of events is equal to the percentage of non-events.

This equation for the WoE works for most cases but does assume that every bin has at least one observation that is an event and at least one that is a non-event. This is caused by the fact that dividing by zero as well as taking the natural log of zero is mathematically not possible. As stated inSubsubsection 2.1.1, this thesis does not use the third rule of the paper from Guoping, and because the system created in this thesis should work for many different data sets, it is not possible to guarantee that bins with either zero events or zero non-events are not present. To accommodate for all data, an adjusted equation for WoE is used. The adjusted equation is as follows:

adjW oE = ln(%nonEvents + 0.5 %nonEvents /

%Events + 0.5

%Events ) (2)

The adjusted equation for the weight of evidence can handle zero events or zero non-events by adding a small value to both.

2.1.3 Calculating Information Value

The third step is to calculate the Information Value (IV) for all bins. The WoE is the degree of difference between the ratio of events in a single bin and the whole feature. However, to state something about the predictive power of a feature, the IV is needed. The equation of the IV is as follows:

Table (1) Predictive Power of a feature based on the total In-formation Value(Lin & Hsieh,

2014). Information Predictive Value Power < 0.02 Unpredictive 0.02 to 0.1 Weak 0.1 to 0.3 Medium 0.3 to 0.5 Strong > 0.5 Suspicious IV = W eO ∗ (%Events − %nonEvents) (3)

The IV is calculated by multiplying the WoE and the difference between the percentage of events and the per-centage of non-events. The IV of a bin is always positive because the WoE is always negative if the percentage of non-events is higher than the percentage of events. The IV gets exponentially higher when the discrepancy between the percentage of events and the percentage of non-events increases. The IV of a feature is calculated by taking the sum of the IV of all bins and can be used to determine the predictive power of the feature. Figure 1

shows two examples of IV being calculated. Figure 1ashows a feature with a very low total IV whileFigure 1bshows a feature with a high total IV. In a paper about WoE and IVLin and Hsieh states that the predictive power of a feature can be determined usingTable 1

(Lin & Hsieh,2014). Using this table, the feature is shown inFigure 1awould be considered unpredictable, and the feature shown inFigure 1bwould be considered strong.

2.1.4 Potential problems

Even though WoE has many advantages, it has two potential problems. The first potential problem is that WoE is dependent on the quality of the binning. Because WoE is calculated

(11)

based on the difference between the bins and the total data, the method of binning can have consequences for the results.

The second potential problem is the fact that WoE is purely based on the feature by itself. This could potentially cause a problem when a feature can be important for the prediction of a subset of the data but not necessarily for the prediction of all the data because the WoE is calculated for the whole data set, it could be classified as unimportant while being important.

(a) Weight of Evidence table for feature A. Feature A has low predictive power

Range bins nonE E % nonE % E % E - %nonE WoE IV

0-20 1 198 97 19.7 19.5 -0.2 -0.0002 0 20-40 2 204 105 20.3 21.1 0.8 0.0009 0.0007 40-60 3 197 98 19.7 19.6 -0.1 -0.0002 0 60+ 4 196 102 19.6 20.5 0.8 0.001 0.001 Missing 5 207 98 20.7 19.7 -1.0 -0.0012 0.0012 1002 498 0.0029

(b) Weight of Evidence table for feature B. Feature B has high predictive power

Range bins nonE E % nonE % E % E - %nonE WoE IV

0-20 1 250 80 23.7 14.6 -9.1 -0.013 0.117 20-40 2 180 140 17.1 25.5 8.4 0.0094 0.080 40-60 3 250 80 23.7 14.6 -9.2 -0.013 0.117 60+ 4 196 120 18.6 21.8 3.2 0.0039 0.012 Missing 5 180 130 17.1 23.6 6.5 0.0079 0.05 1056 550 0.376

Figure (1) Weight of Evidence tables for two different features. The low total IV score of the feature A suggests low predictive power while the high total IV score of feature B suggests a high predictive power. (Note that events was abbreviated to E.)

2.2 Decision Tree Algorithm

Decision Tree Classification algorithms(DTC) are among the most used learning algorithms. DTCs have many advantages; they are, for example, straightforward to use (Su & Zhang,

2006). This simplicity is caused by the fact that DTCs do not require many parameters, and can deal with many different data types effectively. Another advantage of DTCs is readability. DTCs are easy to understand because they work similar to how we would make decisions ourselves. DTCs can be used for various types of categorical problems, but in this thesis, we will only discuss binary classification problems, which means the answer is either true or false.

DTCs work by splitting the data into subsets based on feature values. The best split is determined using Information Gain(IG). The IG of a split is the entropy before the split minus the entropy after the split. The higher the IG, the better the split. The entropy of a

(12)

data set is calculated using the following equation: E = c X i=0 −pilog(pi) (4)

In this equation, pi means the fraction of the total observation is part of category i. When dealing with binary problems, there are only two possible categories, True or False. This simplifies the equation into:

E = (−pt∗ log(pt)) + (−pf∗ log(pf)) (5) The equation for the entropy functions like a parabola that has its peak at 0.5 with a value of 1.0 and has a value of 0.0 if either everything is true or everything is false. After being split into subsets, the entropy of the data set is calculated using the weighted average of the subsets. The equation to calculate this weighted average is as follows:

E(S) = n X

i=1

Pi∗ E(i) (6)

In this equation, E(S) is the entropy of the whole data set while E(i) is the entropy of a subset of data. Pi is the fraction of data that is part of subset i, which means that the entropy of the larger subsets is weighted more.

Figure (2) Example of a simple decision tree

Figure 2shows an example of an effec-tive split made. The data set consists of 11 observations where five observations are red stars, and six observations are blue di-amonds. The best prediction that could be made from this initial data set would be to predict all observations to be diamonds, which would result in only 55% of the pre-dictions being correct. The difficulty of prediction is also be shown by the high en-tropy value of 0.69.

The data set is split into two subsets, one consisting of all the observation with feature X larger than 30 and one consisting of the remaining observations. The entropy of the two subsets is lower than the entropy of the root by having a value of 0.45 and 0.5, respectively. Calculating the entropy of the data set after the split is calculated using the weighted average and results in 0.47. The IG of the split has a value of 0.22, indicating the split is effective.

The example given in Figure 2 is of a simple tree consisting of only one split, while in reality, many more splits are needed to predict complex data sets correctly. It is not uncommon that trees grow to many hundreds of splits. When using DTCs it is advised to limit the number of splits to prevent overfitting.

(13)

2.2.1 Random Forest

Even though DTCs can be good classifiers and offer great readability, there are classification problems that are very hard to solve using normal DTCs. One of the methods to improve the predictive power of the tree algorithms is by extending it into a random forest algorithm. Random forest algorithms function by creating a high number of simple DTCs that are all trained on subsets of the data set. These small DTCs are called weak learners because they have low predictive power by themselves. When a random forest algorithm wants to make a prediction, all weak learners make a prediction. The predictions from the weak learners are evaluated, and the most common prediction is chosen as the final prediction. Results from research done by Breiman show that random forest algorithms are more reliable and accurate when compared to algorithms that are based on a single tree (Breiman,2001).

2.2.2 AdaBoost

AdaBoost is one of the most popular implementations of random forest algorithms. Ad-aBoost uses boosting to create and evaluate the high number of weak learners made for random forest algorithms. AdaBoost is the first practical boosting algorithm and is still one of the most widely used (Schapire, 2013). The first step of AdaBoost is to create a weak learner similar to the one shown inFigure 2based on the full data set. Note that every weak learner used by AdaBoost consists of a single split, these are also called stumps. A subset of the data is created, which consists primarily of the observations that are not predicted correctly by the first tree. A second tree is created based on this new subset. This process will repeat until either the desired number of weak learners are created, or the predictions made by the algorithm have reached the desired accuracy. In Adaboost not all weak learners are weighted equally when predicting but are assigned weights which determine how much they influence the prediction. The weight of a weak learner is determined by the fraction of data it correctly predicts.

2.2.3 XGBoost

In this thesis, Extreme Gradient Boosting (XGBoost) is used instead of AdaBoost. XGBoost is similar to AdaBoost but has certain advantages which make it more suitable to use. The first reason to use XGBoost is the optimization for the use of sparse data. XGBoost has shown to run 50 times faster on sparse data than naive boosting algorithms (Chen & Guestrin, 2016). Effective functionality when dealing with sparse data is vital given the amount of sparse data used in this project. Benchmarks made comparing different types of boosting algorithms6 _{show that XGBoost is among the fastest and most accurate}

boosting algorithms. XGBoost has proven to be very successful and widely used in many programming competitions. An example of this is the KDDCup 2015, where all top-10 finishers used XGBoost(Chen & Guestrin,2016).

This thesis uses XGBoost to determine the importance of each feature. First, a model is trained using XGBoost. When using the Python version of XGBoost7_{, it is possible}

to get a list of feature importance. From this list, the most important features can be selected. Determining which features need to be selected can be done using various methods,

6_{http://datascience.la/benchMarcing-random-forest-implementations/} 7_{https://xgboost.readthedocs.io/en/latest/python/python intro.html}

(14)

but in this thesis, a simple threshold is used. If the feature has higher importance than the threshold, it is selected, and otherwise, it is removed. Increasing this threshold will decrease the number of selected feature but will increase the possibility of a significant drop in performance.

2.2.4 Potential problems

Even though XGBoost has many advantages, it has two potential problems. The first poten-tial problem is that XGBoost is a greedy algorithm, which means that XGBoost generates its splits using heuristics rather than processing the whole dataset. This could result in XG-Boost, making locally optimal choices but not always globally optimal choices. This could impact the importance of value given to a feature.

The second potential problem is the approach XGBoost has towards dealing with fea-tures containing similar information. If two feafea-tures contain similar information, by being correlated, XGBoost would only need one of the two features for predicting. This means that one of the two features would get a very low importance rating, even though it is as important as the other feature.

2.3 Neural Networks

Neural networks (NN) are among the most used learning algorithms when solving classifi-cation problems. NNs are used for classificlassifi-cation problems like credit card fraud detection, cursive handwriting recognition and cancer screening, to name a few (Widrow, Rumelhart, & Lehr, 1994). Standard NNs consist of many simple, connected processors called neurons, each producing a sequence of real-valued activations (Schmidhuber, 2014). Input neurons are activated through sensors perceiving the environment, and other neurons get activated through weighted connections from previously active neurons(Schmidhuber,2014). The sim-plest forms of NNs have been around for over 50 years, with the first form of neural networks proposed in 1943 (S. & Pitts, 1943). It was not yet able to learn but was dependant on static parameters given by the user. Nowadays, neural networks can train on data using either supervised or unsupervised methods.

2.3.1 Overfitting and speed

Many AI machine learning models are prone to overfitting. Overfitting is the phenomenon where instead of finding patterns in the data, the algorithm starts to memorize the data. An example of overfitting is shown inFigure 3. In this figure, two algorithms attempt to predict the value of feature Y based on feature X.Figure 3a shows a line that predicts the value of feature Y while not being too complex. Figure 3bshows a line that predicts the value of feature Y with a very complex line. WhileFigure 3bis much more accurate when predicting the training data, it is much worse when predicting the validation data.

NNs have the advantage that they can learn very complicated relationships between inputs and outputs. NNs are however very susceptible to overfitting (Lawrence, Giles, & Tsoi, 1997). Lawrence et al. state that one reason for overfitting is the high number of weights present. A reduction of the number of features in a data set (Lawrence et al.,

(15)

thereby reduce the possibility of overfitting. Another advantage of feature reduction is the improvement in learning speed. The number of calculations done by a NN is based on the number of weights; if this number is reduced it will automatically increase the training speed.

(a) Example of an algorithm that is not overfitting.

(b) Example of an algorithm that is overfitting.

Figure (3) Example of the difference between an algorithm that is overfitting and one that is not.

2.4 Encoding

Many algorithms have difficulties when dealing with categorical data. These difficulties are caused by the fact that most algorithms function using numeric data. To resolve this problem, categorical data is processed using an encoder. There are many different methods of encoding data, but only two are used in this thesis.

2.4.1 Ordinal Encoding

The first method used is ordinal encoding, which means that categorical data is replaced by numeric values. An example of ordinal encoding is shown in Figure 4. This method

(16)

is sometimes also called numeric or integer encoding. In this thesis ordinal encoding is executed using sklearn8_{. Ordinal encoding has the advantage that it is easy to execute and}

is very space-efficient given that it is one of the only methods of encoding that does not add new columns to the data.

While ordinal encoding has many advantages, it also has some problems. One problem with ordinal encoding is that it implies a relationship between categories that might not be present. In the example given the encoded data could imply that Rotterdam is twice Amsterdam, and London is even higher even though this is not the case. Another prob-lem with ordinal encoding is that not all types of algorithms can work with it optimally. When researching the impact of encoding data on the performance of a neural network, ordinal encoding was shown to be the worst-performing method of encoding tested (Potdar, Pardawala, & Pai,2017).

City

0 Amsterdam

1 Rotterdam

2 Amsterdam

3 Rotterdam

4 London

City

0

1

2

1

3

2

4

3

Figure (4) Example of column encoded using ordinal encoding

2.4.2 One-Hot Encoding

The second method encoding used in this thesis is one-hot encoding. one-hot encoding is one of the most used encoding methods because it requires no knowledge of the data and works very well with neural networks. one-hot encoding creates a separate column for every unique category in a column. The values in the new columns consist only of the values 1 and 0, stating if the observation is a part of the category or not. InFigure 5an example of one-hot encoding is shown. In the example given in Figure 5 the column City turns into separate columns for Amsterdam, Rotterdam and London respectively. The reason NNs work so well using One-hot encoding is that it can assign different weights to all the categories separately.

While one-hot encoding has many advantages, it also has some problems. The primary problem with one-hot encoding is space efficiency. One-hot encoding creates a new column for every unique category, which can lead to a very large data set, especially when the number of categories increases. The space efficiency can be increased when using sparse matrices9_{, but is still not ideal. To reduce the number of columns created, a method from}

ten Kaate’s thesis is used whereby all categories present less than 5 times are placed together 8_{https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder}

.html#sklearn.preprocessing.OrdinalEncoder

(17)

into the category ”uncommon” (ten Kaate,2018).

Another problem with One-hot encoding is that it completely removes information that was present in the original data set because it splits columns into multiple columns. When looking at the encoded data it is not possible to determine which columns were based on the same original column. This can have a negative impact on performance when using algorithms that use this kind of information, like DTCs.

City

0 Amsterdam

1 Rotterdam

2 Amsterdam

3 Rotterdam

4 London

City A

City R

City L

0

1

0

1

0

1

0

2

1

0

3

0

1

0

4

0

1

Figure (5) Example of column encoded using one-hot encoding

2.5 Evaluation Metrics

Evaluation is one of the most important stages of research. Knowing the quality of the results and their implications is vital when discussing the success of the used method. In this thesis, the success is based on the number of features removed as well as the predictive power. The predictive power is a combination of the F-score and the balanced accuracy (bAcc). Precision and recall are also shown in the results of this thesis but will not be used during the evaluation of the results.

Precision, recall and accuracy make use of true positives(tp), true negatives(tn), false positives(fp) and false negatives(fn). A prediction is true positive when correctly predicted positive, is true negative when correctly predicted negative, is false positive when incorrectly predicted positive and is false negative when incorrectly predicted negative.

(18)

P recision = P tp

P tp + P f p (7)

Recall = P tp

P tp + P f n (8)

F − score = 2 ∗ P recision ∗ Recall

P recision + Recall (9)

T rueN egativeRate = P tn

P tn + P f n (10)

Accuracy = P tp + P tn

P tp + P f p + P tn + P f n (11)

BalancedAccuracy =Recall + T rueN egativeRate

2 (12)

Figure (6) equations used for evaluation results, T rueN egativeRate and Accuracy are not directly used in this research but are give as context.

The precision of a model is the number of correct positive predictions in the collection of all positive predictions. When an algorithm is optimized on precision, it will cause the algorithm to predict significantly more true positives than false positives. This can be very useful when developing a system where false positives are a big problem. An example could be an automatic fine system where an algorithm would determine an offence and would automatically hand out fines without any human interference. The downside of precision is that it does not take into account the number of false-negative predictions. This means that these algorithms tend to predict positive less often because they need to be entirely sure to do so. This, however, means that these algorithms can get a high number of false negatives.

The recall of a model is the number of correct positive predictions in the collection of all positives. When an algorithm is optimized on recall, it will cause the algorithm to predict significantly more true negatives than false negatives. This could be very useful when developing a system where false positives are not a big problem. An example of this would be an algorithm that would be used when filtering data before a human looks at them to determine the further process. In this case, false positives would not be a problem because the human evaluation could remove those, but the filtering is still successful because it could save the observer a significant amount of work. The downside of these algorithms is that they do not take into account the number of false-positive predictions. This means that these algorithms tend to predict positive much more often because it increases the chance of predicting all the positives correctly.

The F-score combines the precision and the recall. The F-score operates like a mean when the precision and recall are close together but punishes disparity between the precision and recall.

Despite both recall and precision being capable of evaluating results, they both dismiss the importance of the true negative predictions. Accuracy incorporates the true negative prediction and could thereby give a better evaluation of the results. Accuracy does, however, have a big downside, and that is working with unbalanced prediction targets. A prediction

(19)

target is unbalanced when one class is much more present than the other. An example of an unbalanced target is in the OOT data set used in this thesis. The OOT data set consists of a much higher number of incidents that are solved within the time limit than incidents that are not. InFigure 7 an example of a problematic situation is described. In this example, both the precision and the recall are extremely low while the accuracy is very high. This problem is caused by the fact that, without a normalizing factor, the dominant class has much more impact on accuracy. This problem can be resolved easily by using the balanced accuracy (bAcc). bAcc works similarly to normal accuracy but neutralizes the prediction. Because not all data used in this thesis is balanced, it is vital that the used evaluation methods are capable of working with unbalanced data and thus the bAcc is used.

totalpositives(p) = 50, totalnegatives(n) = 950 tp = 1, f p = 9 tn = 941, f n = 49 P recision = 1 50 = 0.02, Recall = 1 10 = 0.1 Accuracy =1 + 941 1000 = 0.942 T rueN egativeRate = 941 950 = 0.99 BalancedAccuracy =0.99 + 0.1 2 = 0.5

Figure (7) Example of the problem that can occur when using accuracy while working with unbalanced data

(20)

3 Data

In this project, four data sets are used to ensure the methods tested are effective on different data sets. The data sets vary in size, the number of features and the ease to predict. All data sets are preprocessed in two steps. The first step is removing all columns that consist of more than 95% of empty rows. The second step is removing all columns that either consist of only one unique value or consist of more than 95% unique values. The two methods of preprocessing are used to reduce overfitting. A more elaborate overview of the data sets can be found in AppendixA.

3.1 Simple data set - Titanic

The first data set used is the titanic data set10_{taken from a Kaggle basic machine learning}

exercise. In this exercise, the user is asked to predict if a person would have survived based on features like sex, age or Cabin. This data set is chosen because of the small size and the expected ease to solve, given that it is a basic exercise. The titanic data set shows if the methods used are capable of reducing the number of features of a simple data set. The titanic data set consists of 891 rows and 11 columns. Two columns are removed during the preprocessing phase, zero due to missing values and two due to unique values. This leaves nine usable columns. A more in-depth breakdown of the data can be found in Appendix

A.1.

3.2 Big data set - WeatherAUS

The second data set used is the Australian weather data set11_{, which was also taken from}

Kaggle machine learning exercises. The goal of this data set was to predict if there would be any rainfall the following day based on various weather features of the current day. This data set is chosen because it has a much larger number of rows compared to the titanic data set. The weatherAUS data set shows if the size of a data set influences the effectiveness of the method used. The weatherAUS data set consists of 142193 rows and 24 columns. No columns are removed during the preprocessing phase leaving 24 usable columns. A more in-depth breakdown of the data can be found in AppendixA.2.

3.3 Complex data set - Adult

The third data set used is the adult data set12. This is a data set that consists of information about various adults living in America, created in 1996. The goal of this data set is to predict if one earns more or less than 50k a year. This data set is chosen because it is expected to be more complex than both the Titanic or the weatherAUS data set. The adult data set shows if the effectiveness of the methods used is influenced by the difficulty of a problem. The data set consists of 48,842 rows and falls, thereby right between the two previous data sets when looking at size. This data set is used many times in different scientific papers and is commonly known as the ”Census Income” data set (Zadrozny & Bianca,2004)(Rosset & Saharon,2004). The Titanic data set consists of 32,561 rows and 15 columns. No columns are removed during the preprocessing phase leaving 15 usable columns. A more in-depth breakdown of the data can be found in AppendixA.3.

10_{https://www.kaggle.com/c/titanic}

11_{https://www.kaggle.com/jsphyg/weather-dataset-rattle-package} 12_{https://archive.ics.uci.edu/ml/datasets/Adult}

(21)

3.4 Domain data set - ABN-AMRO OOT

The fourth data set used is the ABN-AMRO OOT data set. This is a data set of the first documentation of incidents in the ABN-AMRO system. The goal of this data set is to predict if an incident will be out of time. It is hard to predict how the ABN-AMRO OOT data set compares to the other three data sets because not much prior research is done on it. It is expected that the complexity of the problem is comparable to the Adult data set, given the results achieved by ten Kaate (2018). The ABN-AMRO OOT data set shows if the effectiveness of the methods used is influenced by the number of features in the data set. The ABN-AMRO data set consists of 55,583 rows and 269 columns. 186 columns are removed during the preprocessing phase, 166 due to missing values and 20 due to unique values. This leaves 83 usable columns. A more in-depth breakdown of the data can be found in AppendixA.4.

4 Method

In this thesis, an attempt is made to divide the set of features into a subset of important and unimportant features. This is done using three different methods. The first method used is based on the WoE. The IV of every feature is determined, as explained inSubsection 2.1. The selection of important features is made using a threshold. Every feature with a total IV higher than 0.05 is classified important while all other features are classified unimportant. The threshold used was determined based on initial experimental exploration. The first experiments were done using a threshold value of 0.02, which is classified as ”unpredictable” learners inTable 1. However, this threshold resulted in too many features being classified as important that did not seem important, because they could be removed without a significant drop in predictive power. The next experiments were done using a threshold value of 0.1, which is classified as ”weak” learners in Table 1. However, this threshold resulted in a significant drop in predictive power. A threshold used in this thesis is between the two thresholds with a value of 0.05 and resulted in the highest number of removed features without a significant drop in predictive power.

The second method used is based on XGBoost13_{. XGBoost is used to determine the}

im-portance of every features using the method explained inSubsubsection 2.2.3. The selection of important features is made using a threshold. Every feature with an importance value be-low 0.01 will be removed. The threshold used was determined based on initial experimental exploration. Initially, the same threshold value as the WoE was used, but this resulted in a drop in predictive power. After multiple experiments, a threshold value of 0.01 was shown to be the most reliable.

The final method used is a combination of the two methods mentioned above. First, a selection is made using the IV based method. The XGBoost based method is used on the same data set consisting of only the selected features. Both WoE and XGBoost have potential problems, as shown in Subsubsection 2.1.4and Subsubsection 2.2.4 respectively. The combination of the two methods would ideally utilize the advantages of both methods and dispose of the disadvantages.

The evaluation is done using neural networks which are built using Tensorflow14 _and

13_{https://xgboost.readthedocs.io/en/latest/} 14_{https://www.tensorflow.org/}

(22)

Keras15 _{with Python}16 _{as the programming language. The neural networks used to train}

are of limited complexity, consisting of an input layer, one hidden layer and an output layer. The size of the input layer is based on the input size, the size of the hidden layer is 128, and the size of the output layer is 2 to predict binary values. The activation function of the hidden layer is ReLU17 _{and the activation function of the output layer is Softmax}18_{. The}

optimizer used to train the data is the Adam optimizer19with default parameter values. All neural networks are trained for ten epochs with a 30% validation-split.

Figure (8) Neural network used in this thesis

In total, seven different neural networks are trained. One using all features, which is used as a reference, and for all three methods of reduction one based on the important features and one based on the unimportant features. The results of the experiments will

15 https://keras.io/ 16_{https://www.python.org/} 17_{https://www.tensorflow.org/api docs/python/tf/nn/relu} 18_{https://www.tensorflow.org/api docs/python/tf/nn/softmax} 19 https://www.tensorflow.org/api docs/python/tf/train/AdamOptimizer

(23)

consist of two parts. The first part is the reduction power of the split, meaning how many features are removed and how much does this improve the speed. Expected is that the more features are removed, the more the speed will improve. The second part of the results is the quality evaluation. Removing features can only be positive when it does not have a significant negative impact on the predictive power of the system. The quality is evaluated using F-score and bAcc. In this thesis, a significant drop in predictive power is defined as a drop of more than 0.05 in either F-score or bAcc. All evaluations are made using SKlearn model evaluation20 in Python, except for the F-score which is calculated usingEquation 9. In an ideal result, the predictive power of the network trained on the selected features would not be significantly worse than the predictive power of the network trained on all the features, while the predictive power of the network trained on the removed features would be.

5 Results

Below are the results gathered during testing. The predictive power is based on both the balanced accuracy and the F-score. All the networks are trained for ten epochs and line graphs of the accuracy over time can be found in AppendixB. AppendixBalso shows bar plots with the importance of each feature.

5.1 Simple data set - Titanic

The best results for the Titanic data set were gained using either the IV based or the combined method. The best result reduces the number of columns by 56% and the speed by 21% without a significant drop in predictive power. Note that XGBoost classified all features as important and thus resulted in no reduction. A more in-depth breakdown of the results and the training process can be found in AppendixB.1.

Method Features OHcolumns Speed Precision Recall F-score bAcc

Reference 9 1195 1.18 0.73 0.73 0.73 0.79 IV Important 4 404 0.93 0.81 0.67 0.73 0.78 Unimportant 5 791 1.05 0.65 0.37 0.47 0.62 Tree Important 9 1195 1.21 0.73 0.73 0.73 0.79 Unimportant X X X X X X X Combined Important 4 404 0.92 0.77 0.72 0.75 0.79 Unimportant 5 791 1.08 0.6 0.43 0.50 0.61

Figure (9) Results for the Titanic data set, X means that the test has not been done because there are 0 columns and thus no data. OHcolumns means the number of columns present after one-hot encoding.

(24)

5.2 Big data set - WeatherAUS

The best results for the weatherAUS data set were gained using either the XGBoost based or the combined method. The best result reduces the number of columns by 95% and the speed by 60% without a significant drop in predictive power. A more in-depth breakdown of the results and the training process can be found in AppendixB.2.

Reference 23 2280 271 1 1 1 1 IV Important 7 337 130 1 1 1 1 Unimportant 16 1943 261 0.57 0.48 0.52 0.69 Tree Important 1 63 106 1 1 1 1 Unimportant 22 2217 282 0.64 0.53 0.58 0.72 Combined Important 1 64 107 1 1 1 1 Unimportant 22 2217 279 0.64 0.53 0.58 0.72

Figure (10) Results for the weatherAUS data set. OHcolumns means the number of columns present after one-hot encoding.

5.3 Complex data set - Adult

The best results for the Adult data set were gained using either the IV or the combined method. The best result reduces the number of columns by 29% and the speed by 3% without a significant drop in predictive power. The unimportant features from both the XGBoost based method, as well as the combined method, result in precision and recall of 0.0. This is caused by predicting all rows as False which is caused by the lack of useful information present in the given data. A more in-depth breakdown of the results and the training process can be found in AppendixB.3.

(25)

Method Features OHcolumns Speed Precision Recall F-score bAcc Reference 14 101 31 0.68 0.31 0.43 0.63 IV Important 10 64 30 0.67 0.43 0.52 0.68 Unimportant 4 37 24 0.53 0.11 0.18 0.54 Tree Important 12 82 23 0.69 0.32 0.44 0.64 Unimportant 2 19 25 0.0 0.0 X 0.5 Combined Important 10 64 35 0.6 0.57 0.59 0.72 Unimportant 4 37 32 0.0 0.0 X 0.5

Figure (11) Results for the Adult data set, It is not possible to calculate the F-score when both the Precision and Recall have a values of 0.0, thereby X is written as value. OHcolumns means the number of columns present after one-hot encoding.

5.4 Domain data set - ABN-AMRO OOT

The best results for the ABN-AMRO data set were gained using the combined method. The best result reduces the number of columns by 76% and the speed by 93% without a significant drop in predictive power. A more in-depth breakdown of the results and the training process can be found in AppendixB.4.

Reference 83 16202 1770 0.74 0.77 0.76 0.86 IV Important 30 6319 212 0.8 0.68 0.74 0.83 Unimportant 53 9883 1132 0.47 0.45 0.46 0.68 Tree Important 30 5649 623 0.76 0.76 0.76 0.86 Unimportant 53 10553 610 0.5 0.38 0.43 0.66 Combined Important 20 3705 132 0.79 0.70 0.74 0.83 Unimportant 63 12497 1493 0.52 0.51 0.52 0.71

Figure (12) Results for the ABN-AMRO data set. OHcolumns means the number of columns present after one-hot encoding.

5.5 Analysis

The IV based method is able to reduce the number of features of all the data sets used in this thesis. The IV based method does, however, perform significantly worse compared to both the XGBoost and the combined method in the weatherAUS data set. This is caused by the

(26)

second potential problem when using the IV based system discussed inSubsubsection 2.1.4. In the weatherAUS data set, only one feature is required to predict the target 100% correct, which means that all other features can be removed. This does, however, not mean that the features are useless when predicting, but that these features do not add anything to the information already present in the most important feature. The features are thus classified as important by the IV based method because the features could be used for prediction by themselves, even though they do not contribute much compared to the best feature.

The XGBoost based method can reduce the number of features of three of the four data sets used in this thesis. The XGBoost is, however, not able to reduce the number of features used in the Titanic data set. This is possibly caused by the first potential problem discussed in Subsubsection 2.2.4. When comparing the list of importance of each feature to the one made by the IV based method, it is prominent that the most important features are the same. The problem is that while all the features classified as unimportant in the IV are also the lowest-scoring features in the XGBoost based method; their value is too high to be classified as unimportant. This problem could be resolved by increasing the threshold, but this could also increase the possibility of a significant drop in predictive power.

The combined method can reduce the number of features of all the data sets used in this thesis. In all four data sets the combined method was either one of the best or the best method to use. The results suggest that the combined method has all the advantages of both methods while removing the disadvantages. The ABN-AMRO OOT data set is the one data set where the combined method proved to be better than either one of the two methods. This is most likely caused by the lower number of features present when executing XGBoost. Fewer features that are unimportant lowers the impact of the second potential problem discussed inSubsubsection 2.2.4. This is an excellent example of the advantage of using the combined method. Noteworthy is the fact that in both the adult data set and the ABN-AMRO OOT data set, either the precision decreases and the recall increases, or vice versa. There has been no mention of this fact in any literature used for this thesis.

6 Conclusion

Based on the evaluated results discussed inSubsection 5.5it is possible to answer the research questions proposed inSubsection 1.1. Before the research question can be answered, all four subquestions have to be answered

The first subquestion is ”Can Weight of Evidence be used to reduce the number of features of a data set without a significant drop in predictive power when solving a binary classification problem?”. The results indicate that Weight of Evidence is indeed able to reduce the number of features without a significant drop in predictive power. The WoE based method resulted in a reduction of the number of features in every data set used in this thesis. The reduction of the adult data set even resulted in a small increase in the balanced accuracy score.

The second subquestion is ”Can XGBoost be used to reduce the number of features of a data set without a significant drop in predictive power when solving a binary classification problem?”. The results indicate that XGBoost is indeed able to reduce the number of features without a significant drop in predictive power. The XGBoost based method resulted

(27)

in a reduction of the number of features in three of four used data sets. The XGBoost based method was however not able to reduce the number of features of the Titanic data set

The third subquestion is ”Does Weight of Evidence or XGBoost perform better when reducing the number of features of a data set without a significant drop in predictive power when solving a binary classification problem?”. Neither of the two methods consistently perform better than the other. In both the weatherAUS data set as well as the ABN-AMRO OOT data set, the XGBoost based method performed better than the WoE based method. In both the Titanic data set as well as the Adult data set, the WoE based method performed better than the XGBoost based method. Noteworthy, however, is the fact that the XGBoost based method was not able to reduce the Titanic data set which might indicate that the WoE based method is more reliable.

The fourth subquestion is ”Does a combination of Weight of Evidence and XGBoost perform better when reducing the number of features of a data set without a significant drop in predictive power when solving a binary classification problem than one of them by themselves?”. The results show that in all four data sets tested in this thesis, the combined method resulted in the highest number of removed features without a significant drop in predictive power. Besides being more consistent than the methods by themselves, the combined method also performs better in the AMRO OOT data set. The ABN-AMRO OOT data set is an excellent example of the two methods using their advantages to resolve their disadvantages.

The overall research question is ”Can a combination of Weight of Evidence and XG-Boost be used to reduce the number of features of a data set without a significant drop in predictive power when solving a binary classification problem?”. The results indicate that a combination of WoE and XGBoost can reduce the number of features and does this better and more consistent than one of the two methods separately. There are no indications that show difficulties with complexity or size, which would indicate that the method can be used for all binary problems.

7 Discussion & Future Research

7.1 Discussion

Even though the method used in this thesis has shown to be very promising, three remarks are identified. The first remark is that currently when two features contain the same infor-mation, only one of the two is chosen by XGBoost to be of high importance while the other feature is deemed redundant. This could be problematic when using the method purely for readability reasons.

The second remark is about the neural network that is used to evaluate the selected features. This neural network is always built with the same size hidden layer and an output layer with a varying first layer based on the size of the input size. This is done to make a fair comparison between the different neural networks, but it can be argued that a transition from a larger input size is more difficult and should require either more hidden layers or a larger hidden layer. This could mean that the selected data has a slight edge due to a smaller input size. If however the results of the removed features are examined, it does not

(28)

suggest this factor to be of significant impact since they also have a much smaller input size, but still performed much worse than the whole data set.

The third remark is about the thresholds chosen for both the XGBoost and the Weight of Evidence. Both thresholds are primarily chosen because they provided the best results during initial experimentation. While these thresholds seem to work well, it could be possible that better parameters could be found. A possible improvement to the system could also be the method of choosing features based on the importance scores to better suit the needs of the project. Increasing the thresholds would remove more redundant features and thus leave only the most important features but would cause a loss in prediction power. For some projects, a loss in prediction power is much less detrimental because the readability is a more important goal. Another method of classifying features could be to choose the top X features. This method obviously has the big downside that it requires some form of knowledge about the data, to choose the number of features returned. Picking a wrong number of features could result in either removing important features because not enough features are returned, or classifying unimportant features as important because many features are returned.

7.2 Future Research

Aside from the methods of determining thresholds, there is more future research possible. The first possible continuation is to vary the different parameters that can be defined when using XGBoost.In this project, the basic parameters from the Python packages were used, but improvements in both execution and results could be made when optimizing these parameters. A similar continuation could be done exploring the parameters of the NNs. The second possible continuation is an exploration of the use of other types of tree-based algorithms like AdaBoost, Microsoft’s LightGBM21 _{or even basic single Tree algorithms.}

A similar continuation could be done exploring alternatives for the Weight of Evidence algorithm like Chi-squared or Gini22_.

A third possible continuation is an expansion towards all types of classification opposed to only binary classification. XGBoost is already capable of all types of classification, but the IV based method would have to be altered to account for non-binary classification.

7.3 Recomendations for ABN-AMRO

For ABN-AMRO, the following options are given for the usage of this method. The first option is to analyze the features that are classified as important. This can be done by looking for either correlation or for exception values that are either much more frequent OOT or much less frequent OOT. This analysis could also include predicting OOT based on the data, but without the feature being analyzed. The second option is to analyze the differences between periods. In this thesis, all accidents between December 2018 and June 2019 are put into one data set, but this could be split into 6 subsets where each subset contains the data from a single month.

21_{https://github.com/microsoft/LightGBM}

22_{Anumber of different methods can be found at} _{http://documentation.statsoft.com/portals/0/}

(29)

8 Acknowledgements

I would like to thank ABN-AMRO for the use of their time, data and expertise. I am especially grateful for both Monique Gerrits, Ronald van der Veen and Paul ten Kaate for their time and support within the ABN-AMRO. I would like to thank all my fellow student for feedback and collaboration.

References

Breiman, L. (2001, Oct 01). Random forests. Machine Learning , 45 (1), 5–32. Retrieved fromhttps://doi.org/10.1023/A:1010933404324 doi: 10.1023/A:1010933404324 Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of

the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785–794). New York, NY, USA: ACM. Retrieved from http://doi.acm.org/ 10.1145/2939672.2939785 doi: 10.1145/2939672.2939785

Genuer, R., Poggi, J.-M., & Tuleau-Malot, C. (2010). Variable selection using random forests. Pattern Recognition Letters, 31 (14), 2225 - 2236. Retrieved from http:// www.sciencedirect.com/science/article/pii/S0167865510000954 doi: https:// doi.org/10.1016/j.patrec.2010.03.014

Guoping, Z. (2014, 07). A necessary condition for a good binning algorithm in credit scoring. Applied Mathematical Sciences, Vol. 8 , 3229-3242. doi: 10.12988/ams.2014.44300 Knigge, D. (2019). Event correlation and dependenc-graph analysis to suppory root cause

analysis in itsm environments. (Bachelor’s Thesis). University of Amsterdam. Lawrence, S., Giles, C. L., & Tsoi, A. C. (1997). Lessons in neural network training:

Overfitting may be harder than expected.

Lin, A. Z., & Hsieh, T.-Y. (2014). Expanding the use of weightof evidence and infor-mation value to continuous dependent variables for variable reduction and scorecard development. SESUG, 2014 .

Potdar, K., Pardawala, T., & Pai, C. (2017, 10). A comparative study of categorical variable encoding techniques for neural network classifiers. International Journal of Computer Applications, 175 , 7-9. doi: 10.5120/ijca2017915495

Riemersma, R. (2019). Predicting incident duration time. (Bachelor’s Thesis). University of Amsterdam.

Rosset, & Saharon. (2004). Model selection via the AUC. In Proceedings of the twenty-first international conference on machine learning (pp. 89–). New York, NY, USA: ACM. Retrieved from http://doi.acm.org/10.1145/1015330.1015400 doi: 10 .1145/1015330.1015400

S., M. W., & Pitts, W. (1943, Dec 01). A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5 (4), 115–133. Retrieved from

https://doi.org/10.1007/BF02478259 doi: 10.1007/BF02478259

Schapire, R. E. (2013). Explaining adaboost. In B. Sch¨olkopf, Z. Luo, & V. Vovk (Eds.), Empirical inference: Festschrift in honor of vladimir n. vapnik (pp. 37–52). Berlin, Heidelberg: Springer Berlin Heidelberg. Retrieved fromhttps://doi.org/10.1007/ 978-3-642-41136-6 5 doi: 10.1007/978-3-642-41136-6 5

Schmidhuber, J. (2014). Deep learning in neural networks: An overview. CoRR, abs/1404.7828 . Retrieved fromhttp://arxiv.org/abs/1404.7828

(30)

Scott, D. (2001). Nsm: Often the weakest link in business availability. Gartner Group AV-13-9472 .

Su, J., & Zhang, H. (2006). A fast decision tree learning algorithm. In Proceedings of the 21st national conference on artificial intelligence - volume 1 (pp. 500–505). AAAI Press. Retrieved fromhttp://dl.acm.org/citation.cfm?id=1597538.1597619

ten Kaate, P. (2018). Automatic detection, diagnosis and mitigation of incidents in multi-system environments. (Bachelor’s Thesis). University of Amsterdam.

Velez, M. (2019). Predicting causal relations between itsm incidents and changes. (Bachelor’s Thesis). University of Amsterdam.

Wang, C., Kavulya, S., Tan, J., Hu, L., Kutare, M., Kasick, M., . . . Gandhi, R. (2013, 11). Performance troubleshooting in data centers: an annotated bibliography? ACM SIGOPS Operating Systems Review , 47 , 50-62. doi: 10.1145/2553070.2553079 Weed, D. L. (2005). Weight of evidence: A review of concept and methods. Risk Analysis,

Vol 25, No 6 .

Widrow, B., Rumelhart, D. E., & Lehr, M. A. (1994, March). Neural networks: Applications in industry, business and science. Commun. ACM , 37 (3), 93–105. Retrieved from

http://doi.acm.org/10.1145/175247.175257 doi: 10.1145/175247.175257

Wiggerman, M. (2019). Predicting the first assignment group for a smooth incident resolution process (Bachelor’s Thesis). University of Amsterdam.

Zadrozny, & Bianca. (2004, 09). Learning and evaluating classifiers under sample selection bias. Proceedings, Twenty-First International Conference on Machine Learning, ICML 2004 , 2004 . doi: 10.1145/1015330.1015425

(31)

Appendices

A

Data clarification

A.1 Simple data set - Titanic

In the Titanic data set two rows were removed both where removed during preprocessing.

Feature is removed because:

= above 95% unique values = 1 unique value = above 95% missing

faeture name missing missing% unique unique%

PassengerId 0 0.0 891 100.0 Survived 0 0.0 2 0.22 Pclass 0 0.0 3 0.34 Name 0 0.0 891 100.0 Sex 0 0.0 2 0.22 Age 177 19.87 89 9.99 SibSp 0 0.0 7 0.79 Parch 0 0.0 7 0.79 Ticket 0 0.0 681 76.43 Fare 0 0.0 248 27.83 Cabin 687 77.1 148 16.61 Embarked 2 0.22 4 0.45

Table (2) All features in the Titanic data set. Colored rows show the feature was removed during preprocessing.

(32)

A.2 Big data set - WeatherAUS

The waetherAUS data set consists of 24 features and none were removed during preprocess-ing.

Date 0 0.0 3436 2.42 Location 0 0.0 49 0.03 MinTemp 637 0.45 390 0.27 MaxTemp 322 0.23 506 0.36 Rainfall 1406 0.99 680 0.48 Evaporation 60843 42.79 357 0.25 Sunshine 67816 47.69 146 0.1 WindGustDir 9330 6.56 17 0.01 WindGustSpeed 9270 6.52 68 0.05 WindDir9am 10013 7.04 17 0.01 WindDir3pm 3778 2.66 17 0.01 WindSpeed9am 1348 0.95 44 0.03 WindSpeed3pm 2630 1.85 45 0.03 Humidity9am 1774 1.25 102 0.07 Humidity3pm 3610 2.54 102 0.07 Pressure9am 14014 9.86 547 0.38 Pressure3pm 13981 9.83 550 0.39 Cloud9am 53657 37.74 11 0.01 Cloud3pm 57094 40.15 11 0.01 Temp9am 904 0.64 441 0.31 Temp3pm 2726 1.92 501 0.35 RainToday 1406 0.99 3 0.0 RISK MM 0 0.0 681 0.48 RainTomorrow 0 0.0 2 0.0

(33)

A.3 Complex data set - Adult

The adult data set consists of 14 features and none were removed during preprocessing.

age 0 0.0 73 0.22 workclass 0 0.0 9 0.03 fnlwgt 0 0.0 21648 66.48 education 0 0.0 16 0.05 education-num 0 0.0 16 0.05 marital-status 0 0.0 7 0.02 occupation 0 0.0 15 0.05 relationship 0 0.0 6 0.02 race 0 0.0 5 0.02 sex 0 0.0 2 0.01 capital-gain 0 0.0 119 0.37 capital-loss 0 0.0 92 0.28 hours-per-week 0 0.0 94 0.29 native-country 0 0.0 42 0.13

(34)

A.4 Domain data set - ABN-AMRO OOT

The adult data set consists of 269 features and 186 were removed during preprocessing. Feature were removed because:

= above 95% unique values = 1 unique value = above 95% missing

ACTION 55583 100.0 1 0.0 Avail 55583 100.0 1 0.0 Average 55570 99.98 9 0.02 COMMAND 55583 100.0 1 0.0 CPU 55583 100.0 1 0.0 CollateralID 55583 100.0 1 0.0 CompanyName 55583 100.0 1 0.0 Description 55582 100.0 2 0.0 Device 55583 100.0 1 0.0 Duration 55583 100.0 1 0.0 ERR 55570 99.98 6 0.01 Email 55583 100.0 1 0.0 FirstName 55583 100.0 1 0.0 Group 55487 99.83 5 0.01 IpAddress 55583 100.0 1 0.0 IsActive 55583 100.0 1 0.0 LastLoginDate 55583 100.0 1 0.0 LastName 55583 100.0 1 0.0 Maximum 55570 99.98 11 0.02 Message 55581 100.0 3 0.01 Minimum 55570 99.98 10 0.02 MobilePhone 55583 100.0 1 0.0 Name 55583 100.0 1 0.0 Opened by 55583 100.0 1 0.0 PID 55582 100.0 2 0.0 Port 55582 100.0 2 0.0 PostalCode 55583 100.0 1 0.0 Profile.PermissionsApiEnabled 55583 100.0 1 0.0 Profile.PermissionsModifyAllData 55583 100.0 1 0.0 Profile.PermissionsViewSetup 55583 100.0 1 0.0

Table (5) Part of the features in the ABN-AMRO OOT data set. Colored rows show that the feature was removed during preprocessing.

Feature Reduction for Binary Classication Problems using Weight of Evidence and XGBoost