A comparison of various fraud detection methods : using outlier and imbalanced classification learning methods to expose credit card fraud

(1)

Amsterdam School of Economics

Faculty of Economics and Business

A COMPARISON OF VARIOUS FRAUD DETECTION

METHODS

Using outlier and imbalanced classification learning methods to

expose credit card fraud

Jeroen Korver

10722416

Msc in Econometrics

Track: Big Data Business Analytics Date of final version: August 15, 2018 Supervisor: Noud van Giersbergen Second reader: Marco van der Leij

Abstract

This paper intends to give a comparison of multiple fraud detection methods, both in computing time, as well as in performance. Due to a large imbalance in the dataset,

the performance of the methods deteriorates. The objective of this paper is to conclude which adaptions on the class imbalance problem increases the accuracy of the classification, to an extend that the computational power required to increase

the accuracy stays minimal. The analysis finds that pre-processing using SMOTE, as well as adapting the AdaBoost, RandomForest and SVM all increase their

(2)

Statement of originality

This document is written by Student Jeroen Korver (10722416) who declares to take full responsibility for the contents of this document.

I declare that the text and the work presented in this document are original and that no sources other than those mentioned in the text and its references have been used in creating it.

The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

1 Introduction

Algorithms are more and more capable to take over work in the health care sector. A prime example of an algorithm use in health care is the survey from Hoorn & Winter (2017), in which they say: “A robot a day keeps the doctor away” (Hoorn & Winter, 2017, p.16). The survey conducted by Hoorn & Winter (2017) was one where good or bad news is delivered to patients by the use of a robot. They came to the conclusion that participants found the robot’s message superior to the message as received from humans. The robots are: “Involving, ethical, skilled, and people wanted to consult them again” (Hoorn & Winter, 2017, p.2). This is just one of the many applications, in which robots or algorithms may take over the health care sector. Another example of algorithm involvement in health care is an online competition in which heartbeats are analysed by letting a computer listen to different hearts with the use of a digital stethoscope (Kaggle, 2017). Teams can train algorithms in identifying the heartbeat as either anomalous or regular. After this training, new heartbeat sounds can be classified and diagnosed. As only a small sample of all hearts contain anomalies, the algorithm may find it difficult to correctly classify the heartbeats. For example, if only 1:1000 heartbeats is anomalous, the algorithm might just always detect the heartbeats as regular, with an accuracy of 99.9%. This paper intends to give a comparison and evaluation of different imbalanced outlier or fraud detection methods. How various generalised outlier detection methods compare to data pre-processing techniques and/or imbalanced data specific methods is therefore the main theme of this paper.

Many real-world datasets involve modelling of extremely imbalanced distributions of the target variable (Branco et al, p.31). These imbalances can correspond to events which can be really relevant for end users, e.g., fraud detection, disease

(4)

detection, catastrophe anticipation, etc. More sophisticated outlier detection method can improve scientific research into all of these topics, which is beneficial for many instances, e.g., tax fraud. This paper is limited to comparing two possible solutions for skewed datasets, namely: data pre-processing after which normal outlier detection methods can be used, and special-purpose learning methods, which specifically focus on skewed datasets. How the various (un-)supervised outlier analysis methods rank on the transaction data is the main theme of this paper. In order to do so, other sub themes are addressed. The standard econometric technique for predicting a binary dependent variable, the logit model, is described. Two data pre-processing methods, SMOTE and CBO, are evaluated, whereafter they are compared. Identically, two different regular unsupervised outlier detection methods are treated, as are three supervised learning algorithms. Finally, three special-purpose learning methods are treated and compared in ways of performance.

Credit card fraud detection always consists of a low amount of frauds, where most transaction are just regular. Hence, one of these datasets is used for evaluating the algorithms. First two general outlier detection methods are used upon the unprocessed data. The approaches of Liu et al. (2008) and Breunig et al. (2000) are used, which are the Isolation Forest and Local Outlier Factor (LOF) respectively. In order to find a better model specification for the fraud detection, also two data pre-processing techniques are discussed. Synthetic Minority Over-sampling Technique (SMOTE) (Chawla et al, 2002) and the Clustered-Based Over-sampling (CBO) algorithm (Jo & Japkowicz, 2004) are these approaches. After these pre-processing techniques have been used, the most extreme skewness in the data class is taken away. This leaves room for the general classification algorithms to work with this data. The regular classification algorithms which are used in this paper are: k-Nearest Neighbours (Barendela et al, 2003), AdaBoost (Schapire & Singer, 1997), Support Vector Machines

(5)

(Cortes & Vapnik, 1995) and Random Forest (Breiman, 2001). These outcomes are matched with the outcomes from the special-purpose learning methods (SPLM): Weighted Random Forest (WRF) (Chen et al., 2004), AdaBoost variants; AdaC1, AdaC2 (Sun et al., 2007) and RareBoost (Joshi et al., 2001) and finally Fuzzy-Support Vector Machines for Class Imbalanced Data (FSVM-CIL) (Batuwita & Palade, 2010). All these algorithms are also compared to the standard econometric technique of estimating a binary dependent variable, namely the Logit model.

This paper starts off by explaining all the data pre-processing techniques and algorithms mentioned above, including the logit model, in Chapter 2. Next, the study design is discussed in Chapter 3, whereafter Chapter 4 gives the results and analysis of the study. Finally a conclusion is drawn in Chapter 5.

2 Theoretical Framework

As stated in the introduction above, this chapter explains the various data pre-processing techniques, algorithms and special-purpose learning methods, which are used in the literature of imbalanced outlier detection. Four possible solutions for working with imbalanced data are suggested in Branco et al. (2016). These four solutions are: data pre-processing, special-purpose learning methods, prediction post-processing and finally a hybrid method. This paper takes a look at two out of these four possible solutions, namely data pre-processing and special-purpose learning methods. Only these two techniques are discussed due to limited computational power and other current researches which are done in this field of work. Firstly, general classification methods are discussed in 2.1, whereafter the data-preprocessing techniques are examined in 2.2. Next, general outlier detection methods are examined in 2.3 and finally the SPLM are considered in 2.4.

(6)

2.1 Logit Model

Cameron & Travedi (2005) offer an extensive analysis of the logit model. They give that the logit model specifies

p = Λ(x0β) = e

x0_β

1 + ex0_β, (2.1)

where Λ(·) is the logistic cdf, which has the following form: Λ(z) = ez_{/(1 + e}z_{). Then,} the first-order MLE conditions simplify to

N X

i=1

(yi− Λ(xiβ))xi = 0, (2.2)

due to the fact that Λ0(z) = Λ(z)[1 − Λ(z)]. As the model includes an intercept, Cameron & Travedi (2005) state that the residuals sum to zero. The literature most commonly suggests that the interpretation of the coefficients is in terms of the marginal effects on the odds ratio, which is the following for the logit model:

p

1 − p = exp(x 0

β). (2.3)

From (2.3), it follows that a logit model slope parameter of 1, means that a one unit increase in the regressor, multiplies the initial odds by exp(0.1) ' 1.105. Thus, a one unit increase is an increase of 10.5% of the initial odds, or the probability of survival increases by 10.5%.

2.2 Performance measures

Before the actual algorithms are discussed, first the performance measures of these algorithms are discussed, such that the algorithms can be compared and evaluated.

(7)

Figure 1: Confusion Matrix

There are six mostly used performance measures for binary classification. The general accepted performance measure for a classifier problem is the predictive accuracy:

Accuracy = (T P + T N )

(T P + F N + F P + T N ) (2.4) Here, TP is the amount of frauds correctly specified, TN is the amount of non-frauds correctly specified, FP is the amount of actual non-fraudulent case defined as frauds and finally FN is the amount of frauds defined as non-fraudulent cases. However, in the context of imbalanced dataset, the accuracy most certainly gives a wrong prediction, due to the simple solution of defining all the data to one class, such as the example in the introduction of this paper. A two class predictions outcome, with a skewness of 1:1000 in one class and 999:1000 in the other, will already give an algorithm an accuracy of 99.9%, while just defining all instances to the majority case. Therefore, other appropriate measures have been conducted. The second most common performance measure is the Receiver Operating Characteristic (ROC) introduced by Swets (1988), which summarises classifier performance over a range

(8)

of different tradeoffs between true positive and false positive errors. The ROC is representing a family of boundaries for the best decision for costs of TP and FP. The X-axis of the ROC is the percentage of FP, where the Y-axis serves as the percentage of TP. (0,100) would be the ideal point, as this means that all positive datapoints are correctly specified, without any negative datapoints specified as positive. In the Appendix A, an illustration from Chawla et al. (2002) is given of an ROC curve. The Area Under the ROC Curve (AUC) is an effective measure of the model performance, as it is independent of the criterion and prior probabilities (Chawla et al, 2002).

However, the AUC also takes into account the amount of correctly specified cases of non-fraud. Therefore, other effective metrics for fraud detection are used, which are: recall, precision and the harmonic mean between these two measures. These are independent of the ROC or accuracy, with their formulas being as follows.

Recall = T P T P + F N P recision = T P T P + F P F1 = 2 ∗ (P recision ∗ Recall) (P recision + Recall) (2.5)

These metrics focus on predicting the actual frauds more than just focussing on classifying all instances, including non-frauds, correctly. The harmonic mean, or F1-measure, between recall and precision is one of the used metrics for model performance in this paper, as this gives a good indication of how well the model is able to detect the fraudulent cases .

Next to F1-measure, also Cohen’s Kappa is used for performance monitoring (Cohen, 1960). Cohen’s Kappa measures the agreement between the categorical items, while also taking into account the possibility of a random agreement between the items. Therefore, Cohen’s Kappa provides a more robust measure of performance

(9)

than measures like accuracy. This is especially useful for imbalanced datasets, as the random chance of just randomly classifying all points to the majority set will most likely pursue a high accuracy, but Cohen’s Kappa adjusts for this fact. It measures the normal accuracy, but divides this number by a hypothetical possibility of random agreement. The further formulas for Cohen’s Kappa can be found in Appendix B under (6.2).

The final metric used for model performance monitoring is Matthews correlation coefficient. It is introduced by Brain W. Matthews (1975) and was introduced as a measure that can be used even if the classes are of very different sizes. It takes a value between [−1, 1], where +1 is the best possible score. It is claimed to be the most informative score in binary classification to provide the quality in a confusion matrix context (Chicco, 2017). Matthews correlation coefficient (MCC) is calculated as follows:

M CC = T P × T N − F P × F N

p(T P + F P )(T P + F N)(T N + F P )(T N + F N) (2.6) Thus, this paper provides six different metrics to establish the quality of the binary classification predictor. From these metrics, the MCC is used as the primary performance metric, as this provides the most stable performance measure. This is due to the fact that the MCC takes the balance ratios of all four confusion matrix categories into account (Chicco, 2017). Also, Cohen’s Kappa is used to measure performance of the models.

2.3 General classification algorithms

The two data pre-processing techniques stated after this paragraph are used on the imbalanced credit card fraud dataset, which makes it suitable for regular outlier

(10)

detection methods to work with the data. These approaches are described next. The algorithms are both used before and after the technique, to verify that the algorithms are able to better detect the frauds, when the data distribution is changed in such a way that the algorithms are more biased towards finding the frauds.

2.3.1 k-Nearest Neighbours

The k -Nearest Neighbour (kNN) is a classification algorithm which assigns an instance to a specific class, when the majority of the kNN is of that specific class. The kNN are identified by calculating the distance between the instance and all instances in the training set, where the distance can for instance be calculated by the simple Euclidean distance. This technique was introduced by Fix & Hoghes (1951) in an unpublished report. However, in order to find the nearest neighbours, first all columns are normalised. This is done by subtracting the mean of the columns from the instance, after which the it is divided by the standard deviation. In this way columns have mean 0 and a standard deviation of 1. Now, none of the columns dominate the distance qualifier, as they are all normalised. The k is chosen in conformity with the size of the dataset. However, with an extremely unbalanced dataset, which is later discussed in the study design, most instances have a higher chance of being allocated to the majority class, simply due to fact that nearly all of the instances belong to the majority class. Hence, the data most likely needs to be pre-processed.

2.3.2 AdaBoost

The AdaBoost algorithm is originally conceived by Freund & Schapire (1997). Schapire & Singer (1998) further generalised the algorithm and made it more precise in tuning its parameters. Schapire & Singer (1998) adapted the algorithm, such that the

(11)

weak learners can have a range over all real numbers and leave the tuning of the parameter αt unspecified, where they do recommended various tunings. The AdaBoost algorithm as specified by Schapire & Singer is as follows. Firstly, a weak learner is trained using a distribution, Dt, where the start values for all these weights is equal to 1/N . In order to train this function, the target values are changed to: yt ∈ {−1, 1}. Next, get a weak learner from the dataset xi ∈ χ, for instance ht_{: χ → R. Finally, update D}t+1:

Dt+1(i) =

Dt(i)exp (−αtytht(xi)) Zt

, (2.7)

where Zt is a normalisation factor. The final hypothesis is used as output:

H(x) = sign T X t=1 αtht(x) (2.8) Conclusively, the weights of each item are initialised by 1/N , but are increased if the example is misclassified and decreased if it is correctly classified. Normally, αt is chosen as (1/2) · ln((1 − t)/t), where t is equal to the fraction of incorrectly forecasted instances.

2.3.3 Support Vector Machines

Support Vector Machines are a classification technique developed by Cortes & Vapnik (1995). SVM tries to find a separating hyperplane between the data points of the two classes. Cortes & Vapnik (1995) introduced this method, in order to solve the problem of finding an algorithm, which updates the weight while minimising the error term. Hence, they came up with an algorithm which solves exactly this problem. They search for the optimal hyperplane, which maximises the distance between the

(12)

different classes, while also maximising the accuracy of the algorithm. In order to do this, the features are firstly transformed by a mapping function Φ. Φ could for instance be the standard normal distribution. Then, the possible hyperplane is represented by:

w · Φ(x) + b = 0, (2.9)

where w denotes a 30-dimensional unknown vector, Φ(·) denotes a known basis function and b denotes an unknown constant.

In order to find this hyperplane, the following margin function is optimized:

Min1 2w · w + C N X i=1 i s.t yi(w · Φ(xi) + b) ≥ 1 − i i ≥ 0, i = 1, . . . , N (2.10)

In these equations (2.10), the objective is to classify at least 1 −P i correctly. This new objective function has two objectives, namely to maximise the margin between the classes and minimise the number of misclassifications. The parameter C can be interpreted as the cost of misclassifying a training instance. Solving these equations will ultimately give the following formula for the weights:

w = N X

i=1

(13)

Here αi is chosen such that the following function is maximised. Max W (α) = N X i=1 αi− 1 2 N X i=1 N X j=1 αiαjyiyjK(xi, xj) s.t 0 ≤ αi ≤ C N X i=1 αiyi = 0 i = 1, . . . , N (2.12)

All instances, for which αi is unequal to zero, are called the support vectors.

Still, the data pre-processing techniques have to be used before applying the SVM method, as SVM still maximises accuracy, which can be achieved by just defining all instances as non-fraudulent cases.

2.3.4 Random Forest

“A random forest is an ensemble method of unpruned classification or regression trees” (Chen et al., 2007, p. 2). Breiman (2001) developed the random forest. Random forests use an ensemble method for regression, classification or other tasks. It works by constructing multiple decision trees at the training time, where the output of the algorithm is the mode of the classes or the mean prediction of individual trees. This way, the decision trees are corrected for the overfitting property they have. The training algorithm applies the bootstrap aggregating technique (hereafter: bagging) for tree learners. Given the training set, X1, X2, . . . XN and the fraud class Y1, Y2. . . YN, bagging repeatedly selects a subsample from the sets. Then, decision trees are fitted upon this data.

Decision trees selects an attribute from the random subset of variables it chooses and splits them in such a way that the chance of the same class is the highest. An example of a decision tree can be found in Appendix A in Figure 3. After training B of these trees, where B is equal to the amount of times a subset from X is taken to

(14)

train a decision tree, the prediction of the class is equal to majority vote of the trees. This is the theory developed by Breiman (2001).

2.4 General Outlier Detection Methods

This section describes the general outlier detection methods which are used on the general dataset without any data pre-processing except for the feature engineering. Outlier detection methods use the fact that outliers deviate so much from the other observations that it can spark suspicion that it was generated by a different mechanism (Aggarwal, 2016). One of the first examples mentioned by Aggarwal (2016) is the credit-card fraud detection. He mentions that: “In many cases, unauthorized use of a credit card may show different patterns, such as buying sprees from particular locations or very large transactions” (Aggarwal, 2016, p.1). From this book, two methods of anomaly detection are taken, which are used upon the unprocessed data. An Isolation Forest as introduced by Liu et al. (2008) and the Local Outlier Factor (LOF) method as introduced by Breunig et al. (2000).

2.4.1 Isolation Forest

Next, as the performance can now be managed, the general outlier detection methods are discussed. Anomalies are more vulnerable to isolation, where isolation is defined as separating an instance from all other instances (Liu et al, 2008). The increased vulnerability comes from the fact that outliers deviate so much from other regular observations. Liu et al. (2008) therefore come up with a new approach for detecting anomalies, where all values in the variables are repeatedly partitioned, until all instances are isolated. They then argue that anomalies should have shorter paths, as the fewer instances result in short paths of the tree structure and distinguishable values for

(15)

some attributes have a higher chance of being separated in some early partitioning (2008, p.414). A path is the length from the start of the decision tree to where the instance is partitioned from all other instances.

Isolation Forest (iForest) randomly selects an attribute, which is randomly partitioned. As random partitioning can be shown as a tree structure, the number of recursive partitions which is required to isolate a node is equal to the path length from the start until the root. See for an example on such a tree Figure 6. This is an example of a tree used in RandomForest. However, the RandomForest decision tree maximises the chances of correctly classifying, whereas an iForest tree randomly splits the attributes. Again, the path length of regular instances should be greater than those of anomalies. This is visualised in Figure 7 in the Appendix A. Multiple individual trees are generated, where the average path length of each instance is calculated. Subsequently, the path length is turned into an anomaly score, where the formulas for the calculation can be found in the Appendix B under (6.1). As higher the expected value of the path length gets, the smaller the value for the anomaly score gets as well. This is due to the fact that c(n) is a function which is dependent on the amount of instances, where if the path length is equal to 1, the score is close to 1, due to the negative exponent in the score function. Contours from the anomaly scores can be generated, where the anomalies can be visualised. Using the contours, as well as the anomaly scores, these can be used to identify the outliers.

2.4.2 Local Outlier Factor

Breunig et al. (2000) proposed a new way to identify outliers in a dataset, namely the Local Outlier Factor (LOF) method. In this method they first find the k -Nearest Neighbours (kNN) of the instance, purely based on the Euclidean distance (2000, p.95). Next, they define a reachability distance of other objects, that is maximum of

(16)

either the distance to the kNN or the distance from the instance to the point:

Rk( ¯X, ¯Y ) = maxd( ¯X, ¯Y ), k − distance( ¯Y )

(2.13)

where d(·) denotes a the distance between two instances and k − distance(·) is the distance to the kNN.

These distances are not symmetric between ¯X and ¯Y. For example, when ¯Y is in a dense region, and the distance between the two points is large, the reachability is just equal to the true distance. However, when the distances are small between the two points, it is corrected by using the k -nearest neighbour distance of ¯Y. The larger the k, the greater the correction will be. Subsequently, the average reachability distance is defined, which is just the average of all these reachability distances:

ARk( ¯X) = 1 N N X i=1 RK( ¯X, ¯Yi) (2.14)

Equation (2.14) is only calculated for the points where ¯Y ∈ Lk( ¯X). Lk(·) is defined as the points within the kNN region of the point ¯X. Then, the LOF score is calculated by simply calculating the mean of the average reachability, divided by the average reachability of the data points around the instance. Hence,

LOFk( ¯X) = ARk( ¯X) · M EAN 1

ARk( ¯Y )

(2.15)

Here, MEAN just represents the mean of the set of values. Formula (2.15) results in that the LOF values for instances around a dense point are close to 1, when the points in that cluster are homogeneously distributed. Even though, two clusters can be far apart, the LOF values in both clusters are close to 1, where only outliers are getting a score much larger than 1. These values are larger than 1, as they

(17)

are defined in terms of the ratio of average reachability of the closest k instances. Conclusively, these scores can be viewed as the normalised reachability distance of a point.

2.5 Data pre-processing techniques

This section discusses the data pre-processing techniques and the advantages of using these approaches in order to overcome imbalanced datasets.

Data pre-processing methods change the data distribution, to make standard algorithms focus on the cases that are more relevant for the user (Galar et al., 2016, p.19). They claim that the advantages of these procedures are that: (i) Applicable to any existing learning tool, and (ii) the models are then more biased towards the goals of the user (as the data distribution is changed to match these goals). However, they also mention the disadvantage of the approach, namely that mapping the distribution into an optimal distribution for the users goals is far from easy.

The current methods for changing the distribution of the data are of three types: synthesising new data, stratified sampling or a combination of these both. This paper focuses on one approach from synthesising new data, Synthetic Minority Over-sampling Technique (SMOTE) and one from the stratified sampling, namely Clustered-Based Over-sampling (CBO). These two approaches are the most general choices for redistributing data.

2.5.1 SMOTE

SMOTE is a technique developed by Chawla et al. (2002) designed for imbalanced datasets. They define it as a blend of under-sampling the majority class, combined with a rare form of oversampling the minority class (2002, p.322).

(18)

SMOTE is introduced for changing the distribution of the dataset. Chawla et al. (2002, p.328) propose this technique as they believe it is superior to create synthetic examples, rather than just over-sampling. The minority is over-sampled by introducing examples along the line segments of the k nearest neighbours. Next, they choose a k of five, where if an oversampling of 200% is needed, two of the nearest neighbours are randomly chosen from the five nearest, where one new point is added along the line segment of the instance and the neighbour. They do this by taking the difference of the two vectors which represent the point and the neighbour. Hereafter, they multiply this difference by a random number between 0 and 1 and add this to the instance under consideration. This way, a new fraudulent case is added to the dataset on the specific line segment between the instance and its chosen neighbour. By doing this, the decision region of the frauds should become more general and hence better detectable. The full pseudo-code of the algorithm can be found in Chawla et al. (2002, p.329).

2.5.2 CBO

The Clustering-Based Over-sampling (CBO) algorithm is developed by Jo & Japkowicz (2004). They wanted to perform an experiment on whether class imbalances are responsible for the degradation of performance of standard classifiers. In order to come to this conclusion, they experimented with a method that takes the small disjunct problem into account. “This method yields a performance superior to the performance obtained using standard or advanced solutions to the class imbalance problem” (Jo & Japkowicz, 2004, p.40).

The method consists of clustering the training data per class separately, whereafter random oversampling is performed within each cluster (Jo & Japkowicz, 2004). They argue that in this way, not only the imbalance between the classes, but also the

(19)

imbalance within classes is handled better by rectifying both simultaneously. The clusters are formed via the use of the k -means algorithm. In this algorithm k random vectors are chosen as the mean of their cluster. Next, the remainder of all vectors calculate the distance to these k representative vectors, where they are attributed to the vector closest to it.

After the clustering is performed, in the majority class, all clusters are over-sampled until they have the same size as the the largest cluster. After this is done, the minority classes are over-sampled to the point where the sum of all points in the minority classes is equal to the sum of points in the majority class, such that no class imbalance exists anymore. E.g. the majority clustering is: 8, 9, 10, 11, where 8 stands for a cluster of 8 instances which have a distance closest to the first randomly chosen neighbour. Next, the clusters are over-sampled to the size of the largest; 11, 11, 11, 11. If the minority clustering is: 2, 2, these clusters are scaled to: (11+11+11+11)/2 = 22. After this re-sampling, they state that as both classes are clean of imbalances.

2.6 Special-purpose Learning Methods

As mentioned in the introduction of this paper, three different Special-Purpose Learning Methods are used in the identification of outliers. These are: Weighted Random Forest as developed by Chen et al. (2004), the AdaBoost variants as firstly used by Sun et al. (2007) and Johsi et al. (2001) and finally the Fuzzy-Support Vector Machines for Class Imbalanced Data (FSVM-CIL) by Batuwita & Palade (2010).

(20)

2.6.1 Weighted Random Forest

Chen et al. (2007) continue on the RandomForest model developed by Breiman (2001), which is discussed in 2.3.4. They recommended a new weighting factor for the trees. The weights are in accordance with the size of the classes. The cost of misclassifying an instance from the minority class is penalised more, such that also the minority class is classified better. Also, if a tree suggests a minority case, this is multiplied by a ”weighted majority vote” such that the chances on a minority case are also increased. This new method is called Weighted Random Forest and outperforms other solutions to the imbalanced data classification problem (Chen et al., p.7).

2.6.2 Adaboost’s variants

The weight updating of general AdaBoost however does not take into account very imbalanced datasets. Hence, multiple solutions to this problem are suggested in Sun et al. (2007) and Joshi et al. (2001). Sun et al. (2007) come up with three different new weight updating rules. They call these new algorithms: AdaC1, AdaC2 and AdaC3. The three modification use a cost item, where they are added to the weight update formula. For AdaC1, AdaC2 and AdaC3, the cost item is added respectively in the exponent, outside the exponent, and both inside and outside the exponent. The costs can be varied, for example higher costs can be given to frauds which involve a higher amount of money. Of course, as the meaning of cost-sensitive boosting is to boost the class size of the minority class, the costs of frauds are higher than the costs of non-fraudulent cases.

RareBoost is the adaption on the AdaBoost algorithm developed by Joshi et al. (2001). RareBoost uses the amount of true/false positives/negatives to update their

(21)

weights. They calculate two different αt for positive and negative cases, as: αp_t = 1 2 ln T Pt F Pt αn_t = 1 2 ln T Nt F Nt, (2.16)

where T Pt is the true positive rate, F Pt is the false negative rate etc. Weights are then just updated with these new αt. However, this weighting strategy only increases the weights of False Positives and False Negatives if T P > F P and T N > F N. This means that the precision should be larger than 0.5. Though, with the imbalance problem, normally the small class has problems with precision and recall, which makes this constraint a strong condition. If this condition is not satisfied, the algorithm is likely to collapse. In case this happens with the credit card fraud data, this paper disregards this algorithm without any further consideration.

2.6.3 Fuzzy Support Vector Machines for Class Imbalanced Data

Regular SVM is adapted in order to overcome the class imbalance problem, which is called Fuzzy Support Vector Machine for Class Imbalance Data (FSVM-CIL) and is superior at correctly learning from imbalanced datasets than regular SVM . It is developed by Batiwuta & Palade (2010), where the first adaption to regular SVM to Fuzzy SVM is made by Lin & Wang (2002). This adaption made SVM more reliable against noise and/or outliers. Fuzzy SVM was then further extended by Wang et al. (2005), where two values are introduced for each training example, for both the positive and negative class. This approach was further expanded, based on vague sets by Hao et al. (2007). However, FSVM-CIL is only based on the regular FSVM as introduced by Lin & Wang (2002).

(22)

sensitive to outliers and/or noise. Hence, Lin & Wang (2002) came up with an adaption to the current algorithm. They added extra weights to the slack variables, i, where these weights are denoted by mi. This turns the optimising problem of (2.10) into the following system of equations:

Min1 2w · w + C N X i=1 mii s.t yi(w · Φ(xi) + b) ≥ 1 − i i ≥ 0, i = 1, . . . , N (2.17)

Note that every point is assigned the cost miC, instead of every point having the cost of C. In this way, the SVM algorithm can find a more robust hyperplane. As now, the margin is maximised by letting out some less important examples, being outliers or noise. Due to adding this mi to the equation in (2.17), the bounds for αi also need to be reconsidered. This turns Formula (2.12) into the following problem.

Max W (α) = N X i=1 αi− 1 2 N X i=1 N X j=1 αiαjyiyjK(xi, xj) s.t 0 ≤ αi ≤ miC N X i=1 αiyi = 0 i = 1, . . . , N (2.18)

This method is known as FSVM. In order to adapt this method to the final approach used in this paper, the weight classifiers mi are defined by Batuwita & Palade (2010). Higher misclassification costs are assigned to the minority class. The proposed method has two goals, namely to suppress the effect of class imbalance and also to reflect the within-class importance of various training instances to suppress the effect of outliers and noise (Batuwita & Palade, 2010, p. 562). They assign the

(23)

weights as follows: m+_i = f (x+_i )r+ m−_i = f (x−_i )r−

(2.19)

In the equations of (2.19), the f (xi)functions assign a value between 0 and 1, which has a reflection of the importance of xiwithin the own class. The values for r+and r− reflect the class imbalance, where the amount of positive cases is far lower than the amount of negative cases and thus r+ _{> r}−_{. This means that the weight values for} m+_i lie somewhere between [0, r+_{], while the weights for m}−

i have a value between [0, r−]. Due to both the function f (xi) as the class imbalance weight r, FSVM-CIL handles both the noise and outliers, as well as the problems of class imbalance.

Batuwita & Palade (2010) come up with three different types of functions for f (xi). They base it on either, the distance to the own class center, the distance to the estimated hyperplane or finally the distance to the estimated hyperplane. In this paper, only the distance to the own class centre is regarded, which results in the following formulas for f (xi).

f1(xi) = 1 − di max(di + ∆) f2(xi) = 2 1 + exp(βdi) di = kxi− ¯xk1/2 β ∈ [0, 1], (2.20)

where, ∆ is regarded as a small positive number, to avoid the case that f1_(·)_becomes zero. Furthermore, β can be regarded as the parameter, which determines the steepness of decay. As can be seen, for β = 0, all instances have the same weight, namely 1, where for higher β, the weights start to decay, as the instances get further

(24)

from the centre. Only these two functions are considered, as these functions led to the best results for Batiwuta & Palade (2010).

This section has described the data pre-processing techniques, SMOTE and CBO, which are used to change the distribution of the data to overcome the problem of the skew dataset. Next, the general outlier algorithms can be used on the the processed data in order to find algorithms. The algorithms used on the pre-processed data are: AdaBoost, kNN and regular SVM. Isolation Forest and LOF are not used on the pre-processed data, as they require the instances to be as isolated as possible and thus not affected by a pre-processing method. Finally, the SPLM algorithms are discussed: WRF, Adaboost variants and FSVM-CIL, which are used on the unprocessed data, as they are supposed to better handle imbalanced data. The next section contains the study design, where the datasets on which these techniques are used, are described. Also, any possible feature engineering on the datasets is presented in the next section.

3 Study Design

This chapter discusses the study design, which contains of the used data, as well as the estimation procedure used to come to the results as found in chapter 4.

3.1 Data

The data used in this paper is data on credit card fraud. It is acquired from a website that hosts online machine learning competitions. The dataset is displayed in order for the entire world to learn from this data. It is downloaded from the web-address: https://www.kaggle.com/mlg-ulb/creditcardfraud.

(25)

The dataset contains 28 components obtained with a PCA transformation, while it also contains the components:

Amount amount of currency which is withdrawn

Time contains the seconds elapsed between the first transaction and the current instance

Class contains a 1 if the instance is fraud and 0 otherwise

The first 28 components are PCA transformed in order to keep the privacy of the credit card users guaranteed. Only about 0.2% of all transactions is fraudulent in this dataset, which makes this dataset very skewed. First of all, a random 80% of the data is used to train the algorithms and the final 20% is used to obtain the results. A summary of the descriptive statistics can be found in Table 14. The training set is again resampled for the use of Cross Validation. Cross Validation ensures that the model does not overfit on the training data, such that the independent test set uses the non-overfitted parameter for testing. k -fold Cross Validation is used, which means that the training set is resampled in k equal training samples. One of these k samples is left out of the training and is used as a validation set, while the others are used for training. This is repeated k times, where each of the k samples is once used as the validation data. These results are then averaged across all k folds. This method yields an advantage over other cross-validation techniques, as all the data are used both as validation, as well as training. This first papers conducting this new method of cross-validation are those of Mostseller & Turkey (1968), as well as the further assessment in Stone (1974). Further use of the k -fold cross-validation concludes that 5-fold is most commonly used (McLachlan et al., 2004). Thus, 5-fold cross-validation is also used in this paper.

(26)

3.2 Feature Engineering

The main dataset contains 283,726 transactions which contain 492 frauds. Starting, a correlation matrix between the explanatory variables and the fraudulent cases is given in Appendix A under Table .

From this correlation matrix, it can be seen that none of the variables have a high correlation with each other, which ensures that the problem of multicollinearity does not occur. Also, this is due to the PCA transformation, none of the 28 explanatory variables, other than Time and Amount are correlated and hence only the correlations between these variables are given in Appedix A in Table 15.

Subsequently, the time of fraudulent transactions is compared to the times most regular transactions occur. This is visualised in Figures 2 and 3. From these figures, it can be seen that most fraudulent transactions happen at times, where the amount of regular transactions is low, e.g. at Time=100000. Therefore, the fraudulent instances should be better detectable.

3.3 Estimation Procedure

First of all, the dataset is split in a training set, containing a random 80% of the data, where the other 20% is used as the test set to come to the results as found in Chapter 4. Next, the training dataset is split into the 5 random folds, which are used for the cross-validation.

The data pre-processing techniques as described in Chapter 2 are first used upon this data. SMOTE is used on the data in order to counter the class imbalance, however the imbalance is only countered slightly, as well as fully, such that the algorithms can better detect the outliers, but not greatly overestimate the amount of fraudulent cases. This could for instance be the case with kNN, as now there

(27)

Figure 2: Count of fraudulent transactions with their timestamp

(28)

are more possible fraudulent neighbours, which increases the amount of times a fraudulent case is predicted. Hence, SMOTE is used where there are 100%, 500%,

1000% and 2000% of the frauds are oversampled, where the majority class is undersampled either, such that the final distributions of the imbalance equal either: (0.99,0.01),

(0.95,0.05), (0.90,0.10) or (0.50,0.50). The k chosen for the SMOTE is equal to 10, as this leaves room for SMOTE to also put new frauds on line segments between further neighbours, making the fraudulent region better detectable. In this way, the algorithms described in section 2.3 can better predict the fraudulent cases, however in order to still have a trade-off between better predicting the fraudulent cases, while not greatly overestimating the amount of frauds, Cohen’s Kappa, as well as Matthews Correlation Coefficient is used to check the performance, where the MCC is the leading performance measure.

When CBO is used upon the dataset, different values for the amount of cluster means are taken. The options chosen for this paper are: (2,3,4,5), these are used to calculate to which cluster each instance belongs. The amount of clusters is kept low, as the amount of frauds is low. Therefore, oversampling really small samples leads to a dataset with many frauds on a relatively small cluster, which ensures that the algorithms deteriorate, as only in these areas, the frauds become detectable.

The general outlier detection methods are used upon the pre-processed data, however for the Isolation Forest and the LOF algorithms, when the outliers are SMOTE’d, they become more general and hence better detectable for these algorithms. Hence, normal Random Forest and standard AdaBoost are used after the SMOTE technique, as well as the kNN and SVM algorithms. For each algorithm, a different amount of SMOTE oversampling and undersampling might give the best predictions. Thus, in the results each different chosen SMOTE variant is discussed.

(29)

per algorithm to find the best specification for each model. The kNN algorithm is used, where the amount of neighbours is varied in each iteration. An iteration is a training of an algorithm, for which either a different fold is used to train the algorithm or the algorithm uses a different option. This leads to 25 iterations per algorithm, as there are 5 folds and 5 options. For testing the SMOTE rates, there are 400 iterations per algorithm, 5 folds, 5 model specifications, 4 oversampling rates and 4 undersampling rates. The amount of neighbours used is equal to 1NN, 3NN, 5NN, 7NN and 9NN. For Random Forest, the regularly chosen amount of variables the decision tree can choose from at each node is equal to the square root of the amount of variables. As the dataset contains 30 explanatory variables, (4,5,6,7,8) are the options tested to analyse which gives better predictions. With the AdaBoost algorithm, the tree depth is varied from 2 to 10 with increases of 2 each step to evaluate if an increase in the tree depth gives a better prediction of the frauds. The number of rounds is set to 150, as after this amount of rounds, the MCC becomes steady. Finally, with the SVM algorithm, the costs are differed with each iteration. The cost is chosen from the following options: (0.01, 0.1, 1, 10, 100). The model specification with the highest MCC is chosen to be trained and used to predict upon the test set.

3.3.1 kNN Estimation

As the kNN algorithm calculates the distances between points, variables with a higher standard deviation and/or a mean unequal to zero have a higher influence on the distance between points. In order to solve this problem, all variables are standardised by subtracting the mean of the training set, whereafter the instances are divided by the standard deviation of the variable. In this way, all variables are normalised and have an equal influence on the distance, which should be the case,

(30)

in order to get the best results. This is done for every algorithm which makes use of kNN, e.g. LOF.

4 Results & Analysis

In this chapter the results from the trained algorithms are given, whereafter an analysis of these results are given. Starting, the optimal model specification are given for each model. These are chosen by calculating all MCC from the 5-fold cross validation done on the training set. The different options, which are tested, are those as specified in the estimation procedure subsection. For the general outlier detection methods, different values of over- and undersampling are chosen in order to find the best model specification for each algorithm. Next, the MCC and Cohen’s Kappa’s are discussed to find the best SMOTE sampling rates and model specifications for the general classification methods. Subsequently, the results of the outlier detection methods are discussed. Finally, the results of the SPLM are considered. The models are compared both by MCC, as well as the computing time on the test set. After randomly splitting the dataset into a training and a test set, the test set contains 99 fraudulent instances, which automatically makes the training set contain 393 frauds.

4.1 Logit model results

Starting, the results of the logit model are given. This is the result which is obtained after using the logit model on all possible explanatory variables, so including Time and Amount. The logit model then obtains the following results on the training set, given in Table 1.

(31)

Table 1: Results for the logit model Logit

MCC 0.746 Kappa 0.734

4.2 Results general classification method results

Next, the results of the general outlier methods are discussed. These are the results which are derived after just randomly partitioning the dataset in the training and the test set. Next, five different options are tested by 5-fold cross. To start, the kNN algorithm is used upon the dataset. Five different options are tested for k, which are: (1,3,5,7,9). These are all kept low, due to the low amount of frauds in comparison to the high amount of non-fraudulent cases. This gives the following Table 2.

Table 2: Results for the general kNN algorithm

k 1 3 5 7 9

Average MCC 0.7638 0.2048 0.1271 0.2023 0.1817 Average Kappa 0.7536 0.1119 0.1347 0.1121 0.1059

From Table 2 it can be concluded that k =1 is by far optimal for this dataset. The reason for the fact that k =1 is optimal is that the fraudulent cases are by far outnumbered by the non-fraudulent cases. Therefore, every instance automatically has a higher chance of having a non-fraudulent neighbour. Thus, the chance on having two out of three nearest neighbours being fraudulent cases is even lower than only having a 1NN which is fraudulent.

Next, the regular AdaBoost algorithm is used upon this dataset. The weak learners for this algorithm are based on decision trees, as these have multiple advantages. These advantages are: non-linearity, consistently better than random guessing, reasonably fast to train and there exists an easy off between bias and variance. The

(32)

trade-off between bias and variance can be found in the tree depth. Shorter trees have a higher bias, whereas longer trees have a higher variance. Therefore, different options for the depth of each tree are chosen in order to find the best trade-off. The different options for the tree depth are (2,4,6,8,10). After 100 trees, the results are converging to steady MCC’s and Kappa’s. This gives the results as given in Table 3. The highest average MCC is derived when using a tree depth of 10. After finding

Table 3: Results for the general AdaBoost algorithm

Tree Depth 2 4 6 8 10 12 15

Average MCC 0.8004 0.8514 0.8611 0.8609 0.8615 0.8699 0.8474 Average Kappa 0.7954 0.8437 0.8548 0.8416 0.8574 0.8589 0.8398 that the highest tree depth results in the highest MCC, even higher tree depths of 12 and 15 are tested. The MCC of a tree depth of 12 and 15 are equal to 0.869939 and 0.847388 respectively. Hence, a tree depth of 12 is used on the test set, as this seems to be the best trade-off between the bias and variance. Trees with more depth have a higher variance, whereas shorter trees have a higher bias. Thus, a tree depth of 12 has the best trade-off.

The results of general SVM are now discussed. The costs in the linear SVM kernel are varied, in order to find the best trade-off between an optimal margin by having higher weights and the amount of correctly specified instances, denoted by εi in formula (2.10). The different cost options are chosen from (0.01,0.1,1,10,100), from which the following results are obtained as given in Table 4.

Table 4: Results for the general SVM algorithm Costs 0.01 0.1 1 10 100 Average MCC 0 0.3467 0.7913 0.8175 0.7823 Average Kappa 0 0.2296 0.7769 0.8046 0.7746

(33)

results on the training set.

The final general classification algorithm is the RandomForest, where different number of options are tested for the amount of variables the decision tree can choose from at a node. The options tested are: (4,5,6,7,8), as these lie closest to the normally chosen amount of variables, namely the square root of the amount of variables, which would be the square root of 30. This leads to the following results in Table 5.

Table 5: Results for the general RandomForest algorithm Variables at node (mtry ) 4 5 6 7 8 Average MCC 0.8597 0.8540 0.8587 0.8453 0.8494 Average Kappa 0.8554 0.8491 0.8546 0.8416 0.8454

In Table 5, mtry stands for the amount of variables which are chosen at each node and the average MCC is calculated by taking the mean MCC over all 5 nodes. From this Table 5, it can be concluded that using the different options, the option of 4 out of 30 different variables at each node is chosen to be used for the decision tree. However, as it is standard in the literature to take the square root of the amount of explanatory variables, also 6 out of 30 variables is used to get a result on the test set.

From the RandomForest, the variable importance can be derived by looking at the mean decrease in gini. Gini is the criterion which is chosen to split at a random node. A gini value of 0 is obtained, if by splitting the specific variable at a node, the model can perfectly predict the classifications. Hence, the lowest gini is the optimal variable to choose. The variable with the highest mean decrease in gini is then the most important variable in the dataset. The variable importance of the general RandomForest can be found in Figure 8 in Appendix A. Leaving out the ten worst variables, with the lowest mean decreasing gini, whereafter the RandomForest is

(34)

ran again, does not improve the MCC. Doing this obtains a highest cross-validated MCC of 0.824694.

4.3 Results general outlier detection methods

This section describes the results as obtained from using general outlier methods upon the dataset.

Starting, the LOF algorithm is used upon the entire dataset. This is used upon the entire dataset, as this method does not have any training phase. In order to evaluate the performance of the models, the 500 instances with the highest LOF score are taken as fraudulent cases. The top 500 instances are evaluated, as the dataset contains a little lower than 500 fraudulent cases. For evaluating the model, multiple choices of k are considered. These are the same as with the kNN algorithm, namely (1,3,5,7,9). Next, a confusion matrix is made out of these 500 instances, from which again the MCC and Cohen’s Kappa can be derived. The results are given in Table 6.

Table 6: Results for the LOF algorithm

k 1 3 5 7 9

Average MCC 0.0713 0.0922 0.1119 0.0846 0.0947 Average Kappa 0.0312 0.0836 0.1168 0.0423 0.0673

From Table 6 can be concluded that for k =5, the model has the highest value for the MCC. Hence, this is used in the final results. As the MCC is below 0.12, it can already be seen that this outlier detection method does not work well on this dataset. This is due to the fact that the distance between different frauds is not necessarily big enough to increase the LOF score for all the frauds. Thus, most of the frauds still have a relatively low LOF score, which is not in the highest 500 LOF scores.

(35)

Next to the LOF algorithm, also Isolation Forest is used upon this dataset. For optimising the MCC, different values for the numbers of instances used to construct a tree are chosen. These different options are: (1000,5000,10000,20000,50000). A value of 1000 therefore means that 1000 of all cases is used to construct an isolation tree. After 500 trees, the anomaly scores are steady. Hence, 500 trees are used in the algorithm. This leads to the results as in Table 7.

Table 7: Results for the Isolation Forest algorithm

Instances per tree 1000 5000 10000 20000 50000 Average MCC 0.1898 0.2475 0.2954 0.3120 0.3010 Average Kappa 0.1686 0.2401 0.3116 0.3116 0.2845

These results are obtained, after all instances with an anomaly score higher than 0.6 are classified as a fraudulent case. Only these are classified as fraudulent cases, as in Liu et al. (2008), it is stated that: “potential anomalies can be identified as an anomaly score larger than 0.6” (Liu et al., 2008, p.416).

4.4 Results data pre-processing techniques

This section describes the best results as derived by using data pre-processing techniques upon the data before using the general classification algorithms. For each algorithm, all the different options as specified in Chapter 3.3 are tried in order to further increase the MCC and Kappa as found in Chapter 4.1 of this paper. This results in 80 different different options for over- and under-sampling in combination with the different model specification options per model. Hence, only the best model specifications are given under in combination with their MCC and Cohen’s Kappa. The rest of all MCC and Cohen’s Kappa can be found in Appendix A. Within each of these tables, for each oversampling rate, the optimal model specification is given in

(36)

the middle row. This optimal model specification is based on the row where the MCC is the highest. This section starts off by giving all the optimal models for the SMOTE pre-processing, whereafter the optimal model specifications with CBO are given. If the optimal distribution of a model is (0.99,0.01), this indicates that 99% of all cases are non-fraudulent, where the final 1% consist of frauds.

Starting, all optimal SMOTE pre-processing over- and undersampling for each algorithm are given in Table 8.

Table 8: Optimal SMOTE rates per classification algorithm

Algorithm Oversampling (%) Class distribution Optimal Parameter MCC kNN 2000 (0.95,0.05) k =3 0.6828 AdaBoost 500 (0.99,0.01) Tree depth = 8 0.8869 SVM 2000 (0.9,0.1) Cost=0.1 0.7648 RandomForest 500 (0.99,0.01) Mtry = 6 0.8281 These optimal rates are again 5-fold cross validated, for which the results can be found in the Tables 16, 17, 18 and 19 in Appendix A.

Following, the results while pre-processing with the CBO technique are given. Having five different options for the CBO pre-processing, while also having five different model specifications makes these model have 25 possible model specifications.

Table 9: Optimal CBO pre-processing per classification algorithm Algorithm k (CBO) Optimal Parameter MCC

kNN 4 k =5 0.2164

AdaBoost 3 Tree depth = 10 0.3961

SVM 4 Cost=1 0.1674

RandomForest 3 Mtry = 6 0.3621

The low MCC in Table 9 are explained by the fact that CBO redistributes the classes to equal sizes. However, as the class imbalance in this dataset only contains 0.17%

(37)

fraudulent instances, changing the class distribution to equal sizes by far overestimates the amount of frauds in the dataset. Therefore, the algorithms predict many False Positives, which lowers the MCC significantly. This makes CBO an inefficient data pre-processing technique for this dataset and many other very skew datasets.

4.5 Results Special-purpose Learning Methods

Again, for the SPLM also numerous options are tried to find the optimal weights or extra costs added to the fraudulent instances. In this way, the frauds are better detectable and more of a priority to find for the algorithms.

This gives the following results for the Weighted Random Forest as given in Table 10. For the WRF, the amount of variables which can be used in the decision tree is equal to 6, as this is most common in the literature. This is namely the root of the amount of explanatory variables. Next, different options for the weights are tried. For this algorithm, the different options for the weights are equal to (0.2,0.8), (0.1,0.9), (0.05,0.95) and (0.01,0.99). The harsh decrease in MCC after the first options is due

Table 10: Results for the WRF algorithm

Weights (0.3,0.7) (0.2,0.8) (0.1,0.9) (0.05,0.95) (0.01,0.99) (0.001,0.999) Average MCC 0.8167 0.8334 0.2048 0.2165 0.2023 0 Average Kappa 0.8103 0.7537 0.1119 0.1347 0.1121 0 to a large increase in the false positive rate. The combination of increased weight in the decision tree, as well as an increased weight at the majority vote, largely increases the false positive rate. Thus, the MCC drops rapidly from a very good to a poor classification score. The final weight of (0.001,0.999) even predict all instances as fraudulent, and hence has a MCC and Kappa of 0. Lowering the weights even further to (0.3,0.7) does not increase the MCC as well, as this gives an MCC result

(38)

of 0.8167.

The results of the AdaBoost variants, AdaC1 and AdaC2, as well as RareBoost, are described next. The same options of tree depth which are used with the general AdaBoost algorithm are also used with these algorithms, thus (2,4,6,8,10). This gives results as described in Table 11.

Table 11: Results for the AdaBoost Variants algorithm

Tree depth 2 4 6 8 10

Average MCC AdaC1 0.7862 0.842 0.8762 0.8717 0.8697 Average Kappa AdaC1 0.7786 0.8350 0.8701 0.8651 0.8614 Average MCC AdaC2 0.7954 0.8454 0.8531 0.8647 0.8631 Average Kappa AdaC2 0.7871 0.8382 0.8474 0.8597 0.8552 Average MCC RareBoost 0.3224 0.3498 0.3867 0.3794 0.3746 Average Kappa RareBoost 0.2805 0.3247 0.3548 0.3380 0.3600 It becomes obvious that RareBoost does not work well on the dataset. This can be explained from the fact already stated in Chapter 2.5.2, where RareBoost only works if the precision is immediately higher than 0.5. However, it takes the AdaBoost model some trees to have predictions with a precision higher than 0.5. Due to the strong constraint, the algorithm collapses. Thus, this algorithm is disregarded from using it on the test set, as it will not increase the MCC as found by the general AdaBoost algorithm.

4.6 Final Results

Now that all the final model specifications are known, the full training set is used for training, whereafter the trained parameters are used to predict frauds on the test data set. Conclusively, the final model specifications are as follows: The general RandomForest uses both 4 and 6 variables at each node, kNN uses 1 nearest neighbour, general SVM uses a Cost of 10 and finally AdaBoost uses a tree depth of

(39)

12.

The general outlier methods are used upon the entire dataset, from which already estimates for the MCC and Cohen’s Kappa are derived. The top 500 outliers are identified as frauds for the LOF algorithm, from which the MCC and Cohen’s Kappa can be calculated. Using k=5 For the Isolation Forest, all instances with an anomaly score higher than 0.6 are classified as a fraudulent instance. Drawing 20000 instances to build each tree is the most optimal for this dataset, as this yields the highest MCC and Kappa.

For the data pre-processing techniques, the WRF uses a mtry of 6, while the weights are equal to (0.2,0.8). In Table 12, computation time is calculated as the time which is needed to both train the model and predict on the test set. Also from the same Table 12, it can be concluded that pre-processing the data does not always increase the MCC. For both kNN and SVM, after pre-processing the data, the MCC only decreases after trying various SMOTE rates. This is due to the fact that the False Positive rates for both algorithms increase greatly, just by changing the distribution even slightly. Adjusting the other parameters of these models, such that the amount of False Positives is kept low, does not oppose the increase in False Positives. The AUC is for all models so incredibly high, due to the fact that the specificity is nearly 1 for all values of the threshold, due to the fact that even when a small number of non-frauds are predicted as fraud, the large imbalance in the dataset ensures that the specificity remains close to 1. All fraudulent instances are also found for most values of the threshold, which explains the high sensitivity, thus having a high value for the AUC curve. This is visualised in Figure 4. Thus, the AUC is definitely not a good metric to evaluate the model performance for this specific dataset.

(40)

Table 12: Final results on the test set

Algorithm MCC Cohen’s Kappa AUC Computation Time (sec)

Logit 0.747 0.739 0.999 8 General RandomForest 0.860 0.857 0.999 207 General AdaBoost 0.854 0.849 0.999 407 General SVM 0.818 0.811 0.999 451 General kNN 0.829 0.829 0.999 659 LOF 0.127 0.117 0.923 75198 Isolation Forest 0.312 0.312 0.942 528 SMOTE RandomForest 0.885 0.884 0.999 135 SMOTE AdaBoost 0.889 0.888 0.999 362 SMOTE SVM 0.763 0.760 0.999 236 SMOTE kNN 0.722 0.716 0.999 408 CBO RandomForest 0.357 0.337 0.967 613 CBO AdaBoost 0.321 0.313 0.938 1189 CBO SVM 0.128 0.105 0.916 1562 CBO kNN 0.204 0.187 0.902 2247 WRF 0.866 0.863 0.999 210 AdaBoost C1 0.869 0.860 0.999 837 AdaBoost C2 0.832 0.825 0.999 910 FSVM-CIL 0.824 0.816 0.999 2165

This is due to the fact that the logit model assumes a linear influence for the expected outcome, while the influence on the classification might be non-linear, which is not captured in the logit model. Hence, multiple mahcine-learning algorithm outperform the logit model. As expected, the SPLM also outperform the regular classification methods. The WRF does outperform the standard RandomForest, however due to a large increase in False Positives, it only slightly beats the RandomForest. The increased weight at the decision tree, in combination with the weighted majority vote, increases the amount of correctly specified frauds. However, the amount of

(41)

Figure 4: ROC Curve of the regular SVM model

False Positives increases quicker than the amount of True Positives, which finally only slightly increases the MCC. For FSVM-CIL and the AdaBoost variants, the MCC also increases slightly, which is the sign that the algorithms are capable of handling the class imbalance problem. With the AdaBoost variants and FSVM-CIL, the same problem arises as with WRF. The amount of True Positives increases, where the amount of False Positives also increases, which ultimately does not increase the MCC greatly.

(42)

AdaBoost algorithm, where the data is pre-processed by SMOTE. The fastest algorithm is the RandomForest algorithm, where also the data is pre-processed by SMOTE. This is due to the fact that the final dataset used by the algorithm only contains 198,858 cases. Both RandomForest and AdaBoost work the most precise and

fastest with the classification task. AdaBoost is a little more precise, whereas RandomForest has a significantly lower computing time, due to the easier tasks of not updating

weights each round. Finally, it can be concluded that these two algorithms are superior in correctly classifying credit card frauds. However, as most results do not differ largely when comparing the MCC, it is of importance to check the statistical significance. This could for instance be done by bootstrapping Table 11. Though, with the computational limiting on the current set-up which is used, the bootstrapping is a task for further research upon this topic.

The final confusion matrix as predicted by the RandomForest algorithm, pre-processed by SMOTE, is given in Table 13.

Table 13: Confusion Matrix of the AdaBoost algorithm, pre-processed by SMOTE Predicted

0 1

Actual 0 56854 5 1 16 83

(43)

5 Conclusion

This paper has attempted to compare various fraud detection methods. It is of importance for multiple instances, e.g. tax fraud detection, that fraudulent transactions are best detectable. As the amount of fraudulent transaction are almost always dominated by the amount of non-fraudulent transactions, the class imbalance problem is also important in this paper. The main theme of this paper therefore has been to compare various generalised outlier detection methods, data pre-processing techniques and/or imbalanced data specific learning methods. In order to be able to do this, some other subtopics have been noted. Firstly, the performance measures have been discussed, as due to the class imbalance, regular accuracy is an ineffective metric. Mathews Correlation Coefficient (Mathews, 1975) is used as the leading performance metric in this paper, where Cohen’s Kappa has also been applied to manage the performance of the algorithms.

Next, some general classification methods have been discussed, as well as the standard econometric approach for a binary classification task. These are k -Nearest Neighbour (Fix & Hoghes, 1951), AdaBoost (Schapire & Singer, 1998), Support Vector Machines (Cortes & Vapnik, 1995) and RandomForest (Breiman, 2001) and the standard econometric approach is the logit model.. Due to the class imbalance, the models are more biased towards the majority cases. As the dataset only contains 492 frauds on 283,726 transactions, other techniques to counter this problem have been examined. General outlier detection methods have been considered afterwards, which are Isolation Forest (Liu et al., 2008) and Local Outlier Factor (Breunig et al., 2000). Continuing, data pre-processing techniques have been considered to change the data distribution in such a way that the algorithms focus more on the fraudulent instances. In this paper, Synthetic Minority Over-sampling Technique (Chawla et

(44)

al., 2002) and Clustering-Based Over-sampling (Jo & Japkowicz, 2004) have been examined. Finally, some special purpose learning methods have been discussed in a final way to overcome the class imbalance problem.

After all optimal model specifications have been found via the use of 5-fold cross-validation, the full training set has been used to train all classification models, after which the models have been evaluated on the separated test set. It has been found that the general classification methods already excellently classify the instances. However, an increase in performance has been found when either SMOTE is applied for data pre-processing or when the SPLM are used. This has been as expected, as these models are specifically designed to work with class imbalanced dataset. The best performing model is the AdaBoost algorithm, for which the data has been pre-processed by SMOTE. However, it only slightly outperformed the RandomForest with a SMOTE pre-processing, while the RandomForest only requires about halve the computing time needed. The final conclusion of this paper has been that AdaBoost and RandomForest, in combination with the data pre-processing by SMOTE, are superior in detecting credit card fraud. In future research with even larger datasets, RandomForest might be a better algorithm, mostly due to the lower computational power needed. Also, future research can include other classification algorithms not considered in this paper, such as a Neural Network, as well as Logistic Regression or Stochastic Gradient Descent. Finally, future research can include other data pre-processing techniques, which are not regarded in this paper.

(45)

6 Appendix A

Figures and tables

The following figure is taken from Synthetic Minority Over-sampling Technique. (Chawla et al., 2002, p.324)

(46)

Figure 6: An example of a decision tree

(47)

(48)

Table 14: Descriptive statistics for the dataset

mean sd median min max Time 94813.86 47488.15 84692.00 0.00 172792.00 V1 0.00 1.96 0.02 -56.41 2.45 V2 0.00 1.65 0.07 -72.72 22.06 V3 -0.00 1.52 0.18 -48.33 9.38 V4 0.00 1.42 -0.02 -5.68 16.88 V5 0.00 1.38 -0.05 -113.74 34.80 V6 0.00 1.33 -0.27 -26.16 73.30 V7 -0.00 1.24 0.04 -43.56 120.59 V8 0.00 1.19 0.02 -73.22 20.01 V9 -0.00 1.10 -0.05 -13.43 15.59 V10 0.00 1.09 -0.09 -24.59 23.75 V11 0.00 1.02 -0.03 -4.80 12.02 V12 -0.00 1.00 0.14 -18.68 7.85 V13 0.00 1.00 -0.01 -5.79 7.13 V14 0.00 0.96 0.05 -19.21 10.53 V15 0.00 0.92 0.05 -4.50 8.88 V16 0.00 0.88 0.07 -14.13 17.32 V17 -0.00 0.85 -0.07 -25.16 9.25 V18 0.00 0.84 -0.00 -9.50 5.04 V19 0.00 0.81 0.00 -7.21 5.59 V20 0.00 0.77 -0.06 -54.50 39.42 V21 0.00 0.73 -0.03 -34.83 27.20 V22 -0.00 0.73 0.01 -10.93 10.50 V23 0.00 0.62 -0.01 -44.81 22.53 V24 0.00 0.61 0.04 -2.84 4.58 V25 0.00 0.52 0.02 -10.30 7.52 V26 0.00 0.48 -0. 05 -2.60 3.52 V27 -0.00 0.40 0.00 -22.57 31.61 V28 -0.00 0.33 0.01 -15.43 33.85 Amount 88.35 250.12 22.00 0.00 25691.16 Class 0.00 0.04 0.00 0.00 1.00

(49)

Table 15: Correlations between the variables Time Amount Class V1 0.117 -0.223 -0.101 V2 -0.011 -0.531 0.091 V3 -0.420 -0.211 -0.193 V4 -0.105 -0.099 0.133 V5 0.173 -0.386 -0.095 V6 -0.063 0.216 -0.044 V7 0.085 0.397 -0.187 V8 -0.037 -0.103 0.020 V9 -0.009 -0.044 -0.098 V10 0.031 -0.102 -0.217 V11 -0.248 0.000 0.155 V12 0.124 -0.010 -0.261 V13 -0.066 0.005 -0.005 V14 -0.099 0.034 -0.303 V15 -0.183 -0.003 -0.004 V16 0.012 -0.004 -0.197 V17 -0.073 0.007 -0.326 V18 0.090 0.037 -0.111 V19 0.029 -0.056 0.035 V20 -0.051 0.339 0.020 V21 0.045 0.106 0.040 V22 0.144 -0.065 0.001 V23 0.051 -0.113 -0.003 V24 -0.016 0.005 -0.007 V25 -0.233 -0.048 0.003 V26 -0.041 -0.003 0.004 V27 -0.005 0.029 0.018 V28 -0.009 0.010 0.010 Time x -0.011 0.012 Amount -0.011 x 0.006 Class 0.012 0.006 x

(50)

Table 16: kNN Cross Validation SMOTE’D

Oversampling Rate (%) Class Distribution k MCC Cohen’s Kappa .. . ... ... ... ... 100 (0.95,0.05) 9 0.4265995 0.34441114 100 (0.99,0.01) 1 0.5667551 0.52509978 100 (0.99,0.01) 3 0.5350303 48569529 .. . ... ... ... ... 500 (0.95,0.01) 9 0.4407144 0.39498583 500 (0.99,0.01) 1 0.6251566 0.6260857 500 (0.99,0.01) 3 0.5963624 0.58839265 .. . ... ... ... ... 1000 (0.95,0.05) 9 0.4659306 0.39498583 1000 (0.99,0.01) 1 0.6454182 0.62660857 1000 (0.99,0.01) 3 0.6139313 0.58839265 .. . ... ... ... ... 2000 (0.95,0.05) 1 0.6541127 0.63684045 2000 (0.95,0.05) 3 0.6827902 0.67172372 2000 (0.95,0.05) 5 0.6365271 0.61515984 .. . ... ... ... ...