Master’s Thesis Econometrics Classiﬁcation of a Severely Class-Imbalanced Dataset

(1)

Master’s Thesis Econometrics

Classification of a Severely Class-Imbalanced Dataset

Author:

D.J. Vroom, s2309939

Supervisor:

(2)

Master’s Thesis Econometrics, Operations Research and Actuarial Studies

(3)

Classification of a Severely Class-Imbalanced Dataset

David Vroom

Abstract

(4)

1 Introduction

1.1 Introduction

In many real-world classification problems the data have some level of class imbalance. The degree of class imbalance can vary from minor to severe. Severe class imbalance occurs in application domains such as fraud detection and loan default prediction, where fraud and loan default cases represent only a small proportion of the total cases. The fewer sampled class is referred to as the minority class and the more sampled class is referred to as the majority class. A problem arises when the class imbalance is severe and the minority class is the class of interest (Chawla et al., 2004). Several machine learning algorithms show difficulties classifying an instance belonging to the minority class (Stolfo et al., 1997). A reason for this is the classifier’s accuracy oriented design (Fern´andez et al., 2018). Accuracy is defined as the number of correctly predicted instances among the total instances. As a result, correctly predicting instances of the majority class is more important than correctly predicting instances of the minority class. The classifier’s prediction will be biased towards the majority class. To illustrate this, let’s assume a majority-to-minority class ratio of 100:1. Then classifying all instances as belonging to the majority class will result in an accuracy of 99%, which sounds as a decent classifier. However, all instances belonging to the minority class are misclassified, which is not acceptable when the cost of misclassification of the minority class is high.

This study compares two models based on their predictive performance on a class-imbalanced dataset. Furthermore, as interpretability of the model also determines its usefulness, both models are evaluated based on their interpretability as well. The first model is the logistic regression. The logistic regression is considered as a standard ap-proach for binary classification problems, but only when the number of parameters is small compared to the sample size. However, extensions of the logistic regression, such as ridge regression (Hoerl and Kennard, 1988), lasso (Tibshirani, 1997) and elastic net (Zou and Hastie, 2005) provide a way to deal with the situation when the number of pa-rameters is not small compared to the sample size. The other model is the random forest classifier (Breiman, 2001). The random forest is chosen because of three reasons. It is able to model nonlinear patterns in the data, it is easily interpretable (Hastie et al., 2009) and the model is robust to outliers. In most studies (Muchlinski et al., 2016; Couronn´e et al., 2018; Bhattacharyya et al., 2011; Whitrow et al., 2009), a random forest shows superior predictive performance compared to the logistic regression for class-imbalanced datasets.

(6)

real-world dataset. The dataset is obtained from a large online Dutch retail company. The dataset contains historical orders, where the orders are either labeled as fraud or nonfraud. The fraud class is also referred to as ones and the nonfraud class as zeros. An order is marked as fraudulent when a fraudster uses leaked credentials to order products on invoice. The order is picked up without being paid for. The fraud class consists of less than 1% of the total orders. Therefore, the dataset is severely imbalanced. The cost of misclassification cannot be sufficiently approximated, as the cost of misclassifying an order depends on too many factors. However, the cost of misclassifying a truly fraudulent order is considered higher than misclassifying a truly nonfraudulent order. Therefore class imbalance might affect the classification performance of both models.

1.2 Research Question

The aim of this research is to compare the performance of the logistic regression with the random forest for a severely class-imbalanced dataset. This includes evaluation of class-imbalanced data corrections as well. The performance of the model is measured based on the ability to predict a class label on unseen data and on the interpretability of the model. Hence, the research question is: “How does the logistic regression perform compared to the random forest for a severely class-imbalanced dataset?”

1.3 Literature Review

The logistic regression is one of the classifiers that is affected by class imbalance. One of the reasons is that the logistic probabilities are calculated based on the predictor variables, as well as on the relative proportion of ones and zeros in the dataset. Consequently, the predicted classification probability is sensitive to the majority-to-minority class ratio. The probabilities are biased towards the majority class (Real et al., 2006). Furthermore, Cramer (1999) demonstrated that the problem of class imbalance also depends on the fit of the model. A poor fit will increase the underestimation of the probabilities of the minority class. On the other hand, if the fit is perfect, no adjustments are needed. However, in real-world problems a perfect fit is rather exceptional. One solution to this problem is to adjust the constant coefficient (Real et al., 2006). Another solution is to adjust the threshold of the predicted probabilities on which the samples are classified (Cramer, 1999). Cramer (1999) suggests to adjust the this threshold to the prevalence of ones in the training dataset.

(7)

sample bias is negligible when samples of thousands are considered, does not necessarily hold for class-imbalanced data. The study of King and Zeng (2001) demonstrates that small sample bias is magnified by class imbalance. This problem leads to underestimation of the actual probability of the minority class, because the coefficients are biased in the direction of favoring zeros at the expense of ones. An appropriate adjustment of the bias is given by Firth (1993), and Cordeiro and McCullagh (1991).

Another problem that may arise when applying the logistic regression is overfitting. Over-fitting occurs when the number of covariates is large compared to the sample size. Then, a regularization method might improve the prediction performance of the logistic regression. The elastic net is an example of a regularization method that penalizes the coefficients (Zou and Hastie, 2005). The elastic net outperforms other regularization methods like ridge regression (Hoerl and Kennard, 1988) and lasso (Tibshirani, 1997).

The predictive performance of the random forest is also affected by class imbalance in the data. In the case of this model, it is caused by the splitting criteria, which is based on a balanced class distribution (Liu et al., 2010). During the building process of decision trees in a random forest, splitting criteria such as the Gini index are based on the prevalence of ones and zeros. As a consequence, this measure is skew-sensitive and will result in bias towards the majority class. Moreover, as class overlap increases, it becomes more difficult to establish discriminative rules. As a result, more general rules are extracted, which are biased towards the majority class (Garc´ıa et al., 2008). On the other hand, with little overlap between the classes, class distribution of the instances becomes less important (Fern´andez et al., 2018).

(8)

correction and weighting. Prior correction adjusts the constant coefficient after the usual maximum likelihood estimation process. Weighting, on the other hand, corrects the log-arithm of the likelihood function that is maximized. When weighting is applied to the logistic regression and the elastic net, it is referred to as the weighted logistic regression and weighted elastic net. Both corrections are based on prior information about the proportion of ones in the population and in the sample. According to Xie and Manski (1989) weighting is more robust and can outperform prior correction when the sample size is large and the functional form is misspecified. Adjustment of the small sample bias in combination with choice-based sampling is proposed by King and Zeng (2001). When weighting is applied as the correction method for choice-based sampling, the model is referred to as the rare event weighted logistic regression (RE-WLR).

(9)

2 Methodology

2.1 Logistic Regression

We introduce four versions of the logistic regression. Table 1 gives an overview of the versions and which problems the versions address. As no version addresses all problems at once, all versions are used to classify the orders and are compared to the random forest.

Class imbalance bias Sample selection bias Overfitting Logistic regression

Weighted logistic regression _X

RE-WLR* _X _X

Weighted elastic net _X _X

*Rare Event Weighted Logistic Regression

Table 1: Overview of the four logistic regression versions and the addressed biases.

2.1.1 Logistic Regression

The variable of interest is the binary response 𝑌𝑖, which in this study corresponds to the label fraud or nonfraud of the 𝑖th order. The response takes either the value 1 when the order is labeled as fraud or 0 when the order is labeled as nonfraud. Instead of modelling the binary response directly, the logistic regression models the probability that 𝑌𝑖 belongs

to one of the two classes. In our dataset we have 𝑛 observations and 𝑝 independent covariates. Let matrix 𝑿 ∈ R𝑛×(𝑝+1) represent the characteristics of the 𝑛 observed

orders. The first column of the matrix𝑿 consists of a vector of ones corresponding to the intercept in the model. The other columns correspond to the different 𝑝 covariates. A row of𝑿 is denoted by 𝒙𝑖for 𝑖 = 1, ..., 𝑛. Each vector𝒙𝑖is associated with a response 𝑦𝑖. Since a strict linear relationship between the input 𝒙𝑖 and the probability Pr(𝑌𝑖 = 1|𝒙𝑖) might

result in unrealistic values, such as negative values and values above 1, a transformation is required. A logistic function solves this problem. The probability of observing 𝑌𝑖 = 1

is then modelled with the following formula:

Pr(𝑌𝑖 = 1|𝒙𝑖) =𝜋𝑖 = 𝑒

𝜷0 𝒙𝑖

1 + 𝑒𝜷0𝒙𝑖

,

where 𝜷 = (𝛽0, 𝜷0₁)0 is a (𝑝 + 1)-vector of coefficients, with𝛽0 representing the coefficient

of the intercept.

(10)

maximum likelihood estimation. The log-likelihood of the sample is then given by 𝑙(𝜷) = 𝑛 Õ 𝑖=1 _𝑦 𝑖ln𝜋𝑖+ (1 − 𝑦𝑖)ln(1 −𝜋𝑖) . (1)

In order to find the maximum likelihood estimates, the log-likelihood is differentiated with respect to the parameters and set equal to zero. This results in the so called score function 𝑠(𝜷) ≡ 𝜕log𝐿(𝜷)_𝜕𝜷 = 𝑛 Õ 𝑖=1 𝒙0𝑖(𝑦𝑖−𝜋𝑖) = 𝑿0(𝒚 −𝝅) = 0.

Since 𝜋𝑖 is a non-linear function of 𝜷, an iterative method like the Newton-Raphson

algorithm should be used to find the estimates.

As a result, the predicted probability of a fraudulent order 𝑖 is

ˆ 𝜋𝑖 = 𝑒 ˆ 𝜷0 𝒙𝑖 1 + 𝑒𝜷ˆ0𝒙𝑖 .

In contrast to the classical linear regression model, we cannot translate the effect of a unit increase of 𝑥𝑖𝑗 on ˆ𝜋𝑖 directly. The reason is that the effect of an increase of 𝑥𝑖𝑗 depends on the current location of 𝑥𝑖𝑗. The relationship between the covariate and the response

follows an S-curve, rather than a straight line. What we can derive from the coefficients is whether an increase of the covariate results in a positive or negative effect on ˆ𝜋𝑖.

2.1.2 Weighted Logistic Regression

The weighted logistic regression adjusts for choice-based sampling. When the data is severely imbalanced, as is the case with the data under study, undersampling of the majority class results in a much smaller sample size. This is advantageous when compu-tational costs are high or when data collection is expensive. The disadvantage is that less information is included in the model. King and Zeng (2001), however, argue that this is not a problem as ones are more informative than zeros. In fact, they state that in general not more than two to five times more zeros than ones have to be collected. On the other hand, if collecting zeros is costless, zeros should be collected as much as possible. To see why eliminating instances from the majority class has no major effects on the prediction performance we have to study the variance-covariance matrix

(11)

Under the assumption that our model has explanatory power, the ones will have on average a ˆ𝜋𝑖 that is higher than the ˆ𝜋𝑖 of the zeros. However, when the fit is not perfect

the ˆ𝜋𝑖 of the ones will on average be underestimated (Cramer, 1999). As a consequence

the ˆ𝜋𝑖 of the ones will be closer to 0.5. Therefore, ˆ𝜋𝑖(1 − ˆ𝜋𝑖) will usually be larger for

the ones than for the zeros. In other words, additional ones will increase ˆ𝜋𝑖(1 − ˆ𝜋𝑖) and

subsequently decrease the variance more than additional zeros do. From this we conclude that ones are more informative than zeros.

Choice-based sampling, however, leads to inconsistent estimates of the coefficients (Cameron and Trivedi, 2005). Manski and Lerman (1977) proposed the weighted exogenous sampling maximum likelihood estimator to deal with this bias. The weighted log-likelihood for the logistic regression can then be rewritten as

𝑙_𝑤(𝜷) = 𝑛 Õ 𝑖=1 𝑤_𝑖 _𝑦 𝑖ln𝜋𝑖+ (1 − 𝑦𝑖)ln(1 −𝜋𝑖) , (2)

where 𝑤𝑖 = (𝜏_𝑦_¯)𝑦𝑖 + (1−_{1− ¯}𝜏_𝑦)(1 − 𝑦𝑖) with 𝜏 and ¯𝑦 representing the proportions of ones in

the population and sample, respectively. Intuitively this makes sense, because when the proportion of zeros in the sample is lower than in the population, the weight corresponding to the zeros increases. As the method applies a weight to the logistic regression, we also refer to it as the weighted logistic regression.

2.1.3 Rare Event Weighted Logistic Regression

King and Zeng (2001) combine the correction of the small sample bias, as described by McCullagh (1989) with the weighted maximum likelihood estimation. The resulting model is called the Rare Event Weighted Logistic Regression (RE-WLR). The study of King and Zeng (2001) argues that the bias of the coefficients in small sample sizes is amplified in the presence of class imbalance. As the data under study is severely imbalanced, the predictive performance of the RE-WLR is included in the comparison with the random forest.

Maximizing the log-likelihood in (2) yields the weighted maximum likelihood estimates ˆ

𝜷WMLE. Then, the bias-corrected estimate is given by

ˆ

𝜷RE-WLR= ˆ𝜷WMLE− bias( ˆ𝜷WMLE) = ˆ𝜷WMLE− (𝑿 0 _ˆ

𝑾 𝑿 )−1𝑿0𝑾ˆ 𝜼,

where the elements of 𝜼 are defined by 𝜂𝑖 = 1₂𝐴𝑖𝑖[(1 + 𝑤1) ˆ𝜋𝑖−𝑤1], 𝑤1 = 𝜏𝑦¯, 𝐴𝑖𝑖 are the

(12)

the variance-covariance matrix 𝑽 (𝜷ˆRE-WLR) is estimated by ˆ 𝑽 (𝜷ˆRE-WLR) = _𝑛 𝑛 + 𝑝 2 ˆ 𝑽 (𝜷ˆWMLE). As _𝑛+𝑝𝑛 2

< 1 we are in the situation that both the bias and the variance are reduced of our estimates ˆ𝜷RE-WLR.

In comparison with the logistic regression, this approach yields less biased estimates for 𝜷. Variable selection, which is a form of regularization, could be applied with methods like best-subset selection, forward- or backward-stepwise selection (Hastie et al., 2009) with for example the Akaike information criterion as the performance metric. However, a method like the lasso (which we will discuss in the next chapter) is much more convenient as it works with 𝑝 > 𝑛, it is much more computational efficient and it adds regularization. Another popular regularization method is ridge regression. There is a study introducing a penalized version of the RE-WLR (Maalouf and Siddiqi, 2014). However, to our knowl-edge no implementation of a regularized version (lasso, ridge regression or elastic net) of the RE-WLR is available in R or Python.

2.1.4 Weighted Elastic Net

The elastic net is a regularized regression method (Zou and Hastie, 2005). It incorpo-rates a linear combination of the 𝐿1 and 𝐿2 penalties of the lasso and ridge regression,

respectively. The ridge regression (Hoerl and Kennard, 1988) performs maximum likeli-hood estimation subject to a bound on the 𝐿2-norm of the coefficients. As a consequence

(13)

is given by min 𝜷∈R𝑝+1− 1 𝑁 𝑁 Õ 𝑖=1 𝑤_𝑖 _𝑦 𝑖ln𝜋𝑖+ (1 − 𝑦𝑖)ln(1 −𝜋𝑖) +𝜆       1 −𝛼 2 𝑝 Õ 𝑗=1 𝛽2 𝑗 +𝛼 𝑝 Õ 𝑗=1 |𝛽_𝑗|       , ₍₃₎

where the second term is the penalty term, with 𝜆 representing the regularization pa-rameter (𝜆 ≥ 0) and 𝛼 representing the elastic net mixing parameter (0 ≤ 𝛼 ≤ 1). The weights are included for correction of the choice-based sampling as in (2). As the elastic net regression is not invariant under scaling of the inputs, the inputs are standardized first before solving (3).

2.2 Random Forest

The random forest (Breiman, 2001) predicts the binary response based on multiple decision trees bagged together, where the results are averaged and the trees are de-correlated.

Decision Trees

We focus here on the CART (Classification And Regression Tree) algorithm for building the decision trees (Breiman et al., 1984). The CART algorithm uses a greedy top-down approach, where it selects the covariate and split-point that minimizes a cost function. The splitting process starts at the top of the tree, where all observations belong to a single region. Then this region is split into two sub-regions based on a condition on one of the features. This binary splitting process is repeated recursively until a region consists only of instances of a single class or a stopping criterion is reached. More formally, at every split the feature space 𝑋 is divided into two sub-regions conditioned on 𝑋𝑗

(14)

model.

The decision tree is built on the training dataset. An appropriate measure that represents the homogeneity of the class distribution within a single region for a particular subset of the training dataset is the Gini index. The Gini index is also known as a node impurity measure. It is defined by Gini (𝐷) = 1 − Õ 𝑘={0,1} ˆ 𝑝2 𝑘, (5)

where 𝐷 is the dataset of a node, ˆ𝑝0and ˆ𝑝1are the proportion of zeros and ones in dataset

𝐷, respectively. The Gini index takes on a small value when the proportion of a class ˆ𝑝𝑘

is close to one or to zero and a large value when the proportion of a class ˆ𝑝𝑘 is close to

0.5. The Gini index is used to select the optimal 𝑋𝑗 and 𝑠 in (4). More specifically, the

split that maximizes the reduction of the Gini index of the parent node is selected in the tree growing process. This measure is computed by subtracting the weighted average of the Gini index of the child-nodes from the Gini index of the parent node. This measure is referred to as the information gain

InfoGain(𝑋𝑗, 𝑠, 𝐷𝑝) = Gini (𝐷𝑝) −

𝑁_left

𝑁_𝑝 Gini (𝐷left) −

𝑁_right

𝑁_𝑝 Gini (𝐷right), (6)

where 𝐷𝑝, 𝐷left and 𝐷right are the datasets of the parent node, left child-node and right

child-node, respectively. Furthermore, 𝑁𝑝, 𝑁left and 𝑁right represent the number of

in-stances of the parent node, left child-node and right child-node, respectively. The CART algorithm selects covariate 𝑋𝑗 and split-point 𝑠 to maximize this information gain.

This splitting process is performed recursively and stops when either the Gini index of the node is equal to zero or when a stopping criterion is reached. Examples of stopping criteria are the maximum depth of the tree, the number of minimum instances per node or a minimum value of the information gain. When the tree is built, prediction is ob-tained by sending instances down the tree until arriving at one of the terminal nodes. The probability of a class is then equal to the proportion of that class in the terminal node.

Bagging and De-correlation

(15)

this problem by modifying the tree growing process. Instead of using all covariates as candidate for the split, only a random subset of the covariates is considered. This is referred to as de-correlation. Typically, the number of the random subset is √𝑝 (Hastie et al., 2009).

The random forest can be even more generalized by testing different settings of the parameters. For example, influencing the bias can be achieved through limiting the depth of the decision trees by adjusting early stopping criteria like the maximum depth of the trees and minimum size of the nodes. In order to influence the variance, the number of randomly selected covariates before each split and the number of trees can be varied.

In order for the random forest to deal with categorical variables in a computationally efficient manner, one-hot encoding can be applied as a preprocessing step of the data. A categorical covariate with 𝐾 different values is transformed to 𝐾 dummy variables. In contrast to the logistic regression, the reference group is included when the categorical covariate has more than two categories.

2.2.1 Class Imbalance

The decision tree building process selects sub-optimal splits in the presence of class imbal-ance (Liu et al., 2010). This can be best shown by rewriting the information gain in (6). Recall that the information gain is used to select the best covariate among a subset of covariates. Maximizing the information gain is equivalent to minimizing the weighted average of the Gini index values of the child-nodes. After substituting the formula of the Gini index given in (5) into the weighted average of the Gini index of the child-nodes and multiplying the result with 𝑁𝑝, we are left with the following maximization problem of the following objective function

𝑄 = 𝑁_left 𝑁 left ,0 𝑁_left 2 +𝑁_left 𝑁 left ,1 𝑁_left 2 +𝑁_right 𝑁 right ,0 𝑁_right 2 +𝑁_right 𝑁 right ,1 𝑁_right 2 , ₍₇₎

where the proportion of zeros in the left child-node is substituted with ˆ𝑝left ,0 = 𝑁_{left ,0}

𝑁_left .

The other terms can be derived in a similar manner. In the presence of class imbalance, we have that 𝑁left ,1+𝑁right ,1 𝑁left ,0+𝑁right ,0. In words, we have that the number of

ones is much smaller than the number of zeros. As a result, we can see from (7) that the minority class has relatively small influence on the information gain of the splits. The root of the problem stems from the fact that the algorithm does not take the significance of the split of the minority class: 𝑁left ,1_𝑁

1 and

𝑁_{right ,1}

𝑁₁ into account (Liu et al., 2010).

(16)

of 10 ones and 100 zeros. Table 2 shows 5 different splits. The best split is split 1, where all ones are divided in the left child-node and all zeros in the right child-node. It therefore has the largest value of objective function 𝑄. The worst split is split 2, where the classes are equally distributed between the left child-node and the right child-node. In order to see the effect of class imbalance, we have to look at split 3, 4 and 5. From split 3 and split 4 we see that the objective function is higher for a perfect split of the majority class than a perfect split of the minority class. Even more so, split 5 which is not a perfect split of the majority class, will be selected over split 3. Therefore, we conclude that the significance of the split of the majority class is more important than the significance of the split of the minority class.

Split 1 Split 2

left 10 ones right 0 ones left 5 ones right 5 ones

0 zeros 100 zeros 50 zeros 50 zeros

𝑄 = 110 𝑄 = 91, 82

Split 3 Split 4

left 10 ones right 0 ones left 5 ones right 5 ones

50 zeros 50 zeros 0 zeros 100 zeros

𝑄 = 93, 33 𝑄 = 100, 48

Split 5

left 5 ones right 5 ones

10 zeros 90 zeros

𝑄 = 93, 86

Table 2: Five examples of splits. Every split divides the parent node in the left child-node and the right child-node. For every split the objective function 𝑄 is calculated.

2.2.2 Interpretation Methods

The random forest has the advantage over the logistic regression that it can model com-plex relationships between the covariates and the response. In order to get some insights on the complex relationship between the covariates and the response, four approaches are explained. These approaches help to identify the important covariates, include a plot of the relationship between a covariate and the response, explain individual predictions and give an understanding of whether high or low feature values move the predicted probability towards 0 or 1.

(17)

importance is measured by taking the sum of the weighted information gain of all internal nodes 𝑇 in the tree in which covariate 𝑋𝑗 is involved. Note that by definition, every

internal node is associated with a split. Subsequently, the result is averaged over all 𝑁𝑇

trees in the forest. Hence, the variable importance of covariate 𝑋𝑗 is given by

Imp(𝑋𝑗) = _𝑁1 𝑇 Õ 𝑇 Õ {_{𝑡∈𝑇:𝑣(𝑡)=𝑋}𝑗} 𝑁_𝑡 𝑛 InfoGain𝑡(𝑋𝑗, 𝑠, 𝐷𝑡),

where InfoGain𝑡(𝑋𝑗, 𝑠, 𝐷𝑡) is the information gain corresponding to node 𝑡 and 𝑁_𝑛𝑡 is the

proportion of samples reaching node 𝑡. In order to only consider the nodes in which the covariate is involved, the formula conditions on covariate 𝑣(𝑡) used in node 𝑡.

The second approach is to plot the partial dependence plot (Hastie et al., 2009). This plot is informative about the global relationship between a covariate and the response. More formally, the partial function tells us for a covariate what the average marginal effect is on the prediction outcome. Let 𝑓 (𝒙ˆ (𝑖)) be the prediction outcome of the random forest built on the training dataset, where 𝒙(𝑖) is the characteristic 𝑝-vector of order 𝑖. Furthermore, let 𝐶 be the complement set of 𝑗, so that 𝑗 ∪ 𝐶 = {1, 2, ...., 𝑝}. Then, the relationship between a single covariate and the response can be estimated by

¯ 𝑓_𝑗(𝑋_𝑗) = 1 𝑛 𝑛 Õ 𝑖=1 ˆ 𝑓 (𝑋_𝑗_{, 𝒙}_𝐶(𝑖)),

where 𝑋𝑗 is the covariate of interest and𝒙(𝑖)_𝐶 is a (𝑝−1)-vector containing the characteristic

values of order 𝑖. In other words, a point on the partial dependence plot 𝑓¯𝑗(𝑋𝑗 = 𝑥𝑗) is

computed by replacing the covariate value 𝑋𝑗 with 𝑋𝑗 = 𝑥𝑗 for every order. Then, the average of these 𝑛 predicted values results in 𝑓¯𝑗(𝑥𝑗). This method assumes independence

between 𝑋𝑗 and the complement covariates. When this assumption is violated, the plot ¯

𝑓_𝑗(𝑋_𝑗_{) is biased. Furthermore, because the partial dependence plot takes the marginal} average, heterogeneous effects are not captured and might bias the conclusion drawn from the plot.

(18)

(2018), the prediction outcome of order 𝑖 can be explained with ˆ 𝑓 (𝒙(𝑖) ) = 𝜙₀+ 𝑝 Õ 𝑗=1 𝜙(𝑖)𝑗 ,

where 𝜙(𝑖)_𝑗 is the SHAP value or feature attribute value corresponding to covariate 𝑋𝑗 for

order 𝑖.

The fourth approach is the SHAP summary plot. Each point on the summary plot is a SHAP value corresponding to a covariate and an order. The covariates are ordered based on the sum of the absolute feature attribute values:

Imp_SHAP(𝑋𝑗) = 𝑛

Õ

𝑖=1

|𝜙(𝑖)_𝑗 |.

Furthermore, the colors of a point represent the corresponding feature value. The sum-mary plot is very informative, because it gives us a sense of the variable importance and whether high or low feature values move the predicted probability towards 0 or 1. As a concluding remark: these four interpretation methods depend on the assumption that the covariates are uncorrelated with each other.

2.3 Performance Metrics

In a binary classification problem, the confusion matrix represents the performance of the model for a given threshold. The predicted values fall in one of the four categories: True Positive (TP), False Positive (FP), False Negative (FN) or True Negative (TN). See table 3 for an overview. Some of the important metrics are presented in table 4.

Actual: YES Actual: NO Predicted: YES TP FP Predicted: NO FN TN

Table 3: Confusion matrix.

(19)

Metric Formula Probability

Accuracy = _{𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁}𝑇𝑃+𝑇𝑁

Precision = _{𝑇𝑃+𝐹𝑃}𝑇𝑃 = Pr𝑌 = 1| ˆ𝑌 = 1

Recall = True Positive Rate (TPR) = _{𝑇𝑃+𝐹𝑁}𝑇𝑃 = Pr

ˆ

𝑌 = 1|𝑌 = 1

Specificity = True Negative Rate (TNR) = _{𝑇𝑁+𝐹𝑃}𝑇𝑁 = Pr

ˆ

𝑌 = 0|𝑌 = 0 Table 4: Overview of four important metrics, with the formula and corresponding prob-ability.

(Cramer, 1999) and the cost distribution of a FP versus a FN is unknown, a threshold can-not be determined on forehand. The rank metric, on the other hand, only measures how well the positive instances are ordered relative to the negative instances. The area under the receiver operating characteristic curve (AUC-ROC) is such a rank metric. The ROC curve is the recall plotted against 1 − specificity for different thresholds. Another rank metric is the area under the precision-recall curve (AUC-PR). The PR curve is the preci-sion plotted against the recall for different thresholds. Note that the AUC-PR focuses on the ones, whereas the AUC-ROC focuses on classification of both classes. Therefore, the AUC-PR is more informative than the AUC-ROC when comparing different models in the presence of class imbalance (Saito and Rehmsmeier, 2015). Furthermore, the study of Davis and Goadrich (2006) proved that a curve will dominate the ROC space if and only if it dominates in PR space and showed that an algorithm that optimizes the AUC-ROC is not guaranteed to optimize the AUC-PR curve. Hence, the PR curve has advantageous properties over the ROC curve in the presence of class imbalance.

2.4 Cross-Validation

(20)

the cross-validation AUC-PR is computed by CVAUC-PR = 1 10 10 Õ 𝑘=1 AUC -PR𝑘,

where AUC -PR𝑘 is the area under the precision recall curve for left out partition 𝑘. The

parameter setting which results in the maximum cross-validation AUC-PR is defined as the optimal setting.

2.5 Missing Data

Missing data is for any modelling an undesirable occurrence and the best way to deal with it is to prevent it. However, in real-world applications, it is often not preventable and adequate measures should be taken. According to Dong and Peng (2013) a researcher must address three aspects: the proportion of missing data, patterns of missing data and the missing data mechanism.

The acceptable proportion of missing data is not strictly defined. Schafer (1999) and Bennett (2001) argued that missing data below 5% and 10%, respectively, can be consid-ered as inconsequential. Even if the proportion of missing data is below 5% or 10%, the other two aspects are relevant as well (Tabachnick et al., 2007).

Patterns of missing data are divided into three categories: (1) the univariate missing data pattern, (2) monotonic pattern of missing data on 2 or more variables, or (3) arbitrary pattern of missing data (Cameron and Trivedi, 2005).

The third aspect is the missing data mechanism, which is divided into three categories (Rubin, 1976). The first category includes data where the missing data is Missing At Random (MAR). Let 𝑌 = (𝑌obs, 𝑌mis) and 𝑅 be the matrix of missingness, where for

observation 𝑖 and variable 𝑗

𝑅_𝑖𝑗 ₌

(

1 if 𝑌𝑖𝑗 is missing,

0 if 𝑌𝑖𝑗 is not missing.

Then the missing condition is said to be

Pr(𝑅|𝑌,𝜉) = Pr(𝑅|𝑌obs, 𝑌mis, 𝜉) = Pr(𝑅|𝑌obs, 𝜉), (8)

where𝜉 represents the unknown parameter(s) of the missingness distribution. For exam-ple, if missingness is determined by a single variable and 𝑅𝑖𝑗 can be described by a logistic

(21)

say that the missing data mechanism is unrelated to the missing values themselves, but only depends on observed data. In other words, knowledge about 𝑌mis does not carry any

additional information about 𝑅 if we have taken 𝑌obs into account. As a result, imputing

values is possible.

The second category of the missing data mechanism is Missing Completely At Random (MCAR). This is a special case of MAR, with condition

Pr(𝑅|𝑌,𝜉) = Pr(𝑅|𝑌obs, 𝑌mis, 𝜉) = Pr(𝑅|𝜉).

In other words, the missing data mechanism is unrelated to the missing values themselves and the other observed values. We can view the observed data as a valid representation of the population. When this assumption holds, no bias is introduced by the missing data mechanism, but will only increase standard errors when observations containing missing values are removed from the dataset.

The third category is when the missing data mechanism is Missing Not At Random (MNAR). This occurs when the missing values depend on the missing values themselves. If this is the case, the researcher has to specify the data mechanism and incorporate the mechanism into the data analysis in order to get unbiased estimates (Dong and Peng, 2013).

Missing Data Techniques

The most common traditional missing data techniques are ad hoc methods, such as list-wise deletion and substitution with the variable mean (Peugh and Enders, 2004). List-wise deletion is a technique, where all observations containing missing values are deleted from the dataset. A drawback of mean and regression imputation is that it lacks variability in the hypothetically complete dataset. Methods like list-wise deletion can be used under the MCAR assumption, but will result in the reduction of the sample size and therefore result in a smaller standard error. If the missing data mechanism is MAR, we need other techniques. The recommended missing data procedures are then maximum likelihood (ML) estimation and Multiple Imputation (MI). Under MAR and MCAR, ML and MI produce unbiased and efficient parameter estimates. On the other hand, when the MNAR assumption holds, these methods will produce biased results.

(22)

estimates based on regression imputation. The missing values are imputed 𝑚 times with a random component, so that we get 𝑚 slightly different imputed datasets and uncer-tainty is taken into account. The next step is to combine the results of the 𝑚 parameter estimates (Rubin, 1976). A drawback of the ML and MI methods for generalized linear models is that it requires a specification of a parametric model for the missing covariates (Ibrahim et al., 2005).

A basic approach for dealing with missing values for the random forest is the On The Fly Imputation method, where the missing values are only imputed during a split. The missing values of the particular covariate are imputed by the mean or mode of the parent node. According to Wan et al. (2015), a better method when computing power, is not an issue is the missForest algorithm (Stekhoven and B¨uhlmann, 2012). This method is espe-cially useful when the data includes complex interactions or nonlinear relationships. As the random forest intrinsically averages over multiple trees, there is no need for creating multiple imputed datasets to account for uncertainty. This approach is applicable as an imputation method for both the elastic net regression and the random forest model.

Summary

(23)

3 Data

The dataset is provided by a large online Dutch retail company. The dataset consists of approximately 6 million orders over a period of 11 months and are labeled either as fraud or nonfraud. Less than 1% of the orders is marked as fraud, which makes the dataset severely class imbalanced. Furthermore, 81 covariates are included as predictors. From the 81 covariates, 9 covariates contain missing values. The missing values occur mostly for the same orders, therefore, we deal with a monotonic pattern. The proportion of missing data is 1.1% of the total orders. This number is fairly small, however, missing values occur in 6.4% of the fraudulent orders. Based on expert knowledge on the data, the missing data mechanism is classified as MAR. Due to the small amount of missing data in combination with the relatively large dataset, we applied list-wise deletion first and checked afterwards whether imputing missing data with the missForest algorithm makes a difference. For the latter imputation method, we used all covariates including the response to produce unbiased estimates (Graham, 2009).

(24)

4 Results

In this chapter the results of the study are presented and discussed. The AUC-PR is used as the predictive performance measure. The PR curve is based on the test dataset. This test dataset contains 25% of the total orders in the dataset and is the same for every model. First, the four versions of the logistic regression are compared with each other. It is investigated whether an adjustment for class imbalance, choice-based sampling or overfitting leads to a better predictive performance. Next, the effect of class imbalance on the random forest is analyzed by a comparison of the predictive performance for different majority-to-minority class ratios in the training dataset. This is followed by the overall comparison of the predictive performance of the four versions of the logistic regression with the random forest. In order to select the best settings of the parameters for the weighted elastic net and the random forest, 10-fold cross-validation is used. Further-more, to assess the influence of missing values, missing data is imputed by using the missForest algorithm and the results are compared with the situation after list-wise dele-tion. Finally, the interpretability of both the logistic regression and the random forest is evaluated.

4.1 Logistic Regression

(25)

interpreta-tion of the predictive performance of the weighted logistic regression is as follows. When the company accepts a recall of 80%, that is, it detects 80% of the true fraud cases, then approximately 80% of the cases marked as fraud are truly fraud.

Figure 1: Predictive performance of the four versions of the standard logistic regression. The figure shows the PR curve of the logistic regression, the weighted logistic regression, the RE-WLR and the weighted elastic net. The models are trained on a class balanced dataset.

4.2 Random Forest

The predictive performance of the random forest is expected to be affected by the class imbalance. Figure 2 shows that the predictive performance based on the balanced train-ing dataset increases considerable when fraud instances are added to the traintrain-ing dataset. However, the increasing effect diminishes after 95% undersampling of the nonfraud in-stances. An explanation of this result is that more instances add more information to the model. Furthermore, we conclude that the overlap between both classes is little, because the predictive performance of the model increases when the class imbalance in-creases.

(26)

Figure 2: Predictive performance of the random forest for different sampling strategies. Four sampling strategies are considered: majority-minority class ratio of 1:1, 95% under-sampling of the majority class, 35% underunder-sampling of the majority class and no under-sampling.

4.3 Comparison

This chapter compares the predictive performance of the versions of the logistic regression with the predictive performance of the random forest. The AUC-PR’s of the models are given in table 5 for three different sampling strategies. We conclude from the results that the random forest yields higher AUC-PR than any of the logistic regression models. This holds for both the random forest trained on the correlation adjusted dataset, as well as for the random forest trained on the original dataset. Furthermore, as nonfraud instances are added to the training dataset, the AUC-PR’s of all models increase. However, no sampling shows little improvement over 95%, but will increase computational costs as the training dataset increases drastically. The best predictive performance is obtained by training the random forest on the original dataset when no sampling is applied. However, the assumption of uncorrelated covariates is violated and therefore we cannot rely on the interpretation approaches of the model.

(27)

1:1 -95% 100% Logit 0.789 0.872 0.894 Logit weighted 0.849 0.892 0.894 RE-WLR 0.601 0.892 0.894 Elastic weighted 0.811* 0.892** 0.894** Random forest 0.859 0.918 0.936

On correlation adjusted data

Random forest corr. adj. 0.863 0.91 0.918

*𝜆 = 0.0001, **𝜆 = 0

Table 5: Overview of the AUC-PR’s of the models.

(a)

(b) (c)

(28)

4.4 Missing Data Imputation

The missing data is imputed using the missForest algorithm (Stekhoven and B¨uhlmann, 2012). As a result, the training dataset is expanded with fraud cases and a slight extension of nonfraud cases. Because the fraud class on which the models are trained is increased, we would expect an increase in predictive performance in comparison with 3(b). How-ever, figure 4 shows that the prediction on the test dataset has not been improved. An explanation could be the relative small proportion of missing data in the dataset. Hence, we conclude that list-wise deletion of 1.1% of the instances in the dataset or equiva-lently 6.4% instances of the nonfraud class, does not significantly affect the predictive performance of the models.

Figure 4: Predictive performance evaluation after imputing missing data. The PR curves of RE-WLR and weighted logistic regression coincide with the weighted elastic net result and is therefore not included in the figure.

4.5 Interpretation

This chapter evaluates the interpretability of the logistic regression and the random forest. The models are trained on the dataset after the majority class is undersampled by 95%. Furthermore, as the interpretation approaches of the random forest depend on pairwise correlation between the covariates, the random forest trained on the correlated adjusted dataset is considered.

Logistic Regression

(29)

this model which variables should be included. After selecting an optimal regularization parameter by applying cross-validation, we found that under 95% undersampling of the majority class all covariates are selected to be included in the model.

Random Forest

The random forest is interpreted by using four approaches: the variable importance plot, the partial dependence plot, the SHAP method and the SHAP summary plot.

The variable importance plot shows the relative importance of the covariates in the model. The variable importance depends on how often a covariate is selected for a split and the information gain that resulted after a split involving the covariate. Figure 5 shows the relative importance of the 30 most important covariates. It can be observed that covariates 24 and 28 are the two most important covariates.

Figure 5: Variable importance plot.

(30)

(a) (b) Figure 6: Partial dependence plot of (a): feature 23 and (b): feature 27.

SHAP (Lundberg and Lee, 2017) values explain the attribution of each feature value to the final prediction for a single order. Figure 7 explains that for this particular fraudulent order, values of feature 3 and feature 24 have an increasing effect on the probability of fraud, whereas the value of feature 28 has an opposite effect. Furthermore, we can derive from the figure that the feature value of feature 3 attributes approximately 27% to the final prediction value of 73%. The feature value of feature 28 attributes approximately -5% to the prediction value.

Figure 7: SHAP values of a fraudulent order.

(31)

(32)

5 Conclusion and Discussion

This work compares the predictive performance of four versions of the logistic regression with the random forest on a real-world dataset in the presence of severe class imbalance. The models are trained on the whole dataset and two subsamples of the dataset obtained by choice-based sampling. The four versions of the logistic regression are the standard logistic regression, the weighted logistic regression which corrects for choice-based sam-pling, the RE-WLR which corrects for both choice-based sampling and class imbalance, and the weighted elastic net which corrects for both choice-based sampling and overfit-ting. The predictive performance is measured using the PR curve of the predicted values of the test dataset. The interpretability of the models is evaluated as well, as it is an important aspect in many real-world applications.

The research leads to the conclusion that the random forest outperforms the logistic regression on predictive performance. This result is attributed to the fact that the random forest is able to model complex relationships between the covariates and the response, and also reduces variance by design. The logistic regression, on the other hand, is bound to model linear relationships. Our findings are consistent with other comparative studies (Muchlinski et al., 2016; Couronn´e et al., 2018; Bhattacharyya et al., 2011; Whitrow et al., 2009). Furthermore, although the random forest models complex relationships, several approaches exist to interpret the model, such as the variable importance plot, partial dependence plot, SHAP method and the SHAP summary plot. These approaches appeared to be very informative from the company’s perspective. Again, the random forest shows superior performance over the logistic regression. Hence, based on predictive performance and on interpretability, we find that the random forest performed better than any of the logistic regression models.

(33)

class imbalance becomes less important (Fern´andez et al., 2018).

According to our research, we recommend the company to use the random forest model for classification of the orders. The majority class is recommended to be undersampled by 95%, because the loss of information is little and computational costs decrease sub-stantially. Then, the company can choose the classification threshold for the probabilities of fraud based on the precision-recall curve. For example, when the company accepts to detect 80% of the true fraud cases, then approximately 90% of the cases marked as fraud are truly fraud. A higher threshold results in a lower recall, but in higher precision, and vice versa. Furthermore, the variable importance plot and the SHAP summary plot show that the random forest attributes the highest importance to feature 24 and 28. According to the SHAP summary plot, high feature values of feature 24 and 28, are in general indicative for a decrease and increase of the predicted probability, respectively. Marginal relationships between the covariates and the response, can be communicated with the partial dependence plots. Furthermore, SHAP values can be used to explain individual predictions. The results demonstrated in this research do not need to hold for other datasets. However, the methodological approach used in this study can be applied to other application as well, as it addresses class imbalance, choice-based sampling and regularization.

(34)

6 Bibliography

Bennett, Derrick A (2001). How can I deal with missing data in my study? Australian and New Zealand journal of public health 25 (5), 464–469.

Bhattacharyya, Siddhartha, Sanjeev Jha, Kurian Tharakunnel, and J Christopher West-land (2011). Data mining for credit card fraud: A comparative study. Decision Support Systems 50 (3), 602–613.

Breiman, Leo (2001). Random Forests. Machine Learning 45 (1), 5–32.

Breiman, Leo, Jerome Friedman, Richard A Olshen, and Charles J Stone (1984). Classi-fication and Regression Trees. Chapman & Hall/CRC.

Cameron, A Colin and Pravin K Trivedi (2005). Microeconometrics: Methods and Appli-cations. Cambridge University Press.

Chawla, Nitesh V, Nathalie Japkowicz, and Aleksander Kotcz (2004). Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD Explorations Newsletter 6 (1), 1–6.

Cordeiro, Gauss M and Peter McCullagh (1991). Bias Correction in Generalized Linear Models. Journal of the Royal Statistical Society: Series B (Methodological) 53 (3), 629–643.

Couronn´e, Raphael, Philipp Probst, and Anne-Laure Boulesteix (2018). Random forest versus logistic regression: A large-scale benchmark experiment. BMC Bioinformat-ics 19 (1), 270.

Cramer, Jan Salomon (1999). Predictive Performance of the Binary Logit Model in Unbalanced Samples. Journal of the Royal Statistical Society: Series D (The Statisti-cian) 48 (1), 85–94.

Davis, Jesse and Mark Goadrich (2006). The Relationship Between Precision-Recall and ROC Curves. In Proceedings of the 23rd International Conference on Machine Learning, pp. 233–240. ACM.

Dong, Yiran and Chao-Ying Joanne Peng (2013). Principled missing data methods for researchers. SpringerPlus 2 (1), 222.

(35)

Fern´andez, Alberto, Salvador Garc´ıa, Mikel Galar, Ronaldo C Prati, Bartosz Krawczyk, and Francisco Herrera (2018). Learning from Imbalanced Data Sets. Springer.

Firth, David (1993). Bias Reduction of Maximum Likelihood Estimates.

Biometrika 80 (1), 27–38.

Garc´ıa, Vicente, Ramón Alberto Mollineda, and José Salvador Sánchez (2008). On the k-NN performance in a challenging scenario of imbalance and overlapping. Pattern Analysis and Applications 11 (3-4), 269–280.

Graham, John W (2009). Missing data analysis: Making it work in the real world. Annual Review of Psychology 60, 549–576.

Haixiang, Guo, Li Yijing, Jennifer Shang, Gu Mingyun, Huang Yuanyue, and Gong Bing (2017). Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications 73, 220–239.

Hastie, T., R. Tibshirani, and J.H. Friedman (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer.

Hoerl, A and R Kennard (1988). Ridge regression, in ‘Encyclopedia of Statistical Sci-ences’, Vol. 8.

Ibrahim, Joseph G, Ming-Hui Chen, Stuart R Lipsitz, and Amy H Herring (2005). Missing-Data Methods for Generalized Linear Models: A Comparative Review. Journal of the American Statistical Association 100 (469), 332–346.

Jeni, L´aszl´o A, Jeffrey F Cohn, and Fernando De La Torre (2013, 09). Facing Imbalanced Data - Recommendations for the Use of Performance Metrics. Volume 2013.

King, Gary and Langche Zeng (2001, Spring). Logistic Regression in Rare Events Data. Political Analysis 9, 137–163.

Liu, Wei, Sanjay Chawla, David A Cieslak, and Nitesh V Chawla (2010). A Robust Decision Tree Algorithm for Imbalanced Data Sets. In Proceedings of the 2010 SIAM International Conference on Data Mining, pp. 766–777. SIAM.

L´opez, Victoria, Alberto Fern´andez, Jose G Moreno-Torres, and Francisco Herrera (2012). Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Systems with Applications 39 (7), 6585–6608.

(36)

Lundberg, Scott M and Su-In Lee (2017). A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems, pp. 4765–4774. Maalouf, Maher and Mohammad Siddiqi (2014). Weighted logistic regression for

large-scale imbalanced and rare events data. Knowledge-Based Systems 59, 142–148.

Manski, Charles F and Steven R Lerman (1977). The Estimation of Choice Probabilities from Choice Based Samples. Econometrica: Journal of the Econometric Society, 1977– 1988.

McCullagh, Peter (1989). Generalized Linear Models. Chapman and Hall/CRC.

Muchlinski, David, David Siroky, Jingrui He, and Matthew Kocher (2016). Comparing Random Forest with Logistic Regression for Predicting Class-Imbalanced Civil War Onset Data. Political Analysis 24 (1), 87?103.

Nemes, Szilard, Junmei Miao Jonasson, Anna Genell, and Gunnar Steineck (2009). Bias in odds ratios by logistic regression modelling and sample size. BMC Medical Research Methodology 9 (1), 56.

Peugh, James L and Craig K Enders (2004). Missing data in educational research: A review of reporting practices and suggestions for improvement. Review of Educational Research 74 (4), 525–556.

Real, Raimundo, A M´arcia Barbosa, and J Mario Vargas (2006). Obtaining Environmen-tal Favourability Functions from Logistic Regression. EnvironmenEnvironmen-tal and Ecological Statistics 13 (2), 237–245.

Rubin, Donald B (1976). Inference and Missing Data. Biometrika 63 (3), 581–592. Saito, Takaya and Marc Rehmsmeier (2015, 03). The Precision-Recall Plot Is More

Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PloS one 10, e0118432.

Schafer, Joseph L (1999). Multiple Imputation: A Primer. Statistical Methods in Medical Research 8 (1), 3–15.

Stekhoven, Daniel J and Peter B¨uhlmann (2012). MissForest – non-parametric missing value imputation for mixed-type data. Bioinformatics 28 (1), 112–118.

(37)

Tabachnick, Barbara G, Linda S Fidell, and Jodie B Ullman (2007). Using Multivariate Statistics, Volume 5. Pearson Boston, MA.

Tibshirani, Robert (1997). The lasso method for variable selection in the Cox model. Statistics in Medicine 16 (4), 385–395.

Wan, Y, S Datta, DJ Conklin, and M Kong (2015). Variable selection models based on multiple imputation with an application for predicting median effective dose and maximum effect. Journal of Statistical Computation and Simulation 85 (9), 1902–1916. Whitrow, Christopher, David J Hand, Piotr Juszczak, David Weston, and Niall M Adams (2009). Transaction aggregation as a strategy for credit card fraud detection. Data Mining and Knowledge Discovery 18 (1), 30–55.

Xie, Yu and Charles F Manski (1989). The Logit Model and Response-Based Samples. Sociological Methods & Research 17 (3), 283–302.

Master’s Thesis Econometrics Classiﬁcation of a Severely Class-Imbalanced Dataset