Imbalanced learning in classification problems : a study on AUC optimasation and Boosting methods

(1)

University of Amsterdam

Faculty of Economics and Business, Amsterdam School of Economics

Imbalanced Learning in Classification Problems

A study on AUC optimisation and Boosting Methods

Sven van Dam

Master thesis in Econometrics Track: Big Data and Business Analytics

Supervisor

Dr. N.P.A. van Giersbergen University of Amsterdam

(2)

Statement of Originality

This document is written by student Sven van Dam who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

1 Introduction

A common problem faced when machine learning techniques are applied in real-world classi-fication problems is the presence of imbalanced classes. We speak of imbalanced classes when one or more of the possible categories is underrepresented within the data. When a classifier is trained using regular methods in order to optimise accuracy the performance of classification of the minority class(es) will suffer since these observations contribute a comparatively small fraction of the total loss. In many applications, however, the minority classes are of primary interest. For instance, in online marketing problems one is specifically interested in predicting who will click on an online advertisement. Another example would be the classification of the nature of tumours. A doctor would not aim at obtaining maximum accuracy but would be more than willing to misclassify some benign tumours in order to be able to identify the malign tumours more accurately.

The above examples indicate that accuracy might not be an optimal performance metric since a user would prefer a lower accuracy with better performance in the minority class over a classifier that obtains a high accuracy but predicts the minority class poor. Hence, performance metrics which are more suitable to imbalanced problems have been developed. Perhaps the most used metric is the Area Under the ROC Curve (AUC). The aim of this study is to compare existing techniques which aim to optimise (approximations of) the AUC and to propose extensions to these methods.

Results show that methods which optimise approximate AUC values lead to undesirable results. Boosting methods show no significant difference on all datasets although a newly pro-posed algorithm named RankBoost.C1 exhibits significantly better performance on a specific subset of difficult imbalanced problems.

The rest of this thesis is structured as follows. Section 2 discusses theory and literature relevant to this subject. Section 3 presents the methods applied in this study. Section 4 discusses the data. Section 5 presents the results. Section 6 discusses these results. Finally, Section 7 concludes.

2 Theory

2.1 Imbalanced Learning

In general, classification tasks in machine learning are defined by an algorithm or function f (·) that takes a feature vector Xi and returns a score ri. Parameters in this function are

changed by a search process in such a way that some criterion function is optimised based on a training dataset D with features X and targets Y . The trained algorithm has then ‘learned’ from the training data and can predict target values for new observations for which the true target value is not (yet) known.

(5)

One speaks of imbalanced learning when an algorithm is trained using a dataset in which observations of a certain type occur relatively seldom compared to other types of observations. Most often, this problem is addressed in classification problems where one class contains only a few observations. The problem can also occur in regression problems if there are only a few observations with a target value on a given sub-domain. In this study only classification problems are addressed. More specifically, only binary classification is considered where only two classes occur: C0 (the negative class) and C1 (the positive class). In binary imbalanced

problems, exactly one class is under represented (the minority class, C_min) and one is over represented (the majority class, Cmaj). The degree of imbalance can then be expressed by the

Imbalance Rate (IR): |Cmaj|

|Cmin|.

When training a classifier to maximise accuracy/minimise loss, the minority class only contributes a small fraction to the total loss. Hence, when loss is minimised, the performance of the classifier will be primarily focused towards the performance on the majority class since this class contributes the most to the total loss. Its performance on the minority class is largely sacrificed since the contribution to the loss is dominated by the contribution of the majority class. Large decreases in performance on the minority class are balanced by smaller increases in performance on the majority class. The result of this is that the trained classifier will often predict the minority class poorly and the majority class well.

This needs not be a problem when the user of the classifier has no preferences towards which classes should be classified well. When the user has a specific interest in performance on the minority class, however, there is a problem. Examples of this were already mentioned in Section 1. Correctly predicting malign tumours is more important than correctly predicting benign tumours since a false positive (falsely diagnosing cancer) is less damaging to the patient than a false negative (not diagnosing cancer where it should have been). Similar reasoning holds for marketing/sales problems (better to invest a little bit in someone who does not want your product than to miss out on a potential customer), fraud detection (better to scrutinise a regular transaction than to allow a fraudulent transaction in your system) and many more applications.

When the preferences of the user are not represented well by the class distribution in the data, steps need to be taken. Branco et al. (2016) provide an overview of methods that are applied to overcome problems associated with imbalance. The vast majority of the solutions encompass some manual step which explicitly gives more importance to observations from the minority class. This can be done in multiple ways. The three main approaches described by Branco et al. (2016) are data pre-processing, special-purpose learning methods and result post-processing.

For the first approach, data pre-processing, changes are made to the data on which a classifier is trained. By adding more (possibly artificial) observations to the minority class or by discarding observations of the majority class, the class distribution can be changed in such

(6)

a way that it represents the preferences of the user well. For the second approach, special-purpose learning methods, functions in the classification algorithm are changed or entirely new functions are created in such a way that more weight is given to loss in the minority class. The last method, post-processing of results, changes the way in which scores ri generated by

a ‘regular’ algorithm trained on the original data are handled. The most straightforward way in which this can be done is to change the classification threshold τ that a score ri needs to

exceed in order to be predicted to be in C1. It is important to note that in certain cases,

different approaches are proven to result in problems that are equivalent (Elkan, 2001). As an example, duplicating all minority samples k times or multiplying the contribution to the loss by the minority class by a factor k results in classifiers that are exactly the same.

2.2 Evaluation Metrics

When one is faced with a machine learning task where multiple algorithms are applicable, it is desirable to be able to asses how well each of these algorithms performs. In order to be able to obtain a fully transitive comparison among all algorithms, one should aim to summarise the performance of an algorithm in a single statistic. If more than one metric is monitored, isocurves with respect to all metrics would have to be defined to come up with complete and transitive rankings of algorithms (although the function defining these isocurves maps multiple metrics in to a scalar, effectively reducing the problem to a one-dimensional ranking problem). When multiple metrics are monitored at the same time without some defined isocurves, the possibility arises where two algorithms outperform each other in different measures and no conclusions can be drawn on which is the most preferable. When a single statistic which is chosen to evaluate performance, it needs to rank algorithms consistently with how desirable these are. The direction of this ranking is irrelevant (one wants mean squared errors to be low but accuracy to be high) as long as lying closer to a predetermined end of the domain of the metric always implies a more desirable algorithm. If this is the case, conclusions based on the metrics are fully transitive (A better than B and B better than C implies A better than C) and the user gets a complete and correct impression of how well each available algorithm is suited for the given problem. In practise it is often difficult to make preferences with respect to algorithm performance (fully) explicit. In classification problems, the costs of type-I and type-II errors need to be known. These costs might not be constant throughout time, hard or nearly impossible to measure or depend on external and possibly unknown factors. If this is the case, defining explicit preferences becomes infeasible and one should choose a performance metric which represents his preferences accurately. What one should not do is blindly select common metrics without any regard for how well the rankings obtained by this metric are in line with how these algorithms are desired to be ranked.

The key issue in imbalanced problems is that the preferences of the user are not represented by the class distribution. Hence, the most common performance metric in classification

(7)

-accuracy - does not rank classifiers adequately. For example, a classifier with 70% -accuracy and 80% of the minority class correctly predicted could be preferred over a classifier with 80% accuracy but only 30% of the minority correctly classified. It becomes clear that accuracy fails to represent these preferences and is not suitable for imbalanced problems. This idea is widely accepted and many alternative metrics have been proposed. Branco et al. (2016) give an overview of such metrics. Most of these metrics are derived from values in the confusion matrix as is shown in Table 1.

Table 1: Confusion Matrix Predicted

True False

Real True True Positive (TP) False Negative (FN) False False Positive (FP) True Negative (TN)

Based on values in this table, one can construct fractions that describe how well the classifier performs on (a sub-domain of) the data. Accuracy would be defined by

accuracy = T P + T N

T P + F P + T N + F N. (1)

If the positive class is the majority class, the magnitude of T P will be larger than that of T N and thus dominate the outcome of this metric. Other fractions can be constructed that describe performance on a more specific sub-domain. Such fractions are:

true positive rate = T P

T P + F N (2)

(often called recall or sensitivity),

true negative rate = T N

T N + F P (3)

(often called specificity),

f alse positive rate = F P

T N + F P, (4)

f alse negative rate = F N

T P + F N, (5)

positive predicted value = T P

T P + F P (6)

(often called precision),

negative predicted value = T N

(8)

None of these values are good evaluation metrics by themselves alone since they disregard parts of the prediction domain entirely. Multiple of these metrics would have to be monitored simultaneously or they would have to be combined in one metric. The latter is preferable since it allows for the performance of classifiers to be represented in a one-dimensional space. Two common ways in which this is done are

Fβ =

(1 + β)2· recall · precision

β2_{· precision + recall} (8)

and

geometric mean =precall · specif icity. (9)

The first metric, F_β, allows the user to make a trade-off between precision and recall by setting the parameter β manually. This makes that the metric can represent the preferences of the user very well but also requires the user to be able to explicitly state his/her preferences. The geometric mean aggregates ‘in-class accuracy’ over both the positive and negative class and gives them equal weight (note that this metric would easily extend to multiclass problems).

In practise, however, metrics based on the Receiver Operating Characteristics (ROC) curve are more common. The ROC curve is based on the true positive rate and false positive rate. To fully understand how the curve is constructed, consider the set of scores {r₁, ..., rN}

assigned by classifier f (X) for {X1, ..., XN}. This set of scores can be separated in scores

assigned to positive observations and scores assigned to negative observations, for instance R0 = {ri: i ∈ C0} and R1= {ri : i ∈ C1}. Classification is done by setting a threshold τ and

comparing all calculated scores to this threshold. Scores that exceed τ are predicted to be part of C₁. Consider Figure 1. The two distribution lines represent the distributions of R₀ and R₁,

Figure 1: Distribution of R₀ (grey) and R₁ (white)

the three vertical lines represent three possible choices for τ . For each choice of τ , a fraction of R0 and R1 lies above this threshold. These fractions correspond to the false positive rate and

(9)

rate and that both these rates are continuous and strictly decreasing in τ . In the extreme case where τ = −∞, everything will be predicted to be in C₁ and both the true positive rate and false positive rate are 1. In the extreme case where τ = ∞, nothing is predicted to be in C1 and both rates are 0. This allows couples of true positive and false positive rates to

be directly associated with a unique threshold. The ROC curve is constructed by plotting these couples against each other with the false positive rate on the x-axis and the true positive rate on the y-axis. The performance of a classifier is summarised into a single line. Figure 2 shows some possible ROC curves. Note that when a ROC curve is constructed for a classifier,

Figure 2: ROC curve for two classifier and a random guesser

one can only observe rates at scores that occur in R0 ∪ R1 since other scores and thus the

rates corresponding to those thresholds are not observed. This makes the drawn ROC curve a piecewise linear approximation of the true ROC curve of a classifier. In practise, when one speaks of the ROC curve, this observable approximation is meant. An ideal classifier is able to perfectly predict observations in both classes. This corresponds to having no overlap in the two distributions in Figure 1 and thus a true positive rate of 1 and a false positive rate of 0. A random classifier would give an exact overlap of the two distributions and thus can never obtain a true positive rate different from the true negative rate. Classifiers trained on real world data fall in between these two extreme cases and are able to discriminate better than random though not perfect. Lines A and B are examples of ROC curves of two such classifiers. In both cases it is clear that they are better than random: for each false positive rate, they have a true positive rate that is at least that of the random classifier (and at least

(10)

at one point higher to exclude the equality case). With similar reasoning it becomes clear that both classifiers are worse than a perfect classifier. When classifiers A and B are compared to each other it becomes harder to draw conclusions since both of their curves are higher than the other in some regions. One could compare curves of all available classifiers and choose the one which reaches ratios that match the preferences as good as possible but this would be a rather arbitrary practise and would require explicitly stated preferences for any possible threshold. Instead, the performance is summarised into a single statistic which is the area under the ROC curve (AUC). This AUC is the integral of the ROC curve on the [0, 1] domain. Since in practise we only can observe a piecewise linear approximation of the ROC curve, the true AUC is approximated by the sum of trapezoid areas under the linear pieces.

A key observation here is that the ROC curve and thus the AUC are based on ratios which are not affected by the number of observations in classes. This makes the AUC metric suitable for imbalanced problems. In practise the AUC is most often found to be the chosen metric to evaluate the performance of classifiers on imbalanced data. There are some cases in which the AUC is not applicable, however. The first is when the user wants to apply post-processing of results by threshold variation. Since the ROC curve is constructed using TP and FP rates for each possible threshold and thus the scores, not the predicted classes are needed for the calcu-lation. Hence, the AUC is invariant with respect to the chosen threshold and cannot be used in an experiment where threshold-varying techniques are applied. The second limitation is the extension to multiclass problems. Although the principle of the ROC curve and its integral theoretically extend into higher dimensions, Fawcett (2006) notes that the dimension of the ROC shape of a m-class problem is m(m − 1). Calculating the integral of such a higher dimen-sional shape becomes very computationally expensive and easily intractable as the number of classes increases. Instead of calculating this difficult integral, multiple AUC extensions have been proposed that average AUC scores of binary sub-problems. Since multiclass problems lie beyond the scope of this study, these metrics are not discussed.

Fawcett (2006) notes an interesting property of the AUC. He finds that the AUC can also be interpreted as the chance that a random draw r₁ ∈ R₁ is larger than a random draw r0 ∈ R0. More formally:

AU C = P (r0< r1 | r0 ∈ R0, r1 ∈ R1). (10)

The probability as approximated by

AU C = P r1∈R1 P r0∈R0 I(r0 < r1) |R₀| · |R₁| , (11)

where I(·) is the indicator equal to 1 if the interior is true, 0 otherwise.

This definition of the AUC illustrates a noteworthy property of the AUC, namely that it is only affected by the ordering of the scores assigned by the classifier. Hence, the scale

(11)

and location of these scores do not affect the AUC. This implies that multiplying all weights and/or including a constant will not affect the AUC.

Although the AUC is the metric most widely prevalent in imbalanced learning-affiliated literature, it is not necessarily an optimal metric. The choice of using the AUC over other metrics is largely arbitrary. In problems where the goal of the algorithm is to obtain a correct ordering, as is the case with recommender systems, the reasons to prefer the AUC are quite clear. Since the AUC is derived using TP and FP rates or using pairs of observations where each class is represented equally often, the AUC implicitly gives equal weights to both classes. It is argued that letting the weight of classes be determined by the data, as is the case with accuracy, is undesirable. This does not mean, however, that giving all classes exactly equal weight is the most desirable choice since the preferences of the user might be more nuanced. The motivation for choosing the AUC as the central metric in this study is twofold. For one, the AUC is a widely used metric throughout literature. Whether or not this is always the optimal choice is left up for debate but representing results in terms of the AUC makes the results relevant to a large number of researchers. Secondly, by always giving classes the same weight, regardless of their magnitude, results become more comparable across different problems. To illustrate this example, think of accuracy. Obtaining 99% accuracy on a heavily imbalanced dataset with a minority class of 1% can be obtained by always predicting the majority class. Obtaining 99% accuracy on a dataset with equally levelled classes does actually indicate good performance of the classifier. The fact that the accuracy score can not be separated from the context of the problem it was obtained on makes it incomparable across problems. In this study, results are obtained using different datasets. By using the AUC, this study entirely abstains from giving classes different weights. This not only allows for a more fair comparison among included problems but also makes results more applicable to problems not considered in this study.

2.3 AUC Optimisation

In Subsection 2.1 multiple approaches to adapt training methods towards imbalance are dis-cussed. Although the resulting classifiers are certainly better suited for imbalanced data, the algorithms still aim to minimise some cost function based on misclassification loss. When the user has chosen to evaluate the performance of a classifier with the AUC, it would be preferred to optimise this metric directly.

This task has proven to be difficult, however. Regular optimisation methods require at least a gradient with respect to the parameters to construct an update rule. Equation (11) contains a sum of indicator functions in the numerator. This sum of non-continuous functions has discontinuities at many points and thus the gradient is often not defined at all and at other points the derivative is 0, making a gradient-based search impossible. In order to implement a gradient-based optimisation scheme, this indicator function needs to be replaced with some

(12)

continuous approximation. Note that if this is done, it is only an approximation of the AUC that is optimised rather than the observable AUC. Multiple choices exist for the approximation of the indicator function. Two are discussed in this study: a sigmoidal and a polynomial approximation. The sigmoidal approximation,

σβ(z) =

1

1 + e−βz, (12)

approximates the indicator function arbitrarily well as β is chosen to be large. More formally, ∀ ε > 0 : ∃ β ∈ R such that supx∈R{|I(x) − σβ(x)|} < ε. This approximation has a few neat

properties. It pointwise converges to I(·) for all points on the real number line as β → ∞ (Calders and Jaroszewicz, 2007). Besides that, its derivative is easy to calculate everywhere. Some caution needs to be taken since as β → ∞, the derivative in an increasingly small neigh-bourhood of 0 will also go to ∞. This might cause gradient-based search methods to exhibit unexpected behaviour in those points. Another drawback of this method is the computational complexity of this methods. Calders and Jaroszewicz (2007) note that the calculation time of sigmoidal approximation of the AUC increases quadratic in the number of observations N since pairs of observations are considered. Instead of the sigmoidal approximation, Calders and Jaroszewicz (2007) propose a polynomial approximation for which calculation time increases linear in N . As the order of the polynomial increases, any function can be approximated arbitrarily well. This method is not with its own merits, however. The first and foremost being that for any polynomial function g(x), lim_x→±∞|g(x)| = ∞. Hence, the approximation of I(·) is only close on a given domain. This need not be a problem due to the nature of the AUC. Remember that the AUC is not affected by the scale of the scores. Hence, scores can be divided by an arbitrary number such that all differences between scores fall within the domain on which the polynomial approximates I(·) well without altering the ordering and thus the AUC.

Let X0(X1) be the set of features corresponding to observations in C0(C1). Note that

only index-based classifiers are considered for these methods. That is, classifiers that form a score r_i by a function of an index w0xi. Hence, ri = f (w0xi). The approximation of the AUC

becomes as follows. ˆ AU C = P x1∈X1 P x0∈X0 h(f (w0x1) − f (w0x0)) |X0| · |X1| , (13)

h(·) being the approximation of the indicator function. The gradient with respect to the weights w now becomes

∂AU Cˆ ∂w = P x1∈X1 P x0∈X0 h0(f (w0x1) − f (w0x0)) · (f0(w0x1) · x1− f0(w0x0) · x0) |X0| · |X1| . (14)

Equation (13) and (14) allow for any choice of h(·) and f (·) as long as they are continuous and differentiable.

(13)

Calders and Jaroszewicz (2007) restrict themselves to linear classifiers with polynomial approximations. That is: f (w0xi) = w0xi and h(z) =

K

P

k=0

ck· zk where K is the order of the

polynomial and c_k a polynomial coefficient. Using the binomial theorem,

h(f (w0x1) − f (w0x0)) = K X k=0 ck· (w0x1− w0x0)k = K X k=0 k X l=0 ck k l (−1)k−l(w0x1)l(w0x0)k−l. Now let αkl= ck k l (−1)k−l, (15)

equation (13) then becomes

ˆ AU C = P x1∈X1 P x0∈X0 K P k=0 k P l=0 αkl(w0x1)l(w0x0)k−l |X₀| · |X₁|

which can be rewritten to

ˆ AU C = K P k=0 k P l=0 αkl P x1∈X1 (w0x1)l ! P x0∈X0 (w0x0)k−l ! |X₀| · |X₁| . (16)

This representation of the polynomial approximation of the AUC and thus its gradient is of key importance. Note that by applying the binomial theorem, the double summation is no longer done over the observations but rather the orders of the polynomial. When a new observation is added to X0 (X1), only one extra summation step has to be performed instead of |X1| (|X0|).

This decreases the complexity of the algorithm from O(N2) to O(N ). Taking the derivative with respect to w gives

∂AU Cˆ ∂w = K P k=0 k P l=0 αkl " P x1∈X1 l(w0x1)l−1x1 ! P x0∈X0 (w0x0)k−l ! + P x1∈X1 (w0x1)l ! P x0∈X0 (k−l)(w0x0)k−l−1x0 !# |X0|·|X1| . (17) This gradient can be used to construct an update rule w ← w + γ · g with g = ∂AU C_∂wˆ the gradient and γ the learning rate. Calders and Jaroszewicz (2007) propose another gradient descent scheme where an optimal learning rate is calculated in each step. The key observation behind this scheme is that the magnitude of the scores does not affect the AUC and thus that the magnitude of the weights w is irrelevant. Now let D be the number of features in the dataset. Both w and g are vectors in RD. Together they span a plan in RD. Any weighted sum of these vectors lies on this plane spanned by w and g. Given that one has knowledge of this plane, the updated weight vector can be summarised in two numbers: the angle α between the updated and old weights and the length β of the updated weight vector. The length of this vector corresponds to the magnitude of the weights which does not affect the

(14)

AUC and no search for an optimal β has to be performed. This leaves only the optimal angle α ∈ [0, 2π] that needs to be found. Calders and Jaroszewicz (2007) perform a grid search with steps of 0.01 over the entire domain of alpha. More sophisticated search schemes could be implemented but the gain in performance is likely negligible. Given the optimal α, the update schemes takes the form w ← cos(α)w + sin(α)g.

A problem with the polynomial approximation of the indicator function is that it only lies close to this function on a fixed domain. Calders and Jaroszewicz (2007) use a polynomial of order 30 that lies close to the desired indicator function on the interval [−1, 1]. This means that if the absolute difference between two scores becomes larger than 1, the approximation becomes very poor and results become unreliable. This problem can be overcome fairly easily by simply rescaling the vectors w and g. Again, note that rescaling only affects the magnitude of the updated weight vector and thus not the AUC. In order to ensure that all differences fall within the [−1, 1] domain, the weights are scaled down by a factor equal to the largest difference between two scores in the data. That is, argmax_x₁_∈X₁_,x₀_∈X₀|w0x1− w0x0|. The same

is done for the gradient vector g. These rescaling steps ensure proper scaling with respect to the current vectors but do not ensure proper scaling of the updated weights. Let w∗ be the updated weight vector. Then

|w∗0x1− w∗0x0| = |cos(α)(w0x1− w0x0) + sin(α)(g0x1− g0x0)| ≤ |cos(α) + sin(α)| ≤

√ 2.

This makes that w∗ needs to be rescaled by a factor √1

2 to ensure that the resulting vector

results in scores that all fall in the [−1, 1] domain. In most cases, the factor √2 is excessive, but rescaling by this factor ensures that each value in the grid search over possible values for α results in valid approximations. Hence, the update rule needs to be w ← √1

2(cos(α)w +

sin(α)g) where w and g are properly scaled. This update rule allows the approximate AUC to be optimised in an efficient manner.

2.4 Ensemble Methods

A group of methods which has proven to be effective in imbalanced problems are ensembles. Generally speaking, ensembles combine multiple trained algorithms and aggregate predictions by these algorithms in to one single prediction. Galar et al. (2012) provide an extensive review of ways in which ensembles are applied to imbalanced problems. The main motivation behind combining ‘votes’ of multiple classifiers is the reduction of variance when considering the bias-variance decomposition. By averaging over multiple classifiers, each with their own specific flaws but all with generally similar behaviour, the specific characteristics of each classifier will have less impact and their common behaviour becomes more dominant. Likely, this common behaviour captured by the classifiers will also be well-suited for new data, making the ensemble generalise well.

(15)

Before these methods are described in more detail, the two families of ensembles that are often used for imbalanced problems are described. These families are bagging and boosting. Note that ensembles merely are combinatory schemes for multiple trained instances of a given base classifier. This means that a broad variety of algorithms can be combined into an ensemble and that an ensemble is not defined by the base classifier it combines. Also, given the context of this study, only ensembles for classification are considered but most of the described techniques could be applied with little or no changes to regression problems. When describing ensemble algorithms, it is often more convenient to represent the labels for classes C₀and C₁ by −1 and 1 respectively, this notation will be thus used for the rest of this subsection.

2.4.1 Bagging

Bagging, short for bootstrap aggregating, is perhaps the most simple form of an ensemble. It was first proposed by Breiman (1996). It creates B different bootstrap samples from the training data and trains one classifier with each bootstrap sample. This results in B trained classifiers, each with different parameters due to the difference in the bootstrap samples. The predicted score of the bagged classifier is then the unweighted average of the scores given by the B classifiers. This score can then be compared to a threshold to determine the predicted class. The usual choice for this threshold is 0.5. Algorithm 1 shows the pseudocode for the bagging procedure.

Algorithm 1: Bagging

Data: Training data D (features X_i and targets y_i∈ {−1, 1}), number of bootstrap samples B, base classifier f

Result: Bagged classifier F (x) N := number of observations in D for b = 1, . . . , B do

Create bootstrap sample Db with N rows drawn with replacement from D

Train base classifier using D_b as training data and store as f_b(x) end Bagged classifier F (x) = _B1 B P b=1 fb(x)

Variations to the bagging algorithm exist where the number of rows in the bootstrap samples is specified by an input parameter. The algorithm can also be changed by including only a random subset of the available features in each bootstrap sample, ensembles where each ‘bag’ only has a subset of the available features are known as random forests.

(16)

2.4.2 Boosting

Boosting ensembles always encompass an iterative procedure where a classifier is trained in each iteration. Each classifier is trained using the entire training dataset but each observation is given an observation weight which is updated in every iteration. The performance of the trained classifier is measured in each iteration and these measures are used to construct weights for each iteration when all classifiers are aggregated. Although arguably more sophisticated than bagging, the foundational work for this technique was published 6 years earlier by Schapire (1990).

A key difference between bagging and boosting in general is that bagging relies on classifiers that perform relatively well on their own whereas boosting was designed to combine classifiers that each have to perform only slightly better than random guessing; these slightly better than random classifiers are often called weak learners in boosting-related literature. It is the diversity of these weak learners that makes the combined ensemble a strong classifier. Algorithm 2 gives pseudocode for the most common boosting algorithm, AdaBoost (short for adaptive boosting). The most common choice for the weak learner is a shallow decision tree which creates a simple decision rule based on one feature. Again, other base classifiers are possible but this shallow decision tree will be used in the further explanation of boosting. In each iteration, more weight will be given to observations that were difficult to predict in the previous iteration. The ‘new’ classifier will then aim to predict these difficult observations correctly, resulting in other observations that get predicted wrong and are given more weight in the next iteration. This process continues for a fixed number of iterations which is to be supplied by the user. Since in each iteration different observations are given more weight, the resulting weak learners will be quite different, providing the rich diversity needed for effective boosting schemes. The combination of many weak but different classifiers generally results in one flexible and well performing classifier. Looking at Algorithm 2, it becomes clear that all observations are given equal weight when the algorithm is initialised. A classifier is then trained on the training data and stored. This trained classifier is subsequently used to predict the classes of the training observations. The weighted fraction of wrongly predicted classes (error rate) is calculated and used to construct the performance measure α of the classifier in that iteration. For error rates close to 0, α will be close to 1. For error rates close to 0.5, α will be close to 0. If the error rate exceeds 0.5, the classifier performs worse than random guessing and the classifier is discarded and the iterative scheme is stopped. Predictions are then based on the previously completed iterations alone. If the error rate does not exceed this number, the observation weights are all multiplied by the exponent of −α · ˆyi· yi. Note that ˆyi, yi ∈ {−1, 1}.

Hence, for observations where the true and predicted class are the same, the product of these two will be 1. For wrongly predicted observations this will be −1. This makes that the weight of correctly predicted observations decreases and the weight of wrongly predicted observations increases for the next iteration. The magnitude of these changes depends on how well the

(17)

Algorithm 2: AdaBoost

Data: Training data D (features Xi and targets yi∈ {−1, 1}), number of iterations B,

base classifier f

Result: Trained AdaBoost classifier F (x) N := number of observations in D

Initialise observation weights γ_i := _N1 for i = 1, . . . , N for b = 1, . . . , B do

Train f with observation weights γ and store as f_b(x) Predict targets ˆyi:= fb(xi) for i = 1, . . . , N

Calculate error rate ε :=

N

P

i=1

γi· I(yi 6= ˆyi) with I the indicator function

if ε > 0.5 then B := b exit for end

Calculate performance measure αb := 1₂log(1−ε_ε )

Update observation weights γi:= γi· exp(−αb· yi· ˆyi) for i = 1, . . . , N

Normalise observation weights γ = _Nγ

P

i=1

γi

end

Create ensemble of all classifiers F (x) := P1

b αb B P b=1 αb· fb(x)

classifier performed, classifiers with a high error rate will have an α close to 0 and thus an exponential term close to 1, resulting in little changes in observation weights. After the new weights are calculated they are normalised to ensure a proper distribution.

After the iterative scheme has ended, predictions of all individual classifiers are aggregated in a manner similar to bagging. The key difference is that classifiers are weighted by their performance metric α.

2.4.3 Ensembles in Imbalanced Learning

The bagging and boosting methods described above are in no sense designed to perform well in imbalanced problems. Bagging trains classifiers on bootstrap samples that all will have roughly the same level of imbalance and boosting will actively try to improve accuracy, making it prone to imbalance. The methods described by Galar et al. (2012) can be categorised in two approaches to address imbalance. The first, cost-sensitive boosting, actually alters the boosting algorithm in such a way that observations from a minority class are given more weight. Multiple methods are described where the calculation of the weights, the calculation

(18)

of α or both is changed in such a way that misclassification of the minority class becomes more costly. The choice of costs is up to the user and requires an explicit formulation of the class preferences. Note that since these methods do not translate to methods applicable to bagging since bagging makes no use of observation weights or weights for the separate classifiers.

The second approach combines a sampling scheme to level out the data distribution with an ensemble technique. Ensembles allow for a deeper integration with sampling schemes than individual classifiers since the sampling step can be applied in every bootstrap sample/boosting iteration. Galar et al. (2012) give a number of examples of how sampling schemes and en-sembles have been applied in relevant literature. The concept of applying a sampling step in each iteration/sample is very general, however, and any combination of a sampling scheme and bagging or boosting related ensemble should give a feasible technique which is more robust to imbalance. The fact that a sampling step is applied multiple times could help the resulting classifier to generalise better since it is less dependent on the outcome of a single sampling step which ultimately is based on a random process. Sampling schemes are described in more detail below.

A technique possibly relevant to imbalanced problems is a boosting technique which origi-nates from the related machine learning literature aimed towards recommender systems. This technique is named RankBoost and is described by Freund et al. (2003). The algorithm follows a procedure largely similar to AdaBoost but it considers pairwise ordering instead of correct classification. This is clearly relevant for recommender systems where only the ordering of items is important but this might also be useful in imbalanced classification problems. Re-member how the AUC has an interpretation that relies solely on ordering of observations. By increasing weights of wrongly ordered pairs, RankBoost aims to optimise ordering and will thus aim to optimise the AUC, though not through the use of (approximate) gradients. More so, Cortes and Mohri (2004) state that (under some distributional assumptions on rankings) the global function optimised by RankBoost is equal to the AUC. A boosting scheme aimed towards optimal ordering could very well result in a classifier that performs well in imbalanced problems.

As stated before, RankBoost considers pairs of observations and not individual observa-tions. Instead of assigning weights to each observation xi in the dataset, a weight γx0,x1 is

assigned to each couple (x₀ ∈ X₀, x1 ∈ X1). These weights get updated according to if the

predicted ordering of the corresponding couple is correct. Let γ_x∗

0,x1 be the updated weight

and ˆy0(ˆy1) the predicted class for x0(x1), the update rule then becomes

γ_x∗₀_,x₁ = γx0,x1∗ exp(α · (ˆy0− ˆy1)). (18)

The performance measure α is no longer based on the error rate but on a ratio describing how well the ranking is performed. Freund et al. (2003) note that if the range of the ‘rankings’ is restricted to {0, 1}, as is the case with binary classification, the optimal choice for this measure

(19)

is α = 1 2log    P x0,x1 γx0,x1 · I(ˆy1 > ˆy0) P x0,x1 γx0,x1 · I(ˆy1 < ˆy0)   . (19)

Algorithm 3 gives pseudocode for the full RankBoost procedure.

Algorithm 3: RankBoost

Data: Training data D (features X_i and targets y_i∈ {0, 1}), number of iterations B, base classifier f

Result: Trained RankBoost classifier F (x) N0, N1:= number of observations in D0, D1

Initialise couple weights γx0,x1 :=

1

N0·N1 for each x0∈ X0, x1 ∈ X1

for b = 1, . . . , B do

Train f with couple weights γ and store as fb(x)

Predict targets ˆyi:= fb(xi) for i = 1, . . . , N

Calculate performance measure α_b according to equation (19) Update couple weights according to equation (18)

Normalise couple weights γ = γ

N0·N1_P j=1

γj

where j = 1, ..., N₀· N₁ denotes the index for

a given couple. end

Create ensemble of all classifiers F (x) := P1

b αb B P b=1 αb· fb(x) 2.5 Sampling Strategies

Although sampling strategies are not the primary focus of this study, altering the training data to overcome imbalance problems is the most common solution in practise and will thus be discussed to some extent. The main advantage of the approach is that it is applicable in combination with any classifier and requires no algorithm-specific operations. In essence, all sampling strategies alter the number of observations in each class in such a way that the resulting data shows a more equal class distribution. Achieving this balanced distribution can be achieved by either increasing the number of observations in the minority class (over-sampling), discarding observations in the majority class (undersampling) or a combination of both. Both over- and undersampling have advantages and disadvantages. Oversampling discards no data and thus ensures that no learnable information is thrown away. But since the number of observations increases, the required training time increases as well. Another impor-tant drawback is that if the information in a small minority class is oversampled heavily, the danger of overfitting occurs. Characteristics specific to the few available minority samples are

(20)

blown up and the classifier that is trained on the data learns these non-general characteristics to a disproportionate extent and will thus generalise less well. Undersampling on the other hand decreases the training time since fewer observations are considered but possibly relevant information is thrown away making the resulting classifier perform worse. Again the danger of overfitting occurs if a very large portion of the majority class is thrown away and only few examples remain in the majority class.

Branco et al. (2016) provide an extensive overview of available sampling schemes and their applications. The most elementary examples of sampling schemes are random oversampling (ROS) and random undersampling (RUS). These techniques randomly select minority (ma-jority) observations and duplicate (discard) them until the resulting dataset has the desired balance. These techniques suffer the most from overfitting dangers since no steps are taken which aim to let the resulting data generalise well. Non-random sampling schemes exist as well. These will not randomly select observations but rather aim to select good minority examples to duplicate or redundant and/or noisy observations to discard in a deterministic manner.

More sophisticated sampling schemes include the creation of new artificial examples. The most famous example is synthetic minority oversampling technique (SMOTE). SMOTE is an oversampling technique and thus increases the number of observations in the minority class. Its procedure, first described by Chawla et al. (2002), selects k nearest neighbours for each minority observation i and creates linearly interpolated observations between minority sample i and one or more (depending on the degree of oversampling) nearest neighbours. The degree of interpolation can vary from 0 to 1, the number is randomly chosen for each new synthetic sample. This procedure creates new synthetic minority samples that all share characteristics of two minority ‘parents’. Because these synthetic samples are a weighted average between two observations they will exhibit the common characteristics between their parents predominantly, lessening the observation-specific characteristics and thus the susceptibility to overfitting. The number of nearest neighbours k to select for each minority observation is to be supplied by the user. Algorithm 4 gives pseudocode for the full SMOTE procedure.

Similar ideas can also be applied in an undersampling setting. To do so, execute a k-means clustering algorithm over the majority class with k equal to the desired number of samples in the majority class. The centroids of the fitted clusters can be used as majority class observa-tions. This undersampling scheme has the neat property that it implicitly aims to compress as much common information as possible in the remaining number of observations (centroids), thereby not only discarding less relevant common information compared to random schemes but also reducing the impact of observation-specific characteristics. Hence, this scheme makes the resulting data more general and will be less prone to overfitting.

Other synthetic sampling schemes and even hybrid sampling schemes exist but this lies beyond the scope of this study.

(21)

Algorithm 4: SMOTE

Data: Minority data X, degree of oversampling d in %, number of nearest neighbours k Result: Synthetic minority observations X∗

N := |X|

if mod(d, 100) 6= 0 then

Randomly select mod(d,100)₁₀₀ · N observations and store their id’s end

Copy the id’s of all observations d−d%%100₁₀₀ times, append the randomly selected id’s and store in I

for i in unique values in I do

Find k nearest neighbours of xi and store

end

for i in I do

Randomly select one of the k nearest neighbours of observation xi, xk_i

Randomly draw a number t ∈ [0, 1]

Create synthetic sample x∗_i = t · xi+ (1 − t) · xki

Append x∗_i to X∗ end

A decision regarding sampling in imbalanced problems which is not often considered is to which degree sampling is performed. The common solution is to fully balance the data (Branco et al., 2016). This means that the sampled data has an equal number of observations in each class. This need not be the optimal solution, however. Khoshgoftaar et al. (2007) show that in the case of a very small minority class retaining some imbalance gives better results. Weiss and Provost (2003) show that the optimal degree of sampling differs per dataset and the metric used to measure performance. When the AUC is considered, sampling the data to full balance indeed seems to be a good choice, though not necessarily optimal in every case. If explicit preferences on class weights can be formed, sampling can be performed exactly so to represent these preferences.

2.6 Statistical Testing for Algorithm Comparison

In order to formalise empirical findings on difference in performance of various algorithms it is desirable to support claims by means of statistical testing procedures. This practise unfor-tunately is less common than it should be and literature describing good testing procedures based on reasonable assumptions is still fairly recent. The testing procedures described here all are designed for an experimental framework as is present in this study. That is, more than two algorithms are compared by means of some performance metric obtained for multiple datasets. Under the null hypothesis, the performance metrics of different algorithms are draws from the

(22)

same population. If this hypothesis is rejected, enough evidence is found to reject this as-sumption of similarity and one can conclude that there is significant difference in performance between two or more algorithms. Note that difference in performance could also be tested using only one dataset by making some distributional assumptions on the performance metric of interest or bootstrapping the procedure. Rejection of the null in such a setting would only allow the user to draw conclusions for the problem (dataset) where the algorithms are applied to. By basing results on a broader set of problems with different characteristics, the obtained conclusions are more generally relevant. Also note that the described testing framework is in no way restricted by the performance metric chosen, as long as the metric and its domain are calculated equally for each algorithm. This makes that a user can test performance of algorithms by means of accuracy, AUC but also metrics like training time.

Throughout this subsection, N denotes the number of datasets used, k denotes the num-ber of algorithms used and cj_i the performance score of algorithm j ∈ {1, ..., k} on dataset i ∈ {1, ..., N }. The described methods and theory are primarily aimed towards tests applica-ble when more than two algorithms are present though pairwise testing procedures are also available.

Demšar (2006) was the first to address how statistical testing in the machine learning literature is often absent or performed in a poor manner. Algorithms were often compared by average metrics. Algorithms with better average results were concluded to be superior. When statistical methods were applied to evaluate results, the tests often are parametric testing procedures. To compare the performance of k algorithms by means of N results, repeated-measures ANOVA and its accompanied post-hoc procedures would be the common parametric procedure. This and other parametric tests rely on assumptions which are at best very questionable when performances of algorithms are compared. For one, it relies on normality of the performance metrics. In general, there would be no reason to assume normality for performance metrics. They are often bounded between 0 and 1 and are rarely found to be lower than 0.5 since this often implies worse-than-random performance. Another, and in this context more important, assumption is sphericity. Sphericity means that the differences between all algorithms are assumed to have the same variance. In the context of algorithm performance this assumption is by no means guaranteed and most often violated. Demšar (2006) notes the above mentioned pitfalls of parametric tests and proposes the use of non-parametric tests, which rely on weaker assumptions. These tests do not regard the actual obtained performance metric but instead looks at rankings of algorithms. That is, the best algorithm gets rank 1, the one-but-best 2, etc. If one suspects that two algorithms perform the same, he/she would also expect the average rank of these algorithms to be the same. Ranking based tests exist and allow to reject equal performance based on comparative ranking rather than actual obtained scores. While the non-parametric tests require only minimal assumptions, independence of the samples is still needed. This makes that for instance one can not use

(23)

multiple cross validation results as multiple samples (Demšar, 2006) to increase N .

In the following, the general framework to test for statistical differences is first presented with simple tests. After that, more advanced tests are presented. Although these tests are more advanced and result in more powerful tests, they still fit in the same framework and are adaptions of the more simple tests.

2.6.1 The Non-parametric Testing Framework for Multiple Comparisons

Testing for statistical difference in performance between more than two algorithms encom-passes three steps. First, an omnibus test is performed on all algorithms together. Under H0

the results of all algorithms originate from the same population. Rejection of this hypothesis allows the user to conclude that some difference in performance exists. Rejection does not give any information on which of the algorithms within the set of all available algorithms exhibit difference in performance. The Friedman test (Friedman, 1937) is the omnibus test described by Demšar (2006). It can be seen as the non-parametric equivalent of the repeated-measures ANOVA test. The Friedman test regards the ranking of classifiers. Let rj_i ∈ {1, ..., k} be the rank associated with algorithm j for dataset i. If two algorithms perform equally good, average ranks are assigned. Hence, a tie for place 2 and 3 would result in both algorithms being given rank 2.5. The average rank of algorithm j, R_j = _N1 P

i

r_ij, calculated for each algorithm is used to construct the test statistic

χ2_F = 12N k(k + 1)   X j R2_j −k(k − 1) 2 4  . (20)

This statistic follows a χ2 distribution with k − 1 degrees of freedom. A more powerful F-test based directly on this statistic developed by Iman and Davenport (1980) exists. This metric is

FF =

(N − 1)χ2_F

N (k − 1) − χ2_F (21)

and follows a F-distribution with k − 1, (k − 1)(N − 1) degrees of freedom. Only when the omnibus test rejects equal performance of all algorithms, the user may proceed to perform pair-wise tests to investigate which algorithms do in fact differ in performance. To test if two algorithms exhibit significant difference, their difference in average ranks is compared to the critical difference

CD = qα

r

k(k − 1)

6N , (22)

where q_α depends on the significance level chosen by the user (values for this can be found in Demšar (2006)). This test is known as the Nemenyi (1963) test. How multiple algorithms are compared may vary depending on the aim of the study. In some cases the goal might be to provide a complete comparative overview of multiple algorithms. In this case all k

(24)

algorithms are compared to all k − 1 remaining algorithms, resulting in a total of k(k − 1)/2 comparisons. Other studies might propose a new algorithm and want to compare this new method to benchmark algorithms. If these benchmark methods have already been studied extensively the study can abstain from comparing the benchmark methods among themselves and only need to compare the new method to the benchmarks, resulting in k − 1 comparisons. The reason a researcher would want to omit certain comparisons in a study is due to the family-wise error rate (FWER). If t (independent) pairwise comparisons are made and each of these tests is designed to have a size of α = 0.05, the probability of making at least one false rejection in the entire set of comparisons at the chosen significance level α is equal to (1 − α)t. If k = 5 algorithms are completely compared at α = 0.05 this results in a probability of (0.95)10= 0.6. This inflated FWER harms the credibility of any conclusions drawn from a study when it is not controlled for.

To control for the inflated FWER, the value of α needs to be adjusted in each of the pair-wise comparisons. These corrections of α are mostly described in this study but note that for each of these procedures a complementing procedure can be derived to adjust the p-values which then can be compared to the regular level α. Regardless of how exactly this is done, the probability with which the tests will reject are decreased substantially. Multiple adjustments exist, varying from rather crude methods to more sophisticated and computationally expensive. The latter will result in higher power regardless of how the algorithms compare and thus are preferable in any case. Note that methods for controlling the FWER are applicable in any multiple comparison context, not just in non-parametric tests as is the case here. First, some of the more simple methods are described to illustrate the general procedure of adjustment.

The most crude adjustment is known as the Bonferroni-Dunn procedure (Dunn, 1961). This procedure assumes independence over all tested hypotheses and divides α by the number of tested hypotheses (t). While this method can be regarded as the most safe procedure available since it assumes the worst case possible (independence over all hypotheses), it also heavily reduces the power of each test, making it an undesirable approach. When the Bonferroni-Dunn procedure is applied, it becomes clear why a researcher might want to only test one algorithm against a set of benchmarks. If the performance of this one algorithm only is of interest, performing k(k − 1)/2 comparisons would require to reduce the value α for each test to a greater extent than when only a subset of these comparisons (being the k − 1 which include the algorithm of interest) is performed, resulting in (substantially) less power for the tests of interest. Refinements to the Bonferroni-Dunn procedure will give a higher probability of rejection, but more comparisons will always decrease the probability of rejection of any single hypothesis.

Refinements to the Bonferroni-Dunn procedure all rely on some form of reasoning that not all combinations of (non-)rejections of hypotheses can be true at the same time and thus that some comparisons are non-sensible after one or more comparisons have been done. This then

(25)

requires to divide α of subsequent tests by a smaller number. The most simple adjustments to the Bonferroni-Dunn method are the Holm (1979) and Hochberg (1988) method. Both methods are sequential procedures. Let H₁, ..., Ht be the hypotheses to be tested, p1, ..., pt

their corresponding p-values before adjustment, H(1), ..., H(t) the hypotheses ordered by their unadjusted p-value (with H(1)having the lowest p-value) and p(1), ..., p(t)the ordered p-values. Holm (1979) starts by comparing p(1) to α_t, as would be done by Bonferroni-Dunn. If this test does not reject, none of the subsequent hypotheses are rejected. If this test rejects, Holm (1979) shows that at least one combination of outcomes becomes obsolete and that by comparing p(2)to _t−1α the probability of making a false rejection in the entire set of comparisons is still maintained to be equal or less than α. If the second ordered hypothesis is not rejected, all subsequent hypotheses are not rejected as well. If it is rejected, the next ordered p-value is compared to α divided by a factor one less than in the previous step. The method proposed by Hochberg (1988) is very similar to the one described above. The only difference is that it starts from the hypothesis with the highest p-value, compares it to α and if it rejects it also rejects all other hypotheses. If not, the next ordered p-value is compared to α₂. If it rejects, all hypotheses with a lower p-value are rejected, etc. Because of the ‘direction’ in which these methods work, they are often referred to as the Holm step-down method and Hochberg step-up method.

These methods of controlling the FWER are very attractive when compared to the Bonferroni-Dunn method. Because α is divided by increasingly smaller values, it results in a uniformly more powerful test. The method however requires no analysis of the logical relation between the tested hypotheses and can be directly applied to any multiple comparison hypothesis test. Hence, there is virtually no reason to apply the Bonferroni-Dunn method.

Summarising, the general methodology to test for difference in performance between al-gorithms consists of three steps. First, a non-parametric omnibus test like the Friedman test should be applied. If this test does not reject, the user can stop. If the test rejects, the user may proceed to post-hoc comparisons of the algorithms by pair. These pair-wise tests should preferably also be non-parametric, with as an example the Nemenyi test. As a final step, the results of this post-hoc test should be adjusted to control for the inflated FWER. Only with these adjusted tests the user may draw valid conclusions on how two algorithms compare to each other.

2.6.2 Advanced Non-parametric Tests and Adjustments

In the above, the general framework for testing the performance of multiple algorithms is given. Here, more sophisticated and powerfull tests and adjustments are given. These methods still fit within the previously described framework, meaning that they still take on the role of either omnibus test, post-hoc test or FWER-controlling adjustment. Any combination of omnibus test, post-hoc test and adjustement will leave the user with a valid procedure.

(26)

In Garcia and Herrera (2008) and García et al. (2010), more powerful methods for multiple comparisons are described. In Derrac et al. (2011) a practical overview of available methods is given. Not all available methods are discussed in this study. Instead, only the applied methods and methods on which these are based are described. First the Friedman aligned ranks test and its accompanied post-hoc test as described in García et al. (2010) are discussed. After that, the Hommel (1988) adjustment for testing one algorithm against a set of benchmarks is described. Finally, the adjustment by Bergmann and Hommel (1988) for testing a group of algorithms completely is described.

The Friedman aligned ranks test differs from the Friedman (1937) test by the fact that it considers one ranking of all results. The Friedman test ranks algorithms for each dataset sep-arately. Results across different datasets are not directly comparable. For instance, a dataset where classes are difficult to predict will lead to lower accuracy than a dataset with easily predictable classes regardless of the ‘goodness’ of the algorithms. To allow for comparisons across datasets, the results need to be aligned with each other. To do so, results for each dataset are centred around the average performance obtained on that given dataset. This op-eration does not affect the rankings for one dataset as the shift is equal for all results for that dataset. However, results of different datasets are now all centred around their respectable mean, allowing for comparison between any two re-centred performance measures. As a result, instead of N rankings of length k, one can now construct a single ranking of length k · N . The ranking ˜r_ij ∈ {1, ..., k · N } is known as the aligned rank of algorithm j for dataset i. Based on these aligned rankings, average aligned rankings for both datasets and algorithms can be constructed. For a dataset the average aligned rank would be ˜Ri = 1_kP

j

˜

rj_i. Similarly, for an

algorithm the average aligned rank would be ˜Rj = _N1 P i

˜

r_ij. The fact that all results are now comparable effectively introduces more information which results in tests that generally are more powerful. García et al. (2010) provide a study on the power of this test in comparison to the regular Friedman test.

The test statistic used in the Friedman aligned ranks test is

χ2_F_A = (k − 1) " P j ˜ R2_j− (kN2_{/4)(kN + 1)}2 # kN (kN + 1)(2kN + 1)/6 − (1/k)P i ˜ R2_i. (23)

Under the null, this statistic follows a χ2 distribution with k − 1 degrees of freedom. Again, this omnibus test only allows to conclude that some difference exists amongst the entire set of classifiers. A post-hoc test is required to gain insight in which algorithms exhibit difference. Let in the following j and j0 denote the indexes of two algorithms to be compared. The

(27)

Friedman test has the associated post-hoc test with statistic zF = Rj− Rj0 q k(k+1) 6N , (24)

which follows a standard normal distribution under the null. The Friedman aligned ranks test is a post-hoc test based on the aligned ranks. This statistic is

zFA = ˜ Rj− ˜Rj0 q k(N +1) 6 (25)

and follows the same standard normal distribution.

This post-hoc test allows one to obtain p-values for each pair-wise comparison but does not control for the inflated FWER. Sophisticated methods to control for the inflated FWER evaluate the relation between all tested hypotheses in order to assess if any combinations are logically contradicting. As an example consider three algorithms: A₁, A₂ and A₃. The composite null hypothesis is defined by three pair-wise hypotheses, namely equality in perfor-mance for each pair {(A₁, A2), (A1, A3), (A2, A3)}. If one of these hypotheses is rejected, at

least one other hypothesis must be false. For instance, if equal performance between A1 and

A2 is rejected, it is logically contradicting to hypothesise equal performance of A3 with both

A1 and A2. The fact that in the above setting it is impossible for exactly one hypothesis to

be false can be leveraged in order to construct an adjustment method where the value α is divided by a value equal or less than in the methods of Bonferoni-Dunn, Holm and Hochberg, resulting in uniformly more powerful tests.

A drawback in comparison to the step-down and step-up methods described before is that the logical relation between all hypotheses needs to be explicitly assessed. The way in which these hypotheses should be assessed varies based on the design of the experiment (if one algorithm is compared to a set of benchmarks or if a complete comparison is performed) and the outcome of pair-wise tests throughout the process. This in contrast to the more easy methods where the adjustment of α is known without any assessment of how the hypotheses relate. Algorithms to assess the relation between hypotheses for both the k(k − 1)/2 and k − 1 comparisons case exist. They result however in rather difficult procedures and, more importantly, are very computationally expensive. If a complete comparison between more than 9 algorithms is performed, the method described below becomes computationally intractable and the user should resort to methods which give less power.

Hommel (1988) proposes an algorithm to adjust p-values if k−1 comparisons are considered. The method requires one to find the largest possible j for which p(N −j+k) > k·α_j holds ∀k ∈ {1, ..., j}. Such a value of j does not need to exist. If no such j is found, all hypotheses are rejected. If such a j is found, all hypotheses H_i for which p_i < α_j are rejected. The algorithm to determine the adjusted p-values which can be compared to the desired significance level α can be found in García et al. (2010), page 2056.

(28)

Bergmann and Hommel (1988) propose a method to adjust p-values when a complete comparison is performed. To understand this procedure, one needs to be acquainted with the definition of an exhaustive set of hypotheses. A set of hypotheses H ⊆ {H₁, ..., Ht} is

exhaustive if all its elements can logically be true at the same time. In order to be able to address both hypotheses and p-values, let I ⊆ {1, ..., t} be a set of indexes so that H_I = {Hi :

i ∈ I} is exhaustive. The method by Bergmann and Hommel (1988) requires one to first find all exhaustive sets for a set of hypotheses {H1, ..., Ht}. This step is very computationally

expensive since any possible subset of hypotheses needs to be assessed. Some performance gain can be made when the notion that any subset of an exhaustive set must be exhaustive itself is leveraged but in general this assessment becomes intractable for more than 9 algorithms or 36 hypotheses. Now let E denote the set of exhaustive index sets. That is, E = {I : HI is

exhaustive}. For each index set I in E, one needs to check if the smallest p-value associated with that set I is larger than _|I|α. Let ˜A = {I : I is exhaustive, min(pi : i ∈ I) > _|I|α} be the

set of index sets for which this logical comparison holds, the union of indexes contained in this set, A = ∪ ˜A, is named the acceptance set by Bergmann and Hommel (1988) and contains exactly all the indexes of hypotheses that are not rejected. One can also use this method to calculate adjusted p-values ˜p = min(a, 1) where a = max(|I| · min(pj, j ∈ I) : I is exhaustive,

i ∈ I). These adjusted p-values can be compared to the desired significance level α.

The methods described above are the most powerful procedures for the k −1 and k(k −1)/2 comparisons case which assure that the FWER is equal or less than α (Derrac et al., 2011).

3 Method

This section describes which previously discussed methods are applied in this study. Besides, a new technique is proposed and its motivation is explained. The final subsection describes the methodology used to allow for performing statistical tests based on the results. The primary aim of this study is to compare various methods which aim to optimise the AUC in binary classification tasks. Two experiments are conducted. The first experiment compares various classifiers that optimise an approximation of the AUC. The second experiment compares dif-ferent ensembles which are aimed at ordering rather than classification and thus implicitly optimise the AUC. In both experiments, AUC-optimising methods are compared with both regular methods which are not altered towards imbalance and with more common solutions towards imbalance that make use of a SMOTE sampling scheme.

In the previous section it is mentioned how sampling to a fully balanced dataset needs not be optimal. Including multiple SMOTE-based algorithms, each sampling to a different degree, would likely violate independence assumptions needed for valid testing. Results of these algorithms could be expected to lie close to each other. In an extreme case, one could create a large number of SMOTE-based algorithms with only minimal variation in the sampling rates.

(29)

This would result in a number of results being clustered together and results obtained by other algorithms which lie further away from this cluster now seem more extreme. Hence, the null hypothesis of results originating from the same distribution could very well be rejected more often for any problem. More concrete, it is suspected that the size of the statistical tests is inflated by including multiple algorithms that only vary by means of hyperparameter settings. No formal proof or study is given on this potential issue. It is left for future research to investigate how tests behave in the presence of such algorithms and how one could control for these issues if they appear to be present. To be safe, only one SMOTE algorithm is included in each experiment. One could either determine the optimal sampling ratio for each dataset by internal cross validation or ex-ante choose a ratio. Since the SMOTE algorithms serve as a benchmark algorithm, the most common practise is chosen for this study, being to sample to a balanced dataset.

One could argue about when two algorithms truly are ‘different’. The definition chosen here is that two algorithms are not different when they only differ in terms of hyperparameters. However, the four approximate AUC gradient-based algorithms are all closely related and one could very well choose a definition of difference for algorithms that would regard the four variations described below to be variations of an equal procedure. Because each of these algorithms optimises a cost function that is of a mathematically different form, these four algorithms are all treated as being separate in this study.

3.1 Approximate Gradient Based AUC Optimisation

The first experiment revolves around the method described by Calders and Jaroszewicz (2007). In their paper, a polynomial approximation of the AUC is used to train a linear classifier. The way in which this method is extended in this study is twofold. For one, not only polynomial approximations are considered but also the sigmoidal approximation as described in equation (12). Besides a different indicator approximation, the work of Calders and Jaroszewicz (2007) is also extended by the utilisation of a nonlinear classifier. This results in four distinct clas-sifiers that optimise an approximate AUC by gradient-based methods. These clasclas-sifiers are characterised by the following approximations of the AUC and its gradient. The first being equal to the polynomial approximation with a linear classifier, giving an AUC and gradient approximation equal to equation (16) and (17) respectively.

The second classifier also uses a polynomial approximation of the indicator curve but calculates scores with a logistic cumulative distribution function, much like one does when using a logistic regression algorithm. The score function now becomes

ri =

1 1 + exp(−w0_x

i)

. (26)

Let this function be denoted by σ(w0xi) from now on. This function has the neat property that

Imbalanced learning in classification problems : a study on AUC optimasation and Boosting methods