Combining dynamic selection with data preprocessing for classification in imbalanced data sets

(1)

Combining dynamic selection with data

preprocessing for classification in

imbalanced data sets

July 14, 2018

Author: Thomas Becker 11093471

Supervisor: dr. N.P.A. van Giersbergen

Second reader: dr. J.C.M. van Ophem

(2)

1 Statement of Originality

This document is written by Thomas Becker who declares to take full responsibility for the contents of this document.

(3)

2 Introduction

Imbalanced learning, “the learning process for data representation and information extraction with severe distribution skews to develop effective decision boundaries to support the decision-making pro-cess” [30], is still a relatively new problem that has emerged in the recent fast development of big data and machine learning applications. The problem of having only a few relevant observations of a particular class (mostly called the minority class) in the dataset, provides a researcher with a major challenge. It started to be a research topic at the start of the millenium when support vector machines (SVM’s) and decision tree methods were rapidly being used more often [58] [36] [56]. It occurs in a variety of applications including oil spill detection, medical diagnosis, fault diagnosis, email foldering, face recognition, credit card fraud, conversion of an online advertisement, tumor detection in a CT scan and more general anomaly detection [25] [29]. In many cases, failed performance on a specific dataset or problem gave rise to the more general question: ”What can be learned from data which has a severe skewness in its distribution?”. When standard techniques are applied to such datasets they tend to ignore small groups of observations as correctly classifying them gives only a marginal contribution to the error function [25]. This leads to a useless model as it fails to distinguish between the most prominent and less occuring class. It is one of the topics that has had seen a rapid growth in articles covering it and yet is still regarded as one of the main challenges for data scientists [29]. Not only has it brought serious challenges to the data scientist community, it also raised serious ques-tions to many users in practical applicaques-tions. How good can a prediction be when there are only few observations to be learned from and is it even possible to extract general conclusions or build general applicable models based on such datasets? Various solutions to this problem were introduced in the past years which will be further discussed in the next section. Some of them are widely applicable and can be generalized to the whole class of imbalanced datasets. Others are more problem specific and can only be applied to a certain field of research area. This research aims to develop a generally applicable method of handling imbalanced datasets and therefore requires being tested on datasets obtained from various fields. Even though this does not guarantee general validity, it does give strong evidence that it can be used as a baseline technique for handling imbalanced datasets.

Recently, the work of Roy et. al (2018) has introduced Dynamic Selection (DS) into the toolbox of imbalanced class learning [49]. Working from the hypothesis that different classifiers perform better for different parts of the data space, they demonstrate that DS can indeed improve performance. This thesis aims to extend this research by retesting the conclusion, testing the robustness of the experimental framework, re-examine the various techniques that can be applied and looking in what way the data preprocessing in dynamic selection can be varied in order to improve performance. The central research question will therefore be:

To what extent can data preprocessing techniques combined with dynamic selection be applied to imbalanced class classification problems?

(5)

3 Theoretical background

3.1 Problem formulation

Suppose we have a set of N observations of outcomes {Y } which can either be integers representing classes or real numbers representing a value and can also be vectors in case of multiple outcomes. As this research focuses on binary classification, we will only consider binary outcome. More infor-mation on imbalanced learning in case of regression can be found in [54] and [55]. Together with these outcomes we have N observations of k features {X1...Xk} which we expect to contain useful

information to predict Y by ˆY = f (X1, .., Xk). In many cases the aim of a researcher is to predict

a certain class or set of classes of Y . When the data contains a small amount of observations from this class or classes compared to another class or classes the imbalanced class problem can occur [5]. The class(es) with low number of observations (called the minority class(es)) are underrepresented in the data and are overruled by the far more occuring majority class data. A measure for imbalance within a dataset is the imbalance ratio (IR). It is defined as the number of occurrences of the minority class compared to the majority class [25] [29]. For multiclass problems the imbalance ratio can be extended by considering all minority class ratio’s or their (weighted) average.

However, the imbalance ratio is not the only thing that needs to be taken into consideration. Even though the minority class can have small occurrence in the data, separation or modelling does not have to be difficult when the the classes can easily be separated. In those cases, standard algorithms such as SVM’s or neural networks can be used effectively [25]. Imbalance therefore not only relates to class size but also to class separability.

To further specify imbalance, it is important to distinguish between absolute and relative imbalance [29]. Relative imbalance refers to imbalance in which the imbalance ratio is inherent to the data space but where increasing the number of observations is expected to increase the number of occurrences of the minority class. Absolute imbalance refers to imbalance in which an increase of the sample size does not affect the number of occurrences of the minority class [29]. The former is the sort of imbalance that is considered in this research. Absolute imbalance is closely related to within class imbalance which occurs when the data within classes can actually be split up into several subclasses. This is also related to the problem of small disjuncts and is hard to solve [30] [25].

Figure 1: Example dataset containing a few observations of the minority class (blue) relative to majority class (red). sub-clusters.

Figure 1 shows an artificial dataset with 2 classes. The minority class (blue) occurs less often than the majority class (red). Moreover, the minority class suffers from within-class imbalance as it is split up into 2 parts.

(6)

3.2 General solutions

Traditionally the solutions to imbalanced class learning can be applied to each stage of the analysis when working with an imbalanced dataset. Following the work of several authors the solutions can be divided into 3 categories: [5] [30] [46] [49] [37]

1. Data preprocessing

2. Adaptation of algorithms and Cost sensitive learning 3. Ensemble learning

3.2.1 Data preprocessing

The most simple way to remove any imbalance from the data is to balance it by ensuring that the number of instances in each class is roughly similar. However, in most cases there is simply not more data available to balance the minority and majority class as the minority class is scarce. The only solution left would be to balance the data by randomly removing a substantial part of the majority class data (undersampling) or replicating the minority class data (oversampling). Undersampling leads to a massive loss of information and could potentially increase the risk of making a Type I error in prediction. Even though this can work for slightly imbalanced datasets, we have to do better for more extreme cases. Kubat and Matwin (1997) therefore suggest a technique where a special selection of the majority class instances are maintained [39]. The key in their approach is to use Condensed Nearest Neighbourhood (CCN) and Tomek links (named after Ivan Tomek) to remove those observations that can be regarded as noise or redundant so that the remaining ones can be regarded as representative for the whole majority class. Other scholars use similar techniques by either combining nearest neighbour techniques with an evaluation of classification or introducing an intelligent removal scheme based on the distance between the observations to the majority and minority class data [43] [41]. Even though these solutions are elegant and have proven to substantially increase classification performance, it still does not provide a solution when the number of instances in the minority class is very low. In extent to this, the nature of the problem in imbalanced class learning does not lie in the fact that the majority class is over-represented but lies in the fact that the information in the minority class needs to be exploited in a better way.

Oversampling provides a simple and elegant solution for this. To stress the importance of the minority class, its observations can be duplicated (possibly with noise) to a point where the imbalanced is removed. The downside to this approach can be seen immediately. Replicating the minority class data greatly increases the chance of overfitting the data as it increases the effect of outliers and measurement errors. As a result, the performance on the test set is known to be very low when the data is preprocessed using oversampling by replication [30]. The problem of overfitting can be

Original data

Oversampling Undersampling

Figure 2: Example of both oversampling and undersampling using the same dataset as before. In oversampling a lot of minority class observations (blue) are added to the data, in this case with random noise. In undersampling, majority class observations (red) are removed. Both sampled datasets can be used for training standard algorithms

partly removed by data generation techniques. Instead of simply duplicating the minority class data, artificial minority class data is generated based on the minority class data available [30]. One of the first of such technique that was found to be succesful is Synthetic Minority Oversampling Technique (SMOTE) [9] and has had several successors including Borderline-SMOTE and Safe-Level-SMOTE. The SMOTE algorithm generates the artificial data (x∗) for each minority class observation (xi) on

the line segment joining the observation with one of its K nearest neighbours (¯x). In formula: x∗= xi+ (¯x − xi) · U[0,1].

The idea behind this approach is that the artificial data are very likely to exist in real life somewhere between the observations in the minority class, even though they are not present in the data itself. Contrary to random duplication, it does not copy the information but implicitly uses the feature values that distinguish the minority class from the majority class. This does require that the mi-nority class observations are to some extend clustered together. Otherwise new synthetic data will just be generated within the majority class and the data will become useless in training. In prac-tice this assumption turns out to be too strict in many cases. Minority class observations can lie completely surrounded with majority class observations or on the border between 2 or more classes.

(7)

These observations can be regarded as outliers or noise and having many of them can indicate that more adequate features are needed to properly separate the classes. To distinguish between useful and useless minority class observations for oversampling the Borderline-SMOTE and ADASYN [28] sampling techniques were proposed. In extend to this, data cleaning techniques such as TOMEK links and CNN can also be combined with SMOTE to improve performance. Examples of this in-clude OSS [39], CNN+TOMEK, NC, SMOTE+ENN and SMOTE+TOMEK [3]. Finally, substantial progress has been made to combine data preprocessing techniques with ensembles methods. These will be discussed in Section 3.2.3.

3.2.2 Adaptation of algorithms and cost sensitive learning

The second way of dealing with data imbalance is to adapt the selection algorithms themselves or adapt the loss or cost function the algorithm is optimizing. In some cases this can be done fairly straightforward by tuning algorithm parameters. In other cases, algorithmic steps have to be altered. On algorithm level, several successful alterations to SVM’s [1] [4] [53], K-nearest neighbours [26] [32] and decision trees [12] have been made. The solutions vary from biasing the classifiers towards the minority class to adaptation of the cost function. The cost sensitive learning framework has also successfully been applied to imbalanced datasets, of which the Adacost (an alteration of Adasyn) family is one of the most well known [23]. In cost sensitive learning, the cost of wrongly classifying a minority intances is penalized heavier than the cost of wrongly classifying a majority instance. The most simple way of doing this to take the ratio of the cost as the inverse of the imbalance ratio but more elaborate methods vary the cost based on the local imbalance ratio, the updated imbalance ratio or the complete distribution [29].

The advantage of cost sensitive learning over especially data preprocessing lies in the fact that it does not require extra computation. Secondly it just works with the data itself and it also does not require the creation of new fake instances as is the case in under-sampling technique.

There is a practical difficulty in applying cost-sensitive learning or adapting the algorithms. In many cases the standard classification methods have been optimized so that that they work computationally efficient and that they can be used more generally. When altering the algorithms this efficiency can be lost. It also means that with every new algorithm that will be developed in the future, an adapted version has to be made to make them suitable for imbalanced datasets. In extent to this, parameters have to be chosen to rebalance the cost function of classification algorithm. It is not immediately clear which parameter values need to be chosen and the optimal choice will vary between datasets, adaptations of the same datasets during online learning and possibly even within different parts of the data. It should also be noted that even though the adaptation of the cost function in some sense balances the error function, the chances of encountering a minority class observation in training are still very low. This means that the model training is very sensitive to the sampled data and small measurement errors can have a huge impact on the parameters to be estimated.

3.2.3 Ensemble methods

In recent years, ensemble methods have become increasingly popular in the data science commu-nity [37] [30]. The general idea behind this approach is that from if whole pool of different classifiers is trained on the training set or slightly altered versions of the training set , every classifier is com-petent in another part of the dataset but can potentially perform poorly in other parts. Ensemble methods therefore train multiple base classifiers and averages their prediction outcomes to classify the data [40]. This means that in areas where a classifier performs poorly its predictions are outweighted with other classifiers’ predictions (static ensemble) or completely ignored (dynamic ensemble). En-semble methods have proven to be very powerful for any dataset and specifically for imbalanced class learning [5] [25] [38].

To apply ensemble methods, all the algorithms have to be trained properly and their performance on the different parts of the data space has to be evaluated. Ensemble methods are therefore usually combined with bagging or boosting. For both, the data is sampled multiple times from the training set with replacement. Classifiers are trained on each of the new training set and their performance is evaluated by looking at their (local) accuracy. It is therefore possible to decide which classifier performs best for each observation in the training sets and observations that lie close to it. For a new observation, one can decide which of the trained classifiers have to be used for classification. Several studies have shown that out of boosting and bagging, the latter generally outperforms the former besides the fact that is is easier to implement [49]. This leads to the following setup for combining Bagging with ensemble learning:

1. Generate T samples of the training data of size N by drawing with replacement

2. On each of the samples a base classifier is trained, resulting in prediction probabilities of each instance in each sample

3. When a new instance has to be predicted, it is classified based on the posterior probabilities that are given to it by all the base classifier in the bagged training sets.

While step 1 and 2 are straightforward and only require choosing a suitable number for T , step 3 allows for more variation and freedom of experimental setup. In static ensembles, the predictions of all base classifiers are used. This does not completely solve the problem of weak competence of some

(8)

learners in some parts of the data space. Following the suggestion of [37], dynamic selection has been proposed to allow for variation in the pool of base classifiers. This selection method called Dynamic Selection (DS) has been studied widely and has offered promising results [49] [6] [14]. It was especially shown to outperform static ensembles in complex datasets (specifically high class overlap but high separability and density per dimension [6]) and it will be further discussed in section 3.3.

3.2.4 Ensembles in imbalanced datasets

The before-mentioned approaches do not address any of the difficulties that occur in imbalanced class datasets. Although combining the classification outcomes of several classifiers based on their competence does improve the classification accuracy in the minority class in general, the imbalance of the data remains intact in the (bagged) training sets. Therefore ensemble learning is often combined with either cost sensitive learning or data preprocessing to remove the effect of imbalance in learning the base classifiers [49]. Even though ensemble learning and cost sensitive learning has been applied successfully [59], Roy et al. (2018) point out several advantages of combining ensemble learning with data preprocessing. In general base classifiers perform better once the data is balanced and the randomization of the data exploits the diversity in the pool of classifiers [49]. Moreover, as data preprocessing can be done completely independent of the classification algorithms, they can be applied without having to adapt them. This allows for more flexibility in the algorithms. They conducted an extensive experiment showing that ensemble learning, together with dynamic selection and data preprocessing leads to better results than static ensemble learning.

Several data preprocessing techniques have been used in combination with ensemble learning including Random Under Sampling (RUS), SMOTE, Ranked Minority Oversampling (RAMO) and Random Balance (RB). Apart from RUS, all sampling techniques are discussed in section 4. RUS is disregarded due to the objections mentioned earlier. Even though SMOTE has proven to be a very powerful data preprocessing technique, it can be outperformed by RAMO [11]. RB is a third alternative which has proven to be an effective preprocessing technique combined with ensemble learning [19]. ADASYN has not yet been combined with ensemble learning and bagging even though it is known to perform well.

3.3 Dynamic Selection

The dynamic selection (DS) technique mentioned in Section 3.2.3 allows for combining the strength of different classifiers or groups of classifiers in different parts of the dataset. Where ensemble learn-ing allows multiple classifiers to be used, dynamic selection allows the use of the information in the dataset to only use those classifiers that perform well for a specific set of observations. Given a set of base classifiers, the competence of each classifier in each part of the data space is determined. This is done using a validation or test set called the Dynamic Selection Dataset (DSEL) [6] [49]. From this validation the so-called regions of competence for each classifier are determined. In the third step of ensemble learning the selection of the different classifiers is based on the regions of competence of the classifiers. This can be done in different ways. Britto et al. (2014) distinguish between individual-based methods and group-individual-based methods and within each class specify several subgroups [6](see Table 3.3).

Main cluster Sub cluster Examples

Individual-based Ranking-based DCS-RANK, DS-MR Accuracy-based DS-LA OLA, DS-LA LCA

Probabilistic-based A priori posteriori method, DES-M1, DES-M2,DES-CS Oracle-based KNE/KNORA,KNU

Behavior-based DS-MCB, DSA-C, DECS-LA Group-based Diversity-based DS-KNN, DS-Cluster, DES-CD,

Ambiguity-based DSA Data handling-based GDES

Another relevant distinction can be made in the number of classifiers selected. Whereas Dynamic Classifier Selection (DCS) selects 1 classifier for an instance in the DSEL, Dynamic Ensemble Selec-tion (DES) selects an ensemble of classifiers. Ko et al. (2018) showed that in general DES outperforms DCS although there is variation within the different algorithms with each method [35]. DES was also shown to perform well in small datasets [7] [8] and further studies have looked into the pool size [48] [47] in DES.

A lot of research has focussed on the performance of different DS methods. Britto et al.(2014) showed that no single dynamic selection method significantly outperforms the other ones but they do state that simpler methods such as KNORA and DS-LCA generally outperform the more complex ones. Xiao et al. (2012) showed that a dynamic switching scheme between DCS-LA and GDES, combined with cost-sensitive learning provided good results for a costumer dataset [59]. Roy et al. considered two DES and DCS methods and found that KNORA performed best in general imbalanced dataset problems when combined with data preprocessing [49]. This partly overcomes the problem of the DSEL imbalance as mentioned by [15]. As the DSEL is a key factor in the Dynamic Selection, a significant imbalance in the DSEL would results in a poor selection procedure even when the training data is re-balanced. META-DES and META-DES.ORACLE are two recently developed (META-)DS

(9)

strategies [16] [13]. These approaches take dynamic selection one step further by considering a meta framework to decide which classifiers are competent to classify an instance. In this meta framework, a prediction is made on the classifiers itself determining whether they are competent or not in a specific region of the dataset. In this training, the classifiers themselves, along with their performance are the ”observations” and the dependent variable is competence in a region. This setup is outside the scope of this thesis but a variation to this is treated in Section 5.8

3.4 Evaluation metrics

Having discussed different methods to be able to deal with imbalanced class learning, the question rises how such a variety of techniques can objectively be compared. Traditionally a dataset at hand is split in a training set and a test set and performance is measured by using a training set to train a model and a test set to validate the performance. The observations in the test set are ”fed” to the algorithm and the outcome of the algorithm is compared to the real value. However, in datasets with imbalanced classes, standard accuracy measures are not an adequate measure of performance [27] [33] [10] [42] [51] [34] [44] [45]. In a situation with an imbalance ratio of 1/99, a 90 % accuracy on the majority class and a 10% accuracy on the minority class will give an overall accuracy of 0.01 · 0.1 + 0.99 · 0.9 = 0.8919%, outperforming an accuracy of 80% on both the majority and minority class. It is clear that such measures, especially in cases where the minority class is the class that is of most interest are useless.

One alternative is to balance the accuracy based on the total amount of majority (Maj) and minority (Min) observations. Define TXas having predicted class X correctly and FX as having predicted class

X wrong. The BalancedAccuracy is then defined as: BalancedAccuracy = TM in

2(TM in+ FM in)

+ TM aj 2(TM aj+ FM aj)

In the new metric, an accuracy on both the majority and minority class is rewarded equally based on the imbalance ratio. In the previously mentioned example the first classifier would reach a balanced accuracy of 0.5 while the second one reaches an accuracy of 80 %. The balanced accuracy is a good metric when we want the accuracy in all classes to be equally high. However, in many cases we are interested in how many times we predict the minority class correctly, without over predicting it too many times. A model that has 100 % accuracy on the minority class but predicts 50 % of all majority class observations incorrectly will have a balanced accuracy of 75 % but still be useless in practice. This tradeoff between making a type I error and optimizing accuracy in the minority class can be done using ROC curves.

3.4.1 ROC

In many classification tasks a certain cutoff point has to be determined which classifies an observation into one class or the other. Logistic regression for example, returns probabilities between 0 and 1 of belonging to a certain class. It is up to the researcher to determine the threshold of classification in a certain class. One way of determining such a threshold is to calculate Receiver Operating Character-istic (ROC) curves that summarize the trade-off between the classification accuracy in the minority class and the type II error rate. [30] [52] [22]. This is summarized in Figure 3 below.

(10)

Figure 3: Example ROC curves

In Figure 3 the value on the y-axis corresponds to the fist part of the Balanced accuracy calculation and equals the (unweighted) accuracy on the minority class. The value of the x-axis is calculated using F mR = Fmin

Fmin+Tmaj can be regarded as the probability of making a false prediction on the minority

class weighted by the total number of majority class observations. As a result, the 2 values no longer add up to 1 as was the case in the balanced accuracy. In selecting cutoff values, one typically selects those values that lead to a low false minority rate and a high true minority rate, thus corresponding to points that lie in the area in the top left part of the graph. Typically an increase of the true minority rate also increases the false minority rate as the model is biased towards the minority class. The slope of the ROC curves will therefore almost always be positive. As a result, the ROC curves gives insight in the trade-off between prediction accuracy of the minority class and the chance of making a type I error.

In addition to using the ROC curves as a method of selecting parameter values and cut-off values within a model, they can also be used to compare models. In Figure 3 it is clear that the selection method corresponding ROC curve in the bottom right corner performs significantly worse than the curve in the top left corner, regardless of the parameter values or cut-off values chosen. This notion can be generalized to an objective metric for comparing performance of classification techniques by calculating the area under the ROC curve (AUC) [30]. This area can be calculated using:

AU C =Smin− nmin(nmin+ 1)/2 nminnmin

where Smincorresponds to the sum of the ranks of the minority class observations in the test set when

the test set is ordered increasingly based on the predicted minority class probabilities and nmin is the

amount of minority class observations. 3.4.2 PR curves

Over the years, research has shown that the ROC and AUC measures are not always effective met-rics. Davis and Goadrich (2006) showed that in highly skewed datasets, ROC curves and AUC can potentially evaluate an algorithm too optimistic [18]. They show that instead precision recall (PR) curves dominate the ROC curves in the sense that an algorithm dominates another one in ROC space if and only if it dominates it in PR space while domination of the area is not guaranteed. The PR curves are constructed by the precision and the recall:

P recision = Tmin Tmin+ Fmin , Recall = Tmin Tmin+ Fmaj [30].

Intuitively, precision refers to the accuracy within the minority class while the recall corresponds to the ability to distinguish between the majority and minority class. The PR curves are constructed by plotting the precision and recall values for different threshold values in the classification procedure. Naturally, algorithms that reside in the upper right corner of the PR plot are preferred. Precision and

(11)

recall can furthermore be used to calculate 2 more evaluation metrics for imbalanced class datasets: Fmeasure= (1 + β)2· Recall · P recision β2_{· Recall + P recision} , Gmean= s Recall × Tmaj Tmaj+ Fmin .

It should be noted that both metrics are useful for optimizing an algorithm by adjusting the classifi-cation threshold values but they cannot be used to compare different algorithms as they require the threshold value.

Next to the ROC curves and PR curves, a third metric for evaluation was proposed that allows for the constructing confidence intervals and statistical significance. By defining a measure of cost and creating cost curves this can be achieved [30]. As the AUC has become the standard metric for comparison in imbalanced classification, this will be used for evaluation in this thesis.

3.5 Research question

As mentioned before the central research question considered in this thesis is: ”To what extent can data preprocessing techniques combined with dynamic selection be applied to imbalanced class clas-sification problems?”

To be able to answer this, 5 sub questions are treated:

1. Does using a combination of data preprocessing and dynamic selection yield better classification results than standard classification?

2. Which combination of data preprocessing technique and dynamic selection method performs best?

3. Does varying the data preprocessing technique between the DSEL and the bagged datasets improve the performance in classification?

4. To what extent is the experimental framework of data preprocessing and dynamic selection sensitive to varying parameters and other changes to the experimental setup.

5. Can the performance of data preprocessing techniques and dynamic selection techniques be predicted based on the dataset characteristics.

The first sub question relates to the necessity of dynamic selection and data preprocessing instead of applying standard algorithms. Although there is strong evidence that data preprocessing works better, it has not been compared to standard approaches. The second research question addresses the various choices in balancing and dynamic selection techniques. Previous research has shown that RAMO together with KNU is the superior dynamic selection choice [49] but Adasyn has never been tested in this setting before. The third question deals with an addition to previously used techniques by varying the data preprocessing in the DSEL and the bagged datasets. As the DSEL is a key part in the dynamic selection procedure, its setup could possibly influence the dynamic selection outcome. The fourth question tackles the robustness and general applicability of the experimental setup. In a regular experimental setup, a vital part of the model fitting and optimization is choosing the optimal parameter values. This does not only apply to the model fitting but also to the data preprocessing and dynamic selection methods. It could very well be that the performance of the different preprocessing and selection techniques could be improved by using other parameter settings. On the other hand one could argue that the experimental framework is so extensive that potential sensitivity to variation is removed. When fitting multiple models on a partly random datasets to generate an average prediction, particular model characteristics could already be removed. This could potentially make data preprocessing and dynamic selection a general applicable technique.

The last question tackles the problem of choosing the right methods. In practice, there will always be several combinations of techniques that work best, depending on the application or nature of the dataset and no single technique will outperform all others in every situation. As the data preprocessing techniques and dynamic selection techniques vary in approach and put more importance on different parts on the minority class observations or local accuracy, having knowledge on several characteristics of the data space could potentially give insight which method to use beforehand without having to try all of them out.

(12)

4 Methodology

This section describes the experimental setup which is used to train the models, test them and evaluate the performance. In order to determine the performance of the dynamic selection combined with different data preprocessing techniques, the same experimental protocol is applied to each combination of dataset used, data preprocessing method and dynamic selection technique. For each dataset the performance of the different classifying protocols are then compared. Following the setup of Roy et al. [49], the analysis for each combination can be summed up by the following steps:

1. Read the data, augment it into the right format and split into training set and test set 2. Create DSEL using a predefined preprocessing technique

3. Bagging of training set and preprocessing each bootstrapped dataset 4. Train classifiers on bootstrapped datasets

5. Determine which classifiers will be used in each region using the DSEL (region of competence) 6. Process test data through all relevant classifiers, generate predictions based on regions of

com-petence (with or without aggregation)

7. Evaluate performance of the predictions on the test set using the AUC measure

A schematic overview of the experimental setup is shown in Figure 4 on the next page. All computa-tions to obtain the results are done in R and the results are analyzed in Microsoft Excel.

(13)

Figure 4: Exp erimen tal setup in case of 6 base classifiers

(14)

4.1 Data

The data used come from the KEEL data repository, HDDT collection and UCI data repository [2][49]. Within these repositories there are collections of imbalanced datasets which are frequently used when evaluating classification algorithms in imbalanced datasets [20][12]. In total 105 datasets were processed for analysis. The data is augmented in such a way that all features are either binary or continuous variables. If necessary, a categorical predictive variable var consisting of k classes is split up in k − 1 binary variables var1...vark−1. The majority class is always taken as the baseline

class.

Figure 5: Overview of all datasets in the data repositories used for the analysis with the number of observations

In the analysis some datasets with a low number of observations could not be processed as the data preprocessing techniques and dynamic selection techniques require a minimum number of observations. Especially in bagging the number of observations of the minority class can become very low or even nonexistent. For some datasets this is solved during sampling but for datasets with a very small number of observations (less than 300) this required so many duplicated instances that overfitting was almost certain to occur. On the other hand, the datasets with over 10000 observations took too long to process for all techniques. In the end, 56 datasets remained on which the all the different combinations of techniques were tested.

4.2 Creating the relevant datasets

After being put in the right format, the dataset is split into the training set and the test set in a ratio of 3:1. From the training set the DSEL is created by randomly drawing with replacement from the training set to create an equally as big dataset. It can potentially contain some observations multiple times. The same procedure is followed for the bagged training set which are used to train the different classifiers on. In total 10 bagged datasets are created in the default setting. In datasets in which the minority class occurs less than 5 times, minority class instances are added to the training set so that it contains at least 5 minority class observations, DSEL and Bagged datasets to allow the data preprocessing techniques to work. This can potentially lead to overfitting of the models and has to be taken into account when interpreting the results.

4.3 Data preprocessing

On both the DSEL and the bagged training sets data preprocessing techniques are applied to balance them. The following data preprocessing techniques are considered:

(15)

1. SMOTE creates new data points based on the procedure described in section 3.2.1. In the default analysis the number of minority instances is multiplied by 10 ( % oversampling=1000) and the number of majority instances is taken as large as the new size of the minority class (% undersampling =100). The number of nearest neighbours in each data generation step is 5. 2. ADASYN creates new random samples for every instance in the minority class but varies

the number of synthetic samples based on the number of majority class neighbours. Given a training set Di with number of minority class observations Nmin and number of majority class

observations Nmaj, one first calculates

d = Nmin Nmaj

.

The total number of synthetic samples is then calculated according to G = (Nmaj− Nmin)β,

where β determines the new balance ratio between the classes, in this case β = 1 to create complete balance.

Next the rank for each minority instance is calculated using: ri=

∆i

K i = 1, .., Nmin,

where ∆iis equal to the number of majority class instances within the K nearest neighbours. In

the default case K = 5, copying the setup of SMOTE. After normalization to ensure ri∈ [0, 1],

the number of synthetic data points for each minority class observation xi is gi= ri· G.

Next, a SMOTE like data generation process is used on the minority class observations using the number of synthetic data points gi. For each xi, a random sample is drawn from the minority

class observations ximinwithin the K nearest neighbours and a synthetic data point x∗i is created

using

x∗= xi+ (ximin− xi) · U[0,1].

This results in gi synthetic data points for each xi [28].

3. RAMO samples from the minority class based on an estimated distribution and then creates minority samples. More formally, for each item in the minority class xi its k1(= 10) nearest

neighbours are determined from the whole dataset. Next the rank of each minority instance ri

is calculated:

ri=

1 1 + exp(−αδ1)

,

where δi is the number of majority instances in the set of nearest neighbours and α is a user

defined value, in this case 0.3 as is done in [49] and [20]. From the set of minority instances the minority instances are sampled according to their weights ri. Next the SMOTE procedure is

applied to the synthetic instances with number of neighbours 5 (k2).

When comparing the 3 algorithms it is clear that RAMO and ADASYN are advanced versions of SMOTE. Both algorithms try to increase the importance of minority class observations that are surrounded with majority class observations. This could possibly increase classification performance in small minority class regions but could on the other hand highlight outliers and/or noisy instances. Of these techniques, RAMO has been shown to perform better than SMOTE but ADASYN has never been considered in this context yet [49].

The data preprocessing techniques are applied to all bagged datasets and to the DSEL. Whereas Roy et al. (2018) use the same preprocessing techniques for the bagged datasets and the DSEL, they are varied for this analysis. All possible combinations of the 3 methods are considered and a comparison is made between the datasets in which the preprocessing techniques are equal and those in which the preprocessing techniques vary.

4.4 Training the models

After the datasets are preprocessed, a classifier is trained on each one of them. 3 main classifiers are considered: logistic regression, SVM and C50. The first two are straightforward classification methods. For the SVM the kernel used is vanilladot and the cost in the Lagrangian are taken to be 10. The C50 algorithm is the improvement over the popular C4.5 algorithm which is a decision tree algorithm widely used in imbalanced classification[3][12], especially when applied in ensembles [20]. All 3 algorithms generate a model based on the training set which can then be used in prediction on both the DSEL and the test set. In the default setup, one of the 3 classifiers is chosen randomly to train on the bagged datasets.

4.5 Evaluating and selecting classifiers for prediction

To create an ensemble for prediction various dynamic selection techniques are considered. Each classifier is used to create predictions on the DSEL and the selection technique is used to create a pool of classifiers for each minority class observation in the test set. In the default setup, the 5 nearest neighbours are used to evaluate the performance [49][20]. The following techniques are considered:

(16)

1. Static In static selection all classifiers are used in predicting. All classifiers are assumed to perform equally well on the minority class observations so there is no evaluation of prediction required on the DSEL.

2. RANK [50] [6].The RANK procedure selects one classifier based on the local rank within the region of the minority class. The rank of a classifier within a local region is determined by the number of consecutive nearest minority class neighbours of a test set instance that are correctly classified by the classifier in the DSEL. The classifier with the highest rank is used for classification of that minority class observation.

3. LCA[57][6] The LCA algorithm determines the local accuracy of the classifier and selects the classifier with the highest accuracy. The accuracy is calculated as the percentage of correctly classified observations within the local neighbourhood of a test instance that belong to the same class.

4. KNE [35] The KNE selects a pool of classifiers based on the local accuracy. It selects those classifiers that achieve perfect accuracy within a local region of the test instance, in this case the 15 nearest neighbours. Only classifiers that achieve 100% accuracy are used. In the case where there is no such classifier, the algorithm stops.

5. KNU[35] The KNU algorithm also selects a pool of classifiers based on accuracy in the local region but selects all classifiers that have an accuracy higher than 0.

When looking at the different algorithms, several differences emerge. Whereas RANK promotes perfect accuracy within a relatively small region, LCA selects classifiers that have a relatively high accuracy in a much larger region. The KNE and KNU algorithms exhibit a similar difference. As KNE requires perfect accuracy it will generally select a small number of classifiers whereas KNU will in practice only remove a small number of classifiers, possibly even preserving the whole pool. Of all classifying techniques KNU has been shown to perform best [49] but never in combination with varying preprocessing techniques and the robustness to varying parameters has not yet been tested.

4.6 Prediction on test set

All Dynamic Selection techniques result in one or more competent classifiers for test instances in the test set. Using the predicted probabilities of each base classifier on the test set, the average pre-dicted probability is calculated for each test instance, only using those classifiers selected in Dynamic Selection. This results in an average predicted probability of all test instances.

4.7 Evaluating the predictions

To evaluate accuracy of the prediction, the predicted probabilities are compared to the actual classes by the AUC measure.

4.8 Analyzing the results

The previous section described the various data preprocessing techniques and dynamic selection tech-niques that are compared. In total we compare 3 data preprocessing techtech-niques for the DSEL, 3 data preprocessing techniques for the bagged training sets and 5 dynamic selection techniques which results in 45 different experimental setups. All 45 of them are applied to the 56 data sets resulting in 56 lists containing the performances of the different setups. Besides the complex models, one regular logistic regression is done on the data as well as one logistic regression on a balanced dataset (using RAMO). These 2 models will be referred to as the benchmark models. The lists of results are then ordered to create a ranking of the methods.

For comparison to logistic regression and regular data preprocessing the average AUC of the extensive setup is compared the the AUC of logistic regression for each dataset. To determine the best data preprocessing technique and dynamic selection method, the average rank for each method is compared as well as the winrate of the method over the datasets. The winrate is defined as the percentage of times an experimental setup has the highest ranks between all setups compared. To evaluate the effect of varying the data preprocessing, the average AUC is calculated for each dataset for the cases in which the data preprocessing is equal and the cases where the dataprocessing between the bagged datasets and the DSEL differs. The winrate and average ranks are reported. Robustness checks are made by considering various combinations in experimental setup and ranking them according to their AUC value. The average ranks of different experimental setups are then reported.

4.8.1 Sub question 5

In the last sub question, the winning methods are compared on 5 dataset characteristics: The number of variables, the number of observations, the imbalance ratio, the Duunn index, the Davies-Bouldin index and Fisher’s discriminant ratio. The number of observations and number of variables can easily be extracted from the data. The other measures are calculated as follows:

(17)

1. Imbalance ratio. As previously defined the imbalances ratio measure the amount of imbalance in the dataset by calculating the ratio between the amount of minority and majority class observations.

2. Davies-Bouldin index (DB-index) [17]. Introduced as a measure to evaluate clustering algorithms, the Davies Bouldin index relates the within cluster distance and the between cluster distance based on a predefined distance measure. More formally, having defined Sias a measure

for within cluster distance for cluster i and Mi,j as a measure for the distance between clusters

i and j, one calculates

Ri,j=

Si+ Sj

Mi,j

as the measure of separation. Next, one calculates Di = maxj6=iRi,j for each cluster, after

which the DB index is calculated as the average Di over all clusters. For a 2 cluster problem,

the DB-index simplifies to R1,2and is of course symmetric. Large values of the DB-index indicate

large within-cluster distances and low distance between clusters while small values indicate well separated classes. The value of the DB-index depends heavily in the distance metric chosen to calculate Mi,j and Si. In this case the euclidean distance is chosen as most datasets contain

mainly continuous, normalized variables.

3. Dunn index [21]. The Dunn index is another slightly simpler measure to evaluate clustering performance. It measures the ratio between the smallest distance between observations not in the same cluster and the largest distance within a cluster. Large values of the Dunn index indicate compact clusters that lie far apart while small values indicate non-compact clusters that lie close together and possibly overlap. In this case the distance metric used is again the Euclidean distance.

4. Fisher’s discriminant ratio[24][31]. Apart from defining a measure for cluster separation, Fisher’s discriminant ratio provides a qualitative comparison of two clusters i and j on a feature k: fk = (µki− µkj)2 σ2 ki+ σ2kj .

In this case, µkiand σkiare the sample mean and standard deviation for feature k within class i.

Having calculated the discriminant ratio, the maximum value over all features can be used as a measure of variations between the clusters. Naturally, a high value means that the clusters differ substantially on the different features and are therefore probably easily separable. Low values indicate a lot of similarity between the clusters and will in general mean that classification will be hard.

(18)

5 Results

This section describes the results of the analysis. First, the preliminary results are presented to give an overview of the overall analysis. Next, the detailed results are discussed and the various sub research questions are treated after which the results of the sub questions are used to revisit the main research question. The fourth part deals with the robustness of dynamic selection and data preprocessing and the fifth section examines the winning techniques compared to dataset characteristics.

5.1 Analysis output

All 56 datasets are processed through 47 different combinations of data preprocessing methods and dynamic selection methods and the results are evaluated with the AUC measure. This results in 2633 AUC values based on which the ranking of the various combinations are made per dataset. In 5 cases the AUC values was equal to 1 for all combinations. These where very small datasets in which all models overfitted the data. These datasets are left out in the remaining analysis. Table 1 shows the average AUC values of the datasets for the basic logistic regression and the more complex models grouped by their imbalance ratio .

Imbalance ratio Number of datasets Average AUC of lo-gistic model Average AUC of logistic model+ data preprocess-ing Average AUC of complex model 0 − 0.05 18 0.900 0.896 0.853 0.05 − 0.1 12 0.917 0.926 0.925 0.1 − 0.15 15 0.936 0.934 0.936 > 0.15 11 0.863 0.842 0.853

Table 1: Average AUC for all datasets

The results show that the average AUC values lie very close together for all models and no hard conclusions can be drawn from it. It does show that the average AUC in highly imbalanced datasets is slightly lower than in less imbalanced datasets in both the logistic regression and the more advanced models. This again underlines the difficulty of classification in highly imbalanced datasets. The results do not show that the average AUC increases significantly when the imbalance ratio drops but this is partly due to the fact that the average also contains complex models that perform poorly. It therefore makes sense to further investigate which combinations of preprocessing and dynamic selection performs best and to eliminate those combinations that perform poorly. Besides the fact that the average contains predictions of both well performing and poor perfroming models, the average AUC over a group of datasets cannot be used as a measure to compare performance as the AUC value is dataset specific. It makes more sense to evaluate based on the rank of the model within the data.

5.2 Comparison to benchmark models

To effectively compare the complex setup to both logistic regression and solely data preprocessing, the performance of the models is evaluated within each dataset. For each dataset it is determined whether the complex model outperforms the benchmark model and the average AUC and maximum AUC is compared to the AUC of the benchmrk models. The average AUC increase of the complex setup compared to the benchmark setup is also calculated.

Imbalance ratio Average AUC > LR Max AUC > LR Average increase in AUC of average Average increase in AUC of maximum 0 − 0.05 33 % 87 % -1 % 17 % 0.05 − 0.1 55 % 82 % 2 % 11 % 0.1 − 0.15 40 % 73 % 0 % 4 % > 0.15 55 % 100 % 1 % 8 %

Table 2: Results of comparing the AUC value of the logistic regression with average AUC value of the complex model and the largest AUC value of the complex models. The numbers in the second and third column indicate the % of times the complex model had a higher average and maximum AUC compared to logistic regression. The fourth and fifth column indicates the average percentual AUC increase of the best performing complex model compared to logistic regression

(19)

Imbalance ratio Average AUC > LR+preproc Max AUC > LR+preproc Average increase in AUC of average Average increase in AUC of maximum 0 − 0.05 60 % 100 % 4 % 23 % 0.05 − 0.1 45 % 82 % 0 % 9 % 0.1 − 0.15 60 % 93 % 15 % 4 % > 0.15 55 % 91 % -14 % 7 %

Table 3: Results of comparing the AUC value of the logistic regression and data preprocessing with average AUC value of the complex model and the largest AUC value of the complex models.

The results in Table 2 and 3 summarize the results of the analysis. In comparison to logistic regres-sion, the complex setup of data preprocessing + dynamic selection does not perform better in general as the average AUC is only higher in roughly 50 % of the cases. The same observation can be made when comparing to logistic regression and data preprocessing. For both, the average increase in AUC of adding dynamic election is also marginal. However, when looking at the best performing complex model within each dataset the difference compared to both bechmark models is quite large. Both the number of times the simple models are beaten by the best performing model and the average increase of the best performing model are quite large. The results furthermore show that there is no clear difference in the results between the different imbalance ration classes when looking at the average AUC. However, this difference is quite distinct when looking at the maximum AUC as the average increase in AUC is higher for the more imbalanced classes than for the less imbalanced classes. The results have two main implications. Firstly, we cannot simply conclude that combining data pre-processing and dynamic selection offers a solution for classification in imbalanced datasets, regardless of the method used. Compared to both simple models the differences are simply too small. The second implication of the results underlines this as we see that the maximum AUC is almost always higher than the AUC of the simple models. This confirms earlier findings by [49]. This conclusion shows the need of further research into the best performing techniques.

5.3 Comparison of techniques

To further examine the possibilities of combining data preprocessing with dynamic selection this sec-tion compares all different techniques with each other. When comparing the preprocessing techniques the average AUC is calculated for each technique over all dynamic selection techniques and the reverse is done when comparing the dynamic selection techniques. For the data preprocessing techniques only the combinations in which the preprocessing of the DSEl and the bagged datasets are similar are considered. The tables below show the comparisons of the techniques.

Preprocessing method Percentage of times method has highest AUC Average AUC Average rank ADASYN 29 % 0.895 23.5 RAMO 27 % 0.885 20.5 SMOTE 42 % 0.900 24.7 All equal 2 %

Table 4: Results for different preprocessing techniques

When looking at the different preprocessing techniques Table 4 shows that in many cases the different techniques result in a similar average AUC and SMOTE has a slightly higher win rate than both RAMO and ADASYN. Surprisingly, the average rank of the RAMO (20.5) is lower than the average rank of SMOTE (24.7) even though the win rate is lower. Apparently RAMO is a more robust method in various applications as it has a better overall performance compared to SMOTE. This makes sense as the RAMO algorithm is an extension of SMOTE but is almost similar to it when many minority class instances lie within a cluster of minority class instances. Only in datasets in which there are minority class instances that are separated from the clusters, RAMO and SMOTE differ. as RAMO gives a relatively higher probability that synthetic instances are created for outlying minority class instances. It should also be noted that in most cases when ADASYN had the highest AUC, RAMO also had a higher AUC than SMOTE. This partly conforms the findings of [49] and shows that ADASYN is not a useful preprocessing technique in this context.

(20)

DS method Percentage of times method has highest AUC Average AUC Average rank KNE 6 % 0.87 27.9 KNU 27 % 0.91 19.7 LCA 29 % 0.90 20.2 RANK 13 % 0.88 27.3 Static 23 % 0.91 19.5 All equal 2 %

Table 5: Results for different dynamic selection techniques

Table 5 shows the results of comparing the different dynamic selection techniques. Out of all considered techniques, LCA has the highest AUC in 29% of all datasets followed by KNU and static with 27% and 23% respectively. KNE has the highest AUC in only 6% of the cases while RANK has the highest AUC in 13% of all datasets. RANK and KNE also have a relatively low AUC compared to the other techniques and their average ranks are also quite high. Although different in implementation and outcome, both techniques focus on high classification performance in a local region as RANK selects the locally most accurate classifier and KNE selects only those classifiers that achieve 100% accuracy. Both methods therefore disregard classifiers that perform slightly worse than the optimal classifiers in the local region. For many datasets this exclusion criterion is too strict. Compared to LCA, the local accuracy in RANK is much stricter as it looks at the accuracy of the consecutive neighbours instead of all minority class nearest neighbours. This results in LCA performing much better and actually having the highest AUC value most of the time. LCA therefore seems a reasonable alternative to both KNE and RANK as it does take into account local accuracy without being too strict.

Alternatively, the KNU and static algorithm use a much larger pool of classifiers. In many cases the classifier pool of both will be similar as KNU selects all competent classifiers within the region and static simply selects all. Only when a classifier has 0 accuracy, it is disregarded. This filters out possible overfit of noisy data in the dataset. The results in Table 5 show that both KNU and static selection in general perform better in terms of average rank than LCA even though their win rate is slightly lower than the win rate of LCA. This confirms the findings of [49] although they only regard average rank as a comparison measure. These results pose an interesting dilemma for implementation. In general, static selection and KNU perform better than LCA and they can be regarded as the safe choice in ensemble selection. However, in specific cases LCA can be a better alternative to both with a much higher AUC. It is not yet clear if there is any pattern when to apply which method although the analysis showed that KNU performed best in datasets with less imbalance. Whether this holds more generally is treated further in Section 5.8.

5.4 Combining techniques

Having examined the optimal preprocessing techniques and dynamic selection techniques, it is worth looking into their combined performances. To do so, the average ranks and average AUC of the preprocessing techniques SMOTE and RAMO and the dynamic selection techniques LCA, RAMO and static are calculated as well as the win rate within the various combinations.

Preprocessing method DS method Percentage of times method has highest AUC Average AUC Average rank RAMO KNU 4 % 0.91 16.9 RAMO LCA 13 % 0.89 17.9 RAMO Static 11 % 0.91 15.5 SMOTE KNU 7 % 0.91 22.6 SMOTE LCA 27 % 0.91 21.4 SMOTE Static 13 % 0.91 23.0 All equal 27 %

Table 6: Results combinations of data preprocessing and dynamic selection

Table 6 shows the results of the combined preprocessing and dynamic selection methods. In 27 % of the cases all methods achieved AUC of 1. These were rather small datasets in which the methods exhibited overfit. In the other cases the SMOTE-LCA combination had the highest win rate when compared to the other methods. This again confirms the earlier finding of LCA being the best choice for specific datasets. However, in terms of ranks it is clear that RAMO combined with either one of the dynamic selection methods is much better. The results also reconfirms the superiority of static selection and KNU in terms of average rank.

(21)

5.5 Varying the data preprocessing

Having examined the various data preprocessing techniques and dynamic selection methods, a further analysis of the data preprocessing on the DSEL can be made. Until now, the data preprocessing of the bagged datasets and DSEL where done using the same technique but it could be profitable to use a different one for both as suggested by [49]. Therefore all combinations of data preprocessing in the DSEL and the bagged datasets have been applied to the datasets leading to 9 different combinations for each. The results are shown in table 7

Preprocessing of data and DSEL

Number of times method has highest AUC

Average AUC Average rank

Using all techniques: Similar pre-processing 48 % 0.89 22.9 Different preprocess-ing 52 % 0.89 23.0

Only using best performing techniques: Similar pre-processing 43 % 0.91 19.4 Different preprocess-ing 57 % 0.91 18.9

Table 7: Results of varying the data preprocessing. The

The results show no clear difference between win rate, average AUC and average rank between similar and different preprocessing. Different preprocessing does have a lower average rank compared to similar preprocessing but the difference is too small to draw any major conclusions from it. However, taking in mind the fact that ADASYN, RANK and KNE perform poorly in all datasets, these results of classifying using those methods should be disregarded in the analysis. The results in in the second part of table 7 show that the exclusion of weak performing techniques leads to slightly more distinctive conclusions. The average rank and average AUC of different preprocessing are now better than those of the similar preprocessing. The difference however is too small to draw any major general conclusions. On algorithm level, SMOTE seems to benefit from differing the data preprocessing while RAMO did not as the average rank of SMOTE increased slightly when combined with RAMO while the average rank of RAMO decreased. However, this results only reconfirms he superiority of RAMO over SMOTE instead of proving the use of varying the data preprocessing.

5.6 Revisiting the research questions

Having compared the different data preprocessing techniques and dynamic selection techniques and having concluded that varying the data preprocessing does not increase classifying performance, it needs to be examined whether the previously described findings still hold. Table 8 therefore again shows the comparison between the different preprocessing techniques but this time the results in which KNE and RANK are used as dynamic selection technique are disregarded. The same is done for the different dynamic selection techniques in Table 9 but with ADASYN disregarded as preprocessing method.

Preprocessing method

RAMO 37 % 0.90 16.5

SMOTE 52 % 0.91 22.3

All equal 12 %

(22)

Preprocessing method

KNU 24 % 0.91 19.4

LCA 37 % 0.90 19.7

Static 27 % 0.91 19.3

All equal 12 %

Table 9: Results for different dynamic selection techniques

The results in Table 8 do not lead to any other different conclusions than the ones previously drawn. SMOTE does have a higher winrate than RAMO but the average rank of RAMO is lower. This makes both methods useful in practice even though RAMO seems the slightly safer choice. When looking at the results of the renewed analysis on the dynamic selection methods in table 9 the previous conclusions partly hold. LCA still has a higher winrate than both KNU and LCA but the average ranks of the latter two are now only slightly lower. Removing the low performing data preprocessing methods therefore has a much larger effect on LCA than on KNU and LCA. From this analysis it is unclear which of the dynamic selection techniques is better than the rest although the results indicate that LCA is slightly superior.

Having concluded this, the first sub question regarding the comparison to logistic regression can be revisited again. To do so, Tables 2 and 3 From section 5.2 are reconstructed but this time ADASYN is disregarded as a preprocessing technique and KNE and Rank are disregarded as a dynamic selection technique. Combinations of preprocessing techniques are also not considered.

Imbalance ratio Average AUC > LR Max AUC > LR Average increase in AUC of average Average increase in AUC of maximum 0 − 0.05 40% 80% 2% 15% 0.05 − 0.1 55% 82% 3% 11% 0.1 − 0.15 53% 73% 1 % 4% > 0.15 64% 100% 2% 15%

Table 10: Results of comparing the AUC value of the logistic regression with Average AUC value of the complex model and the largest AUC value of the complex models. The numbers in the second and third column indicate the % of times the complex model had a higher average and maximum AUC compared to logistic regression. The fourth and fifth column indicates the average percentual AUC increase of the best performing complex model compared to logistic regression

Imbalance ratio Average AUC > LR+preproc Max AUC > LR+preproc Average increase in AUC of average Average increase in AUC of maximum 0 − 0.05 60% 87% 7% 17% 0.05 − 0.1 55% 73% 1.3% 7% 0.1 − 0.15 73% 87% 1% 3% > 0.15 82% 91% 2% 5%

Table 11: Results of comparing the AUC value of the logistic regression and data preprocessing with Average AUC value of the complex model and the largest AUC value of the complex models.

The results in Table 10 and 11 again underline that dynamic selection combined with data prepro-cessing can indeed increase performance of classification. The number of datasets in which dynamic selection and data preprocessing have a higher average AUC than logistic regression has increased compared to the previous results and the average increase of the average AUC has increased com-pared to both regular logistic regression and logistic regression combined with data preprocessing. This increase is highest in datasets in which the imbalance is very large.

(23)

5.7 Robustness

The results in the previous sections indicate that combining data preprocessing with dynamic selec-tion can indeed increase classificaselec-tion performance. Robustness has partly be investigated by apply dynamic selection to various datasets varying in imbalance. However, other checks on robustness have not been examined yet. In this section, robustness is further investigated by focusing on variation in:

1. Parameter values in data preprocessing 2. Parameter values in dynamic selection 3. Classification algorithms

5.7.1 Parameter values in data preprocessing

Up until now, RAMO ans SMOTE have shown to be the best performing data preprocessing methods. However, both methods require parameters which have now been set fixed. On the one hand perfor-mance of both methods could be increased by varying the parameter values. On the other hand the setup of dynamic selection is so extensive that it could very well be that the effect of the parameters choices is diminished.

For RAMO, two parameters values are chosen k1, k2 and α. The first two refer the number of neigh-bours chosen in the estimation of the probability distribution and data generation while α controls the difference in probabilities in the density estimation. Each of the parameters is varied and the results of applying RAMO as a data preprocessing method for both the bagged datasets and the DSEL on 20 datasets, combined with dynamic selection method KNU and random classifiers, are shown below.

α Average rank k1 Average rank k2 Average rank 0.1 14.2 3 14.6 5 14.6 0.3 13.8 5 14.1 10 13.8 0.5 14.0 7 13.2 15 13.6

Table 12: Robustness to variation in parameters of RAMO

Table 12 shows the average ranks of varying the 3 parameters over 3 possibles values. For all values the average rank within the set of the 27 possible parameter combinations is shown. The average rank is lowest for α = 0.5 although the differences between the values are very small. For k1 and k2 the differences are slightly larger for k1 = 7 and k2 = 15. These results show that the experimental setup is quite robust to the parameter changes in RAMO although the combination (0.5,7,15) is performing best for these datasets in terms of AUC.

For SMOTE, 3 parameter values have to be chosen: percentage oversampling and percentage oversampling and the number of nearest neighbours. The first one refers to the amount of oversampling that is done on the minority class the second one indicates the balance between the over sampled minority class and the majority class. Here 100 % indicates total balance. As the number of nearest neighbours is similar to k2 in RAMO it will not be considered. The other two parameters are varied combined with the same experimental setup as before.

% over- sam-pling Average rank % under-sampling Average rank 100 7.6 50 6.9 500 6.7 100 6.5 1000 6.5 200 6.2 2000 5.2

Table 13: Robustness to variation in parameters of SMOTE

Table 13 shows the average ranks for the different parameter values of % oversampling and % undersampling in SMOTE. For the oversampling the results show the higher the number of new mi-nority class instances generated, the better the performance in terms of AUC. Keeping this in mind, it is safe to say that choosing high percentage oversampling is very profitable for better classification performance. In contrary to this, the % undersampling of the majority class seems to have less im-pact on the classification results. The average rank of the 3 parameter values are fairly close together although heavy oversampling seems profitable for better performance. Naturally, combining both a high % oversampling and high % undersampling yields the best results.

Looking at the results of both data preprocessing methods it is clear that parameter selection still matters in the experimental setup but that the effect of different parameter values is not large.

(24)

This reconfirms the robustness of dynamic selection combined with data preprocessing for different applications without necessarily having to optimize over parameter values. The only parameter that had strong influence on classification performance is the % oversampling in SMOTE. More research choosing the optimal value of this parameter based on the application could potentially tackle the need for optimizing.

5.7.2 Parameter values in dynamic selection

Each dynamic method selects classifiers based on their performance in a local region. In all cases this local region is defined based on the number of nearest neighbours. This is the only parameter value that needs to be chosen. The results of the previous sections have shown that KNU and LCA are the best performing dynamic selection techniques. The number of nearest neighbours is therefore varied for these 2 methods. As static selection simply selects all trained classifiers it does not require parameter choices. k Average rank 1 4.6 3 4.1 5 3.1 7 3.1 9 3.3 11 2.8

Table 14: Robustness to variation in parameters of KNU

k Average rank 3 3.2 5 2.5 7 2.9 9 2.7 11 3.6

Table 15: Robustness to variation in the parameters of LCA

Looking at the results in Table 14 and 15 it is clear that there are differences in sensitivity to parameter changes. KNU performs best when the number of nearest neighbours is large. Apparently, the inclusion of competent classifiers benefits from expanding the neighbourhood of the minority class instances. It should however be noted that KNU did not benefit further form increasing the number of nearest neighbours. For LCA, the effect of the parameter change is much smaller than for KNU and no clear optimal value for k can be distinguished. Taking in mind the fact that LCA performed as one of the best dynamic selection methods, this provides evidence for the robustness of LCA within the preprocessing and dynamic selection framework. Although it may be that specific parameter values generate better results for specific datasets, LCA seems generally applicable.

5.7.3 Classification algorithms

The final robustness check on the data is done by variation in the base classification algorithms. Three algorithms are considered: logistic regression, SVM and C50. The performance of using either three of them is compared to the previously used random choice of algorithms. For all choices, the data preprocessing technique is RAMO and the dynamic selection technique is KNU with default parameter values. Classifier Average rank C50 1.9 Logistic regression 2.8 Random 2.8 SVM 2.5

Table 16: Robustness to variation in base classifiers used

The reported average ranks in Table 16 show a clear performance difference of the classifiers. The C50 algorithm has a much lower average rank than the other methods which lie very close together. The superiority of C50 means that it should be the first choice when classifying in an imbalanced

Combining dynamic selection with data preprocessing for classification in imbalanced data sets