Handling Imbalanced Datasets in Multi-class Image Recognition

(1)

Handling Imbalanced

Datasets in Multi-class Image

Recognition

(2)

Layout: typeset by the author using LA_TEX. Cover illustration:

(3)

-Handling Imbalanced

Datasets in Multi-class Image Recognition

Lars Veefkind 11630876

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor dr. A.G. Brown

Institute for Informatics Faculty of Science University of Amsterdam

Science Park 904 1098 XH Amsterdam

(4)

Abstract

Extensive research has been conducted regarding the classification bias, which occurs dur-ing the traindur-ing of a neural network on imbalanced datasets, which has resulted in a wide variety of techniques that aim to reduce this training bias. This thesis focuses on the mostly unexplored multi-class imbalanced image recognition datasets, and aims to determine which technique achieves the best performance on such datasets. Using the 10 and CIFAR-100 datasets, the results show that over-sampling techniques, particularly SMOTE, achieve the best classification performance, at the cost of an increased computational cost, while applying a class-balanced loss function offers excellent performance without the penalty in computa-tional cost. However, partially due to the datasets used in this research, addicomputa-tional research is required to provide further verification for these results.

(5)

3 Methods 10 3.1 Datasets . . . 10 3.2 Deploying techniques . . . 14 3.3 Application of techniques . . . 14 3.4 Training . . . 15 3.5 Evaluation . . . 16 4 Results 18 4.1 Baselines . . . 18 4.2 Techniques . . . 19 4.2.1 Performance results . . . 19 4.2.2 Generated images . . . 21 5 Discussion 23 5.1 Performance remarks . . . 23

5.1.1 General classification performance . . . 23

5.1.2 General computational performance . . . 23

5.1.3 Baseline performance . . . 24

5.1.4 CIFAR-10 spike distribution performance . . . 24

5.1.5 k-NN performance . . . 24

5.1.6 GAN performance . . . 25

5.2 Future research . . . 25

6 Concluding remarks 26

(6)

Acknowledgments

I would like to express my sincerest gratitude towards a number of people, without whom this project could not have been completed:

My supervisor dr. Andrew Brown for providing the fundamental idea for this research, and for pro-viding guidance, information and feedback on both the thesis and my research methods throughout the entire process, while still enabling me to use my own creative insights to complete this project.

My supervisor dr. Sander van Splunter, for organizing the thesis project and providing guid-ance and information.

And in general, the University of Amsterdam, for providing the necessary tools which enabled me to receive the chance to start and complete this project.

(7)

1 Introduction

In recent years, excellent results have been achieved in image recognition using convolutional neu-ral networks (CNN) on image recognition datasets such as CIFAR [18, 15]. However, most image recognition datasets used for benchmarking, including CIFAR, are balanced – such that each class consists of the same amount of data points – while in reality, this is often not the case [6]. This imbalance is often expressed using the imbalance factor, which is the factor with which the ma-jority class outnumbers the minority class. It is not uncommon for imbalance factor to exceed values of 100 [19]. Training a neural network, without any counter-measures, on such datasets can result in a bias towards the over-represented classes during inference [11]. This problem stems from how a network is being trained. During the training stage, a network attempts to minimize the overall error in classification (loss). It does so without taking the distribution of the different classes into account. Therefore, in extreme cases of imbalance, the network can have a seemingly high classification performance based on the accuracy values. However, in reality, the network only outputs the label of the over-represented class as prediction, meaning it has not learned the distinction between the classes. Therefore, it can be the case that the network is not as capable of discriminating between the classes as the performance metrics would suggest.

Extensive research has been conducted regarding the implications of and possible solutions to this problem. Through this research, several techniques have been proposed to reduce the adverse effects of training a neural network on severly imbalanced datasets [3, 10, 26, 5, 11]. These dif-ferent techniques can be grouped into three basic techniques. The first option is to re-sample the dataset such that all classes contain the same amount of data points. This can be achieved by over-sampling, where more data points are created for the minority classes, or under-sampling, where data points are deleted from the majority classes, or by a combination of the two. For both over- and under-sampling there are multiple possible options in terms of executions, ranging from randomly deleting or copying data to synthesizing new images, using a technique like the synthetic minority over-sampling technique (SMOTE) [3] — based on a the k-nearest neighbor algorithm (k-NN) [4] — or by utilizing a generative adversarial network (GAN) [9, 27, 7].

Another technique is to change the network’s loss function into a class-balanced loss function, such that each class receives a weight, which will be applied during the computation of the prediction-loss for a data point of that particular class. This weight should be higher for minority classes and lower for majority classes, such that the disparity between the total influences on the loss between the classes decreases.

The final technique is to learn using ensemble learning techniques by splitting the dataset into multiple ensemble sets, where each of the sets contains a different portion of the majority class, but all of the minority class. On each of these ensemble sets, a network is trained. Upon final classification, the output from each of those networks is combined to create the final output and classification prediction.

However, while most of the conducted research concerns binary classification problems, many image recognition problems are multi-class, where the most substantial majority class is several orders of magnitudes larger than the smallest minority class, with numerous classes in between. A notable example of such a dataset is ImageNet 2011 [6]. Additionally, some well-performing techniques, such as SMOTE, are not initially designed to be applied to high dimensional data such as images. Therefore, this research aims to determine which of these techniques yields the best performance when applied to neural-network-based multi-class imbalanced image recognition problems. Due to the focus on multi-class problems, ensemble learning will not be considered in this research, since ensemble learning increases the computational cost significantly when dealing with many classes with varying degrees of imbalance because it requires training a high number of networks.

(8)

2 Techniques

In this study, various techniques to counter majority class bias are tested. This section provides an overview of the utilized techniques, along with an in-depth analysis of these techniques.

2.1 Re-sampling

The majority of the techniques employed in this research are re-sampling techniques. As stated before, the simplest form of re-sampling is to do it randomly. For over-sampling, this results in creating copies of randomly chosen data points for the minority classes, while random data points are deleted from the majority classes for under-sampling. Both random over-sampling (ROS) and random under-sampling (RUS) will be examined in this research.

While ROS and random RUS can improve performance using a simple solution, both these tech-niques have their limitations. Although ROS is capable of increasing the loss penalty for the over-sampled classes during miss-classification, which should enhance the performance, it does not provide additional features to learn from. Contrarily, RUS increases the relative miss-classification loss on minority classes by decreasing the miss-classification loss on the majority samples, and due to the deletion of data, it even removes existing trainable features. Therefore, this research will ad-ditionally explore more sophisticated means of re-sampling, which aim to combat these problems. The majority of these methods are based on k -NN. While these methods have shown promising performance on certain types of data [3, 10, 8], limited research has been conducted with regards to image recognition, especially multi-class image recognition.

2.1.1 Synthetic minority over-sampling technique

The synthetic minority over-sampling technique, or SMOTE, is an over-sampling technique where new data samples are synthesized based on a nearest neighbor algorithm [3]. For each minority class, each data point is represented as a point in an N -dimensional feature space. Subsequently, the algorithm loops over these data points and computes k nearest neighbors for each of these data points, therefore selecting the k most closest data samples in the feature space. To generate a new data sample, a random neighbor is selected, and the difference between the feature vector of the neighbor and the chosen data point is computed. Subsequently, this difference vector is multiplied by a random value c ∈ [0, 1], and the result added to the chosen data point. Thus the formula for a data point generated by SMOTE is as follows:

~

X0 = ~X + ( ~P − ~X)c (1)

Where ~X0 is a data point synthesized by SMOTE, ~X is the selected starting data point, ~P is a randomly selected nearest neighbor, and c is the randomly generated value between 0 and 1. This process is repeated until sufficient data points have been generated for that particular existing data sample. This number depends on the over-sampling rate, where the same number of samples are generated for each existing data sample. So an over-sampling rate of 600% would result in the synthesis of six new data samples for every existing data sample.

2.1.2 Polynomial-fit SMOTE

While traditional SMOTE has shown excellent results [3] on dealing with imbalanced datasets, many variations of the basic SMOTE algorithm have been developed. Through research where the performance of 85 of these different variations of SMOTE have been compared, it has been shown that the polynomial-fit variation of SMOTE (P-SMOTE) [8] has, on average, the best performance across a range of 104 datasets [16]. Instead of synthesizing new data entries from a set number of neighbors, this variation of SMOTE fits a polynomial function to all the data entries in each of the minority classes and subsequently synthesizes new entries by sampling feature vectors from that

(9)

Figure 1: Visual representation of SMOTE in two-dimensional feature space. Left: The crosses on the mark data samples from the majority class and the circles data samples from the minority class. Right: The circles an opaque edge are the original data samples from the minority class, while the circles with a transparent edge are synthesized data samples. Image from [25].

polynomial. There are a total of four different polynomials to choose from, which perform indistin-guishably from each other in terms of performance [16]. Therefore this research will consider the default ”star” polynomial, which works as follows: for each minority class, the mean feature vector is computed. Thereafter, line functions are fitted between the feature vector of each other data entry in the majority class and the mean feature vector. The collection of the mean average feature vector and the fitted lines to the existing data samples describes the the entire star polynomial functions. New feature vectors are sampled along the lines of the star polynomial, resulting in new data points.

Figure 2: Visual representation of P-SMOTE using a ”star” polynomial-fit. Image from “New oversampling approaches based on polynomial fitting for im-balanced data sets” by Gazzah and Amara, 2008.

The effective difference between standard SMOTE and P-SMOTE is that, despite every data point synthesized by P-SMOTE being synthesized using all of the data, polynomial-fit SMOTE gener-ates more scattered data entries than SMOTE. This is due to synthesis only depending on two points, namely the mean point and one other real data entry, which could be a relatively far outlier compared to the other real data entries.

(10)

2.1.3 Adaptive synthetic sampling

He et al. aim to improve SMOTE’s positive influence on the overall classification performance by developing adaptive synthetic sampling (ADASYN) [10]. The ADASYN algorithm adaptively alters the number of data entries to be synthesized from each existing data entry, which is always roughly the same in the case of traditional SMOTE. The goal is to emphasize the synthesis of data points around existing data points where the features are more difficult to learn. This is accomplished by executing the following steps for each minority class, as described by the original paper [10]: in the N -dimensional feature space, search for the k-nearest neighbors from the entire dataset for each data entry. Then compute the ratio of neighbors which belong to a different class for each data point:

ri= ∆i

k (2)

Where ri is the ratio of the ith data point in the evaluated minority class, k is the number of nearest neighbors, and ∆iindicates the number of nearest neighbors belonging to a different class.

Since the eventual goal is to divide the total number of data entries which need to be generated over the existing data points, the ratios need to be normalized, such that a density distribution remains:

ˆ ri= ri P iri (3)

Where ˆri is the normalized ratio, withP_irˆi= 1.

Then the final step is to compute the number of data entries that need to be generated for each original data entry:

gi= ˆri∗ G (4)

Where giis the number of data points to be generated for the ith existing data entry in the minor-ity class, and G is the total number of data points that need to be synthesized for the minorminor-ity class.

The previous equations 2, 3, 4 have been adapted from the original paper [10].

The remainder of the algorithm works in the same manner as the traditional SMOTE algorithm, except that instead of randomly choosing a data entry to utilize for synthesizing a new data, the algorithm loops over all of the existing data entries, and then uses the g values to determine the number of data points to synthesize.

2.1.4 NearMiss

Similar to the previous algorithms, the NearMiss algorithm is a k -NN based algorithm. NearMiss is an under-sampling technique, however, which attempts to reduce the loss of trainable features during random over-sampling by only maintaining the data entries from the majority class which are closest to data entries from the minority class, since these data points are the most difficult to learn and are therefore most likely to be wrongly classified (near misses). There are three different versions of the NearMiss algorithm, of which the second algorithm (NearMiss-2) will be considered in this research, due to its superior performance to the other two methods [22]. The NearMiss-2 (NM-NearMiss-2) algorithm determines the near misses by computing the distance between the average distance between a data entry from a majority class and the k furthest neighbors from the other remaining classes. The data entries with the smallest average distance are selected as near misses, and the remaining data entries will be deleted.

(11)

2.1.5 Cluster-based under-sampling

Similarly to the NearMiss algorithm, the cluster-based under-sampling (CBU) technique [21] at-tempts to reduce the loss of essential features during under-sampling. Although, instead of selec-tively deleting samples from the majority classes, which still results in lost features, CBU attempts to encapsulate the crucial features from all of the data in fewer synthesized data points. This is ac-complished through the use of k-means clustering [1], unlike the previously k-NN based techniques.

To synthesize the replacing data points for a majority class, k clusters centroids are computed using k-means clustering in N -dimensional feature space, where k equals the number of images to be synthesized, typically equal to the size of the minority class. These k cluster-centroid vectors, which represent the mean feature vectors from each respective cluster, replace all of the majority class data entries.

2.1.6 Generative adversarial networks

Alternatively to over-sampling with techniques based on relatively computationally inexpensive methods such as k-NN’s and k-means, it is also possible to over-sampling through the deployment of a generative neural network. Since the introduction of generative adversarial networks (GAN) [9], great results have been achieved using state of the art GANs, such as Nvidia’s StyleGAN2, regarding the synthesis of visually convincing images [13].

GANs distinguish themselves from other networks through the underlying method to generate new samples. Most generative models (including the previously described models) are explicit, which means that these models attempt to capture the underlying distribution of the data, which is often complicated and near-unobtainable. Contrarily, GANs are implicit, meaning that instead of directly using the training data to generate new data, the model aims to discover the function which transforms random data, in the form of noise, to the data space from the training data.

A GAN realizes this goal through the utilization of two separate adversarial networks: a gen-erator and a discriminator. The gengen-erator is tasked with generating data samples using only random noise as input. On the other hand, the discriminator specializes in discriminating between real data samples and fake data samples from the generator. During training, the discriminator receives data samples from both the original dataset and the generator and is eventually trained through conventional back-propagation. Contrarily, the generator trains by minimizing the per-formance of the discriminator by basing the generator loss on the prediction of the discriminator. Effectively, this results in the two networks competing with each other. The generator receives a lower loss value when it manages to deceive the discriminator, and the discriminator receives a lower loss when it accurately discriminates between real data and generated data.

Figure 3: Simplified diagram of the architecture of a GAN. The generator generates data samples from random noise, and the discriminator learns to discriminate based on these generated data samples and real data samples. Image from [24].

(12)

As the original paper showed, it is possible to define this competition mathematically in terms of a value function V [9]. Take D(x) as the discriminator model, which returns a probability ∈ [0, 1] that the image x is a real image. Additionally, G(z) is the generator model, which gener-ates a fake image based on random noise z, with pz(z) and pdata(x) being the priors defined for the inputs noise z and real image x respectively. Finally, L(y) computes the loss value based on network output y. Then the equation to optimize the discriminator becomes:

max

D V (D) = Ex∼pdata(x)[L(D(x))] + Ex∼pz(z)[L(1 − D(G(z))] (5) And the equation to optimize the generator based on the discriminator output:

min

G V (G) = Ex∼pz(z)[L(1 − D(G(z)))] (6)

Finally, combining these equations results in the following minimax game for both networks, as shown by [9]:

min

G maxD V (D, G) = Ex∼pdata(x)[L(D(x))] + Ex∼pz(z)[L(1 − D(G(z)))] (7)

Theoretically, a solution exists for every GAN, specifically when the output for the discrimina-tor network D(x) always equals 1₂ [9], which should suggest that the generator has found the perfect transform function.

Considering a conventional GAN, as described above, is only capable of generating data sam-ples that are indistinguishable from the training data in its entirety, for multi-class over-sampling, this would require the training of one GAN for each minority class, which is highly impractical. However, through the relatively simple addition of a label parameter, the GAN becomes condi-tional [23], which allows for the specification of a class-label during over-sampling the GAN.

A major downside to using a GAN is that GANs can be quite sensitive to hyperparameters, such as the learning-rate, and properties of the dataset [2]. Improper configurations can easily result in no apparent learning at all or mode collapse, which often occurs later in the training when the generator manages to discover a particular configuration that manages to increase the loss of the discriminator but results in generated images with a low variety. From equation 6 it follows that, during a single time step, the generator attempts to generate images such that G(z) = argmaxxD(x). The network’s optimizer attempts to reach this point G(z). In some cases, the best result for the generator might be realized by exerting no variety on the generated images, therefore negating the noise input z as a variable. The discriminator should prevent this from occurring by learning that it should assign a low probability to those generated images. However, when this does not occur fast enough, the generator gets trapped, which will consequently result in an unrecoverable state of poor image generation.

(13)

2.2 Class-balanced loss function

Instead of re-balancing the training at the data-level, it is also possible to re-balance the training at the algorithm-level. Considering the root of the imbalance problem is located at the computation of the loss function, where the majority classes exert significantly more influence on the total loss than the minority classes, an apparent solution would be to change the influence weighting for each class, using a class-balanced loss function (CBL), which ascribes higher weights to minority classes than majority classes. The most straightforward solution is to apply these weights one-to-one, such that each class exerts the same amount of influence on the total loss value, resulting in the following loss function:

Wi=

max0≤j<Nnj ni

(8)

Where Wi is the weight for the ith class, and ni is the number of data samples in the ith class, and N is the number of classes. However, according to Cui et al. datasets with long tails, where there are only a few classes with most of the data samples, and a large number of classes with few samples per class, are common among real-world visual datasets. These types of datasets benefit from a class-balanced loss function where the degree of re-balancing the loss can be tuned [5]:

Wi= 1 − β

1 − βni (9)

Where β ∈ [0, 1) is the tunable hyper-parameter, with β = 0 resulting in no re-balancing of the loss function, and β → 1 resulting in re-balancing according to equation 8 1_{. Cui et al. suggest} using high β values such as 0.9, 0.99 and 0.999.

1_{While these two equations might not yield the same weight values in the case that β → 1, these equations}

are proportional to each other. Since multiplying the loss function with a constant factor does not influence the training, these two equations perform effectively the same.

(14)

3 Methods

This section describes the procedures this research deployed to benchmark the varying techniques described in section 2. All the methods described below were executed on a Windows 10 machine with an AMD Ryzen 9 3900X 24-thread CPU, Nvidia GeForce RTX 2080 Ti GPU with 11GB VRAM, and 32GB of DDR4 RAM.

3.1 Datasets

To train and evaluate the various techniques, CIFAR-10 [18] and CIFAR-100 [18] were used, with 10 and 100 classes, respectively. Both versions of CIFAR consist of 50,000 small 32x32x3 RGB images in the training set, and 10,000 in the test set. Each image contains one subject out of the 10 or 100 labels, with the subject located in roughly the middle of the frame. Figure 4 portrays some example images for both CIFAR-10 and CIFAR-100.

CIFAR-10 image samples

(a) Ten sample images from each of the ten CIFAR-10 classes.

CIFAR-100 image samples

(b) Ten sample images from the first ten CIFAR-100 classes.

Figure 4

While both of these datasets are balanced, imbalanced versions of these datasets have been created with varying degrees of imbalance. This enables the evaluation of the compared techniques across a broader spectrum of cases and allows for performance comparisons between applying these tech-niques on imbalanced datasets and using a balanced dataset, representing the performance target. The following imbalanced datasets have been constructed for both CIFAR-10 and CIFAR-100:

(15)

Imagenet

The first imbalanced dataset is an approximation of the data distribution of the Imagenet 2011 [6] dataset (figure 5), which is an accessible, unbalanced dataset, with 14.2M images, distributed across 21,841 classes. Unlike the following datasets, this dataset has only been replicated for CIFAR-100, due to CIFAR-10 having too few classes to retain a relatively gradual approximation of the original distribution. To achieve this, the proportional class values, where the class with a proportional value of 1 is the largest majority class, of the original ImageNet dataset were com-puted. Subsequently, the distribution of these proportional values was binned into N bins. These bins represent the class factors for each of the CIFAR classes. Similar to the previous two methods, this distribution is re-scaled as follows:

fi=

(1 − γ)

N · hi (10)

Where N is the number of classes, 0 ≤ i < N is the class index, fi∈ [0, 1] is the factor of images from the ith class, γ is the minimum factor of images, and hi is the ith value from the binned distribution.

Figure 5: Visualization of the ImageNet class distribution.

Linear

A linear dataset, where the number of images per class grows linearly, according to the following equation:

fi=

(1 − γ)

N − 1 · i + γ (11)

Logarithmic

A logarithmic dataset, to represent the aforementioned long-tailed datasets:

fi= (1 − γ) · −

ln(−(x−N_N )

ln(N ) + γ (12)

(16)

Spike

Finally, to represent a worst-case scenario, a spike dataset, where almost all of the classes retained the minimum number of images, and only a couple of randomly selected classes retained signifi-cantly more image samples.

With the exception of the ImageNet dataset, each of these imbalanced datasets maintained a minimum of 100 image samples per class. This resulted in γ = 0.2 for CIFAR100 and γ = 0.02 for CIFAR10 in the logarithmic and linear distributions in the equations 12 and 11. The ImageNet distribution used γ = 0.05, to simulate the actual ImageNet dataset as close as possible, without effectively deleting classes. These distributions were then applied to both the training and the test set of the CIFAR datasets. Figures 6 and 7 demonstrate the distributions of these datasets on CIFAR100 and CIFAR10 respectively, and table 1 lists the number of images for each imbalanced dataset.

Distribution of imbalanced CIFAR100 datasets

(a) (b)

(c) (d)

(17)

Distribution of imbalanced CIFAR10 datasets

(a) (b)

(c)

Figure 7

CIFAR-100 CIFAR-10

Train size Test size Ratio Train size Test size Ratio

Logarithmic 18405 3679 0.368 17857 3571 0.357

Linear 30000 6000 0.600 25500 5100 0.510

Spike 12270 2454 0.245 9034 1806 0.181

ImageNet 23768 4750 0.475 - -

-Table 1: Shows the number of images in the training and test set for each imbalanced dataset, as well as the ratio compared to the full dataset.

As can be seen from the ratio in table 1, the created imbalanced datasets are considerably smaller than the original CIFAR datasets. This will make it unreasonable to compare the performance of these imbalanced datasets with the performance on the full datasets. Therefore, re-scaled baseline CIFAR datasets have been created for each imbalanced dataset, by re-scaling each class from the full dataset according to the ratio. This results in seven re-scaled balanced CIFAR datasets, one for each imbalanced dataset.

(18)

3.2 Deploying techniques

The previously described techniques were applied to each of these seven imbalanced datasets through the use of various implementations in Python. For the re-sampling algorithms ROS, RUS, SMOTE, NM-2, CBU and ADASYN, this research used the implementation from the Python imbalanced-learn API [20], while the implementation for the P-SMOTE algorithm was adapted from the SMOTE-variants Python library implementation [17].

The GAN used in this research was based on a PyTorch implementation of a conditional deep convolutional GAN (cDCGAN): https://github.com/togheppi/cDCGAN. Both the discriminator and generator are 4-layer networks, using the Adam optimizer [14], with binary cross-entropy loss. For CIFAR-10, the generator and discriminator both use a learning-rate of 2 · 10−4, while the discriminator uses a learning-rate of 10−4 for CIFAR-100. The learning-rate schedule used for all the learning-rates decays the learning-rate by a factor of 2 after epoch 5, and again after every 10th epoch. The total number of epochs was set to 50, with a batch size of 128. Table 2 displays the computational properties of the cDCGANs for 10 (cDCGAN-10) and CIFAR-100 (cDCGAN-CIFAR-100) in terms of number of parameters and number of floating point operations (FLOPs) required to forward one image through the network.

Parameters (M) Cost per image (MFLOPs)

Generator Discriminator Generator Discriminator Total

cDCGAN-100 3.449 2.738 576 189 1719

cDCGAN-10 3.081 2.645 564 141 1551

Table 2: Overview of the computational properties of the individual GAN networks for both CIFAR-10 and CIFAR-100. Considering each batch results in the generator being executed twice — once to provide fake samples to train the discriminator and once to train itself – and the discriminator being executed thrice — twice to train itself, both for fake and real data, and once to train the generator — the ”Total” column represents the total cost per image for a single batch.

Since the class balanced loss method is executed during training time by merely changing the loss weights for each class, this only required the implementation of a weight function which computes equation 9 for every class index given based on the β parameter. These weights can be forwarded to the loss function of the training network elaborated in section 3.4.

3.3 Application of techniques

In order to apply the implemented techniques on the training of the imbalanced datasets, some parameters for the different types of techniques were determined.

The over-sampling techniques (ROS, SMOTE, P-SMOTE, and ADASYN) were all set-up such that each class from the training set was over-sampled to contain the same number of image sam-ples as the largest majority class, creating a completely balanced training set of the same size as the training set as the original CIFAR dataset, for every imbalanced dataset. Unlike the other over-sampling models, the GAN requires the training of a neural network. Since each imbalanced dataset is unique in terms of size and balance, training the GAN on these datasets requires the re-configuration of the parameters to avoid difficulties in learning, such as the mode collapse, which is explained in section 2.1.6. To prevent the requirement for this configuration, the GAN was instead trained on the random over-sampled version of each imbalanced dataset. This is a relatively simple solution to implement and does not result in alterations to the individual data samples themselves.

Similarly to over-sampling, the under-sampling techniques (RUS, NM-2, CBU) were also con-figured to create perfectly balanced datasets, but in this instance, through under-sampling each class to have the same number of images in the smallest minority class. However, since this results in a significant reduction in total data samples for every imbalanced dataset, each under-sampling

(19)

technique was also combined with random over-sampling such that each class was either over or under-sampled to 1₂ the size of the largest majority class.

Additionally, each of the re-sampling techniques which rely on k-NN (SMOTE, ADASYN, and NM-2) use a k value of 5, such that five neighbors were evaluated for each respective algorithm. Furthermore, CBL was set-up using three different values for β, due to the varying results a single β value might produce on different types of imbalance. The chosen values for β were 0.9, 0.99 and 0.999. Figure 8 illustrates how the number of class-images affects the loss weight for these three values of β.

Class-specific loss weight for various values of β

Figure 8: Shows the class loss weight as a function of the number of images the class contains with varying values of β.

3.4 Training

The network used to train the individual datasets is ResNet-50 [12], which is a residual CNN with 50 layers, based on the Python PyTorch implementation on: https://github.com/bearpaw/pytorch-classification. It uses the stochastic gradient descent (SGD) optimizer with the cross-entropy loss function, a weight decay of 10−4, and a momentum of 0.9. The computational properties of the networks are visible in table 3.

Parameters (M) Cost per image (MFLOPs)

ResNet-50 CIFAR-100 0.764 226

ResNet-50 CIFAR-10 0.759 226

Table 3: Overview of the computational statistics of the ResNet-50 networks for both CIFAR-10 and CIFAR-100

(20)

Due to the varying sizes of the datasets, setting a fixed learning-rate, learning-rate decay schedule, and the number of epochs would either result in strong over-fitting on the smaller datasets or under-fitting on the larger datasets. Therefore, it was chosen to create an adaptive learning-rate schedule and number of epochs, which were determined during the training. This was achieved by using an initial learning-rate of 0.1, which decayed with a factor of 10 after the 81st and 122nd epoch, for a total of 164 epochs. During the training, the algorithm kept track of the best per-forming network, based on the macro arithmetic mean accuracy (MAMA), which is the arithmetic average of the partial class accuracies. When the learning-rate decayed, the most recent iteration of the network was replaced by the best performing network. This effectively deleted the regress in the network of the network in the epochs thereafter, which also effectively shifts the learning-rate decay schedule points forward and reduced the effective number of epochs.

Using this configuration, a total of three ResNet-50 networks were trained for every configura-tion, using a batch size of 128.

3.5 Evaluation

After training, there are a total of seven imbalanced datasets, each with a total of 14 additional configurations — one for each implementation of a technique — as well as eight baseline datasets. Since three networks were trained for each configuration, this resulted in a total of 318 networks that needed to be evaluated.

Considering the nature of the imbalance problem, traditional overall classification accuracy is a poor indicator of performance, since it exhibits a similar bias towards the majority classes. There-fore, various other metrics have been utilized to provide a better representation of the performance differences:

Macro arithmetic mean accuracy

Macro arithmetic mean accuracy (MAMA) is the same metric that has been utilized during the training of the network. This metric averages individual class accuracies by computing the arith-metic mean between the accuracies of these classes. This effectively results in a weighted accuracy, where the predictions from minority classes receive a higher weight.

M AM A = PN

i=1acci

N (13)

Where N is the number of classes, and acci is the accuracy for the ith class.

In addition to computing this metric value for the performance of all the classes, this metric has also been computed for the 20% worst performing classes.

Macro geometric mean accuracy

Instead of utilizing the arithmetic mean to compute the average class-accuracy, macro average geometric accuracy (MGMA) computes the geometric mean:

M GM A = ( N Y i=1 acci) 1 N (14)

This metric is more sensitive to lower values in the accuracies between the classes and therefore ascribes and therefore ascribes a lower value to methods where some classes perform much worse than others, which is exactly the behavior which these methods should avoid.

(21)

Arithmetic mean F1-score

Besides accuracy, two other frequently used metrics are precision and recall, with the following equations:

Pi=

T Pi T Pi+ F Pi

(15)

where i is the class index, P is the precision, T P is the number of true positive classifications and F P is the number of false positives.

Ri=

T Pi T Pi+ F Ni

(16)

Where R is the recall, and F N is the number of false negatives.

The F1-score metric computes the harmonic mean between precision and recall:

F 1i=

2 · Pi· Ri Pi+ Ri

(17)

To compute a single F1-score for all the classes combined, the arithmetic mean F1 score was computed for each F1-score per class, resulting in the arithmetic mean F1-score (AM-F1).

Rank

To enable the ranking of the techniques, each technique has received four ranks — one for every previous metric — for every imbalanced dataset, which ranks their performance relative to the other techniques. These four ranks are averaged for each imbalanced dataset, which provides a ranking for the individual imbalanced datasets. Finally, for both CIFAR-10 and CIFAR-100 the average ranks for each technique were computed, to provide a general ranking of each technique.

Floating point operations

Finally, to provide an indication of computational cost in terms of training, the total amount of floating point operatiions (FLOPs) has been computed for the training of each network. As ex-plained previously, the network adaptively deletes epochs where over-fitting occurred. The number of deleted epochs has been subtracted from the 164 total training epochs, to simulate the typical number of epochs if the network learning-rate schedule was manually fine-tuned for each dataset. After subtraction, the remaining effective number of epochs will be multiplied with the number of images in the entire dataset, which is different for each imbalanced dataset due to their varying size, and then multiplied with the FLOP-per-image value for the respective utilized network.

For evaluation, these metrics have been computed for the classification performance on the testing set of each of the three networks for each imbalanced dataset. To illustrate the performance impact of utilizing imbalanced datasets instead of balanced once, these metrics have also been computed for the baseline network for each respective imbalanced dataset. The metric results for each of these three networks have been averaged. Additionally, synthesized images from each of the re-sampling methods where images are synthesized have been randomly selected for visual inspection.

(22)

4 Results

This section contains the results obtained through the evaluation methodology as elaborated in the previous section. These results contain numerous acronyms for the varying techniques and metrics. Table 4 provides an additional overview of these acronyms.

Acronym Meaning Section

Techniques

ADASYN Adaptive synthetic sampling 2.1.3

CBU Cluster-based under-sampling 2.1.5

GAN Generative adversarial network 2.1.6

CBL Class-balanced loss function 2.2

NM-2 NearMiss-2 2.1.4

ROS Random over-sampling 2.1

RUS Random under-sampling 2.1

SMOTE Synthetic minority over-sampling technique 2.1.1

P-SMOTE Polynomial-fit SMOTE 2.1.2

Metrics

MAMA Macro arithmetic mean accuracy 3.5

MGMA Macro geometric mean accuracy 3.5

AM-F1 Arithmetic mean F1-score 3.5

Min 20% MAMA for the 20% worst performing classes 3.5

FLOPs Floating point operations 3.5

Table 4: Overview of the acronyms used in the results for the techniques and the performance metrics. The section column refers to the section where the acronym originates from.

4.1 Baselines

These tables show the influence the difference in terms of the performance metrics between an imbalanced dataset and a balanced dataset of the same size, therefore illustrating the performance impact of training a network on an imbalanced multi-class image recognition dataset.

CIFAR-10

MAMA MGMA AM-F1 Min 20% TFLOPs

Linear Imbalanced 0.841 0.801 0.851 0.530 601 Balanced 0.899 0.897 0.898 0.805 795 Logarithmic Imbalanced 0.809 0.782 0.827 0.508 418 Balanced 0.869 0.867 0.869 0.764 436 Spike Imbalanced 0.636 0.409 0.612 0.308 192 Balanced 0.826 0.819 0.824 0.690 242

Table 5: Shows the mean average results of the varying metrics on the classification of each network trained on imbalanced and balanced versions of the CIFAR-10 dataset.

(23)

CIFAR-100

MAMA MGMA AM-F1 Min 20% TFLOPs

Linear Imbalanced 0.629 0.604 0.632 0.186 585 Balanced 0.641 0.624 0.641 0.319 509 Logarithmic Imbalanced 0.554 0.517 0.557 0.142 325 Balanced 0.577 0.552 0.577 0.212 452 Spike Imbalanced 0.500 0.453 0.499 0.125 255 Balanced 0.514 0.484 0.515 0.160 283 ImageNet Imbalanced 0.521 0.000 0.523 0.000 481 Balanced 0.587 0.565 0.588 0.237 416

Table 6: Shows the mean average results of the varying metrics on the classification of each network trained on imbalanced and balanced versions of the CIFAR-100 dataset.

These tables illustrate that imbalanced versions of CIFAR-10 have a significantly higher reduction in performance than imbalanced versions of CIFAR-100. For example, the performance reduction in MAMA is 12.12% for CIFAR-10, and 2.86% for CIFAR-100 when only considering the versions of imbalance which are shared by both CIFAR-10 and CIFAR-100. This difference is especially severe in the spike dataset, with a MAMA reduction of 23.00% for CIFAR-10 and only 2.72% for CIFAR-100.

4.2 Techniques

This section displays all the results from the implemented and tested techniques in this research. The generated images for each image-generating re-sampling technique will be displayed, as well as the performance metrics as described above.

4.2.1 Performance results

Due to the great quantity of performance results (Appendix) and the low variation between the per-formance statistics between the individual types of imbalanced datasets, this section only displays an overall ranking of each technique and the performance of each metric on the spike distribution for CIFAR-10, due to the higher gap this distribution provides between the balanced and imbal-anced dataset, which might suggest that the results from this distribution are more significant. These results contain a great amount of acronyms, indicating the techniques and metrics.

CIFAR-10 rank CIFAR-100 rank Avg. TFLOPs

Over-sampling SMOTE 2.17 (1/1) 2.63 (1/1) 907 GAN 3.33 (2/2) 4.06 (3/2) 1172 (+4118) ADASYN 3.67 (3/3) 4.19 (4/3) 914 ROS 4.92 (6/4) 5.63 (6/4) 929 P-SMOTE 7.25 (8/5) 9.63 (10/5) 1117 Under-sampling RUS 10.42 (11/1) 9.69 (11/1) 143 CBU 12.42 (12/2) 11.13 (12/2) 133 NM-2 12.50 (13/3) 12.50 (13/3) 136 Over- + under-sampling ROS + CBU 6.75 (7/1) 7.19 (7/1) 462 ROS + RUS 7.42 (9/2) 7.81 (9/3) 457 ROS + NM-2 9.67 (10/3) 7.69 (8/2) 500 Loss CBL 4.33 (4/-) 2.69 (2/-) 388 No technique No technique 4.67 (5/-) 4.44 (5/-) 408

Table 7: Overview of the average ranking for each technique, separated by technique-type, on both CIFAR-10 and CIFAR-100. The first number in brackets indicates the overall rank, and the second number represents the rank within its particular type of technique.

(24)

To condense the results from the tables in the appendix, table 7 provides an overall ranking of each technique on both CIFAR-10 and CIFAR-100, and the average computational cost required to train a network after applying the respective technique. These results show that on both CIFAR-10 and CIFAR-100, SMOTE yields the best overall performance, with CBL following closely on the CIFAR-100 dataset. Additionally, the majority of the techniques perform worse than applying no technique, with only the SMOTE, GAN, ADASYN, and CBL techniques performing better.

In terms of computational cost, the over-sampling techniques are the most expensive, increasing the computational cost for just training the network by 122.31% to 187.25% on average compared to applying no technique. Contrarily, under-sampling techniques are the least expensive, reducing the computational cost between 64.95% and 67.40% on average. Both CBL and the over- and under-sampling techniques perform relatively close to the default imbalanced dataset, adding be-tween -4.90% and 22.55% on average.

Technique MAMA MGMA AM-F1 Min 20% TFLOPs Avg. rank No technique 0.636 (4) 0.409 (11) 0.612 (5) 0.308 (8) 192 7.00 ADASYN 0.682 (2) 0.649 (2) 0.666 (2) 0.400 (1) 721 1.75 CBU 0.399 (12) 0.249 (13) 0.127 (13) 0.055 (13) 18 12.75 GAN 0.644 (3) 0.603 (3) 0.681 (1) 0.383 (2) 938 (+3878)* 2.25 CBL (β=0.99) 0.629 (5) 0.588 (4) 0.624 (3) 0.358 (4) 170 4.00 NM-2 0.370 (13) 0.321 (12) 0.159 (12) 0.129 (12) 20 12.25 ROS 0.600 (6) 0.558 (5) 0.580 (6) 0.358 (4) 834 5.25 ROS + CBU 0.567 (9) 0.503 (7) 0.522 (10) 0.223 (11) 377 9.25 ROS + NM-2 0.553 (10) 0.494 (9) 0.534 (9) 0.267 (10) 420 9.50 ROS + RUS 0.578 (7) 0.531 (7) 0.543 (8) 0.300 (9) 323 7.75 RUS 0.476 (11) 0.464 (10) 0.299 (11) 0.358 (4) 20 9.00 SMOTE 0.693 (1) 0.659 (1) 0.620 (4) 0.383 (2) 667 2.00 P-SMOTE 0.578 (7) 0.533 (6) 0.556 (7) 0.333 (7) 789 6.75

Table 8: Shows the performance metrics for the techniques on CIFAR-10’s imbalanced spike dataset. The values in brackets represent the ranking the technique receives for that particular dataset according to the metric of that column. These ranks are averaged in the last column. * The additional FLOP values for the GAN technique represent the cost to train the GAN itself.

Table 8 shows the performance for each technique on the CIFAR-10 spike dataset, with largely similar results as the previous table. The over-sampling techniques SMOTE, ADASYN and GAN still provide the best performance, however in this case ADASYN outperforms SMOTE, with the SMOTE and GAN techniques performing close to ADASYN. In addition, P-SMOTE and random over-sampling provide slightly better performance on this dataset than applying no technique, unlike in the previous table where these techniques performed worse.

(25)

4.2.2 Generated images CIFAR-10

(a) SMOTE (b) ADASYN

(c) P-SMOTE (d) CBU

(e) GAN

Figure 9: Examples of generated images for CIFAR-10 synthesized by re-sampling techniques. A black bar implies that no images have been generated for that class.

(26)

CIFAR-100

(a) SMOTE (b) ADASYN

(c) P-SMOTE

(d) CBU

(e) GAN

Figure 10: Examples of generated images for CIFAR-10 synthesized by re-sampling techniques.

Figures 10 and 9 contain samples of generated images by the data-generating re-sampling tech-niques. The results between CIFAR-10 and CIFAR-100 are quite similar. In both cases, SMOTE and ADASYN generate images with which seem fairly accurate in terms of the shapes of the subjects and the color intensity of the images, but with color frequent artifacting. Neither the P-SMOTE, CBU and GAN techniques display such artifacting. While the subjects in the images generated by P-SMOTE seem to be accurate in terms of shape, the colors are washed out. In ad-dition to being washed out, the CBU images are generally blurred. Furthermore, the GAN images suffer from a different problem. Despite the generated images showing features from the respective classes, the image classes are not immediately recognizable and generally seem to be deformed low quality images.

(27)

5 Discussion

This section provides an in-depth analysis and interpretation of the results regarding the eval-uation of the varying techniques to counter training imbalance on multi-class imbalanced image recognition. Additionally, shortcomings of these results and the underlying research are located, which leads to propositions for future research.

5.1 Performance remarks

5.1.1 General classification performance

The overall rankings indicate that both the over-sampling techniques SMOTE, GAN and ADASYN, and CBL are the best performing techniques with regards to the classification performance on im-balanced multi-class image recognition datasets. Overall, the best performing technique for both CIFAR-10’s and CIFAR-100’s imbalanced datasets is the traditional SMOTE technique, with over-sampling through the use of a GAN or ADASYN coming relatively close on the CIFAR-10 dataset, and CBL performing very similar on CIFAR-100. This increase in relative performance for CBL on CIFAR-100 could be explained by ADASYN and GAN-based over-sampling losing performance.

For ADASYN, this performance reduction for CIFAR-100 could be caused by the intrinsic method ADASYN generates images. In contrast to SMOTE, ADASYN focuses its over-sampling on the areas which contain more data samples from other classes in an attempt to provide more learning data for the areas which are harder to learn. Due to the complexity of image data, where a single class may consist of multiple clusters, this might result in generated images that overlap with other classes. Since CIFAR-100 consists of more classes, there should be more overlap between these different classes, resulting in more overlapping images.

The performance reduction for the GAN, on the other hand, can be explained by the complexity of the GAN. For both CIFAR-10 and CIFAR-100, the GAN has virtually the same number of parameters. Considering there are more classes to distinguish for CIFAR-100, which increases the complexity of the problem, this could result in a decreased quality of the generated images by the GAN on this dataset.

The remaining techniques, including random over-sampling and P-SMOTE, the under-sampling, and the combined re-sampling techniques, perform worse than the default configurations, where no techniques are applied. From the results, it becomes evident that the three under-sampling techniques CBU, NearMiss-2, and random sampling suffer from the main drawback of under-sampling techniques, which is the loss of trainable features due to removal of data. While com-bining these under-sampling techniques with random over-sampling improves the classification performance, the performance is considerably worse than the other techniques and the default configuration.

5.1.2 General computational performance

In addition to performance metrics, the results provide computational metrics. Due to the increase in dataset size, the over-sampling techniques significantly increase the computational cost required to train the classifier compared to applying no technique, with SMOTE increasing the average number of FLOPs by 122.31%. Contrarily, applying a class-balanced loss function does not result in a higher computational cost, since the size of the dataset does not change, while it still yields excellent classification performance.

(28)

5.1.3 Baseline performance

Notably, the default configurations perform relatively well, ranking fifth for 10 and CIFAR-100. This relatively high classification performance is supported by the baseline results. Tables 5 and 6 indicate that while the classification on both the CIFAR-10 and CIFAR-100 dataset suffers when these datasets are imbalanced, most versions of imbalance across both datasets result in only a relatively small reduction in classification performance, with MAMA, MGMA and AM-F1 values decreasing by less than 15% for CIFAR-10 and less than 4% for CIFAR-100. Although this reduction increases for the Min 20% score, this metric only represents a relatively small, albeit relevant, spectrum of the total performance.

Due to these minimal performance reductions, there is not much room for improvement for the tested techniques. This lack of performance degradation is probably a result of the datasets used in this research. CIFAR, especially CIFAR-100 with 500 images per class, does not allow for a high imbalance factor for the creation of imbalanced datasets. If the imbalance factor becomes too big, the minority classes will have too few images to learn from. Therefore, the imbalanced datasets still have relatively large minority classes.

For example, in the spike dataset for CIFAR-100, the majority classes combined contain roughly 3400 images, whereas combining the minority classes yields a total of 8900 images. Consequently, the minority classes together strongly outnumber the majority classes, which forces the network to learn these minority classes to achieve a low loss value, which results in the majority class bias not being as prominent as would be the case with a higher imbalance factor. With the other imbal-anced datasets, the line between a majority class or a minority class is much less pronounced, but the same principle generally holds, even for CIFAR-10. Only the worse performing spike variation of CIFAR-10 is not affected by the same issue. In this dataset, the majority classes, with 8300 total images, severely outnumber the minority classes with 700 images, which explains the more severe performance reduction. This suggests that the performance figures on this dataset are perhaps more significant than the other datasets.

5.1.4 CIFAR-10 spike distribution performance

However, table 9 illustrates that the ranking for the techniques on CIFAR-10’s spike distribu-tion dataset is quite similar to the other imbalanced datasets. Similarly to the general ranking on CIFAR-10, SMOTE, ADASYN, and GAN-based over-sampling achieve the best performance, with ADASYN performing slightly superior to SMOTE and the GAN. However, due to the lower performance of the default configuration dataset, the class-balanced loss function, P-SMOTE, and random over-sampling improve their relative scores significantly, with random over-sampling and P-SMOTE now performing slightly better than the default configuration.

5.1.5 k-NN performance

Considering the k-NN based algorithms are not generally utilized for image data, it is remarkable that some of these techniques provide such excellent performance. Especially since the images generated by both techniques contain colored artifacts, which makes the images look unconvincing to humans. However, from the classification performance, it seems that these artifacts do not diminish or even enhance the classification performance. It is conceivable that this is due to the particular datasets used in this research. The images in CIFAR are only 32x32x3, resulting in 3072 features, which results in a relatively small dimensionality for images.

Furthermore, the image subjects are centered and generally fill the entire frame. Therefore the same pixels, and accordingly feature indexes, describe the discriminating information in each im-age. This gives k-NN methods an advantage, considering the different images from the same class are therefore more clustered in N -dimensional space than on datasets with larger images and non-centered subjects.

(29)

However, it is notable that P-SMOTE performs much worse than traditional SMOTE and ADASYN, while ADASYN, SMOTE, and P-SMOTE are based on k-NN. Additionally, the synthesized im-ages generated by P-SMOTE in figures 10 and 9 seem to look quite convincing, if not a washed out. Unlike SMOTE and ADASYN, it does not seem to suffer from the colored artifcats revealed by both these techniques. The reason P-SMOTE performs poorly could be due to the distance between the two image samples used to generate a new image. Since P-SMOTE was configured to use the star pattern, where data samples are randomly sampled between the mean feature vector and other, possibly far removed data samples, the generated images might fall outside the class regions if these class regions consist of multiple clusters.

5.1.6 GAN performance

As mentioned before, the GAN technique shows quite good performance, despite the poor visual appearance of the images. Similarly to SMOTE and ADASYN, this is probably due to the GAN learning underlying distinguishing features. Even though the computational cost of training the GAN is 251.36% more than the training of the classification network on average, the network is still relatively shallow for a GAN. Especially for CIFAR-100, where the GAN performance is worse than on CIFAR-10, a better GAN could result in images which are more convincing to humans, which might result in better performance. This could result in GANs overtaking SMOTE in terms of classification performance on datasets with larger and non-centered images, where k-NN tech-niques might be less suited.

On the other hand, the performance improvement of GAN-based over-sampling could be attributed to the increase in effective traiFning parameters. In a GAN, the discriminator has a similar func-tion to the ResNet-50 classifier network. The discriminator of the cDCGAN used in this research has 3.6 times the number of parameters as the ResNet-50 classifier. If this is the case, the GAN’s performance figures are not comparable to the other techniques, due to the GAN technique having access to significantly more classification parameters.

5.2 Future research

To obtain a complete overview of the different techniques, this research proposes several changes for future research. The main suggestion is to utilize a different dataset. Preferably a dataset with more images per class, to allow for a more substantial imbalance factor, which should result in an increased performance loss between the balanced and unbalanced versions. Furthermore, the individual images should be of higher resolution, and the image subjects’ location should be varied throughout the frame. This change aims to discover whether k-NN techniques like SMOTE and ADASYN retain their excellent performance on more complicated datasets.

In addition to changes to the dataset, it could be viable to test more SMOTE variations on image data. In this research, the version of SMOTE with the highest performance on other types of data (P-SMOTE), according to previous research, was a poor performing technique, while traditional SMOTE performed quite well. However, since there are many variations of SMOTE, some of these might perform better on image data.

Finally, considering the uncertainties regarding the performance improvements of the cDCGAN, more research is required to determine the usefulness of over-sampling using GANs. Research is primarily required to determine whether GANs increase in performance is due to the addition of training parameters, or if GANs provide a more significant increase in performance than merely increasing the number of training parameters. This research proposes to research this by compar-ing the performance between a smaller classifier aided by GAN-based over-samplcompar-ing and a larger classifier, which has the same number of parameters or requires the same amount of FLOPs as the smaller classifier and the GAN’s discriminator combined. If research reveals that the additional training parameters can not solemnly explain a GANs increase in performance, it is feasible to expand the research with deeper and more complex GANs, which generate more accurate images than the images generated in this paper.

(30)

6 Concluding remarks

This research has provided compelling results regarding varying techniques to reduce the training bias on multi-class imbalanced image recognition datasets. Primarily, when the only goal is the increase the classification performance, the over-sampling techniques SMOTE is capable of provid-ing excellent results on a wide range of variations of imbalanced datasets, scalprovid-ing well on datasets with both 10 and 100 classes. Moreover, the techniques ADASYN and GAN-based over-sampling also exhibit excellent performance on the smaller datasets with ten classes, while falling behind on the datasets with 100 classes. Additionally, adapting the loss weights using a class-balanced loss function can yield excellent results, especially on the datasets with 100 classes, without increasing the computational cost of training a network, whereas SMOTE almost triples this computational cost.

Furthermore, the results show that the evaluated under-sampling techniques severely under-perform in comparison to the other techniques on datasets with multiple imbalanced classes, due to the excessive data loss these methods cause.

However, extensive research is still required to verify this research’s results, primarily the gen-eralizability of the performance of the techniques across multi-class image recognition datasets with more imbalance, complexity and image samples.

(31)

References

[1] Paul S Bradley and Usama M Fayyad. “Refining initial points for k-means clustering.” In: ICML. Vol. 98. Citeseer. 1998, pp. 91–99.

[2] Andrew Brock, Jeff Donahue, and Karen Simonyan. “Large scale gan training for high fidelity natural image synthesis”. In: arXiv preprint arXiv:1809.11096 (2018).

[3] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. “SMOTE: synthetic minority over-sampling technique”. In: Journal of artificial intelligence research 16 (2002), pp. 321–357.

[4] Thomas Cover and Peter Hart. “Nearest neighbor pattern classification”. In: IEEE transac-tions on information theory 13.1 (1967), pp. 21–27.

[5] Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. “Class-balanced loss based on effective number of samples”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019, pp. 9268–9277.

[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. “Imagenet: A large-scale hierarchical image database”. In: 2009 IEEE conference on computer vision and pattern recognition. Ieee. 2009, pp. 248–255.

[7] Georgios Douzas and Fernando Bacao. “Effective data generation for imbalanced learning using conditional generative adversarial networks”. In: Expert Systems with applications 91 (2018), pp. 464–471.

[8] Sami Gazzah and Najoua Essoukri Ben Amara. “New oversampling approaches based on polynomial fitting for imbalanced data sets”. In: 2008 The Eighth IAPR International Work-shop on Document Analysis Systems. IEEE. 2008, pp. 677–684.

[9] Ian Goodfellow et al. “Generative adversarial nets”. In: Advances in neural information processing systems. 2014, pp. 2672–2680.

[10] Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. “ADASYN: Adaptive synthetic sampling approach for imbalanced learning”. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). IEEE. 2008, pp. 1322– 1328.

[11] Haibo He and Edwardo A Garcia. “Learning from imbalanced data”. In: IEEE Transactions on knowledge and data engineering 21.9 (2009), pp. 1263–1284.

[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for im-age recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 770–778.

[13] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and Improving the Image Quality of StyleGAN. 2019. arXiv: 1912.04958 [cs.CV]. [14] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv

preprint arXiv:1412.6980 (2014).

[15] Alexander Kolesnikov et al. Big Transfer (BiT): General Visual Representation Learning. 2019. arXiv: 1912.11370 [cs.CV].

[16] Gy¨orgy Kov´acs. “An empirical comparison and evaluation of minority oversampling tech-niques on a large number of imbalanced datasets”. In: Applied Soft Computing 83 (2019), p. 105662.

[17] Gy¨orgy Kov´acs. “Smote-variants: A python implementation of 85 minority oversampling techniques”. In: Neurocomputing 366 (2019), pp. 352–354.

[18] Alex Krizhevsky, Geoffrey Hinton, et al. “Learning multiple layers of features from tiny images”. In: (2009).

[19] Miroslav Kubat, Robert C Holte, and Stan Matwin. “Machine learning for the detection of oil spills in satellite radar images”. In: Machine learning 30.2-3 (1998), pp. 195–215.

(32)

[20] Guillaume Lemaˆıtre, Fernando Nogueira, and Christos K Aridas. “Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning”. In: The Journal of Machine Learning Research 18.1 (2017), pp. 559–563.

[21] Wei-Chao Lin, Chih-Fong Tsai, Ya-Han Hu, and Jing-Shang Jhang. “Clustering-based un-dersampling in class-imbalanced data”. In: Information Sciences 409 (2017), pp. 17–26. [22] Inderjeet Mani and I Zhang. “kNN approach to unbalanced data distributions: a case study

involving information extraction”. In: Proceedings of workshop on learning from imbalanced datasets. Vol. 126. 2003.

[23] Mehdi Mirza and Simon Osindero. “Conditional generative adversarial nets”. In: arXiv preprint arXiv:1411.1784 (2014).

[24] _{Joseph Rocca. Understanding Generative Adversarial Networks (GANs). 2019. url: https:} / / towardsdatascience . com / understanding generative adversarial networks -gans-cd6e4651a29 (visited on 06/23/2020).

[25] Rohit Walimbe. Handling imbalanced dataset in supervised learning using family of SMOTE algorithm. 2017. url: https://www.datasciencecentral.com/profiles/blogs/handling-imbalanced-data-sets-in-supervised-learning-using-family (visited on 06/23/2020). [26] Show-Jane Yen and Yue-Shi Lee. “Cluster-based under-sampling approaches for imbalanced

data distributions”. In: Expert Systems with Applications 36.3 (2009), pp. 5718–5727. [27] Xinyue Zhu, Yifan Liu, Jiahong Li, Tao Wan, and Zengchang Qin. “Emotion classification

with data augmentation using generative adversarial networks”. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer. 2018, pp. 349–360.

(33)

Appendix

Tables 9 and 10 contain the performance results and ranking for each metric on every tested technique using all the imbalanced datasets from CIFAR-10 and CIFAR-100. In table 10 the MGMA and Min 20% scores equal 0.00 for each technique for the ImageNet distribution, which is due to some classes never being accurately recognized. These scores do not attribute to the average rank, since the rank for each technique would be the same. Table 4 contains a list of the acronyms.

Imbalance-type Technique MAMA MGMA AM-F1 Min 20% TFLOPs

Avg. rank Linear No technique 0.841 (2) 0.801 (3) 0.851 (2) 0.530 (3) 601 2.50 ADASYN 0.834 (5) 0.774 (8) 0.847 (3) 0.493 (7) 1071 5.75 CBU 0.315 (12) 0.219 (12) 0.215 (12) 0.053 (13) 18 12.25 GAN 0.836 (4) 0.776 (7) 0.843 (4) 0.486 (8) 1372 (+3878)* 5.75 CBL (β=0.9) 0.834 (5) 0.788 (5) 0.839 (6) 0.522 (4) 599 5.00 NM-2 0.281 (13) 0.195 (13) 0.215 (12) 0.065 (12) 18 12.50 ROS 0.842 (1) 0.819 (1) 0.838 (7) 0.605 (1) 947 2.50 ROS + CBU 0.827 (7) 0.792 (4) 0.824 (9) 0.516 (6) 556 6.50 ROS + NM-2 0.800 (10) 0.732 (10) 0.789 (10) 0.440 (10) 481 10.00 ROS + RUS 0.826 (9) 0.782 (6) 0.832 (8) 0.517 (5) 524 7.00 RUS 0.464 (12) 0.441 (11) 0.389 (11) 0.249 (11) 20 11.25 SMOTE 0.841 (2) 0.803 (2) 0.854 (1) 0.535 (2) 863 1.75 P-SMOTE 0.827 (7) 0.768 (9) 0.841 (5) 0.474 (9) 1033 7.50 Logarithmic No technique 0.809 (5) 0.782 (5) 0.827 (2) 0.508 (6) 418 4.50 ADASYN 0.816 (3) 0.793 (3) 0.819 (5) 0.559 (3) 1094 3.50 CBU 0.415 (12) 0.288 (12) 0.265 (12) 0.044 (13) 23 12.25 GAN 0.827 (2) 0.807 (1) 0.841 (1) 0.554 (4) 1372 (+3878)* 2.00 CBL (β=0.999) 0.810 (4) 0.790 (4) 0.820 (3) 0.544 (5) 312 4.00 NM-2 0.308 (13) 0.234 (13) 0.203 (13) 0.075 (12) 18 12.75 ROS 0.803 (6) 0.755 (9) 0.820 (3) 0.453 (10) 768 7.00 ROS + CBU 0.795 (8) 0.772 (6) 0.774 (10) 0.583 (2) 332 4.50 ROS + NM-2 0.787 (10) 0.751 (10) 0.790 (9) 0.454 (9) 538 9.50 ROS + RUS 0.800 (7) 0.760 (8) 0.811 (8) 0.464 (7) 574 7.50 RUS 0.461 (11) 0.449 (11) 0.380 (11) 0.313 (11) 18 11.00 SMOTE 0.823 (1) 0.803 (2) 0.813 (7) 0.592 (1) 929 2.75 P-SMOTE 0.792 (9) 0.761 (7) 0.818 (6) 0.462 (8) 909 7.50 Spike No technique 0.636 (4) 0.409 (11) 0.612 (5) 0.308 (8) 192 7.00 ADASYN 0.682 (2) 0.649 (2) 0.666 (2) 0.400 (1) 721 1.75 CBU 0.399 (12) 0.249 (13) 0.127 (13) 0.055 (13) 18 12.75 GAN 0.644 (3) 0.603 (3) 0.681 (1) 0.383 (2) 938 (+3878)* 2.25 CBL (β=0.99) 0.629 (5) 0.588 (4) 0.624 (3) 0.358 (4) 170 4.00 NM-2 0.370 (13) 0.321 (12) 0.159 (12) 0.129 (12) 20 12.25 ROS 0.600 (6) 0.558 (5) 0.580 (6) 0.358 (4) 834 5.25 ROS + CBU 0.567 (9) 0.503 (7) 0.522 (10) 0.223 (11) 377 9.25 ROS + NM-2 0.553 (10) 0.494 (9) 0.534 (9) 0.267 (10) 420 9.50 ROS + RUS 0.578 (7) 0.531 (7) 0.543 (8) 0.300 (9) 323 7.75 RUS 0.476 (11) 0.464 (10) 0.299 (11) 0.358 (4) 20 9.00 SMOTE 0.693 (1) 0.659 (1) 0.620 (4) 0.383 (2) 667 2.00 P-SMOTE 0.578 (7) 0.533 (6) 0.556 (7) 0.333 (7) 789 6.75

Table 9: Shows the performance metrics for the techniques on CIFAR-10’s imbalanced datasets. The values in brackets represent the ranking the technique receives for that particular dataset according to the metric of that column. These ranks are averaged in the last column.

(34)

Imbalance-type Technique MAMA MGMA AM-F1 Min 20% TFLOPs

Avg. rank Linear No technique 0.629 (5) 0.604 (5) 0.632 (5) 0.186 (7) 585 5.50 ADASYN 0.634 (3) 0.608 (3) 0.640 (2) 0.165 (9) 861 4.25 CBU 0.418 (12) 0.382 (12) 0.392 (12) 0.117 (11) 217 11.75 GAN 0.632 (4) 0.607 (4) 0.635 (4) 0.197 (6) 1245 (+4298)* 4.50 CBL (β=0.99) 0.635 (2) 0.611 (1) 0.638 (3) 0.215 (4) 524 2.50 NM-2 0.347 (13) 0.000 (13) 0.334 (13) 0.079 (13) 235 13.00 ROS 0.626 (6) 0.602 (6) 0.631 (6) 0.213 (5) 823 5.75 ROS + CBU 0.581 (8) 0.560 (8) 0.575 (8) 0.231 (2) 493 6.50 ROS + NM-2 0.573 (9) 0.550 (9) 0.561 (9) 0.222 (3) 545 7.50 ROS + RUS 0.595 (7) 0.574 (7) 0.592 (7) 0.238 (1) 425 6.50 RUS 0.472 (11) 0.442 (11) 0.461 (11) 0.121 (10) 240 10.75 SMOTE 0.637 (1) 0.610 (2) 0.642 (1) 0.184 (8) 909 3.00 P-SMOTE 0.523 (10) 0.481 (10) 0.526 (10) 0.116 (12) 1306 10.50 Logarithmic No technique 0.554 (6) 0.517 (5) 0.557 (6) 0.142 (5) 325 5.50 ADASYN 0.569 (1) 0.531 (1) 0.574 (1) 0.087 (12) 946 3.75 CBU 0.426 (12) 0.388 (12) 0.408 (12) 0.092 (11) 221 11.75 GAN 0.560 (3) 0.525 (3) 0.561 (5) 0.163 (2) 1229 (+4298)* 3.25 CBL (β=0.9) 0.565 (2) 0.530 (2) 0.566 (2) 0.142 (5) 457 2.75 NM-2 0.409 (13) 0.371 (13) 0.390 (13) 0.111 (10) 233 12.25 ROS 0.557 (5) 0.517 (5) 0.563 (3) 0.128 (8) 1064 5.25 ROS + CBU 0.548 (7) 0.511 (7) 0.547 (7) 0.137 (7) 551 7.00 ROS + NM-2 0.546 (8) 0.510 (8) 0.544 (8) 0.144 (4) 597 7.00 ROS + RUS 0.541 (9) 0.503 (9) 0.541 (9) 0.123 (9) 565 9.00 RUS 0.469 (11) 0.439 (11) 0.459 (11) 0.165 (1) 228 8.50 SMOTE 0.560 (3) 0.524 (4) 0.563 (3) 0.152 (3) 890 3.25 P-SMOTE 0.530 (10) 0.484 (10) 0.535 (1) 0.067 (13) 1236 8.50 Spike No technique 0.500 (5) 0.453 (5) 0.499 (6) 0.125 (1) 255 4.25 ADASYN 0.528 (2) 0.473 (2) 0.532 (2) 0.075 (9) 999 3.75 CBU 0.473 (11) 0.428 (8) 0.453 (11) 0.083 (6) 251 9.00 GAN 0.505 (4) 0.455 (3) 0.507 (4) 0.075 (9) 1119 (+4298)* 5.00 CBL (β=0.9) 0.500 (5) 0.455 (3) 0.502 (5) 0.092 (5) 255 4.50 NM-2 0.456 (12) 0.139 (13) 0.432 (13) 0.075 (9) 224 11.75 ROS 0.507 (3) 0.305 (10) 0.510 (3) 0.083 (6) 1013 5.50 ROS + CBU 0.484 (8) 0.291 (11) 0.482 (8) 0.100 (2) 445 7.25 ROS + NM-2 0.499 (7) 0.451 (6) 0.497 (7) 0.075 (9) 619 7.25 ROS + RUS 0.476 (9) 0.431 (7) 0.475 (9) 0.100 (2) 612 6.75 RUS 0.474 (10) 0.428 (8) 0.460 (10) 0.083 (6) 273 8.50 SMOTE 0.534 (1) 0.486 (1) 0.537 (1) 0.100 (2) 1067 1.25 P-SMOTE 0.450 (13) 0.259 (12) 0.451 (12) 0.058 (13) 1331 12.50 ImageNet No technique 0.521 (2) 0.000** 0.523 (3) 0.000 481 2.50 ADASYN 0.502 (5) 0.000 0.517 (5) 0.000 705 5.00 CBU 0.184 (12) 0.000 0.124 (12) 0.000 183 12.00 GAN 0.516 (3) 0.000 0.519 (4) 0.000 927 (+4298)* 3.50 CBL (β=0.999) 0.528 (1) 0.000 0.533 (1) 0.000 396 1.00 NM-2 0.149 (13) 0.000 0.106 (13) 0.000 206 13.00 ROS 0.489 (6) 0.000 0.496 (6) 0.000 1055 6.00 ROS + CBU 0.478 (8) 0.000 0.480 (8) 0.000 479 8.00 ROS + NM-2 0.465 (10) 0.000 0.462 (10) 0.000 545 10.00 ROS + RUS 0.466 (9) 0.000 0.475 (9) 0.000 475 9.00 RUS 0.228 (11) 0.000 0.192 (11) 0.000 203 11.00 SMOTE 0.514 (4) 0.000 0.527 (2) 0.000 1022 3.00 P-SMOTE 0.483 (7) 0.000 0.491 (7) 0.000 1218 7.00

Table 10: Shows the performance metrics for the techniques on CIFAR-100’s imbalanced datasets. The values in brackets represent the ranking the technique receives for that particular dataset according to the metric of that column. These ranks are averaged in the last column.

* The additional FLOP values for the GAN technique represent the cost to train the GAN itself. ** These 0.000 values have not contributed to the ranking.

Handling Imbalanced Datasets in Multi-class Image Recognition

Handling Imbalanced

Datasets in Multi-class Image

Recognition

-Handling Imbalanced

Datasets in Multi-class Image Recognition

Contents

Acknowledgments

1

Introduction

2

Techniques

2.1

Re-sampling

2.2

Class-balanced loss function

3

Methods

3.1

Datasets

Imagenet

Linear

Logarithmic

Spike

3.2

Deploying techniques

3.3

Application of techniques

3.4

Training

3.5

Evaluation

Macro arithmetic mean accuracy

Macro geometric mean accuracy

Arithmetic mean F1-score

Rank

Floating point operations

4

Results

4.1

Baselines

4.2

Techniques

5

Discussion

5.1

Performance remarks

5.2

Future research

6

Concluding remarks

References

Appendix