A Framework for Systematic Comparison of Convolutional Neural Network Architectures

(1)

Master Software Engineering

A Framework for Systematic Comparison

of Convolutional Neural Network

Archi-tectures

Rico Lamein – 10589848

9th August 2017

Supervisors: Maarten Roosendaal

Softw

are

Engineering

–

University

of

Amsterd

am

(2)

(3)

Abstract

Image recognition is on the rise with applications in a wide range of areas such as the mechanical field (e.g. self-driving cars) and the medical field (e.g. skin cancer detection). Recently, Convolutional Neural Networks (CNNs) have proven to be most successful in performing image recognition. With this success, a lot of CNN architectures have been developed. However, each of these architectures performs differently on a separate image recognition problem. Accordingly, we cannot, for instance, say that a certain architecture is the fastest in performing all image recognition tasks. Moreover, implementing these architectures requires some experience in Artificial Intelligence, as well as a well-annotated, very large dataset. Therefore, the research conducted during this master’s thesis set out to investigate to what extent the user can be aided in deciding which CNN architecture fits her image recognition problem best. A framework has been constructed that aids in collecting a dataset with the help of a Google Images scraper and, subsequently, trains and evaluates five CNN architectures. During the evaluation of these architectures, they are ranked according to their speed, accuracy and amount of false positives and negatives. The results are displayed in table format so the user can easily decide which one of five CNN architectures fits best to her image recognition problem. Furthermore, the confusion matrix as well as the precision and recall metrics are calculated per CNN architecture and shown to the user. As a proof of concept, the framework was utilized to decide which CNN architecture was best for our product recognition system. As will become clear, the results not only make it easy for the user to make her decision, it also helps in performing trade-offs.

(4)

(5)

Introduction

Image recognition is a long-standing research topic in the field of computer vis-ion and machine learning. Recently, Convolutvis-ional Neural Networks1 _{(CNNs) have}

proven to be one of the most successful approaches in tackling this problem [16]. Image recognition is a research topic in the field of computer vision that deals with identifying objects in an image. Research regarding this topic is fundamental because the applications are endless and of great importance. For instance, a self-driving car might employ an image recog-nition system to recognize traffic signs.

Recognizing objects in complex scenes, and thus creating an image recognition system, requires a flexible representation of the visual world. This was initially performed with a hand-crafted model, called the Bag of visual Words2 _{(BoW) model, as implemented by Deselaers et al. [6].}

First, local descriptors were extracted from the dataset using feature extractors such as SIFT3

[20]. Subsequently, a clustering algorithm was applied to all the vectors to obtain a codebook. Finally, a classifier was trained on top of the vectorial representation obtained using the BoW. However, in the last few years, several papers have shown that convolutional neural networks (CNNs) are most successful in performing image recognition tasks, whether the objects in the images are handwritten characters [19], house numbers [33] or objects from the 1000-category ImageNet dataset [16]. Especially in the latter, CNNs showed a record-beating performance during the ImageNet4 _{classification contest [30] by achieving an error rate of 16.4%, compared}

to the second place result of 26.1%.

1.1 Market research

In this section, we will discuss some real world applications of image recognition. We discuss the purpose of the applications and what attributes are most important in their domain.

1.1.1 Google’s reverse image search engine

Google’s reverse image search engine5 _{lets its users search by images instead of the usual}

keywords. After uploading an image, the engine scrapes the web for the provided image. Not only can the engine tell which websites are hotlinking the exact searched image, it also collects images that are visually similar. This is useful for a lot of applications. For instance, Tinder6

1 https://en.wikipedia.org/wiki/Convolutional_neural_network 2_{https://en.wikipedia.org/wiki/Bag-of-words_model_in_computer_vision} 3_{https://en.wikipedia.org/wiki/Scale-invariant_feature_transform} 4_{www.image-net.org/} 5 https://images.google.com/? 6_{https://www.gotinder.com/}

(8)

and Facebook7users can use the engine to research profile pictures of their potential dates while travelers can use it for finding the location where photos are taken.

Seeing that the engine provides visual similar images indicates it is using some direct image matching algorithm. However, the engine also provides a description of the searched image, which implies it uses CNNs to recognize the object (or context) in the image.

As the engine deals with user interaction, speed is of importance: most users will not use the ser-vice if they have to wait minutes before receiving the result. Moreover, accuracy is an important attribute. Preferably, the engine correctly annotates the image, but it is not disastrous when it fails once in a while. The user can namely correct the annotation himself when it is incorrect.

1.1.2 Diagnosing skin cancer

When skin cancer is suspected, a dermatologist usually looks at the suspicious lesion with the naked eye and with the aid of a dermatoscope. If these methods are inconclusive or lead the dermatologist to believe the lesion is cancerous, a biopsy is the next step. Research by Esteva et al. [7] showed that, by training a CNN with skin images, the first step could be replaced by a deep learning algorithm which decides whether a biopsy is needed. The researchers eventually want to employ their application to smartphones.

They, however, need to be cautious when deploying their application. People can namely assume the result of the algorithm is 100% correct without having second doubts. However, as we know, prediction algorithms sometimes make mistakes.

Therefore, reducing the amount of false negatives (i.e. the amount of images classified as “no cancer” when the subject actually does have cancer) is essential for an application like this. A false positive (image classified as “cancer” when the subject does not have cancer), on the other hand, is less disastrous since the dermatologist can still correct the prediction. Furthermore, speed is also of less interest, as long as the result does not take days to compute.

1.1.3 Wildlife preservation

There are many applications in wildlife preservation where image recognition can be of help. In this section, we discuss one the most important applications: the protection of honeybees. Honeybees are indispensable in our everyday life. For one, they take a crucial role in our food production: according to Greenpeace8, “seventy out of the top 100 human food crops - which supply about 90 percent of the worlds nutrition - are pollinated by bees” [31]. Furthermore, bees and bee products have been used in many areas of health: from bee sting apitherapy for arthritis to antibiotic treatment with honey for burn victims. From this, we can conclude that the extinction of bees would have an immense negative impact on our everyday life.

Yet, according to research performed by Inverse9 _{[11], one-third of the honeybee population in}

the United States died last year. This is mainly caused by a small, red parasite named the Var-roa destructor10_{. These parasites attach themselves to honeybees to spread among the bees and}

destroy the entire beehive. Therefore, Swedish beekeeper and inventor Bjrn Lagerman turned to CNNs for preserving bee colonies. With the help of CNNs, the parasites could be spotted just by taking a picture of the bees. Once the parasites are spotted, appropriate action, such as applying miticides, should be taken to save the bees.

Thinking about important attributes the network should comply with, accuracy is at the top of the list: as many bees as possible should be saved. Furthermore, we want to stop the parasites

7_{https://www.facebook.com/}

8_{http://www.greenpeace.org/international/en/} 9_{https://www.inverse.com/}

(9)

from spreading throughout the entire hive, which makes speed an important attribute as well. However, since the parasites do not spread in a matter of minutes, speed is a less important driver than accuracy. Finally, when thinking about false negatives and positives (i.e. classifying a bee as non-infected when it is and the other way around), both are not disastrous. When a false negative occurs, the parasite might spread to a few other bees. However, as long as the false negative rate is not too high, we will find out that some bees in the hive are affected long before the whole hive is, and thus appropriate action could still be taken in time. On the other hand, when a false positive occurs, the whole hive is cleaned while no bees are affected. This is a tedious task, but taking the precaution is better than losing the whole hive.

1.2 Bol.com

Thinking about Google’s reverse image search engine, as described in section 1.1.1, we see that image recognition provides some form of searching by image. Searching by image would be ideal for a webshop like bol.com11_{. It would namely allow its customers to shop by image: by}

taking a picture of a product the user is interested in, she gets offered that exact product or its closest match offered by bol.com. This would drastically easen the customer journey, especially for products with specific textures, such as furniture and home accessories. Sometimes these products do not have brand names or barcodes anymore making them virtually impossible to look up. Therefore, bol.com asked to investigate the capabilities of image recognition. However, by looking at how Google’s reverse image search engine works, we see that they do not only employ image recognition, but also algorithms such as image matching. Investigating both is out of scope for a master’s thesis research like this. Therefore, we decided to focus on pure image recognition alone, as described below.

As we can see from the examples in section 1.1, there are a lot of different CNNs. We can also see this by looking at the ImageNet classification contest12_{, where each participant built its own}

CNN to classify objects in images. Each CNN, or more specifically its architecture, is invented with different attributes in mind. Now, if one wants to solve her own image recognition problem with existing architectures, she needs a guideline of which architecture performs best at which attributes for her specific problem.

1.3 Research questions

As the CNNs perform differently on separate image recognition problems (i.e. we cannot say “this architecture is the most accurate in performing every image recognition problem”), this research sets out to construct a framework that hints the user which architecture best fits her image recognition problem. This framework should be kept as generic as possible making it applicable to many image recognition problems. For example, one user may want to check which architecture is best to build a bird classifier where another user wants to know which architecture suits best to build a facial recognition system.

In order to construct such a framework, and thus help the user in making her decision, we have to answer the following research questions:

• How can we facilitate the process of systematically collecting a dataset?

• What design principles should a generic framework comply with? How can these be used to construct a framework for Convolutional Neural Network comparison?

• What are the methods and metrics used for CNN comparison? How can we use these to decide which CNN is best for a specific image recognition problem?

11

https://www.bol.com/nl/index.html

(10)

1.4 Thesis outline

In chapter 2, we discuss our chosen research methodology. More specifically, we discuss the methods chosen to answer the research questions of section 1.3. Next, in chapter 3, the theoretical framework of this research is outlined. The different aspects that play a role in using neural networks to build classifiers and comparing them to each other are discussed. Subsequently, in chapter 4, we provide an overview of the system and discuss its implementation. Hereafter, the experiments are described and the results are shown. These results are discussed in chapter 6. Before drawing a conclusion in chapter 8, we first show some related work regarding our research.

(11)

CHAPTER 2

Research method

In this chapter, we will discuss our chosen research method. Specifically, we explain the methods we used to answer the research questions described in section 1.3.

2.1 Dataset collection

The idea of automatically collecting a large dataset is adopted from research performed by Xiao et al. [44]. As they point out, obtaining a massive amount of well-labeled data is usually very expensive and time-consuming. This prevents people from training deep learning models, such as CNNs, on new image recognition problems simply because they lack the amount of training data. It is, therefore, necessary to develop new, efficient labeling frameworks for deep learning. Just scraping the internet for web images and their labels, however, is extremely unreliable due to various types of noise, such as labeling mistakes by search engines. This could adversely impact the classification accuracy of the induced classifiers [24]. Xiao et al. tried to solve this problem by collecting images from the internet and automatically labeling these according to the keywords in their surrounding text. They then try to automatically correct the wrong labels before learning the data to CNNs.

However, as this framework is intended to be used for specific image recognition tasks, the user already knows which classes she wants to include. Moreover, she probably already knows what kind of images she wants to collect. For instance, when the goal is to create a classifier that can differentiate between n bird species in nature images, the user wants to collect images containing noisy backgrounds for each of the n bird species. Therefore, we use manually selected labels in combination with search queries to collect the training images. Thus, when we think of the bird species example, a label would be “sparrow” and the corresponding search query would be “sparrow in the wild”. This, however, still results in some incorrect images. Adding an extra layer and minimal user interaction fixes this problem.

2.2 Convolutional neural network comparison

Comparing classifiers to each other has been performed many times in the field of machine learn-ing [37, 32, 5]. Since classifiers and CNNs essentially perform the same task, the methods used for classifier comparison can also be applied to CNN comparison.

Most literature focuses on one attribute of the classifiers: their accuracy. This makes it easy to draw conclusions such as “classifier X is the best”. However, as we have seen in section 1.1, other attributes are important drivers in classifier choice as well. Sokolova and Lapalme [36] propose twenty four performance metrics used in the complete spectrum of machine learning classification

(12)

tasks (i.e. binary, multi-class, multi-labelled, and hierarchical). However, since object recogni-tion is a multi-class task, meaning the input is to be classified into one of N non-overlapping classes, we are only interested in the multi-class performance measures.

Once the metrics are computed, we need a way of displaying them to the user so she can quickly decide which classifier suits her problem best. As we can see in the literature focusing on the accuracy attribute alone, graphs are a good way of providing this information. However, if we compute a graph per measure, the user quickly loses overview when multiple attributes play a role in her decision. For instance, when thinking back to the example in section 1.1.3, we see that accuracy is the most important driver. However, if we are dealing with a lazy beekeeper, he also wants to minimize the amount of false positives, as these result in having to clean the whole hive. Consequently, when a classifier performs best at the accuracy metric but the worst at the false positive metric, the beekeeper might choose for a classifier that performs average on both. Comparing these metrics by means of graphs is impractical. We, therefore, chose for a ranking. For each of the metrics, the classifiers are ranked according to their performance. This not only offers the ability to see which classifier performs best at which metrics but also provides a quick overview of how the classifier scores at other metrics, making it easy to perform trade-offs.

(13)

CHAPTER 3

Theoretical Background

In this chapter, we will discuss both the theoretical background and relevant lit-erature behind each aspect that played a role in our research. In order to get a clear picture, this chapter is divided into three parts. The first part describes how (convolutional) neural networks work and how they assist with classification tasks. We then proceed with the theory behind the most widely used convolutional neural network architectures for image recognition problems. Finally, we discuss the the-ory on classifier comparison and the metrics that are involved in this task.

CNNs are very much based on the classical Artificial Neural Network1(ANN). We therefore first provide the theoretical basis of ANNs before diving into the concept of CNNs.

3.1 Artificial Neural Networks

An Artificial Neural Network (ANN) is a relatively new concept in the field of Artificial Intel-ligence2 (AI) and machine learning. As with many AI developments, ANNs are fundamentally based on how the human brain performs computations and makes decisions. ANNs particularly focus on how neurons in the human brain interact.

The first steps in the development of the ANN were made by Warren McCulloch and Walter Pitts in 1943 [21]. They namely showed that, because of the“all-or-none” character of nervous activity, neural events and the relations among them can be treated by means of propositional logic. Using this hypothesis, they developed a logical model of how neurons in the brain work. This model paved way for neural network research to split into two approaches; a biological and an AI one. The biological approach focused on simulating the brain as accurately as possible, where the AI approach focused more on the application side of the model.

However, McCulloch and Pitts’ model lacked a mechanism for learning, which is crucial in AI applications. This changed in 1957, when Frank Rosenblatt invented the idea of a perceptron3 [28], a simplified model of a neuron in the brain. He was inspired by the foundational work of Donald Hebb, who invented the Hebbian Theory4in 1949 [10]. Basically, Hebb pointed out that neural pathways are strengthened each time they are used. If two nerves fire at the same time, he argued, the connection between them is enhanced. This concept is fundamentally essential to the ways in which humans learn. By implementing her idea on custom hardware, Rosenblatt created a system capable of learning to classify simple shapes correctly with 20x20 pixel-like inputs [27]. This is considered the birth of AI.

1

https://en.wikipedia.org/wiki/Artificial_neural_network

2_{https://en.wikipedia.org/wiki/Artificial_intelligence} 3_{https://en.wikipedia.org/wiki/Perceptron}

(14)

Although the perceptron initially seemed promising, it entailed one major drawback: perceptrons were only able to output a one or a zero, and thus could not be used to recognize many classes of patterns. This, in combination with a book published by Marvin Minsky and Seymour Papert [22], caused the field of neural network research to stagnate for many years. They showed that perceptrons were only capable of learning linearly separable patterns because they were incapable of processing the exclusive-OR (XOR) logic gate, and thus also incapable of learning complex functions. Their solution was to stack multiple perceptrons to create a multi-layer perceptron5

(MLP). However, because computers lacked sufficient processing power to effectively handle the work required by large neural networks, research stagnated again.

This did not change until 1986, when Rumelhart et al. [29] independently rediscovered the back-propagation algorithm [35]. Backback-propagation allowed perceptrons to be trained in a multi-layer configuration. In essence, this allowed the network to improve itself by learning from its output error. However, the algorithm was computationally very expensive, too expensive for computers at that time. Therefore the computationally less expensive algorithms (e.g. SVMs) gradually overtook neural networks in machine learning popularity.

Finally, in the late 2000s, ANNs regained interest with the rise of fast GPU implementations. They have been succesful in a wide range of fields ever since: from image processing (e.g. recog-nizing handwritten text [25]) to economics (e.g. stock market prediction [17]).

3.1.1 Single layer ANN (Perceptron)

As discussed in the previous section, perceptrons are the most basic form of an ANN. Although they are rarely implemented nowadays, understanding perceptrons is crucial to understanding more complex networks. Since the perceptron is based on the workings of a neuron in the human brain, it shows many similarities.

A typical neuron6 _{(Figure 3.1) is divided into three parts: the cell body (the area surrounding}

the nucleus), the dendrites and the axon. The axon connects to the dendrites of multiple other neurons. It uses this connection to send electrical and chemical signals with each its own amp-litude. This way, each neuron receives multiple signals with different altitudes. The neurons sum up their incoming signals and if these signals reach above a certain threshold, the neuron transmits the signal through the axon.

Figure 3.1: A representation of a neuron in the human brain.

A perceptron (Figure 3.2) works roughly the same: a weighted sum based on all inputs is calculated. Subsequently, this sum is fed to an activation function which decides whether the

5_{https://en.wikipedia.org/wiki/Multilayer_perceptron} 6_{https://en.wikipedia.org/wiki/Neuron}

(15)

perceptron outputs a signal. The activation function can thus be regarded as a threshold function that ultimately defines the output.

Figure 3.2: Model of a perceptron.

As can be seen in the figure, the perceptron is a linear classifier. We can namely translate it to the linear formula:

Output = F (w0+ w1x1+ w2x2+ ... + wnxn) (3.1)

where xi’s are the inputs, wi’s their weight and F is the activation function. This is the major

drawback of perceptrons: they only support input that is linearly separable.

3.1.2 Multi-layer ANN

The Multi-layer ANN, also known as Multi-layer Perceptron (MLP) or feed-forward Neural Network, is offered as a solution to the non-linearly separable input problem. Basically, the network now consists of multiple perceptrons combined together, as in Figure 3.3 below.

Figure 3.3: An example of a fully-connected MLP.

This is called a fully-connected network because each node in any layer is connected to all the node in the subsequent layer. The foremost perceptron, also known as the Input layer, consists of nodes receiving a single value. They multiply the value by their weight and transfer it to every node in the next layer. This brings us to the hidden layers of the network. An MLP can consist of any number of hidden layers, including zero. However, when the network has zero hidden layers, it is a simple perceptron. Each hidden layer receives a weighted sum from all outputs of the previous layer (whether this is an input or hidden layer) and feeds the sum to an activation function. The result is then forwarded to every node in the next layer. When the data finally reaches the output layer, one or more values are calculated that say something about the input. For instance, when trying to classify objects in images, the number of nodes in the output layer corresponds to the number of objects the network ought to be able to recognize and each node then returns a probability of the input containing that object.

(16)

When building an ANN, we try to approximate a complex function f∗(x) by iteratively applying a function f to the training examples x. During each iteration, an error is calculated and the weights of the network are adjusted in hope of lowering this error in the next iteration. The weights are thus not fixed but learned by the network itself. This, however, does require an error function. One commonly used error function is the Mean Squared Error7_(MSE):

MSE = 1 n n X i=1 (f (xi) − f∗(xi))2 (3.2)

where (f (xi) and f∗(xi) represent the predicted and actual values corresponding with n training

samples, respectively. The weights are subsequently updated using backpropagation, as will be explained in section 3.1.4.

3.1.3 Activation function

The activation function decides the output of the perceptron, and can thus be regarded as a function that ultimately defines the output. This function can take many forms; the simplest one being the step function8 _{(Figure 3.4a). The step function outputs a 1 if the sum exceeds}

a certain threshold and a 0 otherwise. The more widely used activation functions are logistic functions, such as the hyperbolic tangent9_{(Figure 3.4b), or sigmoids}10_{(Figure 3.4c):}

f (x) = e 2x_{− 1} e2x_{+ 1} (3.3) f (x) = 1 1 + e(−x) (3.4) (a) (b) (c)

Figure 3.4: Three examples of activation functions. a) Step function. b) Hyperbolic tangent function, y = e_e2x2x−1₊₁. c) Sigmoid function, y =

1 1+e−x.

However, recent research by Geoffrey Hinton and Vinod Nair [23] pointed out that a Rectified Linear Units (ReLU) function11 (Equation 3.5) is a much better choice because it allows the network to train a lot faster with roughly no difference in accuracy. Furthermore, the function also helps to alleviate the vanishing gradient problem. This is a problem where the lower layers of the network train very slowly because the gradient decreases exponentially through the layers. The ReLU function is defined as:

f (x) = max(0, x) (3.5)

Basically, this means it changes all negative values to 0. Krizhevsky et al. [16] have proven that using the ReLU function leads to great improvement in convergence compared to the tanh function, as can be seen in Figure 3.5.

7_{https://en.wikipedia.org/wiki/Mean_squared_error} 8_{https://en.wikipedia.org/wiki/Step_function}

9_{https://en.wikipedia.org/wiki/Hyperbolic_function#Hyperbolic_tangent} 10_{https://en.wikipedia.org/wiki/Sigmoid_function}

(17)

Figure 3.5: A plot from Krizhevsky et al. indicating the six times improvement in convergence with the ReLU function compared to the tanh function.

3.1.4 Backpropagation

When building an ANN, we try to optimally approximate a complex function f∗(x). This is done with the backpropagation algorithm, as introduced by Rumelhart et al. [29]. The goal of this algorithm is to fine-tune the randomly initialized weights of the network so that the error function reaches its minimum. Essentially, the algorithm computes how much each individual weight in the network contributes to the error function. Furthermore, it calculates in which direction (i.e. positive or negative) the weight needs to be changed in order to minimalize the error function. In order to do so, it uses the Gradient Descent12_{. The gradient of a function f (x)}

defines how much the value of f (x) will change with a unit increase/decrease in the value of x. This basically means we are differentiating f with respect to x.

3.2 Convolutional Neural Networks

A Convolutional Neural Network (CNN) shows many similarities with an Artificial Neural Net-work (ANN). First, they are both made up of nodes representing neurons that have learnable weights. Furthermore, each node receives some input, performs an operation and optionally follows it with a non-linearity. Finally, the whole network still expresses a single differentiable score function (e.g. from the raw image pixels on one end to object class scores at the other). They are different in the sense that they are specifically designed to exploit local-connectivity. This is found in high dimensional data, such as images and audio. As opposed to ANNs, CNNs have the following distinguishing features:

1. 3D volumes of nodes. The layers of a CNN have nodes arranged in three dimensions: width, height, and depth. The nodes inside a layer are connected to only a small number of nodes in the previous layer, called a receptive field. This means the CNN does not have to be fully connected. A CNN architecture typically contains a combination of both locally and completely connected layers.

2. Weight sharing. Instead of having a unique weight for each interconnected pair of nodes, CNNs share weights and form a feature map. This means that all nodes in a convolutional layer respond to the same feature. This allows for features to be detected regardless of their position in the field, thus constituting translation invariance.

3. Sub-sampling or pooling. The goal of this operation is to reduce the dimensions of the convolutional responses in order to make the network more spatial invariant.

(18)

3.2.1 Convolution

In mathematics, convolution13 _{is a mathematical operation on two functions, f and g, which}

produces a third function. The convolutional operator defined as the integral of the product of the two functions after one is reversed and shifted. For continuous functions, it is defined as:

(f ∗ g)(t) = Z ∞ −∞ f (τ )g(t − τ )dτ (3.6) = Z ∞ −∞ f (t − τ )g(τ )dτ (3.7)

However, since we are mostly dealing with discrete input represented as multidimensional matrices, the discrete convolution operation is of more interest:

(f ∗ g)(t) =

∞

X

i=−∞

f (i)g(t − i) (3.8)

This works for one-dimensional data. However, images contain three-dimensional data. Since we only want to convolve over the width and height, this reduces to two dimensions, leading to the formula for two-dimensional convolution:

(f ∗ g)(w, h) = ∞ X i=−∞ ∞ X j=−∞ f (i, j)g(w − i, h − j) (3.9)

Intuitively, performing a convolution with an image and a kernel feels like sliding the flipped kernel over the image. The kernel has to be flipped first in order to preserve the convolution’s commutative property. This is also demonstrated in Figure 3.6. We can also see here that if we want to compute the new value of the top-left pixel, we encounter a problem. This can be solved by saying that all positions outside of the image get value 0, or by wrapping the image, meaning we take the pixel value of the other end of the image.

Figure 3.6: Example of a two-dimensional convolution.

(19)

3.2.2 Convolutional Layer

The convolutional layer is the core building block of a CNN that performs most of the compu-tational heavy lifting. The layer consists of a set of learnable filters, each spatially small but extending through the full depth of the input volume. For instance, when dealing with an im-age, we might choose a convolutional layer with size 5x5x3. For the convolution we disregard the third dimension, resulting in a 5x5 kernel. Then, during the forward pass, we convolve the image with this kernel and produce an output feature map. Furthermore, doing this for multiple kernels gives us multiple output feature maps. Finally, we stack these output maps along the depth dimension and produce the output volume.

As a result of performing convolutions, CNNs are not fully-connected. This results in fewer com-putations and less redundancy between parameters. We should, however, note that the extent of the connectivity along the depth axis is always equal to the depth of the input volume. This means we treat the spatial dimensions differently from the depth dimension: the connections are local in space (along width and height), but always full in depth.

Another result of performing convolutions is weight sharing. This only relies on one assumption: if a patch feature is useful to compute at some spatial position, then it should also be useful to compute at other positions. Since we take the same convolution across the whole input image, this is exactly what we assume. Weight sharing drastically reduces the number of weights.

3.2.3 Pooling Layer

Another important concept of CNNs is the pooling layer. The pooling layer is usually inserted between successive convolutional layers. In this layer, the input gets non-linearly downsampled. In other words, it reduces the spatial size of the input to reduce the amount of parameters, and hence to control overfitting. Overfitting occurs when the network performs well on the training set, but not on new instances.

To perform the downsampling, the pooling layer most commonly adapts the max pooling al-gorithm. Again, a window of arbitrary size is slid over the input. The maximum of each window is computed and is placed in the output feature map. This process is demonstrated in Figure 3.7.

Figure 3.7: Max pooling with a 2x2 filter.

3.2.4 Fully Connected Layer

The output from the convolutional and pooling layers represent high-level features of the input of the network. These features are fed to the final layers of the network, also known as the fully connected layers. Their purpose is to use the features to classify the input into one of the various classes based on the input training set. This means that the final layer of the CNN contains a single node for each target class in the model, as in Figure 3.8. Note that the figure does not show connections between the nodes in the fully connected layers.

(20)

Figure 3.8: Example of the fully connected layers of a CNN. The final layer contains as many nodes as there are classes.

The final layer of the network generally contains a softmax activation function14 _{to generate a}

value between zero and one for each node. It does so by squashing a K-dimensional vector z of arbitrary size real values to a K-dimensional vector σ(z) of real values in the range [0, 1] that add up to 1 with the formula

σ(z)j=

ezj

PK

k=1ezk

for j = 1, ..., K. (3.10)

We can, however, also completely remove the output layer of the network and use the network as a fixed feature extractor. These features are subsequently fed to a classifier, which performs the classification task. As shown in an article by Yichuan Tang, “simply replacing softmax with linear SVMs gives significant gains on popular deep learning datasets” [39].

3.2.5 Linear Support Vector Machines

Support Vector Machine15 _{(SVM) classifiers are inherently binary classifiers based on the}

stat-istical learning theory [41]. However, they can be extended to become multi-class classifiers with the one-against-one approach [13]. This entails that N (N −1)₂ binary SVMs are constructed with N the number of classes. Each SVM trains data from two classes and performs a classifica-tion. The final result of the classification becomes the class that is selected by the most classifiers. In the rest of this section, we will explain the theoretical background behind a binary SVM. SVMs are based on the idea of finding a hyperplane with maximum margin that best divides a dataset into two classes (Figure 3.9a). It uses n learning samples of the form (xi, y) with

i = {1, ..., n} and y = ±1, where −1 and 1 represent the two classes. Subsequently, a hyperplane w ∗ x + z = 0 is formed that separates the classes:

y(xi∗ w + z) ≥ 1 (3.11)

for all i with w the normal vector of the hyperplane. The margin of the hyperplane is defined as the sum of the distance between the hyperplane and the nearest positive and negative learning samples (learning samples of the classes 1 and −1, respectively). This margin is simply 2

|w| with

|w| the Euclidean length of the hyperplane. This can thus be maximized by minimizing |w|2

with respect to equation 3.11.

14_{https://en.wikipedia.org/wiki/Softmax_function} 15_{https://en.wikipedia.org/wiki/Support_vector_machine}

(21)

(a) (b)

Figure 3.9: a) Training samples that are separable. b) Training samples that are not separable.

It is, however, not always possible to separate the learning samples, as is displayed in Figure 3.9b. If this is the case, the formula

|w|2_{+ C} m

X

i=1

ξi (3.12)

must be minimized with respect to the restrictions y(xi∗ w + z) ≥ 1 − ξi

ξi≥ 0

for all i with C the sanction for error and ξi positive variables that measure how much the

restrictions are exceeded.

The learning samples xi∗ xj are transformed to their feature space with the kernel

K(xi, xj) = φ(xi) ∗ φ(xj) (3.13)

Many implementation of kernels are available, with Radial Basis Function (RBF) kernel16being the most popular. The RBF kernel is defined as:

K(xi, xj) = exp(−

||xi− xj||2

2σ2 ) (3.14)

where σ is a parameter. Subsequently, the learning process consists of maximizing the Lagrange

W (α) = m X i=1 αi− 1 2 m X i,j=1 αiαjyiyjK(xi, xj) (3.15)

with respect to the restrictions

αi≥ 0 m

X

i=1

αiyi = 0

The optimal α can be found with the help of quadratic programming. Finally, a new observation b can be classified with the formula

G(b) = m X i=1 αiyiK(b, xi) + z (3.16) 16_{https://en.wikipedia.org/wiki/Radial_basis_function_kernel}

(22)

If the result of this formula is negative, b belongs to the class that is represented by −1, otherwise it belongs with the other class [3].

3.2.6 Transfer Learning

In practice, it is not feasible to train an entire CNN from scratch because it is relatively rare to have a dataset of sufficient size. In order to classify images correctly, a network namely requires about a thousand images per class. Moreover, training a network from scratch requires extensive training, requiring both time and computational power. Therefore, it is more common to apply a technique called transfer learning17.

With transfer learning, we first train a base network on a dataset and task and then transfer the learned features to a second network to be trained on a target dataset and task. This process will tend to work if the features are general, meaning suitable to both base and target tasks. However, as pointed out by Yosinski et al. [46], although transferability of features decreases as the distance between the base and target task increases, transferring features even from distant tasks is better than initializing with random features.

In the case of image recognition, transfer learning boils down to retraining the last few layers of a pre-trained network or using the pre-trained network as fixed feature extractor to train a classifier (as we do in this research). We can namely assume the pre-trained network is capable of extracting important features, even of objects it has never seen before. The most commonly used pre-trained network is a network trained on the large scale, well-annotated ImageNet dataset, containing 1.2 million images of over a thousand classes.

3.3 Convolutional Neural Network Architectures

Now that we know the basics of how a Convolutional Neural Network (CNN) works, let’s look at some well-known CNN architectures. Specifically, we will discuss the architectures of the networks used in the rest of our research. We chose for these architectures since they performed best at the ImageNet classification contest, a contest where participants build their own neural network architecture to classify images containing objects of over a thousand classes.

3.3.1 VGG16 and VGG19

The VGG network architecture (Figure 3.10) was first introduced by Simonyan and Zisserman in 2014 [34].

Figure 3.10: The VGG network architecture.

(23)

The main idea behind the VGG architectures is to keep it simple. The inventors only used 3x3 convolutional layers stacked on top of each other in increasing depth along with 2x2 max-pooling layers. Finally, the features are fed to a softmax classifier.

The only difference between the VGG16 and VGG19 networks are the number of weight layers they use. They, however, entail the same drawbacks. First, they are painfully slow to train due to the high depth of the networks. This can be solved with pre-training, a process where smaller networks are trained first and subsequently used as initialization for the larger, deeper networks. The second drawback of the VGG networks is that they consume a lot of bandwidth and disk space due to their depth and number of fully-connected nodes.

3.3.2 Resnet50

The Resnet50 architecture (Figure 3.11) was first introduced by He et al. in 2016 [9].

Figure 3.11: The Resnet50 network architecture.

Unlike traditional sequential architectures, such as the VGG network architectures, Resnet50 is a form of network-in-network architectures that relies on micro-architecture modules. It has shown that extremely deep networks (as in fifty layers) can be trained using residual blocks.

(a) (b)

Figure 3.12: Two types of CNNs. a) Traditional CNN. b) CNN with a Residual Block. With traditional CNNs, we have an underlying mapping with a nonlinear function H(x) from input to output. Now, instead of H(x), we use a nonlinear function F (x) which is defined as H(x) − x. At the output of the second weight layer (Figure 3.12b), we arithmetically add x to F(x). Subsequently, we pass F (x) + x through the Rectified Linear Unit (ReLU) function. This enables us to carry important information from the previous layer to the next layers. Further-more, even though intuitively surprising, it fastens the training of the network.

Even though Resnet50 is much deeper than the VGG networks, the model disk size is substan-tially smaller due to the usage of global average pooling rather than fully-connected layers. It also trains faster due to the Residual Blocks.

(24)

3.3.3 InceptionV3

The Inception micro-architecture was first introduced by Szegedy et al. in 2015 [38]. Just like Resnet50 uses their own residual blocks, Inception has its inception modules (Figure 3.13).

Figure 3.13: The Inception module architecture.

The bottom green box represents the input where the top box represents the output of the model. In traditional CNNs, one has to make a choice whether to add a pooling or convolutional layer. An Inception module, however, allows all of these operations to be executed in parallel. This induces the disadvantage of leading to too many outputs. The solution the authors provide is to add a 1x1 convolution to reduce the dimension of the data.

3.3.4 Xception

The Xception architecture (Figure 3.14) was introduced by Chollet in 2016 [4]. Xception is an extension of the Inception architecture. It replaces the standard Inception modules with depthwise separable convolutions. This results in an improvement over the inception module and architecture in terms of architecture itself. Furthermore, the paper states that the Xception architecture outperforms InceptionV3 on a large image classification dataset. Due to the fact that both architectures have roughly the same number of parameters, the performance gains are not due to increased capacity but rather to a more efficient use of model parameters.

(25)

Figure 3.14: The Xception network architecture.

3.4 Classifier Performance

Now that we know how to build classifiers with the help of CNN architectures, we need a way of systematically comparing them so that the user can easily choose the classifier that fits her image recognition problem best. Sokolova and Lapalme [36] propose twenty-four performance measures used in the complete spectrum of machine learning classification tasks, i.e., binary, multi-class, multi-labelled, and hierarchical. However, since object recognition is a multi-class task, meaning the input is to be classified into one of N non-overlapping classes, we are only interested in the multi-class metrics. These are shown in Table 3.1.

In the table, tpi, f pi, tni and f ni indicate the amount of true positives, false positives, true

negatives and false negatives for class i, respectively. The amount of true positives for class i can be calculated by counting the number of samples that are correctly recognized to be belonging to i. The number of true negatives for class i are the samples that do not belong to i and are also recognized as not belonging to i. Furthermore, samples that were assigned to class i but do not belong to i are false positives and samples that do belong to class i but are recognized as another class form the false negatives.

With the help of these numbers and the formulas in Table 3.1, we can calculate the metrics. However, we must also know what they mean. The average accuracy of a classifier indicates how well the classifier performs (i.e. what fraction of the training samples it correctly classifies). The error rate indicates exactly the opposite: the fraction of training samples the classifier incorrectly classifies. Furthermore, precision indicates the fraction of correctly classified over the total positive classified samples and recall indicates the fraction of correctly classified over the total correct samples. We intentionally disregard the f-score, the weighted average of precision and recall, since we want to use our framework to know which classifier performs best on one of these metrics, and thus their weighted average would not tell us much.

(26)

Table 3.1: Measures for multi-class classification.

Metric

Formula

Average Accuracy

P

N

i=i

_{tpi+tni+fpi+fni}

tpi+tni

N

Error Rate

P

N

i=i

_{tpi+tni+fpi+fni}

f pi+fni

N

Precisionµ

P

N

i=1

tp

i

P

N

i=1

tp

i

+f p

i

Recall

µ

P

N

i=1

tp

i

P

N

i=1

tp

i

+f n

i

On top of the metrics provided in Table 3.1, we also compute the time to train the classifiers and how many samples they can classify per second. Furthermore, we build a confusion matrix so the user can see on which classes the classifier performs best and worse. This provides some form of root-cause analysis; when, for instance, the classifier drastically fails to correctly classify images of watches, the training set consists of too little images of watches or the images of watches are not clear. Finally, we also provide the rank-1 and rank-5 scores which indicate the fraction of samples that are correctly classified (accuracy) and the fraction of samples that are classified as one of the top five classifications, respectively.

With all these metrics in mind, the user can choose which classifier suits her problem best. For instance, when dealing with user interfaces, speed and the rank-1 score are important metrics. However, when the users are provided with a list of the five classes with highest probabilities, the rank-5 score is more important than the rank-1 score. On the other hand, when dealing with medical applications, the amount of false negatives (e.g. concluding a person does not have cancer when she actually does) should be minimized, leading to an increased recall.

(27)

CHAPTER 4

Architecture of a CNN Comparison

Framework

In the previous chapter, we explained how CNN architectures can be used to build classifiers. Furthermore, we discussed some of these architectures and we discussed how these can be compared with each other. This chapter focuses on combining this into a generic CNN comparison framework. We first sketch an overview of the framework itself and explain how it can be used to construct our proof of concept: a product recognition system. Hereafter, each of the system’s components is discussed in detail.

4.1 System overview

This section provides an overview of the framework and explains how it is used to build a product recognition system. The overall system can be divided into three separate components:

1. Collecting: semi-automatic collection of the dataset.

2. Training: using the dataset, in combination with CNN architectures, to build and evaluate classifiers.

3. Using: using the best classifier to build a product recognition system.

By combining these components, we attempted to build a product recognition system that, given an image of a product, detects what kind of product the image contains. Subsequently, this information is used as input to bol.com’s search engine. The system looks like Figure 4.1 with the numbers denoting the components of the system.

(28)

Figure 4.1: Overview of the system.

4.2 Collecting the dataset

In order to compare classifiers with each other, one first needs a dataset to train them with. As explained in section 2.1, obtaining a massive amount of well-labeled data is usually very expensive and time-consuming, preventing people from training CNNs simply because they lack the required dataset. Since one of the main motivations behind our framework is to aid the user in building CNNs, we included a Google Images1 _scraper.

The scraper requires a class name as well as a search query as input. It then retrieves the first 100 images corresponding to the search query and places them in a temporary directory. The scraper then loops through the images and checks if a directory with the class name already exists. If this is not the case, the directory is created and the image is given the name classname 1.jpg before being placed inside the directory. If the directory, however, already exists, the image is saved with the name classname d.jpg with d being the first number, counting from 1, the scraper finds so the filename will not yet be in use. This ensures that, when the user manually deletes an image from the dataset and then runs the scraper again, no data is overwritten. This creates the directory hierarchy:

4.3 Training and evaluating classifiers

Once the dataset is acquired, it can be used to train and evaluate the classifiers. Per class, all images are collected and divided. 80% of the images are used to train the classifiers where the

(29)

remaining 20% is used for the evaluation. This is done per class to ensure that the classifiers learn enough samples from all classes, preventing one or more classes from dominating.

Subsequently, the training process itself is initiated. The CNN architectures discussed in sec-tion 3.3 are initialized with Imagenet weights and the last fully-connected layer is popped since the network is used for feature extraction, not for classification. These architectures are readily available through the Keras2 _{library. Keras is a high-level neural network API that runs on top}

of Tensorflow3_{. We chose for Keras over other APIs since it provides a clear documentation and,}

because it is higher level, it allows us to create sophisticated models with just a few lines of code. The CNN architectures are subsequently used to extract features from each of the training im-ages. All of these features are used as input to the Linear SVM classifier. Once trained, the classifiers are saved as pickle4 files, making them easily transferable.

At this point, 5 classifiers are trained and saved, each one with features extracted with the help of a different CNN architecture. During the next step, the classifiers are systematically compared so the user can easily decide which classifier fits her problem best. This is done with the help of the metrics described in section 3.4 along with some other insightful metrics. Here we will briefly explain how these metrics are computed.

The first metric calculated is the time to train the classifier. This includes the time it takes for the CNN architectures to extract the features from the training images. Next, during the evaluation process, we looped through all of the evaluation images, extracted their features and predicted their output. This results in a probability for each of the possible classes per image. The five classes with highest probabilities are kept. If this top-5 of classes includes the actual class of the image, the rank 5 counter is increased. When the predicted class with the highest probability is the actual class of the image, the rank 1 counter is also incremented. Finally, both counters are divided by the number of evaluation images to come to a percentage.

By timing how long it takes a classifier to perform the evaluation, the images per second metric can be computed by dividing the evaluation time by the total number of evaluation images. Furthermore, for each of the CNN architectures, a confusion matrix is provided. The confusion matrix not only provides some form of root-cause analysis, but also allows us to calculate the amount of false positives and false negatives. This is helpful in, for instance, medical applica-tions where a false negative can be lethal. The scikit-learn library is also used to compute the classification report which includes the recall-, precision-, and f-scores. However, as pointed out in section 3.4, the f-score is irrelevant.

Finally, the results of each classifier are written to a file. On top of that, we created a ranking for each of the metrics “Images per second, rank 1, rank 5, amount of false negatives, amount of false positives”, so the user can easily see which classifier performs best at which metric. This ranking is written to the same file as the results per classifier.

2_{https://keras.io/}

3_{https://www.tensorflow.org/}

(30)

(31)

CHAPTER 5

Experiments

In the previous chapter, we thoroughly discussed the different components that compose our generic Convolutional Neural Network comparison framework. In this chapter, we put it to the test. We first apply our framework to the well-known MNIST and CIFAR-10 benchmark datasets to make sure it equates with state of the art results. Subsequently, we proceed with our proof of concept: we show how the framework can be used to compose a product recognition system.

5.1 MNIST and CIFAR-10

Since accuracy generally is the most important driver in CNN architecture choice, we tested our framework on the well-known MNIST [18] and CIFAR-10 [15] benchmark datasets to see if the framework achieves near state of the art results. We specifically chose for these two datasets since they are both designed for a different image recognition problem. As we want our framework to be as generic as possible, it should perform well on both datasets.

The MNIST dataset is composed of 70,000 28x28 grayscale images of the ten handwritten digits (Figure 5.1a), resulting in ten classes. It contains 60,000 training images and the remaining 10,000 instances are used for testing. This is relatively small compared to other benchmark datasets. This, however, is done on purpose due to our limited time resources.

The CIFAR-10 dataset consists of 60,000 32x32 color images of 10 object classes, ranging from airplanes to horses (Figure 5.1b). Each of the 10 classes contains 6,000 images. There are 50,000 training images and 10,000 test images.

(32)

(a) (b)

Figure 5.1: Example images from the MNIST and Cifar-10 datasets. a) MNIST dataset. b) Cifar-10 dataset.

5.1.1 Setup

Both the MNIST and the Cifar-10 datasets are readily available through the Keras library. The images from the Cifar-10 dataset do not need pre-processing. They are directly fed to the CNN architectures which perform the feature extraction. Subsequently, the features are fed to the Linear SVM classifier to initiate the training process. The training is performed on the UvA cluster of the DAS-41_{, a supercomputer with multiple NVidia Titan GPU’s. Once trained, the}

classifiers are evaluated. Since we only care about performance in this experiment, the accuracy metric is the only computed metric.

This process is repeated with the MNIST dataset. However, as the MNIST dataset consists of grayscale images where the CNN architectures expect color images, the images are first converted using the opencv2 library.

5.1.2 Results

Table 5.1 presents the accuracy results of our framework applied to the MNIST and Cifar-10 datasets. State of the art results for these datasets are 99.79% and 96.53%, respectively.

Table 5.1: Accuracy results on the MNIST and CIFAR-10 datasets.

CNN architecture MNIST accuracy CIFAR-10 accuracy

VGG16 99.24% 76.34 % VGG19 99.21% 77.20 % Resnet50 98.17% 41.94 % InceptionV3 97.71% 85.23 % Xception 97.94% 87.24 % 1_{http://www.cs.vu.nl/das4/} 2_{http://opencv.org/}

(33)

5.2 Product recognition system

As a proof of concept, we utilized the framework to come up with the best CNN architecture for our product recognition system.

5.2.1 Setup

The Google Images scraper is used to form a product dataset consisting of images from six categories: jeans, shirts, shoes, rings, watches and bracelets. We specifically chose the latter two since they look alike and we were wondering how well the classifier can differentiate between them. Since a product recognition system is most likely to be used with real world images, we chose search queries which result in product images containing noisy backgrounds. Some examples of queries are displayed in table 5.2.

Table 5.2: Examples of search queries used to collect the dataset. Class name Search query

Jeans Woman jeans outside Ring Ring on hand Watch Watch on wrist

Once the dataset is collected, it is used to train and evaluate the classifiers using the CNN ar-chitectures from section 3.3. Per architecture, a list of its results are created and written to a text file. On top of that, a ranking per metric is created and written to the same text file. With the help of this ranking per metric, we choose the best CNN architecture for our product recognition system. Subsequently, we combine this architecture with a simple web application. The web application lets the user upload an image, recognizes the object in the image and uses this, in combination with the most dominant colors, as input to bol.com’s search engine.

5.2.2 Results

During the collection of the dataset, we quickly noticed the scraper not only collected correct images (as in Figure 5.2a), but also images containing multiple objects and images without the intended object at all, as displayed in the Figures 5.2b and 5.2c, respectively.

(a) (b) (c)

Figure 5.2: Results of the query “woman with jeans on”. a) The result is a correct image where jeans are the main object. b) The result contains many objects and jeans is not the main one. c) The result does not contain jeans at all.

(34)

As an example, the results of the InceptionV3 architecture are provided in Figure 5.3.

Figure 5.3: The list of results for the Inceptionv3 architecture.

As we can see, the Inceptionv3 architecture performs pretty well on our dataset with an accuracy of 93.47% and an astonishing rank-5 accuracy of 99.78%. Furthermore, as expected, it performs the worst on the bracelets images, predicting eleven of them as rings and eleven as watches. The ranking per metric is displayed in Table 5.3. ps, fn, and fp mean images per second,false negatives and false positives, respectively.

(35)

Table 5.3: Ranking per metric.

Ranking Ips Rank 1 Rank 5 Fn Fp

#1 vgg16: 26.23 ips inception: 93.47% xception: 100% inception: 24 inception: 35

#2 vgg19: 23.1 ips xception: 92.15% inception: 99.78% xception: 32 xception: 39

#3 resnet: 20.69 ips vgg19: 88.16% vgg19: 99.56% vgg19: 36 vgg16: 62

#4 xception: 18.79 ips vgg16: 88.05% vgg16: 99.45% vgg16: 46 vgg19: 71

#5 inception: 12.84 ips resnet: 66.37% resnet: 99.0% resnet: 89 resnet: 215

Using the classifier trained with features extracted with the help of the best architecture, we built an interactive web application. The usage of the web application is depicted in Figure 5.4 below. The user first uploads an image (Figure 5.4a). She is then asked to segment the object of interest (Figure 5.4b). Subsequently, she is asked for a final check and to select the color(s) of the product (Figure 5.4c) before being redirected to bol.com (Figure 5.4d).

(36)

(a) The user uploads an image. (b) The user segments the object of interest.

(c) The user performs a final check and selects the colors.

(d) The user is redirected to bol.com.

(37)

CHAPTER 6

Discussion

The previous chapter outlined the experiments we performed and presented their results. In this chapter, we will thoroughly discuss those results while reflecting back on our research questions, as described in section 1.1.

The first experiment we performed was the baseline experiment. We tested whether our frame-work equates with state of the art results by applying our frameframe-work to the well-known MNIST and CIFAR-10 benchmark datasets. Although this experiment does not answer a research ques-tion directly, accuracy is always an important driver in convoluques-tional neural network architecture choice. This means that scoring low on this metric compared to the state of the art would make the framework useless. Furthermore, since the framework should be generic, it should be able to perform well on different image recognition problems. Therefore, we chose for two benchmark datasets that are different in nature.

The state of the art accuracy for the MNIST dataset is 99.79% [42]. We must, however, note that the authors also added randomly cropped and scaled images to the training set. On the original MNIST dataset, they achieved an accuracy of 99.48%. As we can see from Table 5.1, the VGG networks both come very close to the state of the art. The others score a little lower but still achieve high accuracy results.

Regarding the CIFAR-10 dataset, the state of the art is set at 90.92% [8]. Looking at Table 5.1, we can see that the Xception and Inceptionv3 architectures perform near state of the art. The VGG networks also perform well but the Resnet50 architecture fails miserably. It did, however, perform averagely on the MNIST dataset. One plausible argument for this is that the Resnet50 architecture is vulnerable to noise in the background of the image. The images in the MNIST dataset do namely not contain background noise, where the images in the CIFAR-10 dataset do. This argument is supported by the results in Table 5.3, where the Resnet50 architecture scores the lowest on our product dataset; another dataset with images that contain noisy backgrounds. Seeing that our framework achieves near state of the art results, we can move on to the second experiment. The second experiment is a performance experiment. We tested whether our frame-work was applicable to real-world problems, such as product recognition. This experiment focuses mostly on the first and third research questions and partly on the second.

As described in section 5.2.2, the Google Images scraper not only collects correct images, but also images containing multiple objects and images without the intended object at all. As this would adversely impact the classification accuracy, a way of filtering the incorrect images was required. However, since one of the design principles of the framework is user friendliness, this ought to be done with minimal user interaction. On top of that, we want to maximally facilitate the process of systematically collecting a dataset

(38)

Therefore, the scraper was altered so that, as soon as it downloaded all images, an user interface is opened which loops through all the images. In the user interface, the user has the option to delete the image, to save it or to draw a bounding box around the object of interest. When the latter is chosen, only the area inside the bounding box is saved. For instance, the process of segmenting the jeans in the image in Figure 5.2b is depicted below. Obviously, the interface also includes a reset function in case the user makes a mistake drawing the bounding box.

(a) (b) (c)

Figure 6.1: Process of segmenting an object in an image. a) The original image. b) The image with a bounding box drawn around the object of interest by the user. c) The resulting image. As we can see from Figure 5.3, the Inceptionv3 architecture performs pretty well on our dataset with an accuracy of 93.47% and an astonishing rank-5 accuracy of 99.78%. Furthermore, as expected, it performs the worst on the bracelets images, predicting eleven of them as rings and eleven as watches. Unfortunately, we cannot see exactly which images are wrongly predicted. However, by looking at the training images, we can make an educated guess. For instance, the image in Figure 6.2 may be too hard to predict, as even we do not know whether this classifies as a ring or as a bracelet.

Figure 6.2: An example from the bracelet images set that looks both like a ring and a bracelet. Looking at the results in Table 5.3, we can see that the classifier trained with the help of the Inceptionv3 architecture scores the highest on the accuracy score with an accuracy of 93.47%, making it a logical candidate for our system. However, it also scores the lowest on speed, with only 12.84 images per second. Speed is also an important driver in our choice since we are dealing with an interactive web application. We, therefore, chose for the Xception architecture as, in our opinion, the 1.32% loose in accuracy definitely weighs up to the gain in speed. This shows that the framework makes it relatively easy to choose the best classifier for a certain problem. Furthermore, the ranking per metric allows for the user to perform trade-offs.

(39)

CHAPTER 7

Related Work

Literature dealing with comparison of classifiers can typically be organized into two main groups: • Work that validates and justifies a new approach by comparing it with relatively few

methods. [45, 1, 40]

• Work that performs systematic qualitative [14, 12] and quantitative [5, 43] comparison between many representative classifiers.

Our research belongs to the second group since it performs a qualitative comparison between classifiers built with several convolutional neural network architectures.

Most of the aforementioned papers focus on one attribute of the classifiers: their accuracy. However, as we have seen in section 1.1, other attributes, such as speed, are important drivers in classifier choice as well. Kotsiantis [14] agrees and presents a review of supervised learning algorithms. He summarizes his findings in a table which considers multiple attributes of the classifiers. Some of these attributes are very meaningful, such as the speed, how well the classi-fier deals with overfitting and its tolerance to noise. Other attributes, on the other hand, such as tolerance to missing values, are of less interest. Providing the results in a table, with the classifiers ranked according to their score per attribute, makes sure the user can easily make her decision when she knows which attributes are important for her classification problem.

Furthermore, in most of the aforementioned papers, the classifiers are applied to relatively few datasets. This introduces a risk; choosing datasets of only a specific kind might lead to incorrect conclusions. This statement is supported by Table 5.1, where we see that the performance of classifiers can very much fluctuate when applying it to datasets of different nature. One possible reasoning behind choosing few datasets is the inability to easily apply the classifiers to a new dataset. This indicates the researchers lacked a generic framework built around the classifiers. On the other hand, there are frameworks that can be used to explore and process datasets in a simple way. One such framework is RapidMiner [26], introduced by Klinkenberg et al. in 2010. RapidMiner is written in Java, which is not the best choice when working with classifiers as they require quite some computational power and Java is a relatively slow language. Moreover, as pointed out by Cetinkaya et al. [2], comparing classifiers using RapidMiner is a laborious task. This is probably because RapidMiner is not intended solely for classifier comparison. Therefore, Cetinkaya et al. presented a framework that can be used to compare and visualize the perform-ance of the classifiers in a user-friendly way. The comparison is carried out with the help of the confusion matrices that are produced by the classifiers. These confusion matrices include the precision and recall metrics. The best scores in the confusion matrices are highlighted before they are offered back to the user.

Although their framework makes the user journey a bit easier, it contains one major drawback compared to our framework: the classifiers can be compared with only two at a time. This

(40)

means that, when we have n classifiers and only one attribute plays a role in the decision, it takes n comparisons to find out which classifier performs best. However, when multiple attributes play a role, this number quickly rises. Furthermore, our framework is more suitable for image recognition problems, since our framework focuses on classifiers built with CNN architectures, which have proven to be best at performing image recognition problems. Finally, our framework contains more attributes the user can make her decision on. One advantage of Cetinkaya et al.’s framework over ours is the graphical user interface. Although our results are neatly formatted and written to file, including the whole application in a user interface would make it even easier for the user to compare classifiers.

A Framework for Systematic Comparison of Convolutional Neural Network Architectures

Master Software Engineering