convolutional neural networks for image classification

(1)

A comparative study of human engineered

features and learned features in deep

convolutional neural networks for image classification

Research Internship

July 16, 2018

Student: T. van Elteren, S1849948, University of Groningen Primary supervisor: Prof. M. Biehl, University of Groningen

Secondary supervisor: Dr. M.H.F. Wilkinson, University of Groningen

(2)

This paper covers a comparative study of human engineered features and learned features in deep convolutional networks for image classification. Human-engineered or hand-crafted features for computer vision tasks have been the defacto standard until recent developments in learning features has surpassed them in the state of the art. The goal of this paper is to learn and describe the differences between the two approaches, learn the specifics of the methods, and determine where they could benefit eachother in a hybrid architecture.

Keywords: deep learning, computer vision, feature learning, hand-crafted features, learned features, neural networks, convolutional neural networks, deep neural networks

1

(3)

1 Introduction 4

1.1 Traditional approach . . . 4

1.2 Feature descriptor complexity . . . 4

1.3 Learning features . . . 4

1.4 Biological inspiration . . . 4

1.5 Neural networks . . . 5

1.6 Deep Neural Networks . . . 5

1.7 Deep Convolutional Neural Networks . . . 6

1.8 Research questions . . . 6

1.9 Outline . . . 6

1.10 History and literature summary . . . 6

2 Methods 9 2.1 Methods for comparison . . . 9

2.2 Considered approaches . . . 10

2.3 Engineered feature descriptors . . . 11

2.3.1 Gabor filter . . . 11

2.3.2 Scale Invarient Feature Transform (SIFT) . . . 12

2.3.3 Speeded Up Robust Features (SURF) . . . 13

2.3.4 Histogram of Oriented Gradients (HOG) . . . 14

2.3.5 Gradient Location and Orientation Histogram (GLOH) . . . 15

2.4 Engineered image representations . . . 16

2.4.1 Bag of Visual Words (BoVW) . . . 16

2.4.2 Fisher Vector (FV) . . . 17

2.4.3 Vector of Locally Aggregated Descriptors (VLAD) . . . 17

2.4.4 Machine learning . . . 18

2.4.5 Image recognition and classification . . . 19

2.4.6 Neural networks . . . 22

2.5 Convolutional neural networks . . . 23

2.5.1 Convolutional layer . . . 24

2.6 Deep Neural Network Architectures . . . 27

2.6.1 (BN-)AlexNet . . . 29

2.6.2 Multi-column Deep Neural Networks for Image Classification . . . 30

2.6.3 Zeiler-Fergus (ZF) Net . . . 31

2.6.4 VGG . . . 33

2.6.5 GoogLeNet-Inception . . . 33

2.6.6 OverFeat . . . 35

2.6.7 DeCAF . . . 36

2.6.8 ResNet . . . 36

2.7 Datasets, benchmarks and frameworks . . . 38

2.7.1 Benchmarks & conferences . . . 41

2

(4)

2.7.2 Frameworks . . . 41

2.8 Summary . . . 42

3 Results 43 3.1 Method comparison . . . 43

3.2 Experimental validation and analysis . . . 44

3.2.1 Elementary filters . . . 44

3.2.2 Higher order concept filters . . . 51

4 Discussion 73 4.1 Conclusion . . . 73

4.1.1 Theses . . . 73

4.2 Outlook . . . 74

4.3 Peregrine . . . 75

(5)

Introduction

The use of human-engineered, or hand-crafted features for computer vision tasks such as object detection and image classification, have been considered as the defacto standard for many years. In recent years, improving the performance with respect to their current state-of-the-art, has required these hand-crafted features to increase in complexity in order to be able to further improve their performance.

1.1 Traditional approach

The typical traditional approach for classification of objects in images, of specific interest to the computer vision research community, consisted of the combination of a feature descriptor, followed by the application of a learning algorithm [24].

1.2 Feature descriptor complexity

The increasing complexity of the most predominant and well known hand-engineered methods for feature description, such as Scale-Invariant Feature Transform (SIFT) [72], Histogram of Oriented Gradients (HOG) [27] [28], the application of the Bag of Words (BoW) method to computer vision known as Bag of Visual Words (BoVW) [124], and related feature detectors such as Speeded Up Robust Features (SURF) [10], Binary Robust Independent Elementary Features (BRIEF) [17] and Vector of Locally Aggregated Descriptors (VLAD) [53] are evidence of a process of growing complexity.

1.3 Learning features

In another field of research, the domain of machine learning, researchers have taken a different approach, and by learning models and features from the raw image data themselves, recent approaches surpassed this standard set by traditional approaches, and raised the bar to a human-competetive performing state-of-the-art [37]. Only very recently have these approaches been successfull in significantly surpassing their engineered state-of-the-art counterparts, see Figure 1.1.

1.4 Biological inspiration

The approach for learning features is biologically inspired. Research shows that a cat’s visual cortex consisted of two types of cells: simple and complex. The simple cells fire in response to certain properties of visual sensory input, such as the orientation of edges. Complex cells exhibit more spatial invariance than simple cells [48] [94].

The artificial neural network, and the artificial neurons it consists of, are also biologically inspired to represent the biological neural networks in the mammal brain. The artificial neuron is a simplified representation of the

4

(6)

Figure 1.1: Performance by classification error rate in the image recognition task as part of the ImageNet Challenge and the type of approach, either traditional computer vision by feature engineering, or deep learning where the features are learned using a deep layering of neural networks [78]. Image source: Musings on Deep Learning, Medium [79].

biological neuron [25]. In addition, the Scale Invariant Feature Transform (SIFT), a feature descriptor based on the research of Complex Cells and Object Recognition by Edelman, Intrator, and Poggio [36], engineered by Lowe, is claimed by the author to also be biologically inspired [72].

1.5 Neural networks

Artificial Neural Networks (ANNs) are as mentioned biologically inspired and were described by in the mid 20th century by Mculloch and Pitts, a logical calculus of the ideas immanent in nervous activity [77] and later constructed based on a computational model of the human brain, the functioning of the brain described by Donald Hebb in his, The organization of behavior: A neuropsychological theory. The neuron in these networks is the resemblance of the biological cell and is able to process several inputs, activated by some outside process.

The activity of the neuron depends on the amount of activation and determines the outputs. The paths, or connections, between these neurons can be strenghtened or weakened by weights. Linear regression is considered as the earliest approach of using neural networks, as a single-layer artificial neural network performs least-squares regression [25]. The research leading towards deep learning is a culmination of many years of research in neural networks, computer vision, machine learning, neuroscience and many other fields of research.

1.6 Deep Neural Networks

The term Deep Learning, possibly coined by Geoffrey Hinton in 2006 [46], describes the use of a system that learns representations using multiple layers that are required to solve a task. A clear separation between so called shallow architectures and their deep counterpart is as of yet not formally defined but shallow architectures, where a combination of a hand-engineered feature descriptor with as final prediction layer a Support Vector Machine, or any other linear or non-linear layer for that matter, can be used and thusi in turn only has a

(7)

few layers and does not learn hierarchical representations. Deep architectures have advantages due to the hierarchical representations of features they learn that are partly general and partly task specific.

1.7 Deep Convolutional Neural Networks

The application of deep learning to the research field of computer vision has resulted in an approach with the name Convolutional Neural Networks (CNNs) [37]. Convolutional is added to the name because of the filter-like approach that is used in this type of Neural Network. As CNNs are the main contribution within deep learning specifically geared towards computer vision tasks, this will be the main topic of this paper.

1.8 Research questions

As the topic of this paper has been introduced and the goals have been set, it is time to define the main research questions, which are enumerated below.

1. Which type features are learned? Do they show similarity to human-engineered features?

2. What is the generalization ability of a deep neural network e.g. transferability of learned features with respect to new task?

3. Feature engineering efforts so far focussed on developing feature descriptors that were invariant to among others changes in orientation and scale. How do deep convolutional neural networks handle these invariances? What techniques are used here?

4. Which approaches in feature engineering are successful? What is the rationale behind these approaches?

5. Deep neural networks have different architectures. Which type of architectures exist, what are their differences and the impact on their ability to perform image and object recognition?

6. Which metrics exist for comparing the characteristics of traditional and deep learning approaches?

7. Which steps can be explored to learn about the inner workings of a deep convolutional neural network?

1.9 Outline

The outline of the paper is as follows: A review of the literature covering the most important breakthroughs in research leading up to deep learning, which already starts at the end of the 17th century, when the chain rule in mathematics was formulated, forming the basis for the application of gradient descent. The second section covers the approaches in feature crafting and learning. This section also covers the problems related to (deep) learning of features and defines a comparison framework. The second part of the second section describes experiments performed to learn about the differences between the application of engineered and learned features. The final sections cover the results and discussion respectively.

1.10 History and literature summary

This section only covers a brief and limited review of literature as an in-depth literature survey is considered outside the scope of this paper.

The research field of machine learning, with its subfield of deep learning, has a long history. This section provides a short summary of a literature review of the research that resulted in the emergence of deep learning.

(8)

Early history

In 1763 the basis for Bayes’ Theorem is published, in his An Essay towards solving a Problem in the Doctrine of Chances, which marks the start of the research into the field that eventually culminates into the research into neural networks. By reasoning about the probability of the occurence of a future event, based on prior knowledge and probabilities, the foundation for statistical inference was laid down. Least squares, in 1805, a standard approach in regression analysis, determines a solution for a overdetermined systems, and the most important application is data fitting. In 1812 Bayes’ Theorem, based on the previous mentioned essay,is published and provides a concrete foundation for further research in the field.

’40s and ’50s

The turingtest, described by Alan Turing in 1936, and published in 1950 in Computing machinery and intelligence, is an important step towards reasoning about machine intelligence. The turingtest and other research combined result in the 1951 birth of a first neural network machine named SNARC by Marvin Lee Minsky, where in 1952, neural networks are first applied to the problem of playing chess by Arthur Samuel at IBM’s Poughkeepsie Laboratory.

’60s and ’70s

The research up to the 60s is focussed on Bayesian methods for probabilistic inference in machine learning but also sees the birth of the perceptron, by Frank Rosenblatt, at the Cornell Aeronautical Laboratory. The Nearest Neighbor algorithm is published and marks the start of basic pattern recognition. Due to a publication on the Limitations of Perceptrons in 1969 by Minsky and Papert marks a period of deminishing research interest on the topic of neural networks. The Automatic Differentiation algorithm, which forms the basis for the Backpropagation algorithm, is published in 1970 but the Backpropagation algorithm is only to be published as such in 1986 by Rumelhart, Hinton and Williams.

’80s

The Neocognitron, an Artificial Neural Network, in 1980 published by Kunihiko Fukushima, later inspires the creation of convolutional neural networs, a type of deep learning specifically applicable to computer vision tasks. In addition in the 80s a number of research fields relevant to neural networks and (deep) machine learning are actively pursued: In 1982 John Hopfield, publishes the Hopfield networks, a recurrent neural network. In 1989, Watkins publishes about Q-learning, an improvement of reinforcement learning.

’90s

In the 90s research shifts towards kernel methods and Support Vector Machines by Cortes and Vapnik, which in fact are neural networks but specialized and therefore named differently, and see many applications. Temporal Difference (TD-) learning is applied to Backgammon. Random (decision) forests is published by Tin Kam Ho, Deep Blue, an IBM solution, beats Kasparov at the game of chess and LSTMs, long-short term memory recurrent neural networks, is published by Hochreiter and Schmidhuber. In 1998 the MNIST Database is released by Yann LeCun et al, the benchmark dataset for handwriting recognition performance evaluation.

2000 and later

In 2002 the machine learning library Torch is released allowing a broad range of developers to use the com- ponents required to implement a (deep) neural network. In the 2010s and later, large companies such as Google, Facebook, Netflix and IBM compete in a number of challenges where machine learning is applied to among others games, handwriting, object and face recognition. The large movement that happened over these decades is from a knowledge driven to a data driven approach, which is analogous to the shift from using human-engineered features to features learned from data. The ability to apply GPU processing power to neural networks with backpropagation combined have been imperative to the rise of Deep Learning, an ongoing

(9)

and active field of research, and the main topic of this paper. The following section covers the methods for comparison between human-engineered and learned features and its applications to the field of Deep Learning.

(10)

Methods

This section covers the methods that are employed traditionally in computer vision for object recognition and the methods that are used in the learning approach within deep convolutional neural networks. The first subsection determines the methods for comparison. The research is mostly qualitative of nature to form a basis for further research. The quantitative research is limited to the metrics defined in the following subsection.

2.1 Methods for comparison

In order to perform a comparison between the hand-crafted and learned features we first need to discover the features that are learned. In contrast to the hand-crafted features this is not trivial. Next we define methods of comparing the hand-crafted and learned features. One of the methods is visual inspection: we compare the features visually and try to find differences between them, defining hypotheses on the reasons why they differ.

Another approach is by determining the visual output of the approaches, e.g. the result after transformation of the input by the algorithm: based on a fixed input image we discover the output images and compare the results. Again we hypothesize about the differences.

Another approach to determine the differences by visual inspection is by subtracting the input image from the output image and compare the difference images of various systems. Finally we try to find evidence to validate or invalidate the hypothesis, supported by theoretical findings. As we first need to discover the features that are learned, we first need to learn about the different types of convolutional neural network architectures, datasets and benchmarks, e.g. applications or tasks, exist, and the impact this has on the features that are learned.

The generalization of an approach to unseen data, e.g. the performance of a solution, is an important metric to determine the state of the art. In the traditional approach the time to research, engineer and develop feature descriptors, is typically years. As the deep learning approach does not require features to be engineered, no time is in principle required to engineer it. However, as the deep neural network itself needs to be adapted to the problem at hand, this is not entirely the case. The use of pre-trained models can greatly speed up the use of a deep neural network, however the problem to solve needs to be closely similar to the problem that the trained model was the result of.

In the neural network learning approach the time to train a deep neural network, the computational complexity in the training phase, is a factor to take into account in a comparison. The runtime requirements and computational resources required, e.g. computational complexity in the testing phase, to apply the trained network in (near-)real time applications is regarded separately due to the large difference between traditional and deep learning approaches. One of the most important differences in both approaches is the training time and complexity of a convolutional neural network versus the use of traditional approaches for computer vision tasks. The transferability of features over datasets, e.g. the performance of learned features trained on one dataset to another, and the generalization ability over tasks, e.g. the application of one learned model on one task to another,

9

(11)

are two more metrics that are important in scoring the performance of an either learned or engineered approach.

In addition some approaches exist to compare between deep learning based solutions and their traditional counterparts. One way to solve this is by comparing performance of a solution that consists of the use of a Support Vector Machine (SVM) with hand-crafted features with the use of a Convolutional Neural Network with learned features. As the features learned by deep learning approaches is currently not entirely understood, visual inspection can be used to determine what features are learned. The use of homogeneous datasets when comparing between solutions is important as it removes the variance introduced by inhomogeneous datasets.

The methods for comparison are a number of metrics that are defined in order to compare between the two approaches. The comparison metrics are defined as follows:

• Development complexity

• Computational complexity in the training phase

• Computational complexity in the testing phase (Execution time)

• Generalization ability (wrt other datasets)

• Transferability quality

• Interpretability

• Requirement for expert knowledge

• Time required for hand-tuning of parameters

Now that we have defined a number of metrics for comparison, we need to list the methods that we consider in our comparison, which will be described in the next subsection.

2.2 Considered approaches

The considered approaches, both traditional and deep learning methods, are given in the following itemization.

• Gabor filter

• Scale Invarient Feature Transform (SIFT)

• Histogram of Oriented Gradients (HOG)

• Gradient Location and Orientation Histogram (GLOH)

• Bag of (Visual) Words (Bo(V)W)

• Vector of Locally Aggregated Descriptors (VLAD)

• Fischer Vector

The thesis is that the features that are the result of these filters, or (feature) transforms, are similar to the features learned in the first layers of a deep convolutional neural network. Therefore we consider the following convolutional neural network architectures, and take a special notion of the result of the first few layers, to compare them to the previously mentioned filters. In extension to the thesis regarding the first layers, the higher order filters found in the subsequent layers might reveal similarities to the features derived using traditional engineered feature descriptors. The feature detectors, and feature learners, need to be invariant, or in other words robust, to a number of variations, or variances, that are part of the image data. A number of variances that are common in images is shown in Figure 2.1.

(12)

Figure 2.1: Visualization of variances in input images that feature detectors and learners need to be invariant to. Image source: Stanford CS231n course notes [106].

The deep convolutional neural network architectures that are considered in the comparison are the following:

• Visual Geometry Group (VGG)

• (BN-)AlexNet

• Zeiler-Fergus (ZF) Net

• OverFeat

• DeCAF

• GoogLeNet-Inception

• ResNet

The traditional engineered descriptors mentioned in the considered approaches are described in the following subsections.

2.3 Engineered feature descriptors

The engineered feature descriptors are described in the following subsections.

2.3.1 Gabor filter

The first filter that is of interest for this research is the Gabor filter [41]. Especially the application of the 2 dimensional Gabor to an image, as it is able to extract features from an image, is of interest here. The 2-D Gabor filter is defined as the product of an elliptical Gaussian distribution that is applied for any rotation and using a complex exponential representing a sinusoidal plane wave [49]. The Gabor features are the result of Gabor filter responses for a given input image. These responses over an image are calculated, as a filter bank, and are tuned to a number of orientations and frequencies. Figure 2.2 shows the resulting Gabor filter bank.

(13)

Figure 2.2: The resulting Gabor filter bank. Image source: Mathworks, Matlab central, Gabor wavelets [75].

2.3.2 Scale Invarient Feature Transform (SIFT)

Typically the traditional object recognition task is done by extracting local features of object and trying to match them with features of an unknown object. The state of the art in object recognition, until recent developments in neural networks and object recognition, is the approach published by Lowe, in Distinctive Image Features from Scale-Invariant Keypoints [72], or best known as, Scale Invariant Feature Transform (SIFT).

The main idea is to transform image content, globally, to local feature coordinates that are invariant to translation, rotation and scale. The features that are the result of the application of the SIFT algorithm are invariant to image scaling, translation, and rotation, and are partially invariant to illumination changes and affine or 3D projection. The algorithm consists of four phases in defining key points for feature extraction [2]:

1. Scale space peak selection / extrema detection 2. Key-point localization

3. Orientation assignment 4. Defining key point descriptors

The first phase consists of identification of key locations in scale space by determining extrema from a Difference of Gaussian (DoG) function. These points are used to generate a feature vector that is a description of the local image region sampled relative to its scale-space coordinate frame. This includes searching over multiple scales and image locations. The second phase is the part where a model is fit to determine location and scale. The stability of keypoints is determined and the keypoints are selected based on their stability. The third phase is used to determine the best orientation for each keypoint region. The last phase is keypoint description where the local image gradients at a selected scale and rotation are used to describe each keypoint region. This step ends wih indexing and matching by creating a hash table, dictionary, with descriptors of sample images.

The descriptors extracted from a new image are matched to the descriptors in the dictionary to perform object recognition.

At the heart of the scale-space interest points of SIFT is a Laplacian of Gaussian kernel. This is used as a blob detector. The Difference of Gaussian (DoG) is a function that closely resembles the Laplacian function [70].

(14)

The subdivision of the scale space in multiple scales in SIFT is handled by a pyramid scheme: the scale space is separated into octaves. Octave 1 uses scaleσ, octave 2 uses scale 2σ etc. In every octave the initial image is repeatedly convolved, a convolution operation, to produce a set of scale space images. The Difference of Gaussian (DoG) is the result of the subtraction of adjacent Gaussians. After each octave, the Gaussian image is down-sampled by a factor of 2 to produce an image that is a quarter of the size to start the next level.

In the key-point localization step, each point is compared to its 8 neighbours in the current image and 9 neighbours each in the scales above and below. Once a keypoint candidate is found, a detailed fit to the surrounding data is performed in order to determine location, scale and the ratio of principal curvatures. A 3D quadratic function is used for fitting, and the Hessian matrix is used to remove edge responses.

The next step, orientation assignment, creates histograms of local gradient directions at the selected scale.

Next, an orientation is assigned at the peak of a smoothed histogram. Each key in this specifies the stable two dimensional coordinates. In order to define this in addition to the x and y coordinates, the scale and orientation are defined along with them.

The final step, keypoint description, is based upon the location, scale and orientation that were gathered from every keypoint in the previous steps. The process computes a descriptor for a local image region that is highly distinct and invariant to variations.

The normalization step uses rotation of the window to standard orientation. The scale of the window size is based on the scale at which the keypoint candidate was found. In figure 2.3 on the left the gradient magnitude and orientation at each point weighted by a Gaussian, with 2x2 descriptors over 8x8 grid, and on the right the orientation histograms, with the sum of gradient magnitude at each direction. In typical experiments, 4x4 arrays of 8 bin histogram is used, resulting to a total of 128 features for one keypoint.

The Lowe keypoint descriptor uses a normalized region about the keypoint, and computes the gradient magnitude and orientation at each point n the region. Then a Gaussian window is overlaid on the circle to weight them. Next an orientation histogram over the 4x4 subregions of this window is created. The end resulting vector typically contains 128 values.

Figure 2.3: The generation of the SIFT descriptor from image gradients to keypoints. Image source: David G. Lowe, Distinctive image features from scale-invariant keypoints [72].

2.3.3 Speeded Up Robust Features (SURF)

Speeded Up Robust Features [10] is an effort to improve upon the SIFT algorithm for feature description. The SURF algorithm uses Haar features and integral images instead of the approach used by SIFT. A box filter

(15)

(a) The SURF descriptor and orientation. Image source: OpenCV, Surf orientation [83].

(b) The SURF descriptor and use of the box filter.

Image source: OpenCV, Surf orientation [82].

approach is applied. There are a number of differences between SIFT and SURF. The differences are:

• SIFT uses a Difference of Gaussian (DoG) with a Laplacian of Gaussian (LoG) but in SURF the approxima- tion of the Laplacian of Gaussian (LoG) is performed using Box Filter

• In SIFT DoG is convolved with different size of images with the same size of filter is used, while in SURF a different size of box filter is convoluted with an integral image

• In SIFT the filter is fixed and convoluted with down sampling images, while in SURF the image is fixed and convoluted with up sampling filters

• Difference in key point detection is that SIFT uses local extrema detection, applies Non maxima suppression and eliminates edge response with Hessian matrix, while SURF uses the Hessian matrix and Non Maxima suppression

• In SIFT orientation histograms are created over 4x4 sample regions. For each orientation histogram 8 directions are used with the length of each arrow equivalent to the magnitude of the histogram as shown in the figure below. SURF, in contrast, uses a quadratic grid with 4x4 square subregions for the orientation, where for every square the wavelet responses are computed from 5x5 samples.

• SIFT uses a 128 bits descriptor while SURF uses a 64 bits descriptor

The main components of the SURF descriptor, that differs from SIFT [80], are shown in Figure 2.4a, showing the descriptor and the use of the orientation, and Figure 2.4b, that shows the use of the box filter.

2.3.4 Histogram of Oriented Gradients (HOG)

Until recent considered the state of the art in object detection, and with that face detection, is the approach described by Viola-Jones [122] using Haar features with AdaBoost, has been surpassed by the Histogram of Oriented Gradients (HOG) method.

The more recent approach used is that of a Histogram of Oriented Gradients for example in Human De- tection by Dalal et al., [27], which is applied in this paper for the task of face detection.

The Histogram of Oriented Gradients approach consists of the following steps:

(16)

1. Start with greyscale image as input

2. Evaluate the grey values surrounding every pixel and replace every pixel with a vector in the direction of the gradient of the surrounding pixels

3. Perform a sliding window approach that replaces every window with the vector of the largest gradient The result of application of the Histogram of Oriented Gradients descriptor on an image is shown in Figure 2.5.

Figure 2.5: The result of the Histogram of Oriented Gradients descriptor on an image. Image source: SciKit Image Organisa- tion, Histogram of Oriented Gradients [96].

2.3.5 Gradient Location and Orientation Histogram (GLOH)

Similar for the SIFT approach outlined in the previous subsection, at least with respect to the first three steps described, the Gradient Location and Orientation Histogram (GLOH) differs in the consecutive steps by a change in the local image descriptor.

The difference is that in this approach a log-polar location grid is used. The default implementation uses 3 different radii and 8 angular directions for 2 of the radii. This results in a location divided over 17 bins 2.6.

Next, a histogram of gradients, using 16 bins, is computed, and together with the log-polar location grid forms the feature vector of 272 dimensions. As this high-dimensionality negatively impacts the performance of the algorithm, dimensionality reduction is performed by projecting the features to a 128 dimensional space.

(17)

Figure 2.6: The GLOH descriptor. Image source: Zhang Li, Gradient location and orientation histogram [131].

In addition to descriptors, methods to represent images by their parts after description, and other relevant methods are described in the following subsection.

2.4 Engineered image representations

In the following subsections three relevant approaches for representing image patches as vectors are covered:

the Bag-Of-Features (BOF), and related Bag-of-Visual-Words (Bo(V)W), (Improved) Fisher Vector ((I)FV) and Vector of Locally Aggregated Descriptors (VLAD).

2.4.1 Bag of Visual Words (BoVW)

Bag-of-Words (BoW), adapted to images resulting in the Bag of Visual Words (BoVW), approach substitutes each description of the region around an interest point of the image with visual words obtainded from a predefined vocabulary as shown in Figure 2.7.

The iconic image fragments around interest points represent the image as a histogram of visual words and are, without the notion of location of the iconic image fragment, converted to a vector to represent the relative amount of occurences of an image fragment in an image.

After representation as a vector, the next step is the generation of a codebook, by converting the vector- represented patches as codewords, e.g. words in text, resulting in a dictionary of these words. A single codeword is a representative of several similar patches, and after clustering, each patch in an image is mapped to a certain codeword and the image can be respresented by a histogram of these codewords.

Next typical text retrieval techniques, as the high dimensional image has been reduced to a low-dimensional vector representation, can be used such as the term frequency-inverse document frequency or the cosine similarity.

The most relevant task as application for the BoW model is that of object categorization. Two categories in the approach to solving this problem can be identified: generative, where the method how data was generated is encorporated in the solution, and discriminative, where the model is used to determine by which category the image is best represented, without the notion of how the data was generated.

(18)

Figure 2.7: The Bag of Words representation model. Image source: Andrew Zisserman, Bag of words [6].

An alternative to Bag-of-Features is the use of a Fisher Vector, which will be covered in the next subsection.

2.4.2 Fisher Vector (FV)

An alternative approach to the method described in the previous section is the Fisher Vector for image representation. The Fisher Vector provides an image representation to form a global image descriptor using local image features. Fisher vectors are exploiting generative models in discriminative classifiers. This describes the deviation of a set of descriptors from an average distribution, modelled by a parametric generative model. It uses Gaussian Mixture Models (GMMs) to cluster the local descriptors. This process is graphically shown in Figure 2.8.

Figure 2.8: The Fisher Vector method. Image source: Tuan Nguyen, Fisher vector [117].

2.4.3 Vector of Locally Aggregated Descriptors (VLAD)

Another approach, that is similar to the Fisher Vector, is the Vector of Locally Aggregated Descriptors (VLAD).

The difference in this approach is that instead of the use of Gaussian Mixture Models to cluster the local descriptors, K-means is used to perform clustering.

Vector of Locally Aggregated Descriptors has shown to outperform both the Bag-of-Features and Fisher Vector methods and thus is considered the defacto standard in representing image (patches) as vectors.

(19)

Figure 2.9: The Vector of Locally Aggregated Descriptors method. Image source: Tuan Nguyen. Vector of locally aggregated descriptors [118].

To conclude this section: some more recent approaches are available in engineered feature generation and detection, however they are considered outside the scope of this paper.

2.4.4 Machine learning

The field of machine learning focusses on the development of generic algorithms that are able to build logic based on the input data. The idea is that the algorithm should be generic in order for it to be applied to other types of data without changes or adaptation of the algorithm [31]. The field is a sub-discipline Artifial Intelligence, but the difference is that it does not focus on strong AI but on a specific task to solve by generalizing over the input data.

Categorization of algorithms

We can categorize the different types of algorithms in the field based on the nature of the learning signal, e.g.

the feedback. Based on this categorization we get to the following two main categories:

• Unsupervised learning

• Supervised learning

We can also categorize based on the desired output of a machine learning solution as shown in the itemization below:

• Classification

• Regression

• Clustering

• Density estimation

• Dimensionality reduction

Unsupervised learning

In the supervised approach the labels, e.g. the to be learned output, is available to the algorithm to learn rules from the data. On the contrary, in the unsupervised category the output labelling is unknown and needs to be learned from the data itself. In this paper we focus on the problem of classification and regression in the subfield of supervised learning. An example problem in this field is that of separating datapoints with the use of a clustering algorithm.

(20)

Supervised learning

Consider an expert that requires a system to aid in decision making. The output, labels in case of classification, are known and the algorithm, logic, needs to adapt in such a way that the transformation it performs provides the labels that correspond to a given input in the training set, e.g. the relationships between input and output are learned based on a labelled training set.

The components contribute to the output using a certain distribution, in which every component, or feature, has a weight. Starting with a weight of 1 for every weight in the system, e.g. an equal contribution, we start and update the set of weights of these components to approximate the output.

The brute force approach would be to try every combination of weights and determine the best weights to fit the data. Instead of using all possible combinations of weights, in the case of supervised learning, the deviation of the output to the expected, optimal, output can be used to determine an optimal set of weights to fit the data, e.g. a cost function.

The graph of the cost equation, depending on the data and type of algorithm, typically looks like a surface that has an optimum, e.g. a weight representation that fits the data with the minimal amount of error.

When we start on this surface in a random location we can find a path from this starting location to the optimum, by determining the derivative, e.g. the slope of the function’s tangent, and changing the weights only minimally, and testing only the set of weights along this path or vector, we drastically decrease the number of weights we try until we find the optimum. The approach outlined here is known by the name of batch gradient descent.

The goal of multivariate linear regression is estimating a function, equation, for a line that fits through all the data points. This is used to predict the output. When a linear function can not approximate, or fit, the data using a continuous line, approaches exist to expand upon linear regression to fit more complex data.

Non-linear algorithms, such as neural networks or support vector machines with kernels, can fit even more complex high dimensional data.

2.4.5 Image recognition and classification

In addition to discrete scalar values that we can use as input for machine learning, we can also use image pixels as input. In order to use an image pixel we map the grey intensity value, or in the case of a color image 3 values of intensities for Red Green Blue respectively, and use these values to represent an image.

When using a color image the typically used representation is known as Tensor. This tensor is a multi- dimensional array, a number of pixels high, a (typically the same as the height) number of pixels wide and 3 channels (Red Green Blue) deep.

In binary classification of the class of an object in a gray value image we would have a neural network consisting of 3 layers, with the number of pixels equal to the width times the height of the pixels, an intermediate layer and 2 outputs: the likelihood that it is the class given the input, and the likelihood that it is not the class.

Training the neural network requires a set of labelled instances of both images with the class that we want to learn, as well as images that contain everything except the class that we want to learn, e.g. the counter examples.

Subproblems

During training of the neural network as described, it becomes apparant that the recognizer only performs well when the object that we want to classify is exactly in the same orientation in the image as it is in the training set

(21)

of images.

A brute force approach to solving the translation invariance problem would be use a sliding window and search to find the class in the test image. The inefficiency of this brute force approach forces us to search for other more efficient solutions.

In addition to the challenges described in the previous paragraphs the problem of solving the task of image classification is further increased when deformation, an object is not necessarily a rigid body, and occlusion, an object can be only partly visible, are taken into account.

Some more characteristics of the problem that add to the issue at hand are viewpoint and scale variation:

objects can be oriented in many ways and in different sizes. Also difference in illumination, amount of background clutter and the intra-class variation, e.g. broad classes can consist of a range of objects within them, together contribute to the problem of image classification.

Learning image categories

The image categories are learned using machine learning: we present a (large) number of labelled images per class and thereby allow it to learn the intricacies per category of images.

This process of learning image categories consists of three components: as previously described, a train- ing set, the learning component that is called training a classifier or learning a model, and the test or evaluation part which consists of a dataset of images with labels that the model or classifier has never seen before, where the predicted label and the true label are compared to determine how well the classifier performs.

A very naive approach would be to perform Nearest Neighbor (1-NN) classification: by comparing a test image as input with all the images in the dataset and predicting the label. This comparison would be performed naively using the difference on a per pixel basis. The distance measure described is known as the L1 distance measure. If the absolute difference, again considering the difference in value on a per pixel basis, is zero (or small) the images are similar, if however this is large, the images are very different.

The accuracy is measured by the fraction of correct predictions with respect to all predictions. In addition to the L1 we can choose L2 as distance measure which is otherwise known as the Euclidean distance, the root of the sum of every distance squared, where the difference between them is that the latter contrastly tolerates many small differences instead of a single big difference.

An improvement over NN would be to consider more than a single nearest neighbor but k Nearest Neighbor (k-NN) classification. In this approach we compare a test image to the top k closest images and vote to determine the label of the test image. Increasing the value of k smoothens the prediction and makes it less prone to outliers.

The parameter k and the selected distance measure, L1 or L2, are known as hyperparameters. In order to tune these type of parameters, without degrading the integrity of the experiment and data, we need to keep a separate part of the training dataset as validation set. This set allows us to validate different sets of hyperparameters without introducing possible overfitting or loss of generalization on the test set. When the amount of training data available is small, cross-validation, the use of a collection of randomly selected train, validation and test sets, can be used.

The k-NN approach outlined has some clear limitations in the task of image classification: for one, too much time is required during testing as all images need to be loaded as they require comparison with the test image, and second, the distance measure is not based on any measure that corresponds to semantic similarity or meaning of the image contents, especially when considering the large inner-class variance within image classes as shown in Figure 2.10.

(22)

Use of the current metric results in grouping together of images with the same color distribution including the background of the images. From the above we learn that there are alternatives required to improve our approach of classification of images in order to reach a good performance.

Figure 2.10: Visualization of common inner-class variances in input images. Image source: Stanford CS231n course notes [106].

As k-NN was an improvement over NN, we can define a linear classifier, with score and loss functions, to respectively maps raw data to class scores and quantification of the agreement between the predicted scores and the true labels, to improve our method.

The score function is used to map the pixels of the input image to the score per class. For this a linear function is used that depends on weights and biases. The parametric approach has the important advantage that after learning the parameters the training data is not required anymore. In addition, during test time, the prediction for a given test image is fast, as it only requires a single matrix multiplication with the learned weights. Along with this an optimization known as the bias trick, allows for the combining of the bias vector with the weight matrix which in turn allows to keep track of a single parameter matrix.

In case of a linear classifier we have a loss function that is itself again a linear function. Two common loss functions are the linear Support Vector Machine (SVM) and the Softmax [114]. These functions measure how well a set of parameters fit the training set with respect to the true labels, a good prediction in this way becomes equal to a set of parameters with a small loss value.

When we visualize the loss function that was described before we get a high-dimensional optimization landscape, where the global optimal is defined by the bottom of this landscape [22]. The SVM, depending on the kernel, can be either linear, with a linear kernel, or non-linear with a non-linear kernel.

The optimization of the loss function can be performed iteratively, starting with a random set of weights and optimizing them until the loss is minimal. The learning rate requires careful tuning as a too low setting results in steady but slow progress but a too high setting might seem faster but comes with a risk.

There are two types of gradient that we can compute: the numerical and analytical. The numerical gradi- ent is simple but it is approximate and expensive to compute. The analytic gradient computes the exact gradient.

Though it is fast to compute it is not of practical use due the fact that it requires the derivation of the gradient.

Therefore the computation of the analytical gradient, in combination with a procedure known as a gradient check, in which the computed gradient is compared to the numerical gradient, is used instead.

(23)

The Gradient Descent algorithm iteratively computes the gradient and performs a parameter update in a loop. As the gradients that are computed can flow backwards through the network, they can communicate the parts of the network that should increase, or decrease, and by how much, to influence the output towards the required output. In addition, by using staged computation, breaking up the function to approximate into small functions for which the gradient is easy to derive, and chaining them using the chain rule, backpropagation can be implemented in a practical way, going back through the variables one step at a time [67] [12].

2.4.6 Neural networks

As mentioned in a previous section, when we need to model complex relations, one way of doing so is with the use of neural networks.

We can extend upon the previous use of a single set of weights to predict output based on the inputs when a single set of weights does not capture the relationships between input and output. These sets of weights can in turn be used to create a new layer that using a number of neurons defines an intermediate representation.

An estimation function that uses a set of inputs and multiplies them by weights to an output. By chaining a lot of neurons, functions that are too complex to be modeled by a single neuron, can be modelled.

In this way we can model complex relations and learn what the underlying representation of the data is.

Now we will take a look at some challenges that arise from the approach to learn underlying representations with the use of a large amount of neurons in a neural network.

A few types of activation, non-linearity, functions exist: Sigmoid, the common sigmoid function, Tanh, the common tanh function, ReLU, an abbreviation of Rectified Linear Unit, which is a function that is thresholded at 0, the leaky ReLU, allowing a small negative slope in an attempt to alleviate the dying ReLU problem, and Maxout, where a non-linearity is applied on the dot product between the weights and the data.

Neural Networks can have different types of layers based on the interconnections the neurons have. Fully- Connected layers have neurons connected to adjacent layers by full pair-wise connections, but where the neurons within a layer are not connected. A layered architecture enables very efficient evaluation of Neural Networks based on matrix multiplications interwoven with the application of the activation function.

Neural Networks are universal function approximators: they seem to inherit the correct assumptions about the functional forms of functions that are required to solve problems that are seen in practice. Large(r) networks bring a desirable higher model capacity but it must be met with an appropriate amount of (stronger) regularization (such as higher weight decay), or else the result might be an overfitting network.

Preprocessing on the input can aid in a better performance: by centering the data to have a mean of zero, and by normalizing the scale to [−1,1] along each feature, best performance can be achieved. Initialization of the weights is best done by drawing them from a gaussian distribution with standard deviation ofp

2/n, where n is the number of inputs to the neuron.

The typical regularization techniques are[66]: L2, penalizing the squared magnitude of all parameters di- rectly in the objective function, L1, where for each weight a term is added to the objective function, Max norm constraints, enforcing an absolute upper bound on the magnitude of the weight vector, and Dropout, complementing the other methods by deactivating neurons based on some probability.

Batch normalization, a technique to initialize neural networks, initializes the network by explicitly forcing activations throughout a network to take on a unit gaussian distribution at the beginning of training [50].

A number of methods exist for parameter updates: Vanilla SGD, the Stochastic Gradient Descent with up-

(24)

date along the negative gradient direction, Momementum update, improving over SGD by using a parameter vector that builds up velocity in the direction that has a consistent gradient, Nesterov Momentum, where a future approximate position is used as a lookahead, Annealing, a process that lets the learning rate decay over time using a step, exponential or 1 over number of iterations type of function, Adagrad as an adaptive learning rate method that keeps track of per-parameter sum of squared gradients allowing element-wise update parameter optimization [35], RMSprop, improving upon Adagrad by alleviating the issue of its aggressive monotonically decreasing learning rate by introducing a leaky cache variable, and finally Adam [61] which performs a smooth version of RMSprop. In addition to this are some "tricks" that improve SGD provided in Stochastic Gradient Descent Tricks [15].

In line with this research is an effort to define a number of tests as a standardized benchmark for stochastic optimization in "Unit Tests for Stochastic Optimization" [93] and visualization of these processes as part of

"Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks" [87].

As neural networks have a number of hyperparameters, hyperparameter optimization is required for parameters such as the initial learning rate, the learning rate decay possibly with the use of a schedule and the regularization strength with for example the L2 penalty and dropout strength, where random search, based on log ranges, can be used to find good hyperparameters [14].

2.5 Convolutional neural networks

The Convolutional Neural Network (CNN) is first applied by LeCun et al in 1998 for hand-written digit recognition. The problem that seems to be unsolvable at the time is to scale to larger images. In addition at the time the lack of a large amount of data for training and the computational resource limitations prevented the step towards the results we see today.

Instead of a sliding window approach, that would be the crude approach when template matching, the convolutional neural network uses an approach of splitting up the input image into small overlapping image parts, and afterwards performing operations on these image tiles.

In the step following this, every part of the original input image is presented to a small neural network, yet the neural network weights stay the same for every image part in the same original image. The result is that every part is processed equally.

The result of this operation is saved in a new array, storing the spatial arrangement in a grid, with information about the parts of the images that contain interesting information useful for further processing.

A follow up step consists of downsampling to reduce the size of the array that is the result of the previous convolution operation. This process is known as max pooling and it works by finding the maximum value in each predefined square of an array, a 2 by 2 square of parts for example, and storing only its biggest value in the result max-pooled array.

The last step of the convolutional neural network approach is feeding the max-pooled array to a fully-connected neural network layer to determine, by means of the 2 outputs of likelihood of a match or non-match, if the object class is present in the image.

A typical architecture [90], consists of a combination and typical ordering of a number of convolution, max pooling layers. The architecture starts with the image input and ends with a number of fully connected layers that end in prediction of the class.

Convolutional Neural Networks are very similar to the Neural Networks we described previously: both consist of a number of neurons that have learnable weights and biases. Every neuron receives inputs, performs a dot

(25)

product and is sometimes followed by a non-linearity, and all neurons together form a single differentiable score function, and have a loss function.

The main difference is that the Convolutional Neural Network enjoys some optimizations with respect to the expected input: images. Due to this assumption we can optimize the approach in a way that improves efficiency of the forward function and reduces the amount of parameters of the network.

Instead of full connectivity of the neurons in the same layer, with regular sized (200 by 200 pixels) images this would lead to a neuron to have over 100 thousand weights, the number of parameters is greatly reduced by dropping the constraint of having every neuron fully connected. In addition a different arrangement of the neurons is employed in the architecture: they are arranged in 3 dimensions forming a volume with a width, height and depth as shown in Figure 2.11.

Figure 2.11: Convolutional neural network layers. Image source: Deepnotes.io [30].

The Convolutional architecture brings two types of layers that are not typically employed in Neural Network architectures: the Convolutional Layer and Pooling Layer. Combined with the Fully-Connected Layer, which is the type of layer that was already typically part of the Neural Networks, and a Non-Linearity Layer, they make up the Convolutional Neural Network (CNN) architecture. In its simplest form we can think of a CNN as a list of layers that transforms, through a differentiable function [11], an input image (as a volume) to an output volume containing the class scores.

A typical CNN architecture consists of a certain succession in layers, [INPUT - CONV - RELU - POOL - FC]: The INPUT layer holds the raw pixel values of an image, the CONV layer computes the output of neurons that are connected to local regions in the input using a dot product between the weights and a small region of the input volume, the RELU layer that performs an elementwise max operation as activation thresholding at zero, the POOL layer is the downsampling layer that decreases the volume by downsampling along the width and height, and finally the FC Fully-Connected layer that computes the class scores, the size of the number of categories or labels.

Some of the layers, the convolutional and fully connected, have parameters, while the ReLU and pool layers do not have parameters. In addition the convolutional, fully connected and pooling layers have hyperparameters, but the ReLU layer doesn’t.

2.5.1 Convolutional layer

The parameters in the convolutional layer consist of a set of learnable filters: each filter extends through the full depth of the input volume but only considers a small part of the image spatially (in height and width). The

(26)

response to these filters, the activation, is stored as a map, along with the spatial position, and with this there is a set of filters in each convolutional layer with each of them having a separate 2-dimensional activation map.

The output volume consists of the stacked activation maps along the depth dimension.

The biological analogy is that each entry in the output volume can be thought of as the output of a neuron that considers only a small part of the input, but shares its parameters with all neurons to its left and right spatially, known as the combination of Local Connectivity and receptive field, the spatial extent of connectivity, together are considered a hyperparameter of a CNN.

The spatial arrangement of a CNN, the amount of neurons per volume, is controlled by three hyperparameters:

depth, is the number of filters to be used, the sliding of the filter moving accross the image is known as the stride, and the zero-padding is used to allow for equal spatial sizes.

In order to control the number of parameters, a Parameter Sharing scheme, is employed in CNNs. Under the assumption that a feature learned at a specific spatial position is also of use in every other spatial position.

This gives rise to the notion of a depth slice, a volume that contains neurons that share the same weights and bias. The result: all neurons in a single depth slice are using the same weight vector, and the forward pass of the convolutional layer can in each depth slice be computed as a convolution of the neuronâ ˘A ´Zs weights with the input volume. In this way the amount of parameters to be learned can be greatly decreased, allowing training, even in the face of larger images.

In special cases, such as learning faces for face recognition, the parameter sharing scheme does not hold, as specific centered features of faces need to be learned in specific spatial configurations, and in these cases a Locally-Connected Layer is used.

The backward pass in backpropagation for a convolution operation is, though with spatially-flipped filters, again a convolution operation. An often used operation, 1x1 convolutions in Network in Network [69], or inception style CNNs, extend over the full depth of the input volume, and in line with these operations, dilated convolutions add another hyperparameter to allow spaces between each cell [127].

Some types of pooling exist, Max pooling as described as current defacto for performance, but in addition, general pooling and L2-norm pooling exist, and were historically used abundantly, with current research even focusssing on removing pooling altogether, with the use of conversions from pooling to fully connected layers [104].

There are some common sizes of layers and common patterns of how layers are subsequently organized that seem to provide good performance. It appears that in these architectures a stack of small filter convolutions is predominant over a single large receptive field convolutional layer. The sizes of layers for the input are a power of 2 for the image size, the convolutional layers should as mentioned use small filters, padding it with zeros to keep the same spatial dimensions, finally using pooling layers that downsample the input typically using the max operation with receptive fields of 2x2 and a stride of 2.

A number common architectures have been developed using the building blocks described in this section and their specifics have also already been covered in a previous section.

Visualizations of activations and first layer weights and re-acquiring images that show maximal activation in a neuron are some of the first techniques used to interpret the features that are learned by a CNN. The visualization of activations during the forward pass shows typical edges of an input image that is used for recognition. An alternate approach is to visualize the weights that are learned by the network as, for example from the training of AlexNet on the CIFAR-10 dataset, is shown in Figure 2.12.

(27)

Figure 2.12: Learned weights visualized. Image source: Stanford CS231n course notes [107].

The first convolutional layer shows interpretable features, smooth filters if the network is well trained, and shows high-frequency grayscale features, and low-frequency color features. The weights in the 2nd convolutional layer visualized show a very large amount of filters, that, if trained correctly, do not contain noise.

In addition to visualizing the weights and activations in general, visualization of the images that maximally activate a neuron, offer great new insights into the workings of the receptive field. What becomes clear from these visualizations is that by the time the features are part of the 5th pooling layer, they tend to activate on large concepts or object components in images as shown in Figure 2.13.

Figure 2.13: Maximally activating images for some of the fifth (max-)pooling layer neurons of the AlexNet CNN. Image source: Stanford CS231n course notes [108].

Some restrictions to general acceptance of this theory however do exist: it is unknown what the effect of the ReLU neurons is on the subsequent layers, though it is probably the case that these neurons combined represent the basis vectors of some space that is represented in image patches.

An alternative approach is to interpret the CNN at large as it performs a transformation of input image to a final representation of images that cluster by semantic meaning and with that are separable by a linear classifier. A specifically well suited method to show this is t-SNE: by extracting the CNN codes, e.g. the vector right before the classifier but including the ReLU non-linearity, and providing these to the t-SNE algorithm, ending up with a 2 dimensional vector for each image, that can in turn be shown in a grid, showing the semantic segmentation that a CNN apparently is capable off.

Another alternative approach is to apply some destructive transformations to images, in the form of occlusions, as shown in Figure 2.14, to learn about the inner working of the feature learning in CNNs after this transformation, by comparing the resulting label and a heat map showing the activations of the network on the same image. In this way it becomes clear what areas of the image are responsible for a high classfication score and what parts do not impact classification score if occluded.

(28)

Figure 2.14: Effect of occlusions on activations visualized by a heatmap. Image source: Stanford CS231n course notes [108].

Transfer learning is an interesting approach that implicitly provides the conclusion that the learned features from a CNN generalize very well accross different sets of image data [125]. There are three approaches:

• The CNN is used as a fixed feature extractor and the learned and generated CNN codes, are without change applied to another dataset to train a linear classifier [88].

• Alternatively, by keeping both the features and the weights, and fine tuning the CNN on a new dataset, the training on this new dataset can be jumpstarted.

• In addition to the two approaches mentioned, more and more researchers and engineers are sharing their pre-trained models, ready for direct application.

Important related research shows that deep neural networks learn correspondence [71], which is as prerequisite for transfer learning to work. In addition, a deeper understanding of the inner workings of deep neural networks, through visualizations, is found in "Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps" [99] and "Understanding Deep Image Representations by Inverting Them" [73].

The deep learning architectures that we consider in the comparison are described in the following subsections as case studies.

2.6 Deep Neural Network Architectures

This subsection covers the predominant deep neural network architectures, and the methods that these architectures inherit, and is based upon "A Taxonomy of Deep Convolutional Neural Nets for Computer Vision" [37]

by Srinivas et al.

The most important method, that is also the basis for deep learning, is representation (feature) learning.

(29)

In this approach the goal is to automatically learn good representations (features) that best describe the most important parts of the input.

The approach in deep learning is to exploit feature learning on multiple levels: by using an architecture consisting of layers of neural networks, a combination of the input and first layer learn the most basic features.

The consecutive layers, in turn, learn higher feature representations consisting of a descriptive part of an image such as a part of wheel in the case of a dataset consisting of means of transportation.

This process of learning multiple levels of features is known as hierarchical feature learning. The learned intermediate representations are dataset specific, contrary the representations learned in the first layers: they show similarity accross datasets as shown in Figure 2.15.

Figure 2.15: Hierarchical Feature Learning and the learned representations from a number of datasets. Image source: Yu Huang. Visual detection, recognition and tracking with deep learning [129].

There are as many approaches in learning features as there are architectures. Typically architectures need some optimizations to outperform the state of the art in a certain task for which the features are learned. The combination of tasks that a deep neural network needs to solve seems to boost performance in the performance of every single task.

An important approach in learning about the features that are learned in deep convolutional neural network has been performed by Yosinski et al [126] with the goal of understanding neural networks by visualizing the features learned.

The features that a Convolutional Neural Network (CNN) is able to extract from the images appear to be more descriptive than their human-engineered counterparts. There is a trade-off to the learned features: the importance of a large dataset grows.

In addition, scale and rotation invariance requires extending the dataset, using a process called data augmentation, to allow the deep neural network to recognize objects with different orientations and scale.

(30)

This in turn boosts the ability of the Convolutional Neural Network (CNN) to learn the importance of certain features in the images and enables the generalizability feature of the Convolutional Neural Network (CNN).

A number of deep neural network architectures exist, as described by Canziani et al [18]. The most predominant convolutional architectures are described in the following sections.

2.6.1 (BN-)AlexNet

One of the most important architectures to discuss is the architecture defined by Alex Krizhevsky et al [65], better known by its alias AlexNet, and an approach that extends it with unsupervised pre-training and batch normalization, known by the name BN-AlexNet [98] by Marcel Simon, Erik Rodner, Joachim Denzler.

The AlexNet consists, in its original form, of 5 convolutional layers, interleaved with max pooling and local contrast normalisation layers. The last 3 layers are fully-connected, ending in a 1000 way softmax. The architecture is shown in Figure 2.16 with the consecutive layers in the architecture shown in Figure 2.17.

Some methods are employed to improve training of the network. The use of Rectified Linear Units (ReLU) which allow faster training of the network. Overlapping pooling aids in reducing overfitting and data augmentation is used to artificially enlarge the dataset with the use of label-preserving transformations. In addition drop-out is used as regularization. The methods used to allow and improve the training of deep neural networks is covered in more detail later on in this paper.

The features that are learned in the different consecutive convolutional layers are shown in Figure 2.18. It shows that as the convolutional layer is closer to the input, more basic features such as edges and blobs are learned, while further down texture and object parts are learned. Ultimately, in the final fully connected layer, the object classes for object classification are learned.

Figure 2.16: Alex Krizhevsky et al’s deep neural network architecture, e.g. the AlexNet architecture. The neurons are physically divided over two GPUs to accomodate for the amount of neurons required for training in acceptable time. Image source: Kaggle, Alexnet. Imagenet classification with deep convolutional neural networks [58].

(31)

Figure 2.17: An abstract, modular overview of the AlexNet architecture, showing the different type of subsequent layers. See the legend for an explanation of the layers. Image source: Hirokatsu Kataoka. CNN feature evaluation [47].

Figure 2.18: The AlexNet architecture and the type of features that are learned in the different parts of the architecture.

Image source: James Hays, Deep net visualisation [51].

2.6.2 Multi-column Deep Neural Networks for Image Classification

An approach by Ciresan et al is to perform processing of parts of the input images divided by P blocks over a number of Deep Neural Networks (DNN) in parallel to form the Multi-Column Deep Neural Network (MCDNN) for image processing and classification.

(32)

Figure 2.19: The Deep Neural Network (DNN) architecture on the left and the Multi-Column Deep Neural Network (MCDNN) architecture on the right. Image source: Jurgen Schmidhuber, Multi-column deep neural networks for image classification [95].

Combined with the AlexNet architecture, these architectures have inspired many research groups to build upon to form improved deep neural networks for various applications. One of the networks that builds upon it is described in the next subsection.

2.6.3 Zeiler-Fergus (ZF) Net

The ZF net owes its name from the two researchers, Matthew D. Zeiler and Rob Fergus, who used a novel method for "Visualizing and Understanding Convolutional Networks" [130].

The architecture consists of two components: one performs the convolution and the other performs a de- convolution operation as shown in Figure 2.20. The convolution component performs the operations of a typical Convolutional Neural Network (CNN) while the other component performs some of the operations in reverse on the output, after generation, to visualize the activations that were the result of the CNN on the input image.