Attentive Learning in Deep Neural Networks

(1)

MSc Computational Science

Master Thesis

Attentive Learning in Deep Neural

Networks

by

Nikos Tsakmakidis

October 9, 2018

Daily Supervisor:

Prof. Dr. S. Bohte

Academic Supervisor

and Examiner:

Dr. R. Quax

Second Assessor:

Dr. H. S. Scholte

(2)

Abstract

When humans perform visual tasks, they observe only objects that they attend to, as past studies have shown. Although it is well known that humans use attention, while performing tasks, it is not yet understood how this attention is utilized by humans and by extension by animals, when they learn about tasks. On the other hand we understand well the learning process of artificial neural networks, as they are widely used and studied on visual tasks. In this thesis, we train and test multiple deep classifiers on a dataset of images of 10 different classes of animals, with a new ’attention’ method, and we compare with the default way of training. We also compare both regular and our ’attention’ based deep classifiers with the behavior showed by humans in similar studies. The two steps of our method are: First the images are filtered and the backgrounds are eliminated keeping only the segmented object on a black background. With this method we resemble attention on the objects. The results of our experiments show that by eliminating the backgrounds and attend only on the objects we achieve better performance on our classifiers. To explain these results we visualize the attention maps of regular CNN classifiers and our findings suggest that they learn to segment the images and find important discriminative features on the actual objects, which indicates that the background is not as important for the classification task as the object is. It also shows that the background might add unnecessary complexity to the problem and by removing it the problem simplifies, which means that shallower networks and smaller datasets can achieve the same performance as deeper networks trained on bigger datasets of full images. Finally, we observe in image ablation experiments that both types of models show human like behavior, with the attention models being the most similar. The accuracy of the classifier drops slightly for the models trained on full images, when background or low significance features are removed, and it drops rapidly, when parts of the object of important features are removed. The accuracy of the attention models behaves similarly for features located on the objects, but it is not affected by the background.

(3)

1 Introduction

Past studies have shown that humans, when asked to classify objects in images, find parts of the objects that are discriminative features [6], and their ability to classify those objects is linked to how many and which of those features are visible[1]. In general, if features of low importance are not visible, humans can perform the classifications tasks with almost the same accuracy as when the full image is revealed. On the other hand, if important features are hidden, their accuracy drops rapidly in a non-linear way.

This process of feature selection can be considered as a form of selective atten-tion, which means that the subjects attend features relative to the task. Studies have shown that selective attention also influences the learning in reinforce-ment learning tasks[19]. There are also many examples like the invisible gorilla experiment[24], where the subjects are instructed to keep track of the basketball passes one of the two teams does. In this process most subjects completely miss the fact that there is a person dressed as a gorilla in the video. Generally, we know that attention influences both tasks that human perform and the learning process of them in some cases. The question that all this research tries to answer is how the human brain learns tasks. In this thesis, we focus on the effects of attention on learning image classification tasks.

Although widely used and as far as we know, there has been no research con-ducted yet on how Deep Artificial Neural Networks (DNNs)[17] and especially Deep Convolutional Networks(CNN) can utilize attention for training on image classification tasks. CNNs are inspired by the human visual system, and it is fitting to compare such models with humans. DNNs use a large number of layers to extract important features and learn to perform a task. Although this might be problematic in some cases in the sense that they are difficult or impossible to understand, this is not the case for CNNs that are trained on images because we can easily visualize and understand it layer by layer. Here, we test similarities of attention CNNs and humans as well as regular CNNs with those trained using attention.

The questions we investigate are:

1. Can we train DNNs to perform similar to humans at the image ablation tests?

2. What features do CNNs learn to attend to from an image dataset, and what are the differences in those features between CNNs that use attention and those who do not? We want to see these differences visually with activation-attention map visualization and by image ablation tests. 3. What is the importance of background in image recognition tasks performed

by DNNs? We want to know how the use of attention by background removal affects the training and performance of a DNN.

(7)

depths and architectures. First, we train two different kinds of models. One is a regular CNN classifier and the other is a CNN trained and tested on the same dataset as the previous one but with the backgrounds of the images replaced by a black background. This method resembles attention as the black background holds no information and the classifier attends only to the objects. For both model types we experiment with networks of different size and we also train models on subsets of the data. We compare these two models and show their differences in performance during and after training.

Secondly, we visualize the attention maps[4] of the previously trained DNNs to show and understand what parts of the images they attend to, and how this compares to the attention to the important features that humans express. In addition, we train a deep classifier on data with randomized labels to compare its attention maps to the other classifiers.

Finally, we conduct image ablation experiments on the trained classifiers with four different methods of ordering the pixel removal. The goal again is to compare the results with the human experiments, and also to compare the two different classifier models with each other.

At the first and second experiment, we discover that the background plays no significant role in classification tasks. Classifiers learn to segment the images and isolate the object, which is to be classified. The background is used only to learn segmentation. We also find that the models that use segmented images to train, have faster convergence, better or same performance as the models trained on the full images, and in addition, they have even faster convergence and better accuracies, when we make use of less resources (network depth, dataset size). From the attention map visualization, which is just a summation of the corre-spondent layer’s filters, we see that DNNs perform the task of segmentation in the first layers and they focus more on features on the actual object in the image in deeper layers. In the Clicktionary experiments[6], performed with human subjects, it was shown which features humans find important in an object classification task. We find that DNNs are comparable to the humans in the way they choose important features in a classification task, as in the attention maps we see that DNNs rank different parts of an image and they make a decision based on the most important features, which are located on the objects and not in the background.

Human studies[1] show that human accuracy in an image recognition task, which includes two classes, animal and distraction, drops non- linearly as pixels are hidden from the least important to the most important. We observe the same behavior for both DNN model types, segmented and non- segmented, as we hide pixels in order, from the least important to the most important. Finally, we see the models trained on segmented images to have slightly different performance curves, which is expected as the all pixels outside the objects contain zero importance and this results to no performance loss, when we remove those. On the contrary, the models trained on full images lose some accuracy from the background, although the most important pixels are always on the object. We suspect that this small drop in performance occurs due to the limited ability of our DNNs to perfectly learn the segmentation task.

(8)

Before we move on to the other sections we need to introduce some basic back-ground information on what artificial neural networks are, what types of them we use later on, and how they are trained on a task.

2 Background

In this section we briefly give some background knowledge from literature that is necessary to continue further. In this thesis we train attention and regular models on different architectures of deep neural networks. We study the properties and the differences of the two model types by using different visualization methods. The topics until 2.6 are general knowledge about Artificial Neural Networks and two methods of visualizing them. The last two topics 2.7 and 2.8 give are about important network architectures that we make use of in this work.

2.1 Feed-Forward Artificial Neural Networks

Artificial neural networks (ANN)[17] are computational systems inspired by biological neural networks, which constitute animal and human brains. They can learn to perform tasks through multiple examples, without task specific programming. An ANN is composed by multiple simpler components connected to each other, which are called artificial neurons. The neurons usually compute a simple function like multiplication or multiplication and addition. Note that any function can be used as a neuron but simple functions are preferred, since the desired complexity arises with the network structure.

The connections between the neurons are weighted and those weights w are the parameters that an ANN adjusts during the learning phase to approximate the solution to the problem. The neurons are organized in layers and the network can become more complex and approximate better non-linear functions by adding more layers. Each neuron of has an activation function f , which is usually a ReLu[17] for the hidden units and for the output layer a sigmoid function for binary tasks or a softmax[17] function for multiple classification tasks. Finally, ANNs usually have a bias neuron for each hidden layer, which is a trainable parameter that is added to the other neurons before the activation function for regularization purposes.

A feed-forward neural network[17] is an ANN that information flows through it in one way, from the input layer to the output layer, without recurrent connections. Those recurrent connections link the output of a layer to a previous layer, which is a method that allows information to flow both ways in a network, giving it memory properties. These connections are used to build Recurrent Neural Networks, which are models used in tasks that require memory and thus, they are not relevant to our task.

(9)

units that consist the MLP are called perceptrons and they are simple neurons that do a multiplication task followed by an activation function:

y = f (wTx + b), (1)

where f is the activation function, w is the weight matrix, x is the input vector, and b is the bias vector. The input vector x for the units located in the hidden layers is the output vector of the previous layer. A simple MLP architecture can be seen in figure 1.

Figure 1: One hidden layer MLP. The weights are shown on the connections, the filled nodes are the neurons and the extra (empty nodes) are the biases. [21]

2.2 Convolutional neural networks (CNNs)

2.2.1 Convolution

Before diving into the convolutional neural networks[17] we have to define the convolution operation. Data like images are by nature stationary, which means that the statistics of a section of an image are the same as in any other section of that image. This property creates an opportunity to learn small features (usually of the size 3 × 3 pixels), which can be translated to the whole image. These features are called filters and after they are learned, they give different feature activation values per location of the image, when applied. The operation of applying these filters on an image is the convolution and it is performed as follows:

Each 3 × 3 filter slides on the image and it is multiplied element-wise with the local 3 × 3 values of the image. The final values of the convolution is a sum of all the elements of these small 3 × 3 matrices for each of the local 3 × 3 areas. The convolution operation S with a two dimensional kernel K on a two dimensional image I(m, n) is described in math as,

(10)

where (i, j) are the indexes of the 2-D convolution, and (m, n) are the indexes of the image array. An example of the filter applied to an area of the image can be seen in Figure 2. When applied this way the filter produces an array that is downsized depending on the filter size. If this is not desirable, zero padding is used to keep the dimensions similar to the input or prevent the output of being too small in case the stride choice was bigger than 1. Zero padding, which is used to keep the dimensions of the output similar to the previous case but with a stride of 2 can be seen in Figure 3.

Figure 2: The filter slides across the image with stride = 1 and produces a new array, which has smaller size than the input. [20]

2.2.2 Network structure

Convolutional neural networks (CNN)[17] are feed forward neural networks that are mostly used for tasks that involve images. For example, image classification, segmentation, and region proposal. They consist of layers of filters as described above. Unlike fully connected neural networks, CNNs share the weights that are applied to the input, as each filter slides over the whole image and updates the same weights for all areas. The weight sharing property allows CNNs to be capable of excellent performance on image related tasks. On the other hand fully connected networks (MLPs) need different weights for each individual pixel of the image and this results in a huge number of parameters, thus they become

(11)

Figure 3: The filter slides across the image stride = 2 and produces a new array, which is the same size as in Figure 2 due to zero padding. [20]

impractical or even impossible to train. For example, in our case we have inputs of size 128 × 128 × 3 = 49152 and we apply 3 × 3 convolution. In the next layer the number of parameters per filter used will be 3 × 3 × 3 = 9, which are the dimensions of the filter times the 3 channels of the input. For the 32 filters out Toy CNN has, we have 288 parameters in the first hidden layer, which is a small number compared to the input size. In comparison, if we had an MLP with 10 hidden neurons the number of parameters would be 491520, which is a very large number and it results in a very weak model. The output dimensions of each convolution layer are given by,

height = inputheight− f ilterheight

stride + 1 (3)

width = inputwidth− f ilterwidth

stride + 1 (4)

depth = #f ilters. (5)

The dimensions of the output are always slightly smaller than the input, but we usually want even bigger reduction to the total parameters of the network. In this case, we use pooling layers such as MaxPool[17]. A demonstration of a CNN, which uses all the layers discussed above is shown in figure 4. Finally,

(12)

each convolution layer is able to capture different features of the input image. For example, the filters in the first layers usually capture simple features like lines or corners, whereas deeper layers capture more complicated structures like eyes, noses, and ears.

Figure 4: Convolutional neural network structure. [22]

2.2.3 Back-propagation

The weights of an artificial neural network are the trainable parameters. These parameters have to be trained in a way such that the ANN approximates the solu-tion to the given problem and minimizes the suitable loss funcsolu-tion to the problem. The most common training method is the back-propagation algorithm[17]. This method propagates the calculated loss of the forward pass, backwards through the hidden layers until the bottom layer. A simple example of back-propagation, when we have an ANN with one input, one linear output, one hidden unit with activation function f is:

1. First, the input x is given to the network and the output of the forward pass t0 is the approximation of the target t.

hinput= w1· x (6)

houtput= f (hinput) (7)

(13)

where w1 and w2are the weights, and h is the hidden unit.

2. Secondly, the error of the output layer is calculated,

E = t0 − t (9)

3. The the error is then back-propagated through the hidden units and we find the weight updates with the help of the chain rule as,

δw2= ∂E ∂w2 = ∂E ∂t0 ∂t0 ∂w2 (10) δw1= ∂E ∂w1 =∂E ∂t0 ∂t0 ∂houtput ∂houtput ∂hinput ∂hinput ∂w1 (11)

4. Finally, the weights are updated using these derivatives depending on the method we choose. For example the gradient descent method:

wnew1 = w1old− aδw1, (12)

wnew₂ = w₂old− aδw2, (13)

where a is the learning rate.

Likewise, we can expand in multiple hidden layers and apply steps 3 and 4 to update each hidden layer’s weights. In practice, we train DNNs in small pieces of the dataset called batches and usually contain 32 or 64 samples. The weights are updated once every after batch and the process continues until we reach the desirable accuracy. This usually takes multiple passes through the dataset, which we call training epochs.

2.3 Pixel-wise Decomposition

To successfully understand which parts of an image are important for the decision of a DNN classifier, we create a 2-D map of importance for the image. This importance map has the same dimension as the input picture and each value on the map corresponds to the importance of each of the images pixels. An importance map can be created with the use of pixel-wise decomposition[5]. Pixel-wise decomposition is a method, which gives understanding of the contribution of each pixel of an image x to the prediction f (x) that a classifier f makes in and image classification task. The decomposition of the prediction f (x) can be expressed as a sum of terms of the separate input pixels xd:

f (x) ≈

V

X

d=1

(14)

Where Rd is the relevance of each pixel. If the relevance is negative, it means

absence of a object-class, which is to be classified, and when it is positive, it means the object’s presence. A good way to achieve pixel-wise decomposition is the layer-wise relevance propagation.

2.4 Layer-wise relevance propagation

DNNs are composed of many layers. Before reaching the input the importance is propagated through all layers from output to input. This is done as described below.

A classifier can be decomposed into several layers[5], where the first layer is the input-pixels, and the last layer is the output-prediction of the classifier f . The l-th layer is a vector z = (z(l)_d )V (l)_d=1 with dimensionality V (l). For each dimension z(l+1)_d of the vector z at layer l + 1 there is a Relevance score R(l+1)_d . Thus, the score R(l)_d for each dimension z(l)_d of the vector z at layer l, when l is a layer closer to the input, is given as,

f (x) = ... = X d∈l+1 R_d(l+1)=X d∈l R(l)_d = ... =X d∈l R(1)_d (15)

This equation is then iterated from the classifier to the input and the Relevance score is calculated for each pixel of the image. Finally, each Relevance score is linked to the previous scores as follows,

R(l+1)_k = X

i:i is input for neuron k

R(l,l+1)_i←k (16)

where R(l,l+1)_i←k indicates the scores from the connections of the layer l.

2.5 Pixel-wise Decomposition of Multilayer Networks

Now that we have seen how the relevance of one layer is projected to the previous layer, we can explore how the pixel decomposition is used in a DNN.

A layer of a deep neural network is mapped to the next layers[5] as follows,

zij = xiwij, (17) zj = X i zij+ bj, (18) xj = g(zj), (19)

where x(d)is the neuron corresponding to the pixel activations, wij is the weight

connecting each neuron xi to a neuron xj, bj is the bias, and g is the activation

(15)

2.5.1 Layer-wise relevance backpropagation

This section describes how the relevances, which we have described above are back-propagated through a deep neural network. The steps of the backpropagation[5] are:

Given the relevance of a neuron R(l+1)_j for the classification decision f(x), we decompose it in part-relevances (Ri←j) sent to neurons of the previous layers.

We use the property shown by eq (15), where the total relevance should be conserved. For a linear network f (x) =P

izij, where the relevance Rj= f (x),

the decomposition is Ri←j= zij. Of course our neural networks are not linear

since they use the ReLU as a non linearity but the zij can still reasonably

measure the relative contribution of each neuron. Finally, the propagation rule is, R(l,l+1)_i←j = ( zij zj+R (l+1) j zj ≥ 0 zij zj−R (l+1) j zj < 0 (20) and eq (16), X i R(l,l+1)_i←j = ( (1 − bj+ zj+)R (l+1) j zj ≥ 0 (1 − bj− zj−)R (l+1) j zj < 0 (21) where ≥ 0 is a predefined stabilizer, and bj is the bias.

The output of this method is a heatmap of pixel relevances for each of the three channels of an RGB image. Adding those three values per pixel we create a 2D relevance map that indicates the importance of each pixel to the decision of the classifier.

2.6 Image-specific class saliency maps

Another way of visualizing the importance map of each pixel of a specific image is the saliency map[3]. Given a CNN classifier and an image I0 with its class

label c, there is a score function Sc(I) that leads to the classifiers choice. The

goal is to measure the relevance of each pixel of the image I0 to the final score

Sc(I0). In a simple case, where

Sc(I) = wTcI + bc, (22)

where b is the bias and w are the weights, the relevance scores are equivalent to the magnitude of the weights w.

Although that is not the case for deep CNNs, because they are not linear, we can still approximate the score Sc(I) in the neighborhood of I0 with the first

order Taylor expansion,

(16)

where

w = ∂Sc

∂I |I0. (24)

To calculate the saliency maps we first find the derivative of w with back-propagation. Then for an RGB image the saliency map M is given by,

Mij = max

c | wh(i,j,c)|, (25)

where h is the index of the pixel with i,j being the numbers of the row and column respectively, and c is the number of the color channel.

2.7 Mask R- CNN

Our attention method uses a pre-trained CNN to segment images and attend to one object before passing it to the classifier. For the segmentation task we chose the pre-trained Mask R-CNN.

The Mask R- CNN[10] is an extension of the Faster R- CNN[16] with the addition of the mask output. The later consists of two stages. In the first stage, a Region Proposal Network proposes candidate bounding boxes. In the second stage, it extracts features from each candidate box using RoIPool to perform classification and bounding box regression. The mask R- CNN generates a binary mask for each RoI in parallel to the other two tasks in the second stage. The total multi-task loss is defined as, L = Lcls+ Lbox+ Lmask, where Lmaskis the average

binary cross- entropy loss. The goal of this approach is to decouple the mask and class prediction, so the masks can be generated independently for every class and without competition in between classes.

2.8 Residual networks

In this section we give an introduction to Residual Networks, which is the CNN architecture that our models are based on.

Deep neural networks become harder to train as the depth increases and more layers are added[7]. On the other hand we need the depth to improve the performance and reach human like or even better accuracy[7]. A deep residual network or ResNet[7] was proposed to reduce the training difficulties and allow us to build very deep neural networks. The idea behind ResNets is that instead of trying to map the exact function that fits the data H(x) with a stack of layers, we can map the residual function F (x) = H(x) − x. Then the initial mapping takes the form F (x) + x, which can be realized as shortcut connections that skip one or more layers. Those skip connections do not add new parameters to the network as they only perform an identity mapping with their output being added to the output of the stacked layers. This type of network is shown to be easier to optimize and achieve very high accuracies in famous problems. In particular,

(17)

an ensemble of ResNets achieved a 3.57% error on Imagenet [2], winning the 1st place on the ILSVRC 2015 classification task.

Although deep residual networks can be very successful and they can improve their performance as their depth increases, they need an enormous increase on the number of layers to improve their accuracy even for a fraction of a percentage in accuracy. This results in very deep neural networks that take a long time to train. The Wide ResNet[8] was proposed to tackle this problem. The architecture is similar to a ResNet, but it is less deep and more wide. The width is added in the form of more filters per layer. These networks are shown to have better results in accuracy and efficiency than all standard ResNets.

2.9 Related work

2.9.1 Human- DNN comparison

State of the art deep convolutional neural networks reach accuracies close or sometimes even better than human performance in image recognition tasks. It is also known that these CNNs use different visual strategies than humans do. This is demonstrated by comparing the features that a CNN (VGG) attends to and the features of importance maps of humans, which are derived for instance by the Clictionary[6] game. Humans were tested[1] on a visual categorization task with two cases, animal or not animal. The images where masked to hide a proportion of the visual features based on the Clictionary’s attention maps. The results showed that human accuracy improves non-linearly as the percentage of hidden features increases whereas, the regular CNN increases its accuracy in a linear way to the percentage of features revealed. Finally, it was shown that the CNN trained with the features of importance map of humans seems to learn more features from the object in the image than from the background. Although these tests give an intuition on what might be the difference between humans and DNNs, they use only one CNN architecture, which is the VGG-16. For that reason, we will go further into this topic and we will explore more CNN architectures for more conclusive results.

2.9.2 Other Attention Models

Attention methods have been used for multiple classification tasks, such as fine grained image classification[11][12]. The later is the ability to classify hundreds of sub-categories, which belong to a the same basic-level category. These models have two levels of attention, the object-level attention that detects objects in images, and the part-level attention, which selects discriminative parts of the

(18)

object. The two attention levels work together for the prediction. The model [12] outperformed many state of the art methods on the datasets: CUB-200-2011, Cars-196, Oxford-IIIT Pet, and Oxford-Flower-102.

Other applications of attention models are the Mask RCNN [10], which uses a detection system before the classifier, and work that has been done on vehicle detection from aerial images [14], where the images are segmented first and then the individual segments are given to the classifier for classification.

All the above use attention methods based either on bounding boxes or masks but they are all trained on full unsegmented images, the attention is part of the network, and they do not explore the option of training on a segmented and already attended dataset. The latter is the method we implement and evaluate in this thesis. We use a different CNN to segment and attend the objects and we create new attended data.

2.9.3 Visualizing deep features for understanding

Deep neural networks can generate correct and successful solutions and results for many problems but due to their complex architectures, they can still be considered as black boxes. It is important to shine light inside these black boxes because it allows humans to better understand the reasoning of the system and gives important information to the user. Two methods that give us a good understanding of an image classifier are the Pixel-wise explanation by Layer-wise relevance propagation[5], and Saliency maps[3]. Both methods generate a heatmap that indicates the importance of each pixel of a given image. If the class of this image is also given, then the heatmap generated shows the contribution of each pixel to the decision of the classifier. Finally, the heatmaps can be viewed by a human expert, who can analyze them and get a good understanding of the classifier’s inner workings.

All these methods use the activations information to calculate a heatmap, which shows the features that the classifier attends to. If the activations for each layer are combined, they can give us the attention map[4] of a layer, which yields information about the attention preferences of the layer. In this work, we will use the attention maps to show what each layer attends as well as, compare the attention maps of a DNN trained normally on 10 class labels and a DNN trained on random labels[15]. The latter is done to explore what features a DNN learns, when it is trained on a dataset that has no class structures. We expect that the classifier is going to attend to different features on each image or attend to all objects that appear in it.

(19)

Figure 5: Mask R-CNN mask generation example.

3 Methods

In this section we discuss in detail the attention method that we applied in this project.

(20)

3.1 Masking and attending objects

Class Train Test Bird 1 944 236 Bird 2 867 217 Bird 3 912 227 Bird 4 455 114 Elephant 700 175 Bird 5 421 105 Horse 290 72 Bird 6 930 233 Bird 7 462 241 Zebra 916 230

Table 1: Dataset classes and num-ber of training and test samples.

Our attention model works in three steps. In the first step, we generate masks of the ob-jects in every image in the dataset using a CNN pre-trained on COCO[9] dataset using the Mask R- CNN model[10]. In the second step, we choose the mask generated with the highest confidence for each picture given the probability scores and the class name gener-ated by the pre-trained model. For example, if the class is bird, we choose the mask, which has the highest score of all objects classified as bird. Samples with no masks generated by the Mask R- CNN or with masks, which have a probability less than 0.98 to belong to the desirable class are discarded. Visual inspection of the masks showed that masks that the Mask R-CNN produces with proba-bility lower than 0.98 are at risk of being low quality or entirely on wrong objects. For example, in figure 5 we see that the masks with label cow and elephant are given with a low probability and they are wrong, whereas the masks with label bird and hight probability are correct. Note that the best mask is the one with the highest probability for bird, which is the mask that we chose. This filtering simplifies slightly the data by removing ’hard’ examples. This might slightly affect the results, but we choose not to take this effect into consideration. Finally, we apply the masks to the original images and we set every pixel that lies outside of the mask to zero. By doing this, we ‘filter’ the original data and we create a new dataset (Table 1), which has only one object shown per image and the backgrounds are the same (black) for all images. This leads to simpler and easier to learn images because complex backgrounds with many objects like the ImageNet’s[2] backgrounds are removed.

3.2 Centering the attended object

From studies[18] and common knowledge we know that attention usually also means centering the object in the visual field. The centering in our model is achieved by finding the center of mass of the array (image). Since every other pixel than the pixels of the object in the array is represented as a zero, the center of mass of the array is always the center of mass of the object we attend to. After the object’s center is found, we calculate the distance vector from the center of the image array. Next, we move the object so the center of mass is on the center of the array, while keeping the same dimensions. This step has only a slight effect in our accuracy because CNNs in general are position invariant. We

(21)

Network Architectures Layer

Name

Wide ResNet Toy CNN

conv1 3 × 3, 16, stride 1, Output(128 × 128) [3 × 3, 32] × 2, stride 1, Output(128 × 128), MaxPool(2 × 2, Output(64 × 64)) conv2 3 × 3, 128 3 × 3, 128 × 2 , Output(128 × 128) [3 × 3, 64] × 2, stride 1, Output(64 × 64), MaxPool(2 × 2, Output(32 × 32)) conv3 3 × 3, 256 3 × 3, 256 ×2 , Output(64×64) [3 × 3, 128] × 2, stride 1, Output(32 × 32), MaxPool(2 × 2, Output(16 × 16)) conv4 3 × 3, 512 3 × 3, 512 × 2, Output(32 × 32)

Average pool Dense 512, Dense 1024 Fully connected 10-d softmax

Table 2: Wide ResNet and Toy CNN architectures. The values depicted indicate the kernel size of the layer, the number of filters in the layer, the stride (if missing stride= 1), the brackets stand for a residual block with two convolution layers and their multiplier is the times this block is repeated, and the output size.

decided to keep this filter, since it showed an effect, and because it is part of attention and computationally cheap.

4 Model architectures

In this section we analyze the architectures of the deep neural networks used to build our models.

4.1 Toy CNN

First, we build a small CNN similar to a scaled down AlexNet[13], with 6 convolution layers, 2 fully connected layers, and an output layer. The model architecture is shown in Table 2. The number of filters doubles after every max pooling layer starting from 32 and ending at 128. To avoid over-fitting because of the small data size, we apply dropout with probability 0.20. We use the Adam optimizer with a starting learning rate of 0.001 and the default settings of Keras. We train the toy model on both the attended and original data on batches of 64 images for 100 epochs.

(22)

4.2 ResNet Architectures

Figure 6: Basic ResNet and Wide ResNet block[23]. Each block has two batch normaliza-tion layers, and two weight lay-ers. The weights indicate con-volution layers as described in tables 2 and 3.

We choose ResNet[7] architectures to perform the experiments, as they are considered to be close to state of the art architectures and to perform very well on datasets like Imagenet[2], where our dataset is a part of. Another rea-son for our choice is that ResNets are very convenient, because they can be scaled up and down in size by adding or removing their basic building blocks. For our experiments we choose ResNet-6, ResNet-10, ResNet-18, ResNet-34, and a variant of the ResNet archi-tecture the Wide ResNet-16-8[8], where the numbers in the names indicate the number of activation layers in the network. The latter seems more promising for high accuracy exper-iments as these networks are shallower (less layers) than the ordinary ResNets and they seem to perform better in tasks like Cifar- 10. We use the basic block with batch normaliza-tion architecture, as it is shown in figure 6, because it is ideal for smaller networks as it is suggested in the literature on ResNet and Wide- ResNet. The architecture can be seen in Tables 2 and 3. We train all model archi-tectures on the attended and original data on batches of 64 images for the ResNets and on batches of 16 images for the Wide ResNet for 100 epochs. The smaller batch size of the Wide ResNet is chosen due to GPU memory constraints.

5 Experiments

5.1 Training tests

First, we train the different models for 100 epochs and we monitor their accuracies after each epoch on the validation set. We reinitialize the networks with different seeds and repeat the process for 10 different seeds to obtain statistical results. We also monitor the training loss (see Appendix) to ensure that there is no over-fitting. Finally, we compare the performances of the attention models to the non-attentive models on their respective datasets.

(23)

Network Architectures Layer

Name

Output size ResNet6 ResNet10 ResNet18 ResNet34

conv1 64 × 64 7 × 7, 64, stride 2 conv2 32 × 32 3 × 3, 64 3 × 3, 64 × 2 3 × 3, 64 3 × 3, 64 × 1 3 × 3, 64 3 × 3, 64 × 2 3 × 3, 64 3 × 3, 64 × 3 conv3 16 × 16 3 × 3, 128 3 × 3, 128 ×1 3 × 3, 128 3 × 3, 128 ×2 3 × 3, 128 3 × 3, 128 ×4 conv4 8 × 8 3 × 3, 256 3 × 3, 256 ×1 3 × 3, 256 3 × 3, 256 ×2 3 × 3, 256 3 × 3, 256 ×6 conv5 4 × 4 3 × 3, 512 3 × 3, 512 ×1 3 × 3, 512 3 × 3, 512 ×2 3 × 3, 512 3 × 3, 512 ×3 Average pool,Dense 10-d softmax

Table 3: ResNet architectures. The values depicted indicate the kernel size of the layer, the number of filters in the layer, the stride (if missing stride= 1), the brackets stand for a residual block with two convolution layers and their multiplier is the times this block is repeated.

5.2 Training networks of different depth and on datasets

of different size

To understand better the advantages that attention has over regular models, we train networks with different depths on the attended and non-attended datasets, and we monitor their accuracies during training, their best performance achieved, the mean 10 accuracies recorded at the last 10 epochs of training, and the difference between those accuracies. The neural network architectures used for this task are: ResNet-6, ResNet-10, ResNet-18, ResNet-34.

Another way we use to see the effects of attention is by gradually reducing the dataset size and monitoring how the two models perform with less data samples. This is achieved by reducing the size of the dataset by 50%, and 70% and then train on the remaining data. This process is repeated for 10 random seeds. The architectures we chose for this task are ResNet-10, ResNet-34, Toy model, and Wide-ResNet-16-8, as we wanted to test all different types of classifiers.

5.3 Attention visualization from activations

Visualizing the filter activations of each convolution layer of a CNN gives us heatmaps that show features of the given image, which the correspondent filter attends to. These heatmaps can sometimes be informative but usually they are not helpful enough as they seem repetitive, sometimes unreadable, and separately they do not hold enough information of each features importance.

To better understand the inner workings of our CNN models and to confirm the previous assumptions, we visualize the attention maps[4] of each convolution

(24)

layer. To visualize the attention map first we extract all the filter activations from a layer, which are one 2D array per filter. Then we take the absolute value of those arrays, as it does not matter if the activations are positive or negative for this purpose but what is their magnitude. Finally, we sum all the filters together and we create the attention map for the specific image.

Additionally, we create a small test set of 900 images along with the masks of the objects in these images, using the Mask R-CNN as the mask generator. Next, we calculate the ratio R of the average activation intensity inside the mask Iin

over the average activation intensity outside the mask Iout.

R = Iin Iout

, (26)

the average intensity is calculated as the summation of all the activation values in of the area over the number of pixels N of the area.

I = PN

n=1in

N . (27)

This ratio is normalized by the number of activations. The unnormalized version is the same but without the division by N. For the last step we use the activation maps given by the hidden layers of Resnet6, ResNet18, Wide ResNet, and Toy Model.

In addition, we train Resnet18 on random labels[15] and we visualize the attention maps of the first 7 layers, as said above to compare the normal trained classifiers with a classifier that memorizes (overfits the data) each image separately instead of generalizing.(More details below)

5.4 Image ablation tests

To better understand which features each of our models give high importance to and how these features are related to the objects and the background, we carry out image ablation tests.

For these tests we first simply test the trained models on test-sets that contain partially hidden images. We start from the completely revealed image and we gradually hide parts of it starting from the most distant pixels to the center of the object, which we want to classify. The values of the hidden pixels of the original images are replaced with the average value of all the removed pixels and with a zero on the filtered images. We do this to keep the intensity of the images constant. Examples are shown in Figure 6.

To get more information on the differences the segmentation before training creates, we create a test set of images masked by the importance maps given from the activations of the last convolution layer of each network, the importance maps given from the elrp method[5], and the important pixels highlighted from the saliency maps[3]. After the pixels are ranked by importance, we start ablating the image by hiding the pixels with the lowest importance value to the highest

(25)

Figure 7: Simple (naive) image ablation examples.

and we measure the accuracy of the two networks for different percentages of hidden pixels. This experiment is repeated for 10 different networks per architecture. Finally, the reason behind the usage of three different methods for defining the importance of the pixels, is first to compare these methods, and secondly to ensure that the defined importance maps are robust. Figure 8 shows some examples of image ablation using elrp on the Wide ResNet. Note that this is a demonstration of the method and in the experiment we replace only the attended data with black and the normal data with the average intensity of the pixels removed. More examples of other methods as well as other networks can be found in the appendix.

5.5 Training on random labels

To show that deep neural network classifiers learn to suppress the background and to attend only on the objects even if the training task is not meaningful, we train on images labeled randomly. To achieve this, we shuffle randomly the labels of our training set. This results in a training set with images, which can belong to any class and every image has to be memorized individually by the classifier. This of course makes the classifier unable to generalize and the performance on the validation set is a random choice accuracy.

(26)

Figure 8: Wide ResNet image ablation with elrp. From left to right: original image- importance map- pixel removal examples.

(27)

6 Results

6.1 The dataset

Figure 9: Datasets sample, full images on the left and attended images on the right.

The dataset we chose to use for our task is a subset of Imagenet[2], which has 10 classes of animal im-ages. The first 7 classes are pic-tures of birds. That increases the difficulty of classification tasks be-cause the models need to learn spe-cific features for each class. The rest of the classes are big animals for some diversity.

From the this data we built two new datasets, one to train the at-tention models (attended dataset), and another with the same but full images from Imagenet. The attended data was created with the help of Mask R-CNN[10], as we described in the methods sec-tion. All images were resized to 128 × 128 pixels, as the detail for this size is reasonable and big-ger images created issues with the GPU memory. Finally, we cre-ated the two datasets only from images that were compatible with the Mask R-CNN thus, the two datasets have about 9000 images each, which is less than 10 Ima-genet classes. We chose 80% of the data for training and 20% for validation. Figure 9 shows some examples of the data.

6.2 Training tests

The results of the accuracies recorded during training show that the models trained in segmented images always achieve better accuracy in early stages of training and better to similar accuracy at late stages than the models trained on full images, depending on the architecture. In addition, they always converge in less epochs. This is visible in figure 10 where we see the comparison between the accuracies on a validation set of 10 classes of the attention models and the regular

(28)

models over 100 training epochs. Finally, we see a different behavior between our less powerful models, such as the Toy model and ResNet-6, where there is a significant difference in accuracies of all training epochs, and the behaviors of the wide-resnet model as well as the deeper ResNets. These differences are linked to the capacity of the network and we better explore them in the next two experiments.

Figure 10: ResNet depth comparison. Left side, accuracy comparison during training for the attention model and the regular model. Standard error is included as line width.

(29)

6.2.1 Training tests- network depth

Figure 11: (Left side).Difference in epochs in which each model reaches con-vergence, for different CNN architectures. (Right side). Mean of the top 10 accuracies recorded per model.

By increasing the depth of the model, we expect the accuracy to rise and this is indeed what happens with the networks trained on full images. This behavior is not what we see for the models trained on segmented images, which accuracy stays the same for the different depths we test. This is visible on the right side of figure 11. In the left side of same figure we see that the difference in epochs until convergence reduces as the network depth increases. This shows that the effect of attention decreases as the depth of the classifier increases.

Figure 10 also shows ResNet model comparisons with different depths. We notice that the differences between the two types of models as mentioned above, become smaller as the network depth increases. For example, in the case of ResNet-6 the difference between the accuracies of the two models is significant in all training stages, whereas for the deeper networks such as ResNet-18, ResNet-34, and even ResNet-10, there is only a significant difference in early training stages and they later converge on similar values. On top of that the difference in early stages of training is also affected as seen in figure 11 (left side).

6.2.2 Training tests- dataset size

Generally, less data to train on causes loss in accuracy of the models, but the accuracies of the attention and no attention models do not decrease at the same rate with each other, as the data size decreases.

(30)

data, for Toy-model, Wide ResNet, ResNet-10, and ResNet-34 architectures. We notice that when the data is reduced, the accuracy in all network types (attention and regular) drops, but it reduces faster for the models trained on full images and that results on an increasing gap between the two performances. We also notice that the difference between the two models becomes significant in all cases trained on less data.

Finally, in figure 13 we clearly see that the differences in epochs between attention and regular models increase as the dataset size decreases and the attention models always train and converge faster to the same accuracies as the regular models. In addition, we see that the models trained without backgrounds achieve better final accuracy and they are easier to train in terms of data that they need than the regular models.

(31)

(32)

Figure 13: Model summary data size comparison. Difference in epochs in which each model reaches convergence, for networks trained on different size percentages of the dataset. (Right side). Mean of the top 10 accuracies recorded per model. From top to bottom the architectures are: a) Toy Model, b) Wide ResNet, c) ResNet10, d)ResNet34.

(33)

6.3 Image ablation tests

All image ablation test curves show a human-like accuracy drop, where we observe the biggest reduction when pixels from the object are removed, which are usually the most important features for the classification. Figures (14- 17)d (naive) show how the accuracy of the models drop on their respective test sets, when we gradually hide pixels of the image starting from the furthest to the object and ending in the center of the object. The architectures used for these results are Toy Model, Wide ResNet, ResNet10, and ResNet34. We notice that the attention models start losing accuracy only after 40% or more of the image is hidden, which is the point that the actual objects the classifier tries to classify get partially covered. This result is expected since there is no background in this model’s test images. On the other hand, the regular CNN loses some accuracy right from the start but the major reduction comes again after 40%. That indicates that the majority of useful features for classification on a CNN are located in the actual objects. The background contains a small amount of useful features too for the regular CNNs, but the attention CNNs show that it is not necessarily needed to perform the task with the same or better accuracy. Similar curves appear to the plots of the image ablation based on feature importances (Figures 14-17(a,c)). The models have only a slight performance drop, when the low importance pixels are removed, and a major drop in accuracy, when we remove important features. The similarity of the behavior of this experiment with the previous one described above is another indicator that the most important pixels for classification are located on the object and not on the background. This is a common behavior between all three methods we used to rank the importance of pixels, except the saliency maps, and between all the model architectures tested. Overall, this experiment shows that both types of models, attention and no attention, behave more similar to humans than we would expect them to and that they both learn to attend to the object to find strong classifiers and not to the background. The models trained on full images lose some accuracy from the background ablation, which indicates that they rely on it to some degree. This effect might be related to the network capacity, as the deeper the network the smaller is the effect. We consider this another advantage of training on segmented (attended) images, as we can train better classifiers without the influence of the background.

Note that the saliency map ablation behaves in an irregular way compared to the others. The attention model loses accuracy even when low importance features are removed. This should not happen as in this case the low importance features should be located in the black background, which includes no information. The result of this is that the accuracy drop for the attention models is faster than for the regular models for the ResNet architectures. Since the other three results all agree and since there is no such effect in the other architectures, we choose to ignore this result in the final conclusion. Table 4 shows the normalized and unnormalized intensity ratios for all different methods. We suspected that the

(34)

Figure 14: ResNet10 image ablation results with different methods of choosing pixel removal. Standard error is included as line width.

Figure 15: ResNet34 image ablation results with different methods of choosing pixel removal. Standard error is included as line width.

(35)

Figure 16: Toy model image ablation results with different methods of choosing pixel removal. Standard error is included as line width.

Figure 17: Wide ResNet image ablation results with different methods of choosing pixel removal. Standard error is included as line width.

(36)

Intensity Ratios

Ratio WRN ResNet6 ResNet10 ResNet18 ResNet34 Toy CNN

Normalized (mean)-activations 5.214 3.751 2.044 2.165 2.200 5.709 Normalized-elrp 4.871 3.484 3.365 3.403 3.630 4.637 Normalized-saliency 4.811 3.345 3.198 3.463 3.510 5.278 Not Normal-ized (sum)-activations 0.915 0.737 0.537 0.565 0.588 1.436 Not Normalized-elrp 0.815 0.794 0.657 0.692 0.686 0.934 Not Normalized-saliency 0.799 0.688 0.639 0.712 0.683 1.030

Table 4: Normilized and unnormalized intensity ratios for all three activation based different methods of image ablation.

saliency would have lower ratio than the other two methods, as this would explain the different behavior. Although the saliency maps have in some cases lower ratio than the elrp, this is not enough to confirm that the intensity ratio is the cause of the different behavior, because the activations have lower ratios for both ResNet-10 and ResNet-34.

6.4 Activation visualization

The attention maps from the activations of each convolution layer show that in both the Toy CNN and the ResNet architectures trained on full images the background of those is gradually suppressed inside the network. See figure 20 for WRN and see appendix for more examples. That indicates that CNN classifiers learn to attend to important features and by extent to the objects in the images and almost eliminate completely the influence of the background, when the depth (capacity) of the network is sufficient. This suggests that the classifier learns to segment the images before classifying. Note that the lightest parts of the attention maps shown in the figures are the most important features for the classification.

ResNet models show a different behavior than the toy CNN as the background lights up every couple of layers. This is the result of the skip connections in every block as they feed the input of each block to the output. Nevertheless, the final result is similar and somewhat superior to the toy CNN.

(37)

Figure 18: Intensity ratios per hidden layer for Toy CNN, Wide ResNet, ResNet6, and ResNet18. The ratio increases for deeper layers because the network attends more the objects. The final layer usually has the biggest ratio. Note that for ResNet18 this does not seem to be the case, but the low ratio values shown for it are due to its feature maps of deeper layers being 4 × 4 pixels, which is not a reasonable resolution for this method to work.

As expected the atten-tion maps of the models trained on pre- segmented images contain no back-ground activations, as show in figure 19 for the Wide ResNet (Toy CNN see appendix). The atten-tion patterns on the ob-jects are similar to the previous models in spite of them being trained on already segmented im-ages. This similarity in the activations might be a clue for why the two different types of classi-fiers can reach similar per-formances but the latter need less training since they do not learn to per-form segmentation but they are focused only in classification.

The intensity ratios in Fig-ure 18 confirm what we saw at the attention maps. We see here that the ratio increases for deeper layers for the majority of the cases. It is also noticeable that the final layer spikes to a very high ratio for most models. Finally, for ResNet18 we do not see a big increase as in others but this is not a well suited experiment for it because its activation maps become too small (4 × 4 = 16 pixels) in deeper layers.

(38)

Figure 19: Attention Wide ResNet activations for each block’s second layer.

(39)

6.5 Random labels CNN

Lastly, in figure 21 and figure 22 we see a comparison for two ResNet18 models, one trained normally and the other trained with random labels. The later shows properties similar to all the other models, which separate object and background. The main difference is that since it was trained on random labels and it does not have clear feature-targets to look for, it attends to the wrong object or to every object in the image. This shows that DNNs learn to attend to something and do some image segmentation, even if they overfit and memorize the data without generalizing.

(40)

Figure 22: ResNet 18 activations for each of the first 7 layers, when trained on random labels.

(41)

7 Conclusion

From the experiments and activation visualization we learned that DNNs learn discriminative features of the object that they are tasked to classify similar to how humans find important features. We found that DNNs also rank those features and depend on some of them more than on others. This case holds for all models and architectures that we tested, even when the DNN is trained on random labels. Finally, we saw that regular DNNs still use the background information to a degree that is dependent on the networks potential.

From the image ablation tests we learned that when using attention the experi-mental curves resemble more that of the humans than when no attention is used. Finally, we proved from the training tests that using attention can train DNN models faster as they do not need as many epochs to train as the regular DNNs. We also showed that the efficiency of attention is much more significant and we can achieve higher accuracies, when the models have a small number of layers and when the dataset is small.

Overall, we conclude that the background is not useful in learning or performing our image classification task. Models that do not make use of attention have to learn to perform segmentation before classification and this makes them generally weaker or much weaker than the attention models, when the networks are not deep enough or when the dataset is not rich enough. Attention and background removal simplifies the problem and make attention CNNs more resource and data efficient than regular ones. All of our experiments confirm that the same task can be learned and performed with similar or better accuracy by attention models that remove the backgrounds of the images, during training and testing. Furthermore, we showed that by utilizing attention, simpler-shallower models perform much better than their counterparts that use no attention. Similarly, when we use less data to train the models, attention CNNs perform much better than the regular CNNs.

Finally, we saw that all of our CNNs had similar behavior to humans on impor-tant feature extraction and similar curves on image ablation tests. By saying similar, we do not mean exactly the same curves as the human test curves that we based our work on show this curve shape but in less features all concentrated on the objects. CNNs show the same curves, when the tests are conducted on full images. We believe this happens because CNNs downscale the input in deeper layers and this procedure creates seemingly equal importance of neighboring pixels, when we retrieve the original size feature map. Overall, attention models resemble better the human curves than regular models because even the best models we tested learned some weak features from the background, which affected their performance on image ablation tests.

(42)

8 Discussion- Future work

In this work, we used background removal as a method of attention, which showed promising results. We used a pre-trained Mask R-CNN to filter the data before training and testing. The limitations of this method is that we need a CNN model trained to segment images before training and testing our model. In addition, our attention method is not just background removal but we also keep only one object per image, which might originally have contained multiple objects of the same class. In the future, we would like to explore other methods of attention as keeping some of the background or all of it with the objects highlighted. A simple improvement is to keep all the objects of the same class in the image instead of just one. This would be more data efficient because we would not waste examples. We also saw that all CNNs learn some segmentation and attend more on the objects. An improvement on our model would be a model that can learn to attend correctly by examples during training without the use of a second pre-trained CNN like Mask R-CNN. If this is not efficient enough, a CNN that can at least use attention only when testing would also be an improvement because we would need a second CNN only for filtering training data and not on testing data.

Finally, the dataset that we used consists of only 10 classes and the task is quit simple as we had 90% or more accuracy with every architecture. In future work, we can expand the dataset to more classes and see if our results hold on more difficult tasks.

(43)

References

[1] Linsley, D. & Eberhardt, S. & Serre, T. Driving deep networks to-wards human vision Conference on Cognitive Computational Neuro-science 2017. Archived at (https://www2.securecms.com/CCNeuro/docs-0/5928e73268ed3f69508a2567.pdf)

[2] Deng, J.& Dong, W.& Socher, R.& Li, L.-J.& Li, K.& Fei-Fei, L. Ima-geNet: A large-scale hierarchical image database. In CVPR, 2009. [3] Simonyan, K. & Vedaldi, A.& Zisserman, A. Deep Inside Convolutional

Networks:Visualizing image Classification Models and Saliency Maps (arXiv:1312.6034 cs.CV 2014).

[4] Zagoruyko, S. & Komodakis. N. Paying More Attention to Attention: Im-proving The Performance of Convolutional Neural Networks via Attention Transfer (arXiv:1612.03928 cs.CV 2017)

[5] Bach, S. & Binder, A. & Montavon, G. & Klauschen, F. & Müller, K.-R. & Samek, W. On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. (PLoS ONE 10(7): e0130140 2015). https://doi.org/10.1371/journal.pone.0130140

[6] Linsley, D. & Eberhardt, S. & Sharma, T. & Gupta, P. & Serre, T. What are the visual features underlying human versus machine vision? (arXiv:1701.02704 cs.CV 2017)

[7] He, K. & Zhang, X. & Ren, S. & Sun, J. Deep Residual Learning for Image Recognition (arXiv:1512.03385 cs.CV 2015)

[8] Zagoruyko, S. & Komodakis. N. Wide Residual Networks (arXiv:1605.07146 cs.CV 2017)

[9] Lin, T.-Y. & Maire, M. & Belongie, S. & Bourdev, L. & Girshick, R. & Hays, J. & Perona, P. & Ramanan, D. & Zitnick, L. & Dolar, P. Microsoft COCO: Common Objects in Context (arXiv:1405.0312 cs.CV 2015) [10] He, K. & Gioxari, G. & Dolar, P. & Girshick, R. Mask R-CNN

(arXiv:1703.06870 cs.CV 2018)

[11] Xiao, T. & Xu, Y. & Yang, K. & Zhang, J. & Peng, Y. & Zhang, Z. The Application of Two-level Attention Models in Deep Convolutional Neural Network for Fine- Grained Image Classification (arXiv:1411.6447 cs.CV 2014)

[12] Peng, Y. & He, X. & Zhao, J. Object-Part Attention Model for Fine-Grained Image Classification ( arXiv:1704.01740 cs.CV 2017)

[13] Krizhevsky, A. & Sutskever, I. & Hinton, G.-E. ImageNet Classification with Deep Convolutional Neural Networks (Volume 1, NIPS’12, 2012, http://dl.acm.org/citation.cfm?id=2999134.2999257)

(44)

[14] Audebert, N. & Le Saux, B. & Lefèvre, S. Segment-before-Detect: Vehicle Detection and Classification through Semantic Segmentation of Aerial Images Remote Sensing, Volume 9, Number 4, 2017, (http://www.mdpi.com/2072-4292/9/4/368)

[15] Zhang, C. & Bengio, S. & Hardt, M. & Recht, B. & Oriol Vinyals IUnderstanding deep learning requires rethinking generalization (2016, http://arxiv.org/abs/1611.03530)

[16] Ren, S. & He, K. & Girshick, R. & Sun, J.Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (2015, http://arxiv.org/abs/1506.01497)

[17] Goodfellow, I. & Bengio, Y. & Courville, A. Deep Learning 2016, page 168- 224 & 331- 371, (http://www.deeplearningbook.org)

[18] Adams, R. W. Peripheral vision and visual attention, 1971, Retrospective Theses and Dissertations. 4933. (https://lib.dr.iastate.edu/rtd/4933) [19] Vartak, D. & Jeurissen, D. & Self, M. W. & Roelfsema, P. R. The influence

of attention and reward on the learning stimulus-response associations, 2017. (https://doi.org/10.1038/s41598-017-08200-w) [20] https://nrupatunga.github.io/2016/05/14/convolution-arithmetic-in-deep-learning-part-1/ [21] http://www.mlopt.com/?tag=multilayer-perceptron [22] https://www.mathworks.com/videos/introduction-to-deep-learning-what-are- convolutional-neural-networks–1489512765771.html [23] https://t.co/NRwVC68Gey [24] http://www.theinvisiblegorilla.com/videos.html

(45)

9 Appendix

9.1 Training and validation accuracy plots

Figures 24, 25, 26 show the accuracies on the training and validation sets during training.

Figure 23: ResNet 6 and resnet18 accuracy curves on the training and validation sets. (Left side) The networks trained on segmented data. (Right side) The networks trained on full images.

(46)

Figure 24: Toymodel and WRN accuracy curves on the training and validation sets. (Left side) The networks trained on segmented data. (Right side) The

(47)

(48)

9.2 Extra activations

The figures bellow show activations of our different model architectures.

(49)

(50)

(51)

Figure 29: Wide ResNet activations for each block’s second layer.

(52)

Figure 31: ResNet 18 activations for each of the first 7 layers, when trained on random labels.

(53)

9.3 Image ablation examples

Figure 32: Wide ResNet image ablation with elrp. From left to right: original image- importance map- pixel removal examples.

(54)

Figure 33: Wide ResNet image ablation with saliency. From left to right: original image- importance map- pixel removal example.

(55)

Figure 34: Wide ResNet image ablation with activations. From left to right: original image- importance map- pixel removal example.

(56)

Figure 35: Toy CNN image ablation with elrp. From left to right: original image-importance map- pixel removal example.

(57)

Figure 36: Toy CNN image ablation with activations. From left to right: original image- importance map- pixel removal example.

(58)

Figure 37: Toy CNN image ablation with saliency. From left to right: original image- importance map- pixel removal example.

(59)

9.4 P values

Figure 38: P values of the means of the Toy and WRN models. The doted red line shows p = 0.05, which is the threshold for significant difference.

(60)

Figure 39: P values of the means of the ResNet models trained in small data. The doted red line shows p = 0.05, which is the threshold for significant difference.

Figure 40: P values of the means of the ResNet models. The doted red line shows p = 0.05, which is the threshold for significant difference.

(61)

9.5 Cross accuracy tests

Figure 41: Accuracy tests of attention and no attention models on attended and regular data. (Seg: attended data/ attention CNN, Unseg: regular data/ regular CNN)

Attentive Learning in Deep Neural Networks

MSc Computational Science

Master Thesis