Directing Attention of Convolutional Neural Networks

(1)

MS

C

A

RTIFICIAL

I

NTELLIGENCE

TRACK: COMPUTERVISION

M

ASTER

T

HESIS

Directing Attention of Convolutional Neural

Networks

by

S

TEVE

N

OWEE

10183914

June 23, 2017

42 EC January 2016 - June 2017

Supervisor:

Dr J.C.

VAN

G

EMERT

Assessor:

Prof. Dr T. G

EVERS

(2)

Abstract

In object detection, often regions of interest are cropped from the input image and then classified by a convolutional neural network . Another approach is to crop regions of interest from the output feature maps of a convolutional neural network and use those extracted features for classification. By cropping regions of interest, whether it is from the input image or from the feature maps, valuable contextual information is discarded and not taken into account during classification.

In this thesis a method to explicitly direct the attention of convolutional neural networks is proposed, the Directed Attention CNN (DACNN). The directing of attention is achieved by using binary masks that define what is foreground, a region of interest, and what is background, the context of the region of interest. These binary masks are simply concatenated to the input of a CNN and to the intermediate outputs of the convolutional layers in the CNN. Since the regions of interest are not cropped, no information is disregarded. Both the contextual information and the information in the region of interest can be used for classification.

The theoretical basis of DACNNs is proven using the MNIST dataset and several variations of that dataset. It is shown that the DACNN is able to focus its attention on one specific digit when four digits are shown. Also, it is shown that the DACNN can use information from the context to improve classification performance.

Finally, the performance of the DACNN is evaluated on realistic data from PASCAL VOC 2007 and compared with two CNN baselines. The DACNN has a slightly lower performance than one of the baselines in which the region of interest is cropped from the image. However, when an object region is cropped from the full image the classification task becomes translation invariant and thus less complex. The DACNN outperforms the other baseline in which all but the region of interest is blacked out. This baseline is more comparable to the DACNN as both have to deal with translation variance. It is also shown that the DACNN is able to classify regions of interest using only their context information with an accuracy that is more than three times higher than random chance.

Thus, the DACNN introduced in this thesis has been shown to improve the performance of object classification using contextual information.

(3)

1 Introduction

Convolutional neural networks (CNNs) have shown very good results in most of the large computer vision tasks. From object detection and segmentation [2,17,20,22], to face recognition [21], to action recognition [12,26].

Various different types of information can be incorporated into the networks to improve their performance. For example within object detection, instead of having networks only focussing on regions of interest (RoIs), networks were developed that explicitly include information from the context of the RoIs as well in the classification [1,13].

This thesis introduces a new approach for CNNs to focus their attention. We propose to explicitly direct the attention of a CNN towards certain regions, without cropping those specific regions from the data. The proposed type of CNN will be referred to as a Directed Attention CNN (DACNN).

By not cropping the RoIs from the data, the context of the RoIs is not thrown away as has been the case in previous work on object detection [7]. Another approach that is adopted in several object detection methods is to project the RoIs onto the output of the convolutional layers and to then crop that instead of the input image [6,23]. This is already more efficient than processing crops of the input image for each RoI. However it is still a crop, meaning that potentially useful information is disregarded.

We achieve the directing of attention with the DACNN by using binary masks. By adding the masks to the input data and the intermediate output of the convolutional layers, the network should focus more attention on the regions of the images that are specified in the masks than on the background of the images. Finally, at the end of the network, features from both the RoI and from the context of the RoI are taken into account for the classification of the image. Adding this information should improve the classification performance. Figure1shows the general principle introduced in this thesis.

(5)

In this thesis the DACNN is mostly discussed in the setting of object detection/classification. However, it is simply a new and different way in which a CNN can interact with data that could be employed in other tasks as well. For example in visual tracking, if you would want to determine whether two regions from consecutive frames are the same object. Instead of cropping the regions and processing the crops, a DACNN can be used with masks that ‘point out’ the RoIs while retaining the context and spatial information of both of the RoIs.

The structure of the rest of this thesis is as follows. First a background of previous work for object detection and the use of context in CNNs is given. After that, the method that this thesis introduces is explained in more detail. Then the various datasets used in the conducted experiments are introduced, followed by the experiments themselves and their results. Finally, a conclusion will be provided of the work done for this thesis together with options for future work.

(6)

2 Background

2.1 Neural Networks

A neural network (NN) is a model that can be used in pattern recognition tasks. An NN consists of layers of nodes with connections between the nodes of adjacent layers. There is an input layer and an output layer and a variable amount of hidden layers in between the in- and output layers. The more complex the task to be solved by the NN is, the more hidden layers are necessary in the network. An example of a neural network structure can be seen in Figure2.

Figure 2: Example of a neural network with five hidden layers.

During training, input data is fed to the NN. The nodes of the subsequent layers get an activation based on the activation of the nodes in the previous layers and the weights of the connections between the nodes. The product of the weights and activations is then processed by a non-linear function to enable the network to model non-linear relations between input and output. If no non-linearity is introduced between layers the whole network can be described by a linear function, no matter how many layers are in the network. Thus non-linearity is a necessity for a well performing neural network. Some examples of non-linear function that are often used are the hyperbolic tangent, the sigmoid and the Rectified Linear Unit (ReLU).

Finally the output nodes get a certain activation that, depending on the ground truth, has a certain error. This error is used to update the weights of the network by propagating backwards through the network. Thus by providing a network with enough data, the weights between the nodes will be trained to perform a certain task.

2.2 Convolutional Neural Networks

A type of neural network that is based on the structure of a visual cortex and is especially effective for image data is called the Convolutional Neural Network (CNN). A CNN is made up of layers, of which there are various types, the most common of which are the convolutional layer, the fully connected layer, the pooling layer and the loss layer. In

(7)

Figure 3: Example of a CNN with various types of layers. The convolutional layer does not decrease the size of its input, rather it encodes the features of its input. The max pooling layer does decrease the size of its input. The amount with which the size decreases depends on the size of the pooling kernel and its stride. The fully connected layers are similar to the layers of a neural network (Figure2). The extracting of features at different scales and positions enables a CNN to perform complicated tasks such as image classification.

The convolutional layer consists of a set of overlapping kernels, which are also called filters. When input is processed by a convolutional layer, the input is convolved by all the layers of the kernels. A convolution is an operation between a kernel matrix and in this case a part of the input of the convolutional layer. An example of such an operation is shown in Figure 4. The output of a convolutional layer is also called a set of feature maps. The kernels in a convolutional layer learn to activate for specific features in the local region they correspond to. Convolutional layers at the beginning of a network will mostly learn to activate for abstract features such as lines or corners. The features that a convolutional layer will learn to recognize are more complex and task-specific for the layers at the end of a network.

(8)

Just as in a normal neural network, the activations between the layers in a CNN are processed by a non-linear function. The non-linear function that is used in this thesis is the ReLU. A ReLU returns zero for input values smaller than zero and identity for values above zero. The ReLU function can be seen in Figure5.

Figure 5: The ReLU activation fuction. The ReLU is zero input values under zero and linear with a slope of 1 for values above zero.

The fully connected layer, as the name suggests, is a layer in which all neurons are connected to all of the activations in the layer before it. These layers can be seen as classifiers that base their decision on the features that were created by the convolutional layers. For that reason, the fully connected layer is often used at the end of a network architecture.

There are also pooling layers, which are used to down-sample the result of a convolutional layer and thus reduce the number of parameters in the network, which in turn prevents overfitting. There are several variations of the pooling layer, most common are max pooling and average pooling. The kernel of the pooling layer moves across the input with a certain step size (stride) and the maximum or average value within the kernel becomes the output value. The amount of down-sampling that is achieved by a pooling layer depends on the size of the kernel and on the stride with which the kernel moves across the input of the layer.

Loss layers are used during training and specify how ‘wrong answers’ are penalized. For example if a CNN is taught to predict a continuous value (regression), the loss can simply be the difference between the value the CNN predicted and the target value.

Since training a CNN takes a lot of time and annotated data, a common practice is to fine-tune the weights of a pre-trained network using data that is specific for the new problem. Fine-tuning is the training of weights that are already trained to perform a specific task. For example, in this thesis we use the weights that are pre-trained for object detection on ImageNet [3]. These weights are then fine-tuned to perform object detection on PASCAL VOC 2007 [5]. Since the weights are pre-trained to perform object detection, it will take less time to train the network to convergence on a different dataset.

(9)

based on the gradient of the objective function on the full data set (see Equation1). w := w − η∇Q(w) = w − η n n X i=1 ∇Qi(w) (1)

Evaluating all data points for every update step can be a costly operation. For that reason SGD updates parameters on-line, based on the gradient of the objective function on single examples or small batches of examples (see Equation

2).

w := w − η∇Qi(w) (2)

The gradient of the objective function on a set of data points should point in the direction of the largest difference, either to a minimum or a maximum (local or global). Thus by moving in the direction of the gradient you move closer to an optimum of the objective function.

In Equations1and2η is the learning rate, which denotes the amount of movement in the direction of the gradient. If the learning rate is too big, gradient descent methods can jump over the optimum instead of converging to the opti-mum. Conversely if the learning rate is too small, converging to the optimum will take a long time as the parameters are only updated in small steps.

AdaDelta is another variant of SGD, which builds upon AdaGrad [4]. In AdaGrad the learning rates of the param-eters are adapted by scaling them inversely with the square root of the sum of all squared gradients up to that point. This way, the learning rates of the parameters with the largest gradients are decreased the fastest. The learning rates of parameters with smaller gradients are decreased much slower. Thus AdaGrad has a self-adapting learning rate that paces the optimization separately for each of the parameters. However, by using the sum of all the squared gradients, the scaling eventually drives the learning rate down to being infinitesimally small. AdaDelta improves upon AdaGrad by only summing the squared gradients for a certain amount of time steps back, instead of all the squared gradients over all time. This way the scaling of the learning rates does not decrease as fast as with AdaGrad, resulting in faster and more stable convergence than with AdaGrad.

Adam is an extension of SGD in which the gradient is updated using the gradient of the previous update step using momenta. The algorithm uses the first and second order momenta (moving averages of gradient and squared gradient respectively) of the previous update steps to compute the momenta of the current time step. Thus if in the previous time steps the gradient pointed in a certain direction, the gradient in the current time step will be slightly corrected into the direction of the previous gradients. This builds on the intuition that the moving average of the previous gradients should largely point in the most optimal direction. Adam should converge even faster than AdaDelta and AdaGrad.

The way the weights of a CNN are initialized influences the convergence and performance of the network consid-erably. In this thesis the weights are initialized in two different ways. The first simply initializes the weights uniformly between zero and one. The second initialization is called Glorot (Xavier) weight initialization [8]. Glorot weight initialization, called normalized initialization in the paper, is a uniform sampling between −limit and limit where limit =

√ 6 √

nj+ nj+1

(10)

3 Related Work

3.1 Object Detection

The field of object detection is one that has gone through several large increases in performance in the past few years. The following sections give an overview of various approaches that have been the reason for these increases in performance.

3.1.1 Regions with CNN features

Regions with CNN features, also known as R-CNN [7], was one of the first approaches in object detection that more than doubled the performance in comparison to the then state-of-the-art. The name ‘Regions with CNN features’ originates in the fact that this approach is based on processing region proposals of images with a CNN. The region proposals used in R-CNN are generated with selective search [28], as this enables a fair comparison with previous work in detection.

Each of the regions generated by selective search is cropped out of the input image and transformed affinely such that they have a fixed size equal to the input size of the CNN. The output of the CNN is a 4096-dimensional feature vector for each of the transformed input regions. These feature vectors are then scored by separate Support Vector Machines (SVMs) for each class. Finally, for each separate class, a greedy non-maximum suppression is applied to all the scored regions. This means that all regions that overlap with a higher scoring region of the same class are rejected, resulting in a set of the maximum non-overlapping detections for every class. An overview of this process is shown in Figure6. The mean average precision (mAP) achieved with R-CNN on PASCAL VOC 2007 [5] is 58.5%.

Figure 6: Architecture of R-CNN. The region proposals are cropped and resized into fixed dimensions. The resized proposals are all processed by a CNN and output features for the proposals. Finally, these features per proposal are classified.

3.1.2 Fast R-CNN

R-CNN was the method that improved significantly on the plateaued performance of that time. However, building on the R-CNN framework, Fast R-CNN improved even further. Fast R-CNN [6] trains 9 times faster than R-CNN and is 213 times faster at test-time. Apart from runtime improvements, Fast R-CNN also has achieved a higher mAP of 70% on PASCAL VOC 2007.

(11)

layer is based on a normal Spatial Pyramid Pooling (SPP) layer [10]. However, while a normal SPP layer uses several levels of splitting the input into grids (full image, two by two, four by four, etc.), the RoI pooling layer only uses one level.

Figure7shows the high level overview of Fast R-CNN.

Figure 7: Architecture of Fast R-CNN. The input images are processed by a CNN once. Next, the region proposals are extracted from the image feature maps using the RoI pooling layer. These extracted features per proposal are used for classification and bounding box optimization.

3.1.3 Faster R-CNN

It was found that in state-of-the-art methods like Fast R-CNN, what slowed detection down the most were the region proposal methods. For that reason, in Faster R-CNN [23] a Region Proposal Network (RPN) was introduced. This RPN is a fully convolutional network that predicts bounding boxes as well as a score of ‘objectness’. This RPN is trained at the same time as Fast R-CNN which is used for the actual detection, which essentially means both the RPN and Fast R-CNN use the same convolutional features.

The RPN uses anchors of multiple scales and multiple aspect ratios in a sliding window approach to predict the ‘objectness’ and ‘not objectness’ as well as four coordinates for each anchor at each sliding window position to regress the bounding box coordinates. Three scales and three aspect ratios were used, resulting in nine anchors per sliding window position.

This approach of simultaneously predicting regions of interest and classifying them significantly speeds up the object detection, as the most time intensive task of region proposal is optimized with the internal RPN. Selective search takes an average of 1500 milliseconds to generate 2000 proposals and Fast R-CNN (VGG16) takes 320 milliseconds to process that number of proposals. In contrast, Faster R-CNN (VGG16) takes only 198 milliseconds to generate proposals and classify them making it around ten times faster than Fast R-CNN with selective search proposals. At the same time, the performance of Faster R-CNN with a mAP of 73.2% on PASCAL VOC 2007 is higher than that of Fast R-CNN.

In Figure8 the general pipeline of Faster R-CNN can be seen. The input image is first convolved by a number of convolutional layers. The resulting feature maps are used by the RPN to generate the proposals that are then combined with the feature maps in the RoI pooling layer. The resulting, fixed-length vectors are then processed by fully connected layers, similar as in Fast R-CNN (Figure7).

(12)

Figure 8: Architecture of Faster R-CNN. The input images are convolved once. However, instead of using pre-computed region proposals the resulting feature maps are used by a Region Proposal Network to generate regions of interest. These generated regions of interest are then extracted from the feature maps using the RoI pooling layer and used for classification.

3.1.4 You Only Look Once

While Faster R-CNN was already faster than its predecessors, it was still not real-time. A method that does perform real-time object detection is You Only Look Once (YOLO) [22]. Like Faster R-CNN, YOLO can be trained end-to-end to predicts bounding boxes and class probabilities for those boxes at the same time.

YOLO divides an input image into an even grid and for each of the grid cells predicts a specified number of bounding boxes, the probability of an object being in the predicted boxes and a class probability for each class. Class specific confidence scores can be computed by multiplying the intersection over union between the predicted box with the ground truth bounding box, the object probability and the separate class probabilities.

As is often the case, there is a trade-off between computational speed and performance. While YOLO is able to handle 45 frames per second (∼ 22 milliseconds per frame) it has a mAP of 63.4%. In contrast, Faster R-CNN takes 320 milliseconds per frame and has a mAP of 73.2% on PASCAL VOC 2007. This means that YOLO has a lower performance than Faster R-CNN, but it is also more than fourteen times faster than Faster R-CNN. Compared to other real-time object detection methods, the performance of YOLO is more than twice as high.

An example of the dividing of an image in an even grid, the bounding box prediction and the probability estimation can be seen in Figure9.

(13)

Figure 9: Example of YOLO dividing an input image in a 7 x 7 grid and predicting class probabilities and bounding boxes for the grid cells.

3.1.5 Single Shot Detector

The following method is not subject to the trade-off between computation speed and performance. The Single Shot Detector (SSD) [19] is able to process more frames per second than YOLO. However, where YOLO had a decreased performance compared to Faster R-CNN, the performance of SSD exceeds the performance of Faster R-CNN.

SSD uses the concept of anchor boxes from Faster R-CNN, but here they are called default boxes. The biggest difference is that Faster R-CNN applies its anchor boxes only to one feature map with a certain scale. Conversely, SSD applies its default boxes to several feature maps at different scales. This way objects of varying scales can more easily be detected in one image. In Figure10the default maps are applied to feature maps with scales of eight by eight and four by four. In this example it can be seen that the cat and the dog are detected by the default boxes because of the use of different scales.

Another difference is that Faster R-CNN was based on a sliding window approach. SSD, on the other hand, only makes predictions per feature map cell. For each feature map cell SSD predicts four offsets for the default boxes as well as class probabilities for each of the boxes.

While YOLO is already a relatively fast method with a speed of 45 frames per second, SSD is able to process 59 frames per second, which is the same as ∼ 17 milliseconds per frame. But where YOLO had a reduced performance of 63.4% mAP, SSD has a mAP of 74.3% on PASCAL VOC 2007. Thus SSD is faster than YOLO and has a higher performance than Faster R-CNN.

(14)

Figure 10: Example of how the default bounding boxes of SSD are used to detect objects at scales of 8 x 8 and 4 x 4. This example shows how SSD can be useful to detect objects at different scales.

3.1.6 Region-based Fully Convolutional Networks

One of the latest developments within the field of object detection is the Region-based Fully Convolutional Network (R-FCN) [17]. It builds upon the fact that state-of-the-art classification networks are all fully convolutional. It is argued that it similarly makes sense to use a fully convolutional network for detection. However, while for classification networks a higher level translation invariance is better, object detection is inherently somewhat translation-variant. So a balance needs to be maintained between the translation variance and invariance in the network.

The R-FCN approach is built upon a shared, fully convolutional architecture such as in FCN [20]. This FCN is originally used for semantic segmentation of images, which can be seen as a pixel wise classification. Deep convolu-tional networks are inherently translation invariant, as the basic operations the networks consist of are applied locally. As R-FCN is an object detection method, a level of translation variance needs to be introduced to FCN for the benefit of the object detection. The final convolutional layer in R-FCN is used to encode position-sensitive score maps for each class. Each of these score maps encodes information about a specific position. These score maps are then pooled by a position-sensitive RoI pooling layer. In the example overview shown in Figure11there are nine score maps and these are pooled into an even grid of 3 × 3. Each of the grid cells describe relative position. So in the example of 3 × 3, these positions are ‘top left’, ‘top middle’, ..., ‘bottom right’. The RoI pooling layer pools selectively, such that each of the grid cells is pooled from its corresponding score map. This principle can be seen in the overview, e.g. the top left grid cell is pooled from the left most score map (dark yellow). The results of the pooling layer are then averaged for each class, resulting in a vector containing scores for each class. Finally, softmax is used to compute the cross-entropy loss.

R-FCN can be trained with precomputed region proposals. However, the Region Proposal Network as seen in Faster R-CNN can also be used in this approach to generate region proposals. The features between the RPN and R-FCN are shared, just as they were in Faster R-CNN.

R-FCN is able to process one image in 170 milliseconds, using ResNet-101 [11]. This is about three times faster than the optimized version of Faster R-CNN combined with ResNet-101. Comparing the performances of both R-FCN and Faster R-CNN with ResNet-101 on the COCO dataset [18] shows that R-FCN performs slightly worse than the optimized Faster R-CNN. R-FCN scores 53.2% mAP while the optimized Faster R-CNN has a performance of 55.7% mAP.

(15)

Figure 11: Overview of R-FCN position-sensitive score maps and position-sensitive RoI pooling. The input image is convolved by several convolutional layers. Then the position-sensitive convolutional layer generates single feature maps that specifically encode position-sensitive information. The position-sensitive RoI pooling then pools each of the position-sensitive feature maps into the position they encoded respectively, resulting in the chequered pattern output. Then the network votes per class based on the position-sensitive pooling output, generating a scoring vector for each class. Finally these vectors are used to do classification.

3.1.7 Mask R-CNN

Another variation on Faster R-CNN, which is somewhat closer to the topic of using masks in CNNs introduced in this thesis, is called Mask R-CNN [9]. Faster R-CNN just detects objects in images, but Mask R-CNN also predicts object masks for the detected objects simultaneously. These object masks are generated in a pixel-wise manner, using a fully convolutional network to process the RoIs. The architecture of Mask R-CNN can be seen in Figure12. In this overview the RPN introduced by Faster R-CNN is not visualized, however it is also used by Mask R-CNN.

Mask R-CNN can process an image in around 200 milliseconds, which is about the same speed as for R-FCN. We saw that R-FCN was faster than the optimized Faster R-CNN, however it did not achieve a higher performance on the COCO dataset. The optimized Faster R-CNN achieves a mAP of 55.7%, whereas Mask R-CNN achieves 62.3% mAP.

(16)

Figure 12: Overview of the Mask R-CNN pipeline. Each of the RoIs is classified, while simultaneously an object mask is generated for each RoI by the fully convolutional part of the architecture.

3.1.8 Convolutional Feature Masking

A method that comes very close to the approach proposed in this thesis is Convolutional Feature Masking [2]. Con-volutional Feature Masking (CFM) is a method for object segmentation that projects segments from region proposal methods onto the domain of the last convolutional feature maps. These projected segments act as masks on the convo-lutional features. The convoconvo-lutional features that fall under the active parts of the mask are fed into the fully connected layers for object recognition. So separate segments from the input image are classified separately based on their con-volutional features, resulting in a semantic segmentation of the input image. This approach is slightly similar to the RoI pooling layer that was introduced in Fast R-CNN. An overview of the CFM pipeline can be seen in the bottom part of Figure13.

Where other methods such as R-CNN crop out regions of interest and extract convolutional features from the crops, CFM applies the convolutions to the full image. This means that no information is thrown away or ignored.

The segmentation performance of CFM on PASCAL VOC 2012 is an mAProf 60.7%. mAPris a special type of mean average precision used for segmentation that is computed as the area under the precision-recall curve. This means that it is not directly comparable to the mAP results that are reported in the previous sections for object detection. The CFM method takes 570 milliseconds to process one image.

(17)

Figure 13: Overview of the Convolutional Feature Masking approach for semantic segmentation (bottom half). The input image is processed by convolutional layers. The resulting feature maps are then masked with the segment proposals. The features in the feature maps that are under the active part of the mask are then processed by the fully connected layer and used for classification of the segment.

3.2 Context in Computer Vision

Context is an important aspect in computer vision. For example, it can provide more information than a region of interest alone. For that reason, using context in tasks such as object detection is a logical approach. The following sections discuss approaches for object detection using context.

3.2.1 ContextLocNet

A method that uses context for object localization in a weakly supervised manner is ContextLocNet [13]. Herein, context is defined as the outer region of a RoI. Both the RoI and the corresponding context are pooled using RoI pooling (Fast R-CNN). The output of the pooling of the RoI is used to classify the object. For localization the pooled results of the RoI and the context are combined. How this combination is done differs depending on the type of model employed. In the paper two different models are introduced. The additive model encourages the network to predict regions that are semantically compatible with their context, by maximizing the sum of the RoI class score and the context class score. Conversely, the contrastive model encourages the prediction of regions that differ starkly from their context, by maximizing the difference between the RoI and context class scores. The general overview of ContextLocNet is shown in14.

It is argued that the additive model helps with preventing wrong detections outside of actual objects, as those will most probably not have the correct context. The contrastive model on the other hand prevents incorrect small detections within an object, as those regions will not be contrastive enough compared to their context.

ContextLocNet with the additive model has achieved a mAP of 33.3% on PASCAL VOC 2007. On the other hand, the contrastive model has achieved a mAP of 36.3%. As this is a weakly supervised method, the performances of this method are not directly comparable to the other performances reported for object detection which were not weakly supervised.

(18)

Figure 14: Overview of the ContextLocNet architecture. The two different streams for classification and localiza-tion can be seen in this overview. The block labelled merge in the localizalocaliza-tion stream combines the output of the fully connected layers depending on whether the additive or contrastive model is employed.

3.2.2 Inside-Outside Net

The Inside-Outside Net (ION)[1] is another approach for object detection that incorporates contextual information. The name Inside-Outside comes from the fact that the network first extracts features ‘inside’ the RoIs from feature maps and uses four directional IRNN layers that create features from ‘outside’ of the RoIs.

Using RoI pooling (Fast R-CNN) features are extracted from the third, fourth and fifth convolutional layers. Added on top of the fifth convolutional layer are two IRNN layers to encode contextual features. RoI pooling is also used on these final contextual features, resulting in four feature vectors extracted at different scales and levels of abstraction. These features are then concatenated and used for classification and bounding box regression. This pipeline is shown in Figure15.

Figure 15: Overview of the Inside-Outside Net architecture. The input image is first convolved by five convolu-tional layers. RoI pooling is used to extract regional features from the third, fourth and fifth convoluconvolu-tional feature

(19)

Figure 16: A more detailed overview of the two IRNN layers of Inside-Outside Net. The fifth convolutional feature maps are copied four times and each of the copies is then processed by an RNN, each of which moves in a different direction. The resulting feature maps are concatenated and the process is repeated again.

Figure 17: A more detailed illustration of how the contextual encoding using the IRNNs works. Each of the colour-coded blocks is the encoding of information of one of the directions in the output of an IRNN.

On the PASCAL VOC 2007 test set an mAP of 79.2% has been reached by ION. This is better than the state-of-the-art results of some of the previously mentioned methods for object detection, showing the importance of using context in object detection.

(20)

4 Methodology

4.1 Directed Attention Convolutional Neural Networks

The principle of Directed Attention CNNs builds on the conventional CNNs as are described in Section 2. It is proposed to make CNNs focus on regions of interest (RoIs) through the use of binary masks. In previous work, CNNs simply crop the RoI and perform their assigned task on that cropping [7]. By using masks to ‘highlight’ the RoI instead of cropping the RoI, no information is discarded before the data is processed by the CNN. An example of a full image compared to the cropping of the RoI within that image can be seen in Figure18.

Figure 18: Comparison between a full image and a crop of the RoI within that image. The majority of the image is discarded by the cropping, resulting in loss of information from the context of the RoI.

The approach of convolving full images instead of crops has shown good results in previous work on object detection [2,6,9,13,17,23]. By additionally masking the RoIs we intend to make the CNN focus on those specific regions, while the background will still be processed by the CNN. This should provide the CNN with extra information that can be used to classify what is shown in the RoIs.

The masks used to direct the attention of the CNN are binary. The masks are 1 over the RoI and 0 over the region outside the RoI. Figure19illustrates the process of masking the full image.

Figure 19: Example of an image in which the RoI is pointed out by a mask. By accentuating the RoI instead of cropping it, the contextual information of the RoI is still visible. Note: The opacity of the mask in the ‘Masked Image’ is for the sake of illustration. The actual mask is binary.

(21)

Figure 20: Example of a foreground mask and the corresponding background mask. Note: The opacity of the masks in the ‘Foreground Mask’ and ‘Background mask’ is for the sake of illustration. The actual masks are binary.

There are several positions where the mask can be inserted in a CNN structure. For example, the mask can be added to the input image, before the first convolution. Additionally, the mask can be added to the intermediate feature maps. The mask simply needs to be pooled as many times as the input image has been pooled to achieve compatible dimensions with the feature maps. The actual incorporation of the masks in a CNN are achieved by simply concatenating them to the input of a convolutional or fully connected layer.

Finally, the weights for the mask layers can also be initialized in different ways depending on what the network should learn. For example, if the network should actually only focus on the RoI and ignore the background, the weights for the background mask could be initialized with a negative number of a high magnitude. Using different weight initializations enables networks to interact with the data in different ways.

4.1.1 DACNN Architectures

In the experiments conducted for this thesis two CNN architectures are used, namely the LeNet architecture [15] with only one max pooling layer and the VGG16 architecture [27]. These two architectures can be seen in Figure21and Figure22.

(22)

Figure 22: VGG16 architecture. One layer that is not originally part of VGG16 is the Spatial Pyramid Pooling layer before the fully connected layers.

Examples of both these architectures with the Directed Attention approach are shown in Figure23and Figure24. The example of DA-LeNet shown here has the mask added before each of the convolutional layers and before the fully connected layer. For images of the DA-LeNet architecture where the mask is added at varying levels, consult Figure36

and Figure37in AppendixA. The example of DA-VGG16 shown here has the mask added only to the convolutional layers. For an image of the DA-VGG16 architecture where the masks are added before each convolutional layer and the fully connected layer, consult Figure38in AppendixB.

(23)

Figure 24: DA-VGG16 architecture. The image input (bottom track) is concatenated with the mask input (top track) in all the yellow layers. These concatenation layers occur before every convolutional layer. To ensure compatible dimensions between the mask and image, the mask goes through the same number of max pooling layers before concatenating it to the smaller feature maps. Before the fully connected layers, the feature maps go through a Spatial Pyramid Pooling layer.

In Figure 22 and Figure 24, we can see that the layer before the fully connected layers is a Spatial Pyramid Pooling [10] (SPP) layer. The SPP layer can handle inputs of variable sizes and returns a fixed size feature vector. It encodes the data spatially at several scales by dividing the input into grids of increasing resolution and pooling each of the grid cells. An example of how the SPP layer works is shown in Figure25. Since the SPP layer can process input of varying sizes, the input images themselves can vary in size as well. This means that the input images do not have to be resized and all aspect ratio in the images is retained.

(24)

Figure 25: Example of the Spatial Pyramid Pooling operation. The feature maps are split into several grids, based on the settings of the pyramid. In the example the pyramid has the levels 1, 2 and 4. Thus the feature maps are split into grids of 1x1, 2x2 and 4x4. Each of the grid cells in these grids is pooled (usually max pooled) and the results are concatenated into a vector with a fixed length.

The VGG16 and DA-VGG16 architectures can be initialized with weights pre-trained on ImageNet [3] to allow for faster training convergence. Only the weights for the convolutional layers should be used and not those of the fully connected layers. The reason for not using the weights of the fully connected layers is that ImageNet has thousand classes, while PASCAL VOC [5] only has twenty classes. Thus, the ImageNet pre-trained fully connected layers are taught to look for different things than is necessary for PASCAL VOC. So instead of using the pre-trained weights for the fully connected layers, the weights of the fully connected layers are initialized randomly using Glorot (Xavier) weight initialization.

An additional step is necessary for the weight initialization of the layers in DA-VGG16. As one or two additional channels are added to the input of each convolutional layer (foreground mask only or foreground and background masks), each of the convolutional layers also needs one or two extra weight layers. These mask weights are initialized randomly and drawn from a uniform distribution between 0 and 1.

4.1.2 Object Detection using DACNNs

One of the main topics in this thesis is the use of DACNNs for object detection. The task of object detection can be separated into two subtasks, which are localization and classification. The localization detects potentially interesting parts of an image, after which the classification determines what was detected.

Often object detection methods use region proposal methods to do the localization. All the proposals that are generated by the proposal method are then classified and assigned a probability for each object class.

(25)

Figure 26: Example of the selective search algorithm. The top row shows the segmentation of the image and the bottom row shows the bounding box proposals for each of the segmentations. On the far left we can see the over-segmentation of the image. The images to the right get less and less segmented, ensuring bounding box proposals of every scale as can be seen in the images in the bottom row.

DACNNs do not have bounding box prediction as goal (localization). Instead it is a method that enhances the way the bounding boxes are classified. Thus a proposal method like selective search would be essential to use DACNNs for object detection. A mask would have to be generated for each of the proposed regions. These masks can then be concatenated to the image and fed into the DACNN.

(26)

5 Experiments and Results

The first two experiments that are discussed have as goal, respectively, to test whether a mask can be used to direct the attention of a CNN and whether the context of said mask is also taken into account. The last experiment tests whether the use of masks to direct attention to regions of interest in object detection improves the classification performance, compared to simply cropping regions of interest or blacking out all but the region of interest.

5.1 DACNNs on MNIST

To determine whether DACNNs are actually capable of adding information in comparison to an ordinary CNN, a series of experiments has been conducted. There are four questions that these experiments should answer. Each of the following sections addresses one of these questions.

5.1.1 Will a DACNN perform equally as well as a normal CNN on the original MNIST dataset?

Firstly, it needs to be verified whether using a DACNN results in a performance that is on par with a normal CNN on a simple task. For this experiment the original MNIST [16] dataset is used. The MNIST dataset is one of the classic machine learning datasets and consists of images of handwritten digits (0-9) used for classification. Each of the image samples in MNIST has a size of 28 by 28 pixels. Some example images from the MNIST dataset can be seen in Figure27.

Figure 27: Examples images from the MNIST dataset.

The number of occurrences per digit in the train set and test set of MNIST are provided in Table1. From these statistics it becomes clear that the number of train and test images are evenly divided over the classes, which means that there is no statistical bias towards specific classes.

Class digits

(27)

DA-LeNet are simply all white images the size of the MNIST images.

In Table2we can see that the results from LeNet and DA-LeNet on original MNIST are almost equal. The table shows results for the different set-ups of DA-LeNet where the masks are concatenated to varying layers. None of these variations seem to have a large influence on performance.

Network Fg-mask Bg-mask Mask Add Fg-Weights Bg-Weights Accuracy

LeNet - - - 98.97%(±0.07%)

DA-LeNet X 1 Random - 98.95%(±0.03%)

DA-LeNet X 1,2 Random - 98.96%(±0.05%)

DA-LeNet X 1,2,3 Random - 98.88%(±0.08%)

Table 2: Results of LeNet and DA-LeNet on original MNIST. The checkmarks denote whether a foreground mask or background mask or both are used. The layers to which the masks are added are denoted by: 1 (before the first convolutional layer), 2 (before the second convolutional layer) and 3 (before the fully connected layer). The random weight initialization is performed using Glorot (Xavier) uniform sampling. Each result shows its corresponding stan-dard deviation which has been computed over five iterations. Each of the separate iterations was trained for fifteen epochs and the optimizer used for training was AdaDelta with a learning rate of 1 at the start.

To prove that there is significant equivalence between the performance achieved by DA-LeNet and the performance achieved by standard LeNet we can apply the TOST (Two One Sided Tests) test [25]. Similarly as with a t-test, the TOST test will output a p-value. If this p-value is below a threshold (usually 0.05 or 0.01) the two results are significantly equivalent. Since the differences are quite small, a threshold of 0.01 will be used. We will apply the TOST test to the set-up of DA-LeNet with a performance of 98.96% and the set-up of LeNet with a performance of 98.97%. This yields a p-value of 0.0001, which is below the specified threshold. Thus we can conclude that DA-LeNet performs at a comparable level as LeNet on the MNIST dataset and does not decrease performance.

5.1.2 Can a DACNN perform a complex task equally as well as a normal CNN performs a less complex task? The usefulness of a DACNN only becomes apparent when several objects are visible and the directing of attention can make a CNN focus specifically on one of those objects. The current question is posed to test whether the focusing on one object among several objects will result in a performance that is comparable to when only one object is visible and no explicit focusing is applied.

In order to answer this question, two datasets have been generated using the MNIST dataset. The first of these dataset is called Single Random Quadrant MNIST and shall be referred to as the SRQM dataset. The images in the SRQM dataset are 56 by 56 pixels big, which is twice as big as a normal MNIST image. The SRQM images can be said to have four quadrants, each of which is the size of a normal MNIST image. Each SRQM image contains one digit, which is added to a quadrant that is chosen at random. This means that a digit can be located in any of the four quadrants. Figure28shows some examples of the SRQM dataset.

(28)

Figure 28: Examples of images from the SRQM dataset.

The second generated dataset is Quadrant MNIST, which shall be referred to as the QM dataset. The QM dataset is the same as the SRQM dataset, but instead of filling only one quadrant with a digit, every image in this dataset contains four digits from the original MNIST dataset. One of the four quadrants corresponds to the actual label of that data sample and the other three quadrants are filled with randomly assigned with filler digits. Each of the digits in one image of the QM dataset is unique in that image. Thus there are no duplicate digits within one image. See Figure29

for some example images.

Label: 1 Label: 7 Label: 0

Figure 29: Examples of quadrant images.

(29)

as the quadrants are numbered. Additionally, a set of inverse masks has been generated to experiment explicitly with both foreground and background masks. The inverse masks can be seen in Figure31.

Figure 30: Examples of masks used in combination with QM. The quadrants are numbered in the same order as shown in this figure. (Note: a small grey border has been added to visualize the edges of each mask, however these borders are not part of the actual masks.)

Figure 31: Examples of inverted masks used in combination with QM. The quadrants are numbered in the same order as shown in this figure. (Note: a small grey border has been added to visualize the edges of each mask, however these borders are not part of the actual masks.)

To try and answer the second question, we compare the performances of DA-LeNet on the QM dataset with LeNet on the SRQM dataset. The SRQM dataset poses the less complex task in which the network does not have to disambiguate between several digits. Conversely, the QM dataset poses the complex task in which the network needs to specifically focus on one of four digits. If DA-LeNet can reach a performance on the QM dataset that is equal to the performance of LeNet on the SRQM dataset, we prove that the DACNN can efficiently focus on specific parts of an image.

Besides varying which layers the masks are concatenated to, we use separate foreground and background masks as well as different weight initializations for the mask layers. This allows for an analysis of how well the various set-ups work.

In Table3the results of LeNet and DA-LeNet on the SRQM dataset are shown. To find an answer to the current question we are interested only in the performance of LeNet, which is shown in the first row of the table. The performance of LeNet is 98.32%.

(30)

LeNet - - - 98.32%(±0.05%)

DA-LeNet _X _X 1 Random Random 98.27%(±0.09%)

DA-LeNet _X _X 1,2 Random Random 98.24%(±0.05%)

DA-LeNet _X _X 1,2,3 Random Random 97.96%(±0.13%)

DA-LeNet _X _X 1 Random -1000 98.23%(±0.12%)

DA-LeNet _X _X 1,2 Random -1000 98.34%(±0.09%)

DA-LeNet _X _X 1,2,3 Random -1000 98.11%(±0.12%)

DA-LeNet _X _X 1,2 0.01 -0.01 98.29%(±0.11%)

DA-LeNet _X _X 1,2 0 0 98.30%(±0.12%)

Table 3: Results of LeNet and DA-LeNet on the SRQM dataset. The checkmarks denote whether a foreground mask or background mask or both are used. The layers to which the masks are added are denoted by: 1 (before the first convolutional layer), 2 (before the second convolutional layer) and 3 (before the fully connected layer). The ran-dom weight initialization is performed using Glorot (Xavier) uniform sampling. Each result shows its corresponding standard deviation which has been computed over five iterations. Each of the separate iterations was trained for fifteen epochs and the optimizer used for training was AdaDelta with a learning rate of 1 at the start.

Table 4 shows the results of LeNet and DA-LeNet on the QM dataset. The highest performance achieved by DA-LeNet on the QM dataset is 98.43%, whereas LeNet on the SRQM dataset achieved a performance of 98.32%.

We can perform a TOST test between the results of these two set-ups to determine whether the results are signif-icantly equivalent. The TOST test yields a p-value of 0.0002 which is below the threshold of 0.01. Thus DA-LeNet achieves results on the QM dataset that are significantly equivalent to the results of LeNet on the less complex SRQM dataset.

LeNet - - - 24.77%(±0.32%) DA-LeNet _X 1 Random - 97.97%(±0.06%) DA-LeNet _X 1,2 Random - 97.98%(±0.13%) DA-LeNet _X 1,2,3 Random - 97.85%(±0.17%) DA-LeNet _X 1 - Random 98.23%(±0.12%) DA-LeNet _X 1,2 - Random 98.31%(±0.12%) DA-LeNet _X 1,2,3 - Random 97.99%(±0.12%)

DA-LeNet _X _X 1 Random Random 98.07%(±0.10%)

DA-LeNet _X _X 1,2 Random Random 98.20%(±0.15%)

DA-LeNet X X 1,2,3 Random Random 97.99%(±0.25%)

DA-LeNet X X 1 Random -1000 98.06%(±0.46%)

DA-LeNet X X 1,2 Random -1000 98.41%(±0.08%)

DA-LeNet _X _X 1,2,3 Random -1000 98.22%(±0.10%)

(31)

5.1.3 How do the performances of a DACNN and a CNN compare on the QM dataset?

If the standard LeNet network is trained on the MNIST dataset, a test accuracy of ∼99% can be achieved. The expec-tation is that if the same network structure is trained on the Quadrant MNIST dataset the accuracy that can be achieved will be ∼25%. This is because the actual class for each image is one of four visible digits, meaning that the network has to randomly pick one of the four and thus can at most achieve an accuracy of one out of four. By pointing out the correct label with DA-LeNet, the performance should greatly increase compared to the ∼25%.

The results in Table4show that the performance of LeNet on the QM dataset is 24.77$. This is in line with the expected behaviour of LeNet having to randomly choose one out of four digits. Additionally, the highest performance of DA-LeNet on the QM dataset is 98.43$ which is close to expected performance of LeNet on MNIST and a great improvement over the performance of 24.77% of LeNet on the QM dataset.

5.1.4 What settings work well for a DACNN?

Following is an analysis of the various set-ups of DA-LeNet and how the different settings influence performance. Firstly, we look at the influence of the type of mask used on the performances in Table4. Surprisingly, only using a background mask performs better than only using a foreground mask. Using both a foreground mask and a background mask results in performances that are similar to the results of using only a background mask.

Next, the influence on performance of which layers the masks are added to is analysed. The largest difference in performance is most apparent between the cases where the mask is added to only the convolutional layers (1,2) and the cases where the mask is added to only the first convolutional layer (1) and to all the layers (1,2,3). The network seems to perform slightly better when the masks are added only to the convolutional layers (1,2). This can be seen in both Table3and Table4.

Finally, we will analyse how the weight initialization of the weight layers influences the performance of the net-work. For this analysis we look specifically at the cases where both the foreground and the background mask are used, in Table3and Table4. In the most cases, the networks where both mask weights are initialized randomly have a lower performance than those where a non-random weight initialization is used. However, the performances of the networks that have a non-random weight initialization (-1000, 0.01 and 0) do not differ much.

5.1.5 Conclusion

From these experiments we can conclude that a DACNN does not decrease performance in comparison to a normal CNN on the same dataset (MNIST). Additionally, the DACNN achieves a performance on the QM dataset that is significantly equivalent to the performance a normal CNN can achieve on the SRQM dataset. This means that the masks used in the DACNN efficiently help the network differentiate between various objects shown in one image. Additionally, the performance of a CNN on the QM dataset is relatively low, portraying the usefulness of the DACNN even more. Finally, the setting of a DACNN that has the most influence on performance in these experiments is the initialization of the mask weights.

(32)

5.2 Context in DACNNs

The next experiments test whether DACNNs are also able to use context information to improve classification perfor-mance. There is only one question which can be answered in two ways. Both answers are addressed in the following section.

5.2.1 Does using a DACNN improve classification performance of classes that are reinforced by context infor-mation?

To study the effect of context information on classification a new dataset is generated called Correlated Quadrant MNIST, which shall be referred to as the CQM dataset. The CQM dataset is effectively the same as the Quadrant MNIST dataset, which was introduced in the previous section. However, a correlation between the digits ‘0’ and ‘1’ is introduced in the CQM dataset. The correlation works as follows: If a zero occurs then a one is beneath it and if a one occurs then a zero is above it. The zero and the one cannot occur separately from each other. All the other digits behave just as in the Quadrant MNIST dataset, i.e. they are placed in random quadrants and the filler digits (non-label digits) are sampled randomly with no duplicate digits in one image. Examples of this dataset can be seen in the top row of Figure32.

There is also a variation of this dataset where the label quadrant is occluded, or left empty. This dataset is called Occluded Correlated Quadrant MNIST and is referred to as the OCQM dataset. Examples of this dataset are shown in bottom row of Figure32.

Figure 32: Examples of correlated quadrant images. The top row consists of images without occlusion from the Correlated Quadrant MNIST dataset. The bottom row consists of images with the label quadrant occluded from the Occluded Correlated Quadrant MNIST dataset. The introduced correlation between zero and one can be observed in these example images.

(33)

of the DACNN and context correlation on performance.

In Table5we see that the classification performance of the 01-classes of LeNet on the MNIST dataset is 99.22%. It can also be seen that the classification performance of the rest-classes is already lower than that of the 01-classes.

Network 01 Class Accuracy Rest Class Accuracy LeNet 99.22%(±0.12%) 98.13%(±0.13%)

Table 5: Accuracy of the classes ‘0’ and ‘1’ and the rest-classes (‘2’-‘9’) of LeNet on MNIST.

Looking at Table 6 we see that the highest performance achieved on the 01-classes by DA-LeNet is 99.86%. Compared to the performance of 01-classes of LeNet on MNIST this is an improvement of only 0.64%. However, considering that the maximum improvement that can be reached is 0.78%, the achieved improvement is relatively big. Network Fg-mask Bg-mask Mask Add Fg-Weights Bg-Weights 01 Class Accuracy Rest Class Accuracy

DA-LeNet X 1,2 Random - 99.85%(±0.02%) 97.76%(±0.26%)

DA-LeNet X 1,2,3 Random - 99.86%(±0.03%) 97.51%(±0.15%)

DA-LeNet X 1,2 - Random 99.77%(±0.10%) 97.92%(±0.18%)

DA-LeNet X 1,2,3 - Random 99.72%(±0.08%) 97.82%(±0.15%)

DA-LeNet X X 1,2 Random Random 99.84%(±0.08%) 97.85%(±0.18%)

DA-LeNet X X 1,2,3 Random Random 99.79%(±0.05%) 97.50%(±0.33%)

DA-LeNet X X 1,2 Random -1000 99.56%(±0.02%) 98.08%(±0.12%)

DA-LeNet X X 1,2,3 Random -1000 99.43%(±0.08%) 97.76%(±0.15%)

DA-LeNet X X 1,2 0.01 -0.01 99.75%(±0.07%) 98.15%(±0.17%)

DA-LeNet X X 1,2 0 0 99.68%(±0.09%) 98.08%(±0.19%)

Table 6: Accuracy of the classes ‘0’ and ‘1’ and the rest-classes (‘2’-‘9’) of LeNet and DA-LeNet on the CQM dataset. The checkmarks denote whether a foreground mask or background mask or both are used. The layers to which the masks are added are denoted by: 1 (before the first convolutional layer), 2 (before the second convolutional layer) and 3 (before the fully connected layer). The random weight initialization is performed using Glorot (Xavier) uniform sampling. Each result shows its corresponding standard deviation which has been computed over five iterations. Each of the separate iterations was trained for fifteen epochs and the optimizer used for training was AdaDelta with a learning rate of 1 at the start.

To test whether the difference in performance between LeNet and DA-LeNet is significant a t-test has been con-ducted. The t-test resulted in a p-value of 0.0005, which is lower than the threshold of 0.01. Thus, DA-LeNet on the CQM dataset performs significantly better than LeNet on MNIST.

Another observation that can be made is that the accuracies of the rest-classes of DA-LeNet on CQM diminish when compared to the results of LeNet on MNIST. This is expected behaviour, as the classification task is more complex when using the CQM dataset. However, the accuracy of the 01-classes does increase using DA-LeNet on CQM. Thus, even though the CQM dataset poses a more complex task than MNIST, DA-LeNet is able to leverage the contextual information of the correlation to improve the performance.

(34)

classi-01-classes, due to the introduced correlation. Therefore, the performance of the rest-classes should drop drastically, while the performance of the 01-classes should stay comparatively high.

Because there is an inherent imbalance between the performances of the classes of MNIST, a direct comparison between the performances of the 01-classes and the performances of the rest-classes is not conclusive. Thus, when analysing the following results, we will not make this direct comparison. Instead, we can verify whether the perfor-mances behave as expected. If the label digit is occluded DA-LeNet has to guess what the label is. If DA-LeNet were unable to leverage the contextual information the performances of both the 01-classes and the rest-classes should drastically drop about equally as much. Looking at the results in Table7, however, it can be seen that this is not the case. In the most cases, the performance of the 01-classes has decreased some in comparison to the results in Table6, but not nearly as much as the performance of the rest-classes. Thus, from this experiment is can also be concluded that DA-LeNet is able to use information from the context of the label digit to improve classification certainty.

Network Fg-mask Bg-mask Mask Add Fg-Weights Bg-Weights 01 Class Accuracy Rest Class Accuracy

DA-LeNet X 1,2 Random - 48.58%(±5.55%) 14.74%(±0.73%)

DA-LeNet X 1,2,3 Random - 58.47%(±7.10%) 14.79%(±0.97%)

DA-LeNet X 1,2 - Random 33.90%(±5.56%) 14.07%(±0.48%)

DA-LeNet X 1,2,3 - Random 17.11%(±12.68%) 14.05%(±1.19%)

DA-LeNet X X 1,2 Random Random 45.31%(±9.35%) 13.37%(±1.39%)

DA-LeNet X X 1,2,3 Random Random 47.49%(±5.70%) 13.32%(±1.10%)

DA-LeNet X X 1,2 Random -1000 26.76%(±24.00%) 8.80%(±2.35%)

DA-LeNet X X 1,2,3 Random -1000 26.76%(±24.00%) 9.23%(±2.80%)

DA-LeNet X X 1,2 0.01 -0.01 21.63%(±18.39%) 11.57%(±2.78%)

DA-LeNet X X 1,2 0 0 35.84%(±18.49%) 11.34%(±2.57%)

Table 7: Accuracy of the classes ‘0’ and ‘1’ and the rest-classes (‘2’-‘9’) of LeNet and DA-LeNet on the OCQM dataset (label digit occluded). The checkmarks denote whether a foreground mask or background mask or both are used. The layers to which the masks are added are denoted by: 1 (before the first convolutional layer), 2 (before the second convolutional layer) and 3 (before the fully connected layer). The random weight initialization is performed using Glorot (Xavier) uniform sampling. Each result shows its corresponding standard deviation which has been computed over five iterations. Each of the separate iterations was trained for fifteen epochs and the optimizer used for training was AdaDelta with a learning rate of 1 at the start.

It can be observed that the standard deviations of the results for the 01-classes in Table7are relatively big. This is most likely caused by the fact that the label digits are occluded. For the 01-classes this means that a stable prediction is dependent on whether the network learns the inherent correlation between the 0 and the 1. The high standard deviation implies that DA-LeNet learned the correlation only in some of the runs. Those good runs would then result in relatively high performances, while the other runs would result in relatively low performances. Having some high performances and some low performances results in a high standard deviation.

(35)

of the background masks to a negative number negatively influences the performance of the 01-classes. Overall, initializing the mask weights randomly seems to perform better than any other initialization.

5.2.3 Conclusion

The experiments described in this section were conducted to answer the question ‘Does using a DACNN on the CQM dataset improve classification performance of the 01-classes?’. From the results of the experiments it can be concluded that DA-LeNet is able to achieve a higher classification performance for the 01-classes on the CQM dataset than LeNet on the MNIST dataset. Additionally, when the label digit is occluded while testing DA-LeNet, the performance of the 01-classes does not drop to random chance as is the case for the rest-classes. Both of the experiments show that DA-LeNet is able to use contextual information to increase classification certainty.

(36)

5.3 DACNN Object Detection

The goal of the previous experiments on MNIST and its variations was to prove the theoretical basis of the use of DACNNs. However, to actually test the usefulness of DACNNs they have to be applied to more realistic and complex data. In the following experiments the PASCAL Visual Object Classes 2007 [5] dataset is used. PASCAL VOC is an image dataset that is used in object classification and detection competitions. The dataset contains twenty object classes which are: aeroplane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, dining table, dog, horse, motorbike, person, potted plant, sheep, sofa, trainand tv/monitor. Examples of these classes can be seen in Figure33.

For every image in the dataset an annotation file exists that contains ground truth bounding boxes and class labels for every object that occurs in the image. For these experiments DACNNs are used to perform object classification on the PASCAL VOC dataset, using the ground truth bounding boxes.

Figure 33: Examples for the classes of PASCAL VOC 2007. From top left to bottom right, the classes are respec-tively: aeroplane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, dining table, dog, horse, motorbike, person,

(37)

while retaining aspect ratio. This baseline will be referred to as the ‘cropping baseline’. By cropping the RoIs the task of classification becomes translation invariant, because the object region is no longer in a variable location in the full image. The comparison of performance between the DACNN and the cropping baseline is complicated, since using the DACNN does not make the classification task translation invariant.

The other preprocessing method makes all but the RoI black, instead of cropping the RoI out of the image. This baseline will be referred to as the ‘blackout baseline’. The blackout baseline conveys the same information as when the RoIs are cropped, but it does not make the classification task translation invariant. Thus, the classification performance using the blackout baseline is more easily comparable to the performance of DACNN.

In the following experiments we use the DA-VGG16 architecture with varying set-ups to compare with the CNN baselines. The varying set-ups are similar as with DA-LeNet. For example, the masks are either added only to the input of the convolutional layers (Figure 24), or to the input of the convolutional layers and the input of the first fully connected layer (Figure38, AppendixB). Furthermore, both foreground and background masks are used simultaneously and separately, with varying weight initializations.

The convolutional layers of both VGG16 and DA-VGG16 are initialized with pre-trained ImageNet weights to make them converge faster.

The expected outcome for these experiments is that the performance of DA-VGG16 will be higher than the perfor-mance of both the cropping baseline and the blackout baseline.

In Table8the results of the VGG16 baselines on the PASCAL VOC 2007 dataset are shown. It can be seen that the performances of the blackout baseline are overall lower than the performances of the cropping baseline. This is most likely caused by the fact that the cropping baseline is translation invariant while the blackout baseline is not. Translation invariance makes the classification task less complex, resulting in a higher overall performance for the cropping baseline.

The SPP depth influences the performances of the cropping baseline differently than those of the blackout baseline. On the cropping baseline, having more depth has a positive influence on the performance. Conversely, having less depth in the SPP layer works better for the blackout baseline. This effect of the SPP depth can be explained. More SPP depth results in feature extraction at several scales and regions. However, in the blackout baseline only a small part of the image is the RoI and the rest of the image is black. Thus extracting features from several scales and regions will most likely not provide more information for the blackout baseline. In the cropping baseline the full image is the RoI. Here, it does provide more information when features are extracted from separate regions and scales of the image as all these regions and scales lie within the RoI.

Optimizer SPP depth Accuracy Cropping Blackout

SGD 1,2,4 84.50% 34.95%

SGD 1 83.93% 77.36%

Adam 1,2,4 79.68% 74.43%

Adam 1 81.97% 77.18%

Table 8: Results of VGG16 on cropped and blacked out PASCAL VOC 2007 data using different optimizers and SPP settings. The networks were trained for fifteen epochs, with a learning rate of 0.00001.

(38)

Optimizer SPP depth Mask add Fg-mask Bg-mask Fg-weights Bg-weights Accuracy

SGD 1,2,4 all, excl. fc _X _X U (0, 1) U (0, 1) 69.67%

SGD 1 all, excl. fc _X _X U (0, 1) U (0, 1) 81.45%

Adam 1,2,4 all, excl. fc _X _X U (0, 1) U (0, 1) 10.30%

Adam 1 all, excl. fc _X _X U (0, 1) U (0, 1) 59.09%

SGD 1 all, incl. fc _X _X U (0, 1) U (0, 1) 80.75% SGD 1 all, excl. fc _X _X 0 0 83.95% SGD 1 all, incl. fc _X _X 0 0 83.64% SGD 1 all, excl. fc _X U (0, 1) - 78.44% SGD 1 all, excl. fc X - U (0, 1) 82.20% SGD 1 all, incl. fc X U (0, 1) - 83.55% SGD 1 all, excl. fc X - 0 80.20% SGD 1 all, incl. fc _X 0 - 82.76%

Table 9: Results of DA-VGG16 on masked PASCAL VOC 2007 data using different optimizers, SPP settings and masking set-ups. The checkmarks denote whether a foreground mask or background mask or both are used. The layers to which the masks are added are denoted by: all, excl. fc (before all convolutional layers, not before fully connected layer) and all, incl. fc (before all convolutional layers and before first fully connected layer). The networks were trained for fifteen epochs, with a learning rate of 0.00001.

It can be seen that the best result of 83.95% is achieved when the masks are added to all convolutional layers but not to the fully connected layer and foreground and background masks are used, both with a weight initialization of zero. We compare this result of DA-VGG16 with the highest performances of the cropping baseline and blackout baseline. These highest performances of the baselines are respectively 84.50% and 77.36%. Thus DA-VGG16 performs better than the blackout baseline. However, the cropping baseline performs slightly better than DA-VGG16. The most likely reason for this is again that cropping the data makes the classification task translation invariant. Thus the classification of the cropped RoIs is less complex than the classification in the blackout baseline or using DA-VGG16.

5.3.2 How well can a DACNN perform object classification based only on the context?

To test how well DA-VGG16 can perform the classification based on context only an additional experiment is con-ducted. In this experiment the network is trained normally with both the RoI and the context visible. However, during testing the RoI will be occluded and only the context will remain visible. The expectation is that DA-VGG16 will perform better than random choice, which is 5% (there are twenty classes, so1_/₂₀_).

Table10shows the results of two set-ups of DA-VGG16 tested on PASCAL VOC 2007 with the RoI occluded. DA-VGG16 classified the RoIs based only on their context. The baselines using cropping and blacking out would not have any available context information to perform the classification, thus the baselines would have to randomly choose classes. If DA-VGG16 would perform randomly it would achieve a score of ∼5% as there are twenty classes in PASCAL VOC. It can be seen, however, that both set-ups of the DA-VGG16 are able to perform better than random performance. The highest performance achieved by DA-VGG16 on the occluded PASCAL VOC data is 16.63%, more than three times higher than random performance.

Directing Attention of Convolutional Neural Networks

MS

C

A

RTIFICIAL

I

NTELLIGENCE

M

ASTER

T

HESIS

Directing Attention of Convolutional Neural

Networks

S

TEVE

N

OWEE

June 23, 2017

Supervisor:

Dr J.C.

G

Assessor:

Prof. Dr T. G

Contents

1

Introduction

2

Background

2.1

Neural Networks

2.2

Convolutional Neural Networks

3

Related Work

3.1

Object Detection

3.2

Context in Computer Vision

4

Methodology

4.1

Directed Attention Convolutional Neural Networks

5

Experiments and Results

5.1

DACNNs on MNIST

5.2

Context in DACNNs

5.3

DACNN Object Detection