Automatic Indoor 3D Scene Classification using RGB-D data

(1)

Track: Computer Vision

Master Thesis

Automatic Indoor 3D Scene

Classification using RGB-D data

by

Nick de Wolf

10165401

September 2016

42 EC Supervisor: Prof Dr T. Gevers S. Karaoglu Assessor: dr. P.H. Rodenburg

(2)

Being able to understand natural scenes is crucial to many activities for humans and animals. A large portion of scene understanding is derived from vision, and as a result a great amount of research has been put into this topic in the field of computer vision. Most of the research has been oriented towards applying 2D based approaches, while 3D approaches have only seen an increase in popularity in recent years. Depth sensors have become more precise and affordable, which paved the way for gathering 3D data on a larger scale. In this work, we investigate the potential performance gains that this new depth data can bring to existing 2D methods in two different tasks withing scene classification, namely object proposal generation and scene category classification. First, we extend a popular 2D approach for object proposal generation with two novel depth-based features that use the information gathered from the 3D point cloud. The results show that the combination of the default algorithm with these additional features can achieve similar recall and MABO scores, while generating significantly less object proposals. Second, we created a global depth-based feature, that uses the detected objects in a scene, for the task of scene category classification. The spatial relations of the objects are used to generate a context based feature. This feature is used in combination with deep learning approaches on both color and depth information, to train a SVM classifier for 21 different scene classes. The results show that using this feature in combination with the deep learning approaches yields an increase in mAP scores in the classification task.

(3)

I would like to thank my supervisor Theo Gevers for providing me the opportunity to work on this interesting topic and providing guidance during our meetings.

In addition I would like to thank my colleagues at 3DUniversum, their knowledge regard-ing the related fields of my thesis and their willregard-ingness to help, has certainly improved the quality of this work. In particular I would like to thank my daily supervisor Sezer Karaoglu for his guidance, ideas, and feedback during the process of writing this thesis. I am also grateful to Morris Franken for his assistance with setting up the server, and his extended knowledge of the tools used in this thesis.

(4)

Abstract i Acknowledgements ii 1 Introduction 1 1.1 Motivation . . . 1 1.2 Goal . . . 4 1.3 Thesis Outline . . . 4 2 Background 5 2.1 Object proposals . . . 5 2.2 Image Oversegmentation . . . 8 2.2.1 RGB-D Oversegmentation . . . 9 2.3 Selective Search . . . 11 2.4 Scene classification . . . 13

2.5 Convolutional Neural Networks . . . 15

2.5.1 Convolutional layer. . . 17

2.5.2 Pooling layer . . . 17

2.5.3 Fully connected layer. . . 18

2.5.4 GoogleNet. . . 18

2.6 Features using context . . . 18

3 Methodology 20 3.1 Extending Selective Search . . . 20

3.2 CNN with additional depth features . . . 23

3.2.1 Fine-tuning CNN . . . 24

4 Experiments 25 4.1 Dataset . . . 25

4.1.1 Properties . . . 26

4.2 Data pre-processing and Analysis . . . 26

4.3 Implementation . . . 28

4.4 Extended Selective Search . . . 29

4.5 Scene Classification . . . 30

5 Results 32 5.1 Extended Selective Search . . . 32

(5)

5.1.1 Evaluation Metrics . . . 32

5.1.2 Reducing segment count . . . 33

5.1.3 Varying Colorspaces . . . 34

5.1.4 Weighing depth features . . . 35

5.1.5 Varying IoU overlap threshold . . . 36

5.2 Scene Classification . . . 38

5.2.1 Evaluation Metrics . . . 38

5.2.2 Combining measures . . . 39

6 Conclusion 44

(6)

Introduction

1.1 Motivation

The ability to understand a scene as a human is crucial to many natural activities, such as recognition, navigation, and general interaction with the environment. Knowing where you are and in what kind of environment you are, can also be useful in tasks where robots have to navigate, as their navigation models could be fine-tuned to all kinds of different room types.

Scene understanding has always been an active field of research within computer vision, and significant progress has been achieved over the past decades. Until recently, research has mostly focused on using standard RGB images for scene classification tasks. State-of-the-art methods [1,2] have used hand-crafted features such as HOG [3], SIFT [4], and SURF [5]. These methods use a variety of techniques, such as computing descriptors by generating keypoints in images. These descriptors can be used to efficiently describe and compare images by matching a generated descriptor of an image against already classified images. In order to create efficient features, a sufficient amount of prior knowledge is required, and the the accuracy of the resulting method is largely determined by the creativity of the designer.

The current state-of-the-art approaches mostly use Convolutional Neural Networks (CNNs) [6], instead of hand-crafted features in the scene classification task [7–10]. One of the big advantages of CNNs over handcrafted features is that one does not have to worry about designing features by hand, instead the CNN essentially learns the features. CNNs have become highly involved in the learning process for computer vision tasks. One of the rea-sons that CNNs have only become popular recently is that CNNs require a large amount of image data to be trained properly. In recent years, large datasets of image data have

(7)

become available [11–13], which in turn allowed for large CNNs to be trained, without the costs associated with gathering all this data. Another reason is the improvement in the computational power of systems that are now widely available. Today, most of the computations with a CNN are performed using a GPU or a cluster of GPUs [14].

While most state-of-the-art conventional methods and CNNs that work with RGB im-ages perform well on most scenes [8–10,15,16], their performance can deteriorate under different conditions such as varying lighting conditions (e.g. highlights, shadows, and dim light) [17]. Moreover, these methods inherently lack geometric information about the scene because in essence, a picture is a 2D projection of the 3D world. This loss of information makes the scene classification algorithms less robust to varying image con-ditions. By adding depth as an additional channel to the image, geometric information could be exploited at a pixel level.

Geometric information can be useful for deriving contextual information from a scene. If we look at the task of object detection, we can see that previous work has used the relations between objects before [18–21]. Rabinovich et al. [19] demonstrated that the presence of a specific object class inside a scene can influence the probability of an object of another class being present. The work of Torralba et al. [18] showed that certain objects occur more often in certain scenes than others. For instance, it is more likely to find a toilet inside a bathroom than inside an office. By adding additional depth information to each pixel, one could also obtain properties such as the relative sizes between objects or the distance between objects in the real world.

Depth measurements are invariant to variables such as lighting conditions, and it could lead to creating more robust features. Another benefit is that, unlike methods that at-tempt to estimate the 3D structure of a scene with techniques like Structure from Motion [22], this depth data would already be available. Hence, generating the 3D structure would be computationally less expensive, and provided that the sensor of the depth data is accurate, would also be more precise. Being able to use this additional information freely would enable algorithms to extract the scene structure more accurately, which could improve the performance in various tasks within scene classification.

In recent years, depth sensors have become available on the consumer market for an affordable price. This enables us to acquire reliable depth maps at a very low cost, stimulating the use of this additional channel of data. Currently depth sensors are accurate for indoor scenes, where most affordable sensors have an effective range from 50 centimeters up to five to seven meters. By combining a depth sensor with a regular camera, the lack of geometric data in RGB images can be solved, resulting in RGB-D data [23]. In order to make optimal use of the depth data that is generated from the depth sensors, the scope of the scene understanding tasks will be limited to indoor

(8)

scenes in this work. By using indoor scenes, instead of outdoor scenes, there is a higher likelihood of the edges of the scene staying within the effective range of the depth sensor, as opposed to outdoor scenes.

For the approach in this thesis, we will use object-object context information. Hence, it is important to be able to find the objects inside the scene. To detect objects in an image, one first has to find their approximate location. Afterwards, a classifier can determine what kind of object is present at the location, if any. Until recent years, the Sliding Windows paradigm was used often in most successful approaches [3,24–26]. This paradigm means that it has to systematically go through an image in search for potential objects. The main problem lies in the number of locations that have to be searched for potential candidates. If you generate an large number of boxes, with a large number of different dimensions, you are certain that you will be able to find every possible object location in the image. Unfortunately, applying detection algorithms to a high number of objects locations is computationally intractable for most state-of-the-art object detectors [8–10, 15, 16]. In recent years, approaches have been suggested that can provide a trade-off between high detection quality and computational tractability under the name of object proposals [27–32]. Although they all use different approaches, they work under the common assumption that all objects have certain properties that differentiate them from the background, and can thus be localized. These methods are able to maintain a high recall of the objects present in the scene, while using significantly less windows when compared to using Sliding Windows. This reduction in generated proposals, could in turn improve the object detection results [33].

Most common object proposal generators use just the color values, and do not account for depth information. Yet, this additional depth information can be used to reduce the number of boxes significantly, while only losing little accuracy in the object proposal generation step.

The contributions of this thesis are two-fold. The main contribution is the introduction of a novel approach for improving object proposals for object detection, by implementing the additional depth channel into a popular existing method called Selective Search [32]. The second contribution is a potential application of these improved object proposals in combination with RGB-D data. For this task, we assume that the object proposals are used effectively in an object detector, and that the resulting detected objects are used to model the scene context. The detected objects are used to model the object-object spatial relations in a scene, and we show that the context provided by the detected objects can be used in combination with the depth information to improve scene category classification. These spatial relations are represented by a feature that uses the objects present in a scene and the distance between these objects. The performance of this novel

(9)

feature is compared to the performance of a RGB CNN [8] and a depth CNN [7] for the task of scene category classification.

1.2 Goal

The focus of this thesis is on investigating the potential use of the additional depth channel in both the region proposal generation step and the scene category classification task. The research questions for these tasks are:

1. How can the additional depth channel of RGB-D data be used to improve the results of existing object proposal methods, originally implemented for RGB data?

2. Can the spatial object-object relations be used to generate a feature that can be used with state-of-the-art approaches for scene classification?

3. How do object-object based context features, based on the global context in an image, compare to local depth based features?

1.3 Thesis Outline

The organization of the remainder of this thesis is as follows: in Section2the background and prior work related to scene classification will be discussed. Section3will present the research approach for both the object proposal approach and the scene classification task. Implementation details and an elaboration on the experiments are given in Section 4. Section5presents the results and their corresponding analysis. Finally, our conclusions and possible directions for future work are presented in Section 6.

(10)

Background

As discussed in the introduction, this work will focus on object proposal generation and scene classification using RGB-D data. A common task within scene classification is object detection, which will only be discussed briefly. First, related work of object proposal generation is discussed, and then the background involving scene recognition is presented.

2.1 Object proposals

In recent years, not just determining what object is being presented in an image is important, but also determining the location of the object. Most state of the art object detectors are designed to classify a single object at a time per given frame. As most images do not display just a single object, it should be possible to detect multiple objects in an image. In order to solve the problem of multiple objects per image, an often used approach is to divide the image into smaller windows in an attempt to capture all possible object locations. The most naive way of solving this problem is by using the Sliding Windows paradigm [3, 24, 25], in which windows (also called bounding boxes) are generated in a predefined grid, in an attempt to cover every possible window in the image. By using the sliding windows paradigm the algorithm is constrained to only use computationally efficient classifiers. Because these approaches generally produce a large number of windows, it would quickly become computationally intractable to apply expensive state-of-the-art complex object detectors, because of the significant computation time required per window.

Less naive ways of proposing object windows, have also been suggested in recent years [27–32,34]. The goal of the object proposal methods is to reduce the number of generated

(11)

windows, while keeping the high quality windows. The quality of the object proposals is determined by the probability of the proposal containing an object and how tightly the proposal fits the object. An example of some object proposals in an image is shown in Figure 2.1. If we look at the figure, we can see that green boxes, representing the ground truth boxes, have a tight fit around an object in the scene and that they are preferred over the larger blue boxes which contain much more background pixels. The red boxes, are not capturing any objects, or only a part of an object, and most object proposal methods attempt to remove most of these type of boxes.

Figure 2.1: An example of generated object proposals inside a scene. Where the blue proposals should be scored lower than the green proposals (ground truth) because they are not as precise, and the red proposals should receive the lowest score, because they

either do not cover an object, or only cover an object partially.

Because most object proposal methods filter out the lower quality bounding boxes, they generally generate a lower number of bounding boxes than the traditional Sliding Window paradigm. This reduction in generated bounding boxes allows one to use a computationally more expensive classifier in combination with these boxes to improve the results of for instance object detection pipelines. There is a variety of different approaches for generating object proposal methods. For instance, previous work has shown that a measure of Objectness can be determined for a window, which attempts to rank candidates on the probability that the candidate contains an object [34]. This score is computed by considering local cues, such as edges, corners, and contours. Recently, Cheng et al. [27] presented their Binarized Normed Gradients (BING) algorithm. BING can efficiently compute the Objectness of a window, by resizing the window to a 8x8 windows and taking the gradients to compute a 64D features. As shown in their paper, windows with objects inside them usually share similarities in the 64D feature vector and can be used to measure the Objectness. The above approaches attempt to reduce the number of generated bounding boxes by scoring and thresholding them.

(12)

In other approaches, such as the work of Endres and Hoiem [28], they attempt to reduce this number by applying a different approach to generating these windows. In their paper, regions are generated by placing seeds on various places in the images. For each of these seeds, a separate foreground-background segmentation is performed, which in turn generates the regions that will be considered. The main advantage of this seed-approach is the high quality of the generated proposals, although this comes at a high computational cost.

The work of Hosang et al. [35] compares a number of state-of-the-art approaches on a variety of criteria. By limiting the scope to the three best performing methods on recall and detection results, we can split these three methods into two categories: Objectness [36] and Superpixel merging [29, 32]. The approach of Zitnick and Doll´ar [36] starts with the concept of the Sliding Window paradigm, but instead of testing every possible location, it uses better estimates for the windows in combination with some of refinement of the results afterwards. Their approach differs from previously discussed methods that detected the Objectness score [27,34]. These two approaches use a variety of cues, such as the edges within the window. The algorithm of Zitnick and Doll´ar [36] only uses the edges within the window, and in contrast to the previous methods, only tries to make use of edges that belong to a contour that is fully contained within the window.

The approaches in the superpixel merging category do not use the Sliding Window paradigm, but instead they start by oversegmenting an image into a high number of small segments, often called superpixels. By generating a large number of segments per image, it is more likely that all pixels within each segment share a certain set of properties.

Segmentation has several advantages and disadvantages over just generating the bound-ing boxes. An advantage for which it is used in general, is the task of semantic segmen-tation. In the task of semantic segmentation, one does not only want to create segments in the image, but also label these segments. The labels could later be used in tasks such as object detection. The advantage over using bounding boxes is that a proper segmentation can capture even irregular shaped objects more precisely than a bounding box. The pixels inside a segment should only include pixels that lay within the object boundaries, while the pixels inside a bounding box may also include a lot of background pixels. A disadvantage of using segmentation is that the process of creating and storing the image segments takes more time and space to generate than computing the four corners of a bounding box.

As the name suggests, the superpixel merging approaches attempt to merge segments together, based on some similarity measure. For instance, Arbel´aez et al. [29] introduced

(13)

their Multiscale Combinatorial Grouping (MCG) method, which consists of both mul-tiscale hierarchical bottom-up segmentation [37] and object candidate generation. The resulting segments are merged based on the edge strengths, and the resulting object proposals are ranked based on several cues such as edge strength, location, shape, and size. While this approach is amongst the top scoring approaches on both recall and detection scores [35], its repeatability and computational time per image are worse than the other two approaches.

The best scoring object proposal according to comparison of Hosang et al. [35] is the Selective Search [32] algorithm. Although their approach is slower than the approach of Zitnick and Doll´ar [36], it is able to output segments and not just bounding boxes. Selective Search is also able to run significantly faster than MCG and scores better on the repeatability score [35]. The Selective Search starts with an initial oversegmentation, and iteratively merges these regions until a single region remains. Afterwards, bounding boxes are generated for each of the regions in every iteration. For this thesis, we decided to use a superpixel merging approach because the generated segments are more versatile than plain bounding boxes. For instance, when using RGB-D data, the depth data can directly be extracted from the pixels inside the segments, instead of first having to distinguish the borders of the objects inside a bounding box. Selective Search performs best, out of the presented superpixel merging approaches in the work of Hosang et al. [35]. Hence, this object proposal algorithm will also be used in this thesis.

As mentioned in the introduction, we want to use the additional depth channel of RGB-D data to improve the performance of existing object proposal methods for RGB data. In order to extract features from this additional channel effectively, the initial overseg-mentation method will also have to use this extra channel. In the following section, we will first discuss the initial oversegmentation method used in Selective Search and some related work, before introducing oversegmentation algorithms that also use the extra depth channel. Afterwards, we will discuss Selective Search in greater detail in Section

2.3.

2.2 Image Oversegmentation

As mentioned in the previous section, the Selective Search algorithm starts with an oversegmentation of an image. For this initial oversegmentation, they use the graph based oversegmentation algorithm of Felzenszwalb and Huttenlocher [38] to generate the initial regions. In this section, we discuss both this method and another often used oversegmentation method, namely the normalized cuts algorithm [39]. Afterwards, oversegmentation methods based on RGB-D input data are discussed in Section 2.2.1.

(14)

Both oversegmentation methods are graph-based methods and are known for their speed and performance. The graph-based methods represents an image as an undirected graph G = (V, E), where v ∈ V are the vertices, and in this instance of image segmentation v represents a pixel in the image. The weight of the edge represents the dissimilarity between the connected pixels. (vi, vj) ∈ E represent the pairs of neighbouring

ver-tices/pixels. The normalized cuts algorithm recursively partitions the generated graph using cues like texture and contour. The graph is cut using the eigenvalues of the small-est eigenvectors, and the goal is to minimize their normalized cut criterion. A benefit of this approach is that the number of segments can be directly controlled by this criterion.

The graph-based method of Felzenszwalb and Huttenlocher [38] measures the likelihood of a boundary between two regions by measuring two quantities: the intensity differences between connected pixels in each region and the intensity differences across the potential boundary. The intuition behind this approach is that when the difference between the intensity at the boundary and the intensity within the region is large, the boundary is likely to be correct. Unlike the normalized cuts algorithm, the number of superpixels or their size can not be directly influenced by this algorithm.

2.2.1 RGB-D Oversegmentation

The standard image oversegmentation methods are designed to work with RGB images, and in turn do not inherently use the depth channel that RGB-D images provide. For this work, we investigate whether using a segmentation method that uses depth information would give any improvement in the Selective Search algorithm. So far, we only discussed methods that work on RGB images. In particular, we focused on superpixel methods that reduce the number of regions that have to be considered by, for instance, object detection algorithms. Although these methods can certainly be effective, the fact that superpixels have to stay within the object boundaries by using the 2D projected information from the 3D scene, means they do not use all the available information.

An example where a color-based segmentation algorithm would have more difficulty with segmenting the image, compared to an algorithm that can use the depth information, can be found in Figure2.2. In the image, you can see a white table positioned against a white cabinet. For the color-based methods the colors from both surfaces are mostly equal, apart from the shadows, making it harder to segment these surfaces. But if one would use depth information of RGB-D data, the two orthogonal planar surfaces would be easily detected and segmented.

Image segmentation using RGB-D images has seen plenty of research in recent years [12,40–44]. By using the RGB-D data, this geometric information can be easily included

(15)

Figure 2.2: A situation where depth information can greatly improve segmentation. The white table is positioned against a white cabinet, which would make it harder for

color-based measures to properly segment the two orthogonal surfaces.

in the segmentation process, and the 3D geometric relationships between points can be used to prevent superpixels from crossing the object boundaries. In this thesis, we focus on segmentation methods that can specifically be used to provide good oversegmentation. One of such methods is the Depth Adaptive SuperPixel (DASP) algorithm [43]. DASP is an adaption of a popular RGB superpixel generation algorithm called Simple Linear Iterative Clustering (SLIC) [45]. SLIC is an iterative gradient ascent algorithm which uses local k-means clustering to cluster pixels in the five dimensional space, consisting of the two dimensional location and color. DASP uses the additional depth channel to increase this dimensional space by adding both depth information and the angles of the normals on the geometric surface of each point.

DASP is considered a 2.5D method, because it works in the 2D domain, while using some additional information from the depth channel. One of the downsides of these approaches is that the segmentation is performed on a single view, which makes strongly occluded objects difficult to detect. In this thesis, we will use an approach that uses the 3D point cloud of the scene that can be generated from the RGB-D image, namely the Voxel Cloud Connectivity Segmentation (VCCS) algorithm [44]. VCCS can be used to generate supervoxels inside point clouds. In this work, supervoxels are essentially superpixels in three dimensional space, in contrast to other papers where voxels imply extensions of 2D methods to 3D by taking video frames and stacking them to generate the additional dimension [46]. Using a point cloud representation of a scene as input, the supervoxels are generated by using k-means clustering with two geometrical constraints.

The first constraint forces the seeding of the clusters to spread uniformly through the 3D space, which ensures that supervoxels are spread evenly throughout the geometry of the scene.

(16)

The second constraint enforces that all voxels are connected in 3D space, which ensures that they can not merge with voxels, even if they are neighbours in the projected image. When two voxels are connected in 3D space, they are considered to be adjacent if they share faces, edges, or vertices.

The algorithm starts by dividing the 3D space into a voxelized grid with a resolution Rseed, which effectively translates to the distance between the initial seeds. If seeds

are are separated in space from any other seeds, they are removed because they would most likely be the result of noise. By using a small search radius Rsearch, the voxels

surrounding the seeds are counted, and if they do not have at least as many voxels as when one would fit a planar plane through the search radius, they are removed. The seeds are relocated to their closest voxel, and the supervoxel clustering starts. For each supervoxel, starting at the seed, the closest voxel is searched in 3D space. If the voxel has not been assigned to a supervoxel, it is added to its closest seed. If a single voxel has been added to a seed, the same process starts for the next seed, and this process continues until either all voxels have reached the lead nodes of their adjacency graph or there are no further voxels without label.

Supervoxels are generated by iteratively expanding each seed in 3D space. Hence, labels can not cross over object boundaries. Because all supervoxels expand at the same rate, a similar size for each supervoxel can be expected. The resulting supervoxels can either be used directly or be projected back into the 2D plane, depending on the algorithm that uses them. Figure 2.3 shows an example of the possible outputs for the VCCS algorithm.

Figure 2.3: Example of the VCCS algorithm output. Left: original image; Middle: supervoxels with connectivity lines; Right: Resulting 2D projection of supervoxels

The initial oversegmentation is just the first step of the Selective Search algorithm. In the following section, the remainder of the algorithm is discussed in more detail.

2.3 Selective Search

In this thesis, we chose to extend the Selective Search method [32] for the object proposal generation by adding geometric features. This method is chosen in particular because

(17)

it has been proven to outperform all other current state-of-the-art methods [35]. In addition, it also allows for easy addition of new measures, such as those based on depth features. The Selective Search method starts by oversegmenting an image using graph-based segmentation [38], which is known for its speed and relatively good performance. Selective Search is a bottom-up hierarchical grouping method, that continues combining the two most similar regions until only a single region remains at the top of the hierarchy. The score is determined by a combination of easily computable measures, as opposed to other methods which use a single computationally expensive method as in [29].

Using several different measures has the benefit of diversifying the results, and Selective Search has three areas where it can diversify. First, complementary color spaces can be used, such as RGB, grey-scale, and HSV. This helps with accounting for different light sources as each of the color spaces catches different types of such conditions. In the original paper, the HSV color space produced the best results, hence this shall be con-sidered as default in this work. Second, the starting locations can be varied by altering the parameters of the graph-based segmentation [38], or by choosing a different initial segmentation method. Last, four easy to compute complementary similarity measures are implemented, and these will be discussed in greater detail.

The first measure is scolor(ri, rj), which computes a normalized one-dimensional color

histogram Ciand Cj for segment riand segment rj, where each channel in the colorspace

receives 25 bins. In order to compute the similarity between these two vectors the intersection is computed. The normalization ensures the similarity value stays in the range of [0, 1]. The scolor measure can efficiently be propagated through the hierarchy

by taking the weighted average of the two histograms, compared to the size of each segment. This can be formalized as:

Ct=

size(ri) × Ci+ size(rj) × Cj

size(ri) + size(rj)

, (2.1)

where Ctrepresents the color histogram of the newly generated region rt. The size of rt

is simply the summation of the sizes of ri and rj.

The second measure is stexture(ri, rj), which attempts to capture the texture similarity

between two segments by taking the Gaussian derivatives in eight orientations, with σ = 1 for each color channel. For each orientation and color a histogram is created, which are combined and normalized to get the final result. Similar to scolor, the similarity

score is computed by calculating the intersection between two histograms.

The third measure ssize(ri, rj) tries to prioritize merging smaller regions over large

re-gions, as to prevent one large region from merging with all the smaller regions around it. The measure allows for similarly sized regions to merge at each stage of the algorithm, as

(18)

larger regions are punished more than smaller regions. This measure is easily computed by taking the sum of the size of the two regions and dividing this by the size of the entire image in pixels, such that a value in the range of [0, 1] is maintained.

The last measure is sf ill, which will ensure that gaps are filled, and essentially is a

measure of how well two segments fit. By computing the bounding box, around the combination of the two regions, its new size can be compared to the size of the image, similarly to ssize.

All measures have in common that they return a result in the range of [0, 1], which allows them to be combined, and that only the data inside the two segments have to be combined, when merging two segments, providing the required speed and low computational complexity. The resulting similarity score is computed by:

s(ri, rj) = a1scolor(ri, rj) + a2stexture(ri, rj) + a3ssize(ri, rj) + a4sf ill(ri, rj), (2.2)

where ai ∈ {0, 1} is a Boolean value that determines whether a similarity measure is

used or not.

So far the related work of the first contribution is discussed. The second contribution consists of using the additional depth channel to improve scene category classification. The general pipeline of this work can be found in Figure 2.4.

As can be seen from the pipeline, the results of the Selective Search could be used by an object detector to localize objects in a scene. For this thesis, we will use the spatial relations between objects for the scene classification to generate a novel depth-based feature, yet implementing an object detector lies outside the scope of this work. Hence, instead of actually using the generate object proposals and using them in an object detector, the ground truth values of the scene are used. Instead of focusing on object detection algorithms, our main focus will be on scene classification algorithms. In the following section, a number of methods for object detection and scene classification are discussed.

2.4 Scene classification

In the previous sections, the various approaches to generate object proposals are dis-cussed, of which the resulting object proposals can be used for scene classification. In this section, the focus will be on the second contribution of this work, namely the scene classification task.

(19)

Figure 2.4: Using an RGB-D image pair, our system segments the image in 3D space using the VCCS algorithm [44] as discussed in Section 2.2.1. Afterwards, this segmentation is used to generate object proposals, which are used to generate the context feature. This features is used in combination with a depth CNN feature and a

RGB CNN feature to train a SVM classifier for the task of scene classification.

Until recently, scene classification was mostly performed using methods that work with RGB images. In the past, the general approach of state-of-the-art methods [1, 2] was to use hand-crafted features such as HOG [3] and SIFT [4] features in order to classify objects within the image. The methods generate features based on distinct locations in an image, for instance locations that contain corners, or edges. These features are used to generate a descriptor for the entire image, by which the image can be compared to other images. One of the main benefits of these approaches over simply comparing all the pixels between two images, is that these features have additional properties such as being scale invariant, making them more robust.

The usage of these features changed by the introduction of the Bag of Words (BoW) approach to computer vision tasks [47–49]. This is a method that is applied in the field of natural language processing, that works by representing a document as a vector that is filled with the occurrence counts of the words inside. This method was adapted to the field of computer vision by treating local features inside the image as words, and in turn represents an image by the occurrence counts of the local features. The benefit of this method over just using the features as an image wide descriptor is that local features can now be used to classify an image on different levels. One of the downsides of the BoW technique is that any spatial information regarding the features is lost, as only the occurrence counts are stored.

(20)

pyramids [50]. This method attempts to solve the loss of spatial information by dividing the image into several regions in a hierarchical fashion. Where at the bottom you will find many small regions that are slowly concatenated into a region that covers the entire image. By computing a BoW for each of these separate regions and computing a weighted combination of these regions, the spatial information is maintained while also being able to benefit from the features of BoW techniques. An adaptation of the spatial pyramids has also been published recently [51]. Their approach uses RGB-D images in combination with spatial pyramids and two depth descriptors called NARF [52] and PFHRGB [53] to classify scene categories.

Today, the main approaches towards tackling the scene classification tasks make less use of such features, and appear to be oriented towards the usage of Convolution Neural Networks (CNNs). As this work will also use a CNN, the methods involved will be discussed in greater detail below.

2.5 Convolutional Neural Networks

Recently, a different approach to scene classification is being taken, namely the usage of Convolutional Neural Networks (CNNs), as can be seen in challenges such as the ImageNet challenge [54]. CNNs have become highly involved in training scene classifiers, and find their origin in Neural Networks. In 2012, a deep convolution neural network was submitted to the ImageNet challenge [55], which won that years challenge and largely influenced the entries of later years.

CNNs are based on the concept of feed-forward Neural Networks. Where we define a neural network to be a system built from layers that each have a certain number of neurons. Each of these neurons can be connected with neurons from the next layers, and all have a specific weight. In general, a neural network always has an input layer and an output layer, with optional hidden layers in between. The feed-forward neural network is a type of neural network where the links between the different layers only point from the input layer, through the hidden layers, towards the output layers, which results in all information going just one way. A simple illustration of such a network can be found in Figure 2.5.

Because layers in neural networks are fully connected, image data can quickly give an explosive growth in the number of weights that have to be learned. Take for instance a 64x64 RGB image, this will result in each node in the first hidden layer to have at least 64 × 64 × 3 weights, making standard neural networks unsuited for the task of scene classification.

(21)

Figure 2.5: Illustration of a feed-forward neural network, with 3 input nodes, 2 hidden layers with each 3 nodes, and an output layer with 3 nodes.

CNNs differ from standard neural networks, because CNNs have layers that consist of convolutions. Convolutions are essentially filters, that can be used to detect local features such as edges and corners. A CNN can safely apply convolutions to the input data because the explicit assumption of CNNs in the field of computer vision is that it will consist of images. A standard neural network is not specifically designed to accept image data and one can not assume that convolutions can always be applied to the input. As a result, a CNN can specifically be designed to efficiently process image data, and different types of layers can be used, each having a specific task.

The general observation with CNNs is that each consecutive layer will trigger on more specific elements in an image. For instance, the first layer will find differences in the gradient. The second layer might find edges and corners, and further layers would trigger on the general shape of objects such as a car or a chair. An illustration of these findings can be generated by visualizing the individual layers in a network [56]. A small example of such visualizations can be found Figure 2.6.

Figure 2.6: Visualization of the first two layers of a CNN, taken from the work of Zeiler and Fergus [56]. In general, the filters of the layers become more complex as

more layers get added.

A typical CNN is built up from a set of different types of layers, each with a specific task. The remainder of this section will describe some of the categories of layers used in

(22)

previous work [6,8] and describe the used CNN for this work.

2.5.1 Convolutional layer

The convolutional layers consist of a set of learnable filters, and they form the core building block of CNNs. Each filter has weights that can be learned, which form the parameters of the layer. Each filter in this layer only covers a small spatial area of the image, also called the receptive field, but will cover all channels of an image. For RGB images these channels are usually represented by the three color channels. During the forward pass, the filter is convolved across the width and height of the image, and the dot product is computed between each filter and the input at all the visited positions. This process produces a 2D activation map, that presents the activation of each filter across the image. The goal of the learning process for the parameters of a convolutional layer is to have each filter correspond to a specific type of feature in the image. A convolutional layer typically has four hyper parameters, First, we have the receptive field size, which controls the size of the filter. Second, the depth of the layer has to be determined, which is equal to the number of filters that will be used in the layer. Third, the stride by which the filter is moved across the image. Last, the amount of zero padding, which allows you to control the output size.

2.5.2 Pooling layer

Each convolution layer used in the network will add more parameters that will have to be learned. In order to reduce the number of parameters, which in turn would reduce the computational costs of training the network, a pooling layer can be added. This layer reduces the spatial size of the representation by taking the maximum operation over a small region that is moved over the representation. For instance, if one takes the maximum over a 2x2 region, the size over that region is effectively reduced by 75%. An illustration of the effects of a max pooling layer is shown in Figure2.7.

Figure 2.7: Illustration of a pooling layer in a CNN, with a 2x2 filter and a stride of 2

(23)

2.5.3 Fully connected layer

Typically a CNN ends with fully connected layers, because similar to normal neural network these can easily be used to give scores to different classes in a classification task. The main difference between the fully connected layer and a convolutional layer is that the convolutional layer is only connected to a local area of the input and that many parameters of a convolutional layer are shared, while the parameters in a fully connected layer are not.

2.5.4 GoogleNet

For this work, we will use the CNN that was used to win the ImageNet Challenge 2015, namely the GoogleNet [8]. This network is a deep CNN that was used in the object detection challenge, that has been trained to classify 1.000 different object classes. One of the key difference with other approaches is that the layers are not just applied linearly but parallel. A full description of this network lies outside the scope of this work, and for that we refer to the original paper. The general pipeline of GoogleNet can be seen in Figure 2.8.

Figure 2.8: Representation of GoogleNet, as provided by the authors [8]

2.6 Features using context

While a CNN is able to capture specific patterns, it is still only a local pattern. What we would want for scene category classification is a method that does not just extract features from local information, but also from global information. This global informa-tion, has also been indicated as contextual information in previous work. Contextual

(24)

information is interpreted as any available information in the image that can influence the perception of the scene and the objects it contains [57].

The work of Divvala et al. [58] and Galleguillos and Belongie [59] each present an overview of the different types of contextual information. The overview of their pa-per also included contextual information that can be gathered from outside an image such as cultural and geographic context. In this thesis, we will only consider the context that can be provided by the image itself. Contextual information that can be derived from data available inside an image are, for instance, 2D scene gist context [60,61], 3D geometric context [62,63], and semantic context [19,21,64].

Torralba [60] showed that a global feature, which he called a gist, could predict the presence of an object and its location. In the work of Gupta et al. [62] they show that the localization of objects can be improved by predicting the location and size of the support surfaces in an image. Rabinovich et al. [19] demonstrated that certain objects are more likely to co-occur in an image. A somewhat similar approach of Farhadi and Sadeghi [21] proposed to model objects in common conjunctions such as person riding a horse, instead of identifying the objects person and horse individually. Hence, taking the research regarding the effectiveness of contextual information into account, we can conclude that the addition of this contextual information is a useful contribution to a scene classification pipeline.

The approaches of Rabinovich et al. [19] and Farhadi and Sadeghi [21] are both heavily influenced by the performance of the underlying object detection. If the object detection does not capture all objects in the image, the quality of their methods also deteriorates. By adding these two additional measures to Selective Search we hope to improve the object proposal generation, which in turn could be used to improve an object detection algorithm.

The proposed new feature of this work use the co-occurrence between the objects in the scene, like the method of [19], but instead of just keeping count of the co-occurrences in a scene, the 3D spatial relations are also considered. These 3D spatial relations are derived using the additional depth information of the RGB-D data. In the following section we discuss the new measures for Selective Search and present an application of the improved object proposals in the scene classification task.

(25)

Methodology

As discussed in the introduction, the contributions of this thesis are two-fold. The first contribution attempts to improve an object proposal method called Selective Search, by adding depth-based features. The second contribution improves the scene recognition task by using newly created depth features in conjunction with an already existing classifier.

3.1 Extending Selective Search

In an attempt to improve the Selective Search algorithm [32], we explore whether re-placing the RGB based Graph-based oversegmentation [38], that is used to generate the initial regions, with a 3D based segmentation method would improve the results of the Selective Search method. The graph-based segmentation generates superpixels, which essentially are clusters of pixels that are similar according to a certain measure. In this thesis, we want to use the 3D equivalent of superpixels, namely supervoxels. For the replacement of this Graph-based segmentation, the VCCS algorithm is taken. By using this algorithm, it is now also possible to use the depth channel of the RGB-D data, instead of just RGB. Because Selective Search can only use a 2D representation for the labeling of the images, the resulting point cloud of the VCCS algorithm is projected back into the original view. Because of this projection some artifacts, such as small clusters of pixels of a label floating in a different segment, are left in the image. In order to remove these artifacts, a median blur is used to smooth them out. This results in a cleaner segmentation, and an example of this median blur is displayed in Figure3.1.

By using the additional depth information, two novel similarity measures that are based on depth features are implemented. These additional measures are combined with the

(26)

(a) Unsmoothed segmentation (b) Smoothed segmentation Figure 3.1: Example of an initial segmentation showing the difference between an unsmoothed and smoothed image respectively. Most of the specks that can be found

in the left image, have been smoothed out in the right image using a median blur.

four already existing similarity measures of Selective Search that are described in Section

2.3. As a result, the algorithm now has the four standard measures: Scolor, Stexture, Ssize,

and Sf ill, and the two new measures: Svoxeland Sdistance. These two new measures can

either be used weighted separately, or grouped by color and depth features. Next, the two new features are described in more detail.

Sdistance(ri, rj) measures the Euclidean distance between the centers of two segments

ri and rj in 3D space. The measure is represented as the fraction of the maximum

possible distance within the scene, resulting in the score staying within a range of [0, 1], which is in line with the already existing measures. The maximum distance within an image is generated by creating a 3D bounding box around the point cloud of the image, and taking the Euclidean Norm (also known as 2-norm) of the two diagonally opposite points. This value is inverted by subtracting the result from 1, such that a small distance gives a high score, while a large distance gives a low score. For the (x,y,z) locations ci

and cj that represent the center of the supervoxels ri and rj respectively, this gives:

Sdistance(ri, rj) = 1 − k ci− cj k2 max(Sdistance) (3.1) = 1 −[ P m,nabs((cim− cjn)2)]1/2 max(Sdistance) , (3.2)

where the Euclidean distance is divided by the maximum distance to return a ratio within the range of [0, 1].

Svoxel(ri, rj) is a simpler measure that uses the adjacency matrix retrieved from the

(27)

ri and rj in the point cloud representation or a 0 if this adjacency link is missing. For

two segments ri and rj, this results in:

Svoxel(ri, rj) =

  

1 if voxel link(ri, rj)exists

0 otherwise

(3.3)

With these new features, there are now two possible ways to combine the new depth-based measures with the original color-depth-based measures. The first approach is to just weigh both color measures and depth measures equally. If the user chooses to use a measure, it will have a weight of 1 and else 0. This results in a simple extension of Equation 2.2:

s(ri, rj) = a1scolor(ri, rj) + a2stexture(ri, rj) + a3ssize(ri, rj) + a4sf ill(ri, rj)

+ a5sdistance(ri, rj) + a6svoxel(ri, rj), (3.4)

where ai ∈ {0, 1} describes whether the similarity measure is to be used or not.

A potential issue with equally weighing the measures, is that the original color based measures of Selective Search will always have a majority over the depth measures because there are simply more measures, resulting in a potentially biased final measure. A simple solution is to weigh the combination of color and depth features separately and combine the results, this would give:

s(ri, rj) =

a1scolor(ri, rj) + a2stexture(ri, rj) + a3ssize(ri, rj) + a4sf ill(ri, rj)

a1+ a2+ a3+ a4

+a5sdistance(ri, rj) + a6svoxel(ri, rj) a5+ a6

. (3.5)

The results of altering the equation to weigh color features and depth features equally are presented in Section5.

If we look at the pipeline in Figure2.4, we can see that the next step in the pipeline is to use the generated object proposals for the object detection task. In this thesis, the object detection step is not performed. Instead, the ground truth values for the objects in the image are used. This will allow the novel depth-based feature of the second contribution of this work to use the information of the objects in the image without the additional noise that is introduced by object detectors. In the following section, we elaborate on

(28)

how the novel depth features are used in combination with the CNNs for the task of scene classification.

3.2 CNN with additional depth features

In this thesis, we explore if there are other features that could be extracted from the image that make use of the additional depth channel. We have already seen the HHA features [7], which use features generated on a superpixel level. These features are still mostly derived from the two dimensional representation, making them 2.5D features, as explained by the authors [7].

For this thesis, we implement a feature that uses the location of the detected objects within the scene, in 3D space. For this feature, a strong assumption is made, regarding the fact that there is a good, or preferably perfect, object detector to detect all the objects in an image. The feature is structured as a 3D co-occurrence matrix with all objects that will be compared. In this thesis, we use all objects that are present in at least 50 unique images in the dataset. This results in a total of 209 object classes remaining in this dataset. For all the objects in an image, the 3D location is computed using the depth data. Next, the distances between all objects, other than with themselves, are computed for each image. As the avoid having to measure on a continuous scale, the distance between objects is discretized into a histogram of ten bins, which have a size of 50 centimeter per bin up to 4.5 meter in the first nine bins. The last bin contains all distances greater than 4.5 meter, this particular value is selected because the inaccuracy of most sensors grow when nearing, or going past this distance.

The result is a 3D matrix with the objects on the x and y axis, and the histogram with 10 bins on the z axis. If computed naively this results in n2 elements for n objects per image. However, the distance between two objects is equal, regardless of whether you compare object oi with oj or vice versa. Hence, only one half of the diagonal of

the co-occurrence matrix is used, which reduces the number of elements to n(n+1)₂ . The last step of generating the feature is to flatten it, because many implementations of neural networks do not accept 3D features. The matrix is flattened per row, where all histograms on the row are concatenated, to be combined with the other rows later, this results in a (n(n+1)₂ × #bins) sized vector.

Similar to the work of [7] the resulting features are used to train a one-vs-all SVM classifier [65]. By design SVM classifiers are fundamentally a two-class classifier, yet the problem of classifying K > 2 different classes occurs often. A common solution to this problem is to implement K different SVM classifiers, where k ∈ K and each classifier

(29)

SV Mk classifies the kth class as positive, and all entries that belong to other classes

as negative. The final output of the combination of the SVM classifiers is the class belonging to the SVM that produces the highest score for a given entry.

Before the RGB and depth CNN features can be used optimally for the task of scene category classification on the SUN RGB-D dataset, they have to be fine-tuned towards this specific task and dataset. This will have to happen before using them in the SVM classifier and the steps involved with this process will be discussed below.

3.2.1 Fine-tuning CNN

Most recent research has been oriented towards object classification, while only a small portion is dedicated to specifically classifying the scene category of the image. Hence, when using an CNN that has been trained to classify a certain set of objects, this will have to be fine-tuned towards classifying a set of scene types. A benefit of using an already existing CNN, although trained towards different, but similar, data is that a properly working model can be achieved with relatively little amount of data. When working with deep CNNs such as GoogleNet, a dataset like SUN RGB-D with around 10.000 RGB-D images is usually not considered to be enough to train the weights of a CNN from scratch.

Recent work has shown that it is more effective to train a CNN using the weights of an already existing CNN, which weights have been optimized for a closely related goal, over training from scratch with little data [7, 66]. For instance in the work of Xia [66], the AlexNet is used to classify styles in an image, while the network was originally trained to classify objects. Gupta et al. [7] showed that a CNN model trained on RGB data can be fine-tuned to create features based on depth data, which they called HHA features.

In order to fine-tune the GoogleNet model to classify scene categories instead of objects, the last few layers of the model have to be adjusted. Taking a closer look at the last layers of the pipeline displayed in Figure 2.8, we see that the final layers consist of a fully connected layer (represented as a convolution layer in the image) and a softmax classification layer. The GoogleNet was originally trained to detect 1.000 different object categories, while the network will be used to detect 21 different scene categories in this work. Hence, the number of outputs of the final layer are changed from 1.000 to 21, and the weights of the loss functions are removed, such that these can be retrained from scratch. This allows the model to adjust its parameters to have it classify the 21 scene classes instead of the 1000 object classes. After the CNN is fine-tuned, the novel depth features can be added to the pipeline.

(30)

Experiments

In this section, we first discuss the specifics of the dataset and the implementation details in Section 4.1 and Section 4.3 respectively. Next, the experiments for the extension of the Selective Search algorithm are discussed in Section4.4. Finally, the experiments for the scene category classification task are described in Section 4.5.

4.1 Dataset

In this thesis, the SUN RGB-D V1 dataset is used [13]. The dataset contains RGB-D images from three other datasets, namely NYU depth v2 [12], Berkeley B3DO [67], and SUN3D [68]. In total, the SUN RGB-D V1 dataset contains RGB-D data generated from four different sensors. Figure 4.1 displays the quality differences per sensor, and Table

4.1shows the distribution of the images over these images. The dataset is composed of four folders, one for each sensor. In total the dataset contains 10.272 entries with depth data, and these images are used to create the test- and train set used in this thesis.

Sensor Initial #images #images after noise removal #Images after threshold

Intel Realsense 1.159 1.115 1.092

Asus Xtion 3.784 3.191 3.088

Kinect v1 2.009 1.948 1.804

Kinext v2 3.320 2.772 2.706

Total 10.272 9.754 8.690

Table 4.1: The sensors used to generate the SUN RGB-D dataset [13], and the number of images per sensor. First column: images remaining after filtering missing depth information. Middle column: images remaining after also removing the noisy classes, such as idk and furniture store. Last column: the number of images that remain after

applying a minimum threshold on the scene class occurrences.

(31)

Figure 4.1: Comparison of the four RGB-D sensors. The raw depth map from Intel RealSense is noisier and has more missing values. Asus Xtion and Kinect v1s depth map have observable quantization effect. Kinect v2 is more accurate to measure

the details in depth, but it is more sensitive to reflection and dark color. [13]

4.1.1 Properties

Each entry contains various properties, as displayed in Table 4.2. The table does not display all available properties in the dataset, but only the ones used in this work. For instance, the extrinsics are not used because the tasks at hand do not require combining multiple images or point clouds. The 3D annotations are also ignored, because they are incomplete and only consist of bounding boxes for some of the objects in the room. Instead, the 3D annotation are re-computed using the 2D annotations in combination with the point cloud that is generated by using the depth information and the intrinsic parameters for each image.

4.2 Data pre-processing and Analysis

In order to be able to use the dataset for scene type classification using supervised techniques, one first has to determine the number of unique scene types in the dataset. The first analysis shows two odd classes that can not be used in this thesis: idk and

(32)

Property Description

Image 3-channel 8-bit .jpg RGB image

Depth 1-channel 16-bit .png image, each pixel represents the measured depth Depth bfx 1-channel 16-bit .png smoothed depth image, to cover missing values Intrinsics .txt file with the intrinsics matrix from the used sensor.

Scene .txt file containing the ground truth class of the scene.

Annotation2D .json file contains the ground truth of the objects in the scene Table 4.2: Properties of an entry in the dataset that are used in this thesis.

furniture store. The entries labelled with the idk class are removed because this label is an abbreviation for I do not know, hence this is seen as data without correct ground truth. If you look at some examples of this class in Figure 4.2 it is evident that these images lack any context by which a room category could be derived.

(a) idk scene (b) idk scene Figure 4.2: Samples of the removed idk class.

The furniture store entries are removed because the contents of these images are show-room models of various scene types, which the classifier would mistake for the actual shown scene type. Hence, these entries are treated as noise, and they are removed. Some examples of the furniture store class are shown in Figure 4.3, where the scene could easily be perceived as either a bedroom or a kitchen.

(a) Furniture store bedroom (b) Furniture store bedroom (c) Furniture store kitchen (d) Furniture store kitchen Figure 4.3: Samples of the removed furniture store class, where they have the

ap-pearance of either the bedroom class or the kitchen class.

The class analysis shows that there are 45 unique classes present in the dataset, when the idk and the furniture store classes are removed. Although the classes are not equally represented within the dataset. In order to ensure that all classes are represented in both train- and test set, a minimum threshold of 50 total occurrences is used. This results in

(33)

a dataset of 21 unique classes, and the distribution of these classes is displayed in Figure

4.4.

Figure 4.4: Scene Category distribution of the classes occurring more than 50 times, resulting in a total of 8.690 images. 21 out of original 45 scene categories remained.

The train- and test set used in the object proposal task, are obtained by randomizing the images and splitting 90% into the train set, and 10% into the test set. For the fine-tuning task we do not want to train a biased R-CNN model. As Figure4.4 clearly shows this skewness in the data, the train set for this specific task is given a threshold of a maximum of 150 occurrences per class, as to remain in range of the minimum 50 class occurrences, if a class occurred more than 150 times, the remainder is inserted into the test set. For the scene category classification, we use 3-fold cross validation to train and test the resulting SVM classifier. Each of the folds is guaranteed to have an equal distribution of classes by dividing the images over the three sets per class. As a final step the order of the images in the train set are randomized, such that overfitting in the early stages of the training stage of the SVM classifier can be avoided.

4.3 Implementation

The implementation of the object proposal generation and the scene category classifica-tion is performed using Python 3.5.11 _{and C++ on a linux server. For this project the} OpenCV2, Point Cloud Library (PCL) 3, Selective Search Python library 4, and Caffe

1 https://www.python.org/downloads/release/python-351/ 2 http://opencv.org/downloads.html 3 http://pointclouds.org/downloads/ 4 https://github.com/belltailjp/selective_search_py

(34)

library [69] are used. The OpenCV library is used to apply image processing on the RGB images, and the PCL library is used with the depth data to both generate and perform operations on 3D point clouds. The Caffe library is a framework for deep learning neural networks, and also includes various pretrained CNNs, such as the GoogleNet and the AlexNet [6]. Finally, the Selective Search Python library is an unofficial version of the original implementation in Matlab 5.

4.4 Extended Selective Search

Previous work has shown the importance of having a low number of object proposals [33]. Hence, in the first experiment we study potential ways to reduce the number of generated bounding boxes when using the novel depth features. For this task, the scores of all measures are evaluated separately per image. Using these observations, a threshold is set for the depth features.

For the second experiment, the colorspaces are varied for all combinations of color mea-sures with the novel depth meamea-sures. For the experiments in this thesis, the RGB and HSV colorspace are tested. Following the results of a comparison between five different colorspaces for the task of image segmentation [70], the hypothesis is that the HSV colorspace will outperform the RGB colorspace. For the default Selective Search the pa-rameters are kept to their default values, resulting in using all color measures and setting the parameters of the original oversegmentation method to σ = 0.8, and k = 100. Where σ is the amount of smoothing applied to the image and k is the value of the threshold function. All other variants of Selective Search use the VCCS algorithm with a voxel resolution of 0.008, a seed resolution of 0.1, the color importance set to 0,2, the spatial importance set to 0.4, and the normal importance set to 1.0. All tested combinations of measures apply all default color measures and a subset of the novel depth features.

In the third experiment, the possible problem of a biased algorithm is addressed. The original equation is presented in Equation3.4, in which the color measures outnumber the depth measures. As it is not certain whether this will influence the results, we propose Equation 3.5. In this equation, the combination of color measures and combination of depth measures are weighed equally. Which could remove, if present, a bias in the Selective Search algorithm.

The final experiment explores different values Intersection-over-Union (IoU) thresh-old values, which is the threshthresh-old that determines whether a generated bounding box matched a ground truth value. For this experiment, all bounding boxes generated during

5

(35)

the run are stored with the highest overlap score with a ground truth box. This list is later used to generate the results.

4.5 Scene Classification

For the Scene classification task the dataset is split into a train- and test set. The train set consists of 90% of the data, and the test set contains the remaining 10%. For these tests scene classes that occur more than 50 times are used, which resulted in 21 scene categories. An equal distribution in the train- and test set is guaranteed by splitting each of the classes separately.

The Scene classification process includes three steps. First, the CNNs that are involved for both RGB and HHA features are fine-tuned on the dataset. Second, the RGB, HHA, and/or the new depth features are computed based on the hyper-parameter that decides which features will be tested. As a result different combinations between the features are tested. The last step is to train the SVM using these features and pass the test set through the trained SVM to compute the results.

For the process of the fine tuning of the R-CNN for the SUNRGB-D dataset [13], and task of scene type classification, the dataset is split into a train-, validation-, and test set. In order to prevent bias for the most occurring classes, each scene class is limited to a maximum of 150 and a minimum of 50 for the train- and validation set, where the remainder is added to the test set. As a final step before the fine-tuning process the order of the entries in the train set is randomized. The fine-tuning process of the R-CNN model is performed on the Distributed ASCI Supercomputer 4 (DAS-4) server [71]. The fine-tuning process is run for 30.000 iterations, which is in line with the R-CNN paper [7]. For the remainder of this work it can be assumed that all CNNs that are mentioned will be fine-tuned.

The CNNs are trained in a similar way, although the input is different. For the HHA features we use AlexNet [6] instead of GoogleNet, because the original authors only tested it’s functionality for that specific CNN model, and they made no claims whether it would work on just any other CNN. In both cases the weights of the last two layers are fine-tuned to model scene categories, instead of object classes. The GoogleNet results in a feature vector with 4.096 entries, and the AlexNet returns a vector of 1.024 entries. Both are independently normalized using the L2-norm. After this step they are usable for the SVM classifier.

For the novel depth features, the objects that are going to be used to train the model, have to occur at least in 50 unique images according to the ground truth values. This

(36)

method is chosen to remove any potential noise generated from a sub-optimal object detector. Another source of noise are the ground truth annotations themselves, or we have to assume that they are generated with knowledge that can not be derived from the image itself. For instance, objects near the borders of the image, had annotation that went outside the borders of the image. This problem has been solved by taking the closest pixel inside the image for every pixel that lies out of bounds.

Finally, we explore the effectiveness of the proposed depth feature, when compared to another context based feature. The proposed depth feature essentially extends a occurrence matrix of the objects within the scene, and splits the number of co-occurrences between two objects across a histogram of ten bins. Hence, in order to test whether the performance improvements is mostly influenced by the object occur-rence counts or the actual splits across the ten bins, we compare the proposed depth feature against a normal co-occurrence context feature. Which can essentially be seen as the proposed depth features with a single bin per object-object pair. In the context of the pipeline this features still requires a proper functioning object detector, and it can serve as another example of an application for the improved object proposal generation.

(37)

Results

In this section, we will first discuss the results of the various experiments involving the object proposal generation. Afterwards, the experiments involving scene category classification will be discussed in Section5.2.

5.1 Extended Selective Search

Throughout this section, the abbreviations CTSF-1, CSTF-2, CTSFA, CTSFV, and CTSFVA are used. These are the abbreviations used to mark down the applied measures in the extended Selective Search, where the color-based measures are described in Section

2.3, and the depth-based measures are described in Section 3.1. The abbreviation can be decoded as follows: (C)olor (T)exture (S)ize (F)ill (V)oxel (A)djacency, which each match to their corresponding measure. CTSF-1 is the default Selective Search algorithm and CTSF-2 uses the same measures but uses the VCCS oversegmentation algorithm. Before discussing the results of the experiments, we will start by elaborating on the used evaluation metrics for this section, namely Recall and MABO.

5.1.1 Evaluation Metrics

For the task of object proposal generation, we are mainly concerned with reducing the number of proposals and maintaining or possibly improving the recall over the default Selective Search approach. Next, the quality of the proposals is also important, preferably we would want bounding boxes that are similar to the ground truth, and have a high overlap. The metrics used for this task are the recall metric and the Mean Average Best Overlap metric (MABO), as used by the original Selective Search paper [32].