Sparse salient visual feature-based localization

(1)

Sparse salient visual feature-based localization

Annemarie W. Burger

10793399

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor

Drs. Anthony 'Toto' van Inge Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam

(2)

Summary

Visual localization is a lesser used, but quite promising method of localization. The aim of this research is to find a connection between the sparseness of a database in terms of pictures used for training and pixels per picture, and the accuracy of a system to correctly identify the room a

photograph is taken in and the position from where it is taken within the room. The system built for this research performs with an average accuracy of 95% very well on room categorization, but improvements can still be made on the position determination part. The connection found between the sparseness of the database and the accuracy of the room classifier can be describes by the

following equation: accuracy = 0.01425* ‘number of pictures used for training’+ 0.00142 * ‘number of horizontal pixels per picture’.

Introduction

The most common idea that first pops up in the minds of a lot of people when thinking about artificial intelligence is that of robots. This is of course only partially correct, but ultimately it is a major topic. Three major problems associated with robotics are mobilization, self-representation and localization. The latter of these is the main theme of this research. Robot localization is often done using GPS or movement tracking, and not so frequently using visual representations, while humans for example, perform well using this method and rely heavily on their eyesight for localization.

The reason visual representations are not commonly used for localization is probably because researchers still have a hard time correctly interpreting these representations. While a human can easily identify numerous objects such as chairs, faces and buildings, this is quite a lot harder to accomplish for a computer program. Some people believe this is because humans have learnt where to look for, and what to expect.

The manner this research makes use of visual representations in order to determine its location is by using pictures of upper corners of rooms. This is because a hypothetical robot using the software in question would be used within a building, such as an office, a hospital or a school, so the robot would have to determine in which room it currently is located in, as well as its position within this room. The upper corners of rooms are suitable for this task since they do not tend to change a lot over time. This fact is a feature which is useful when categorizing the room, together with the fact that, when taking a picture of an upper corner, this picture generally does not contain many objects. Another useful feature of upper corners is that they are usually of a certain shape, namely right angles. This feature makes it possible to determine the position within the room.

This research revolves around the question of how the sparseness of a training database influences the accuracy of a specifically designed system built to determine the room in which an input-picture is taken and the position within the room from where this picture is shot. This requires the building of a computer system which handles these problems and a series of tests in which this system is tested with a varying database. The variety of the database consist of two factors: the number of pictures of each room in the training database, and the number of pixels in each picture. The hypothesis is that these two variables are related to the accuracy of the computer system in performing localization using visual representation.

This thesis starts by explaining the state-of-the-art-work in the relevant fields. After this the research method and the database will be illustrated and explanation are given of the designing choices made when building the system. Then the results are illustrated and evaluated and formed into the

conclusion that the accuracy of the system can indeed be linked to the sparseness of the database. The system built works very well on the room classification part, but there are still improvements to be made on the position determination part. Among others, suggestions to do so are given in the discussion and future work part of this research, which also concludes this thesis.

The relevance of this research lies in the finding of a formal relation between the sparseness of a database and the accuracy of the system built. This can be of use for other researchers when

discussing matters such as the memory required for a certain localization problem or when trying to predict the accuracy based on the quality of input pictures. Better localization using visual

representation could lead to easier, faster and cheaper localization tools, which could help with making robots more useful for a number of tasks and continue to further alleviate and assist humans in their lives.

(4)

Related work

Visual localization is a lesser used method of localization. This could be because of the clutter and noise that comes with pictures. However, very useful features can be extracted when using the proper methods. Zhang and Kosecká (2005) for example used colour histograms in combination with pixel selection and SIFT keypoints to identify buildings. This approach was with an 83.5% accuracy rate quite effective and also worked well when viewing the building from different angles. They only used the pixels belonging to an edge on the image whose

direction complied with main vanishing directions. This because it would be more probable that these pixels belonged to the buildings on the picture, since manmade structures are far more likely to have straight lines and corners than natural objects. They used this method to reduce the problem of background change and noise, and called it this representation a ‘localized colour histogram’ (Zhang and Kosecká, 2005).

Mallya and Lazebnik (2015) did an interesting research on finding the box-like-shape which forms most rooms. Their research ‘Learning Informative Edge Maps for Indoor Scene Layout Prediction’ uses informative edges to predict the probability for different kinds of box-shapes. They define these informative edges as “the edges of the projected 3D box that fits the room.” (Mallya and Lazebnik, 2015) which Figure 1 illustrates.

An important aspect of finding features to determine this box is corner detection, on which great work was done by David Lowe (2004), among others. He developed SIFT keypoints, which stands for Scale Invariant Feature Transform. These keypoints are rotation and scale invariant and highly discriminant and robust against changes in illumination and noise. Rosten and Drummond (2006) wrote a nice comparison on the feature detectors Harris, SUSAN and SIFT as well as one they developed themselves. They found that their method was less computationally intensive and took less processing time. A nice example of the effectiveness of SIFT keypoints is illustrated in Figure 2. The lines show the matches in keypoints, and it is clear that most matches can be found when the building is in fact the same.

Probably the most popular edge finding algorithm is Canny Edge, which was invented by John Canny (Canny, 1986) and of which an example can be found in Figure 3. This algorithm works by smoothing the image using a Gaussian convolution and looking for maxima in the derivative. Edge pixels are extracted,

Figure 1: Image with groundtruth box (Mallya and Lazebnik, 2015)

Figure 2: An example of matches of SIFT keypoints (Zhang and Kosecká, 2005)

Figure 3: An example of edges found using Canny Edge (Canny, 1986)

(5)

and complete edges are constructed by finding adjacent sets of edges, group them in ordered lists and use thresholding to eliminate the weakest solitary edges, while keeping the ones connected to more distinct edges (Siegwart, Nourbakhsh and Scaramuzza, 2011). A Gaussian convolution is a linear filter used to blur and to reduce noise. Every pixel of the picture is replaced by the average of the intensity of the pixels in the neighbourhood of this pixel (Siegwart, Nourbakhsh and Scaramuzza, 2011). To make sure that the noise reduction is maximal, while the loss of information because of blurring is minimal, a proper filter needs to be used.

A commonly used algorithm to extract straight lines from pictures is the Hough transform. This tool works by letting edge pixels ‘vote’ for parameters of a straight line. The lines with the highest number of votes in the end are straight edge features of the picture (Siegwart, Nourbakhsh and Scaramuzza, 2011).

When working with a sparse database Se, Lowe and Little (2005) found that this sparseness was not necessarily an issue. Their research revolves around a robot driving around while trying to locate itself and simultaneously mapping its surroundings. They use SIFT keypoints for the localization, and showed that even at a low resolution these were sufficiently discriminative.

Shi, Zhang and Liu (2004) showed that it is possible to determine one's position within the room using a picture of a corner. This is because most corners of rooms are straight corners and by knowing this, and knowing the angle the edges of the corner form on the picture, one can determine the position of the camera relative to the corner. To determine this, it is necessary to find the angles, referred to in Figure 4 below as θi and θj, between the actual lines, referred to as li and lj, and the pictured lines, referred to as l’i and l’j.

Figure 4: (a) Basic imaging geometry of the corner; (b) edge lines li and lj of the corner (Shi, Zhang and Liu, 2004)

(6)

With these found θ’s it is possible to then rotate the corner so that the position vector can be determined.

From all this related work, a knowledge gap can be found, namely researching the accuracy of a system of the localization of both the room a picture is taken in, as well as its position within the room, while using a sparse database. A colour histogram proved useful for the room classification part of this task, and the before mentioned Canny Edge algorithm helped finding the lines to determine the position vector within the room. The Hough transform proved to be not so effective in this research, so an alternative method was proposed. The SIFT keypoints looked very promising from the articles mentioned before, but were not used in this research for a lack of immediate necessity.

Research method

The research method in this research is as follows. Firstly, the computer system for room recognition and position determination is build, details on which are discussed later in this thesis. Then, this system is tested using different databases as input. These databases differ from each other in two factors: the number of photographs used to train the system, and the number of pixels in the horizontal direction of each picture. The aim is to find a formula which describes the relation between these variables and the accuracy of the system, since this answers the main research question on the

connection between the sparseness of a database and the accuracy when performing localization using visual features.

The parameters chosen for the variable ‘pixels per picture’ are 500, 400, 300, 200 and 100 pixels in the horizontal direction of the picture. Since all pictures are taken in ‘landscape’ mode, this distance is the longer one compared to the vertical distance, and the vertical distance is simply resized with the same factor as the horizontal distance from the original picture, so the ratio between these two is not affected.

The parameters chosen for the variable ‘pictures used for training’ are 50, 40, 30, 20 and 10 pictures. This means that in the first parameter mentioned, 50 pictures are used to train the classifier, and the remaining 25 pictures from the database are used to test the system, since the database used in this research consist of 75 pictures in total. When using only 10 pictures to train the classifier, this results in 65 pictures being used to test the system. Although this can result in more precise accuracy results than when there are less test pictures, this problem is obviated by iterating over the classifier a thousand times per different parameter.

The accuracy of the system built with regards to room classification is quite simply determined, since the accuracy is just the number of pictures correctly identified divided by the total number of pictures. The accuracy of the position determination part is somewhat different. The goal is to determine from the information in the pictures and the knowledge that the corner on the picture is a straight corner, a position vector which coincides with the vector that forms the line between the camera and the corner of which the pictures is taken. The accuracy will be measured by the Euclidean distance between the actual camera position and the line determined.

(7)

Database

The database consist of 75 pictures shot at Science Park 904, 1098 XH Amsterdam. Three different rooms were chosen, namely A1.06, A1.50 and D1.110. One of the rooms was an Accessible toilet and the other two were classrooms. One of the classrooms had a ceiling with a lower and a more elevated part and both had one wall with windows. These rooms were chosen since they are quite representable

for other rooms in the building.

In each of these rooms 5 different positions were chosen. These positions were measured relative to the upper corner that is positioned to the right of the main door of the room. The x-axe was defined as the distance from the wall which contained the main door to the camera. The y-axe was defined as the distance from the wall to the right of the wall

containing the main door and the camera. The z-axe was defined as the distanced from the ceiling to the camera. All this is also illustrated to the left in Figure 5.

Photographs were taken from this position of the door, and the four upper corners of the room, starting at the one to the right of the right and moving clockwise. This is illustrated in Figure 5 by the numbers 0 to 4, which indicate the order in which the pictures were taken and by which they were labelled. All pictures were labelled by room number -respectively 6, 5 and 1 for A1.06, A1.50 and D1.110-, position number -1 to 5 for each room- and object number -0 for the door, 1 to 4 for the respective corners-. Also included in the name of each picture was the position of the camera relative to the upper corner that is positioned to the right of the main door of the room, in the order x-axe, y-axe, and z-axe. One of the names of the pictures in the database, shown here below as upper right picture, is 5.2.1.200.100.140. This means the picture is taken in room 5, i.e. A1.50. The position within the room was second of a series of five, and the object on the picture is the first corner to the right of the door, also indicated in Figure 5 as corner 1. The distance from the camera to the left wall on the picture is 200 centimetre, the distance to the right wall is 100 centimetre and the distance to the ceiling is 140 centimetre.

Important to realise is that the goal of the system built in this research for position determination will be to find the vector describing the line from the corner to the camera. In Figure 5 this vector is illustrated as a red dotted line. In the example we just gave, a perfect position vector as output determined by our system would be [200*x, 100*x, 140*x] with x as an arbitrary constant.

Figure 5: A general outline of the labels in the database and the position vector

(8)

The rooms chosen for this database all had straight corners, and consequently had a box-shape. This was necessary for the position vector determination. Examples of pictures of the database can be found in Figure 6. Their respective names -from left to right- are: 1.1.0.200.200.100,

1.4.3.150.150.200, 5.2.1.200.100.140, 5.4.4.150.150.40, 6.3.0.300.200.70 and 6.4.2.200.300.120.

Implementation

Room classification

The approach taken in this research for the determination of the room in which the photograph has been taken works with colour histograms. After extracting these features, the k-Nearest-Neighbour classifier algorithm is used.

Colour histograms

As mentioned before, Zhang and Kosecká (2005) used a method called ‘localized colour histograms’ for their research on recognizing buildings. Although this method works very well for Zhang and Kosecká, a simpler approach was chosen in this research. Reason for this was the fact that on the pictures in our database there is no background change or much noise, since everything on the

pictures is part of the room and thereby useful for classification. The colour histogram method used in this research is one suggested by Adrian Rosebrock (2016) as part of a Machine Learning tutorial. Originally the idea was to further improve the classification algorithm after applying this method, but since the results were already quite good using only Rosebrocks implementation, we refrained from that.

Rosebrocks implements a colour histogram as follows: he first converts the image to HSV colours, in which HSV stands for hue, saturation and value and is sometimes referred to as HSB in which the b stands for brightness. Then OpenCV method CalcHist is used and the resulting histogram is

normalized using the OpenCV method normalize. This results in a histogram which is flattened before being returned, so that it becomes a feature vector. In this research eight bins are used for the

(9)

histogram, which means each image is defined by an 8x8x8 = 516d feature vector. The code for this reads as follows:

k-Nearest-Neighbour

The k-Nearest-Neighbour, or k-NN, classifier is one of the simplest machine learning/image classification algorithms according to Rosebrock (2016). It works by mapping each object with its features and, when choosing the proper features, finding that similar features usually correspondent with a same label/classification. Knowing this, an object is classified by determining its features, then checking the labels of the k objects which have the most similar features, and assigning the object in question the label most common among its k neighbours.

Empirical testing lead to the determination that, when applied to the full database of 75 pictures, while using 50 as training and 25 pictures as testing objects, a 5-Nearest-Neighbour classifier was most effective and could reach accuracies of more than 90%. More on this topic will be discussed in the chapters Results and Evaluation.

Position determination

The approach to determining the position within the room is as follows: firstly we use Canny Edge to determine the edges on the pictures are used. Hopefully the three edges we need, - the one between the two walls, the one between wall 1 and the ceiling, and the one between wall 2 and the ceiling - are among the edges found. Then a number of pixels in the centre of the picture are tested as centre point of these three edges, i.e. the corner itself. The centre with the most edge points on the three lines drawn is chosen as the most probable approximation of the corner. Then we have three equations of the three lines that, if correct, span up the corner. From this the angles between them can be found, and these can be used to determine the position vector. We define this vector as the line between the camera and the corner.

Edge extraction

Since the aforementioned Canny Edge (Canny, 1986) is a proven solid algorithm, it is used in the program built for this research. A function which alters the parameters of the OpenCV function Canny Edge is designed to continue increasing the parameters until only a certain percentage of the pixels of the original picture is recognised as an edge. The parameters used are two thresholds, of which the highest determines whether a pixels is an edge pixels, and the lowest determines whether a weaker candidate for an edge pixel could still be one because of stronger edge pixels surrounding it (OpenCV, 2017).

(10)

Determined by trial and error was that a percentage of 9% of the pixels being edge pixels, usually found the correct edges. In Figure 7 some examples of percentages can be found, which illustrates the ultimate choice to assign the 9% value to the system.

Once the edge image is returned, a function converts this to a list of coordinates of the edge pixels. We distinguish three different lists: one with the pixels at the most right part of the picture, one with the pixels at the most left part of the picture, and one with the pixels at the bottom of the picture, excluding in the latter the pixels already included in the former ones. Determined through testing was

that usually a part of 30% of the total picture would be enough to accurately get only the right/left/under edge in the respective pixel list, as illustrated in Figure 8. The database was composed in a way that the corner would always be somewhere in the middle of the picture, and determined was that when 30% of the picture would be used to determine the right/left/under edge, the corner would always be in the ‘rest’ part of the image.

The separation of the edge pixels into three different lists was necessary because if all the edge pixels are included in the process of

finding the three edges, it could happen that the right edge would be chosen wrongly because of outliers in the left part of the picture.

Straight line extraction

Although the aforementioned Hough transform is a widely used method to find straight lines, when experimented with in this research, it did not perform well. It almost never found more than one edge and was not able to accurately determine the start and ending of a line. Two examples of this can be found in Figure 9. Reasons for this could be the low quality of the pictures of the fact that the colour and texture differences between the walls and the ceiling were sometimes very small.

Because of this inadequate results, a new approach was

100 pixels wide, 8% edge 100 pixels wide, 9% edge 100 pixels wide, 10% edge

Figure 7: Three examples of different percentages of edge pixels

Figure 8: An example of the categorization of the edge pixels

Figure 9: Two examples of the poor performance of the Hough transform

(11)

devised and an original algorithm was developed. The algorithm takes the lists with edge pixel

coordinates as input. It then iterates over the pixels in the area labelled in Figure 8as ‘Rest of the edge pixels’ and tries to form a corner from this point, referred to as the centre point. It starts by finding a line on which the highest number of edge pixels labelled in Figure 8as ‘Right’ lie, and then does the same for the ‘Left’ and ‘Under’ edge pixels in the respective lists. The number of edge pixels that lie on each line is divided by the total number of edge pixels in the respective part of the image and then totalled up. The centre point which maximizes this value is chosen as the most probable estimation of the real corner and the algorithm returns the equation values of the three lines found. This method, although quite slow, produces adequate results and usually returns the correct corner lines as illustrated below in Figure 10.

Written in pseudocode the algorithm works as follows:

2D to 3D transformation

(12)

angles it is possible to determine the position vector from where the photograph has been taken. This is because all the corners in the database, are straight corners. Knowing this, and knowing the angle the corner forms on the picture, one can determine the position of the camera relative to the corner. From the slopes of the three lines found we can find the corners between them by taking the atan of the slopes. The corner between the line to the right and the line down is then defined by 180 degrees minus the atan of the right line plus the atan of the down line. The corner between the left and the down line is defined by 180 degrees plus the atan of the left line minus the atan of the down line. The corner between the left and the right line can be defined by the atan of the right line minus the atan of the left line. All this can be derived from basic geometry.

From this information, Shi, Zhang and Liu (2004) explained that we can find the angles between the actual lines and the lines pictured and referred to these as θ’s. After we found these angles, we can rotate our real 3D corner twice according to two of the θ’s we found, so that one of the edges of the corner points exactly in the direction of the camera and the other two span the field of the image plane. The rotation matrix is shown below, with α and β being two of the θ’s found by following Shi, Zhang and Liu (2004)’s steps as explained in ‘Related Work’.

𝑅 = (

cos 𝛼 cos 𝛽 −cos 𝛼 sin 𝛽 sin 𝛼

sin 𝛽 cos 𝛽 0

−sin 𝛼 cos 𝛽 sin 𝛼 sin 𝛽 cos 𝛼 )

This rotation matrix pitches and rolls the objects in question over the respective y- and x-axes. It does this with the angles found between the real lines and the pictured lines, so that the coordinate system of the actual corner rotates and becomes aligned with the coordinate system of the picture frame. Once the coordinate system are aligned, the z-axe is also aligned with the line from the camera to the corner i.e. the position vector we are looking for.

Results

Room classification

The room classification algorithm which uses colour histograms was ran 1000 times on the full database of 75 pictures, for 2 times 5 different parameters. These parameters were the number of pixels in the horizontal direction of the picture -namely 100, 200, 300, 400 and 500 pixels-, and the number of pictures used to train the database before testing it with the pictures that were left in the database once we removed the ones used for training. This latter parameter was tested using 10, 20, 30, 40 and 50 pictures to train the database. In Graph 1 results are displayed when taking the average of these tests with the number of horizontal pixels per picture as variable.

(13)

In Graph 2 results are displayed when taking the average of the accuracy when using the number of pictures in the training database as variable.

Graph 2: The accuracy of the room classifier with the number of pictures in the training database as variable

Position determination

When determining the position within the room, a position vector was given as output. The accuracy was then calculated by taking the Euclidean distance from this line to the exact point from where the picture was taken. Since the algorithm was so time consuming, it was only tested using 100 pixels in the horizontal direction of the picture. Since we made an algorithm to find the position vector based on the corner in the picture, we only used pictures of corner 1, i.e. the first corner to the right of the door, for this, because the measurements for the database are relative to this corner 1.

To further quantify how time consuming the system is, it is worth mentioning that it was run on an Asus Zenbook laptop, with an i7-4510 CPU processor and 4GB of RAM memory. When running the full algorithm in all its possibilities, the determination of the lines on a single pictures takes over 100 minutes, which is why it was chosen to limit the options that the algorithm tries. Only ⅕ of the edge pixels are investigated as a line with the centre. The process then still took about 11 minutes per picture.

The results were quite irregular in accuracy, with one of the worst position vector determinations having a Euclidean distance of 107 centimetre between the position vector and the actual camera position, with the camera position being 114 centimetre away from the corner. This makes for a relative error of 107/114 = 0.94 meters error per meter distance from the corner. The best one found had a Euclidean distance of 14 centimetre and a distance of 380 centimetre from the corner, which made the relative error only 0.04. All tested images and their found errors can be found in Table 1. Taken into account is the fact that the line finding algorithm does not always find the correct lines, especially not at the lower pixel rates used for this test. This is why a rating is given to each line determination, with 1 out of 5 points being appointed when the lines are not even close to the actual corner, and 5 out of 5 points being appointed when the lines (almost) exactly align.

(14)

Table 1: Results from the position vector determination algorithm

Evaluation

Room classification

When looking at the accuracy of the system when changing the number of pixels in the image, an interesting phenomenon occurs, namely that the accuracy is actually better at a lower pixel rate. Reason for this could be that the resizing algorithm keeps the more important or more discriminant pixel values, or that the histogram algorithm performs better with less input. If the first explanation is the case, this could be tested by manually resizing the image before even loading it into the system. If you would want to ascribe the better performance at lower pixel rates to the histogram algorithm, this could be proven or disproven by trying other histogram methods than the one featured in this system, i.e. the one from OpenCV.

Interesting as well was the fact that the accuracy actually dropped when more training pictures were given to the system. An explanation for this could be over-fitting, seeing that the class with 50 training pictures with the 5-NN classifier had an average accuracy of 89% in comparison with the overall accuracy of 95%. One could think that better performance when having access to more training pictures could be achieved by not using a 5-Nearest-Neighbour classifier, but, for example, a 7-NN classifier. However, when the system was tested with a 7-NN classifier, this did not improve the accuracy of either the class with 50 training pictures or the overall result.

By solving the minimization equation of the squared error between a linear surface and the values found by the classification system, the following linear equation was found that best approached the values: accuracy = 0,01425* ‘number of pictures used for training’+ 0,00142 * ‘number of horizontal pixels per picture’. The mean squared error found here between the equation and the actual values found, was only 0.1.

(15)

Position determination

Unfortunately, the results of the position determination were not as good as the results from the room classifier. The position vectors returned were often not even close to the actual camera position, and sometimes the system was not able to find the proper lines. Especially the room labelled ‘5’ caused a lot of trouble when trying to find the edges at a horizontal width of 100 pixels. An example of this can be found below in Figure 11.

The average error over the 15 pictures is 0.50, which means that for every meter the camera moves further from the corner, the position determination algorithm gets on average 50 centimetres more off from the original position. When taking only the average error of the 4 pictures whose lines were (almost) perfectly aligned with the actual corner lines, the error drops to 0.27, and when taking the three pictures who got a 4 out of 5 score for their line precision also into account, the error is 0.28. There could be a number of different reasons the position determination is not working adequately. One could be that the photograph is not precise enough to accurately determine the proper lines the edges form, as seems so be the case at Figure 11. Another reason could be that there is something wrong in the determination of the angles between the real and the image lines.

Conclusion

The research question which fuelled this whole research was the following: how does the sparseness of a database affect a specifically designed system which determines the room in which an input-picture is taken and the position within the room this input-picture is shot. After building and testing the system in question we found that even with a database as spare as tested in this research, i.e. consisting of 10 training pictures of each only 100 pixels wide, the system can still determine the room a picture is taken in with an accuracy of over 95%, while only using a quite simple colour histogram method and a k-Nearest-Neighbour algorithm. The connection between the sparseness of the database and the accuracy of the room classification part of the system can be described by the following equation: accuracy = 0.01425* ‘number of pictures used for training’+ 0.00142 * ‘number of horizontal pixels per picture’. A new method was tested in finding the edges which span an upper

(16)

edge lines were found, the average Euclidean weighted error would still be 0.28 meter deviation per meter distance from the corner itself.

Discussion & future work

There are some aspects of this research which could be further researched or improved. Among these is the quite obvious aspect that the system should be tested with a more extensive database. Since the database used in this research only consisted out of 75 pictures taken in three rooms, a more accurate and precise conclusion could be drawn with a bigger database.

An idea at the start of this research that was not completed in the end was the idea to further improve the room classifier with a SIFT keypoint matcher. SIFT keypoints were used in many researches mentioned in ‘Related work’, but since the classifier was already performing quite well using only features from a colour histogram, it was decided to leave this further improvement for future work. This also counts for a method using ‘Grey Level Co-Occurrence Matrix’ (GLCM) features. Once you had found the edge lines of the corner, you could take some patches of the two walls and the ceiling and compare these to patches found on pictures yet to be classified. More information about this method can be found on Scikit-image.org (2017).

Another idea which has not been carried out since the classification algorithm was already performing quite well, is that of conditional probability. Since the database consist out of 15 different position from where 5 pictures were taken each time, one could better categorize a room if you would get another picture of this same room and position. If a robot would be uncertain of its location after taking a picture of just one corner, it could of course make the choice to make a picture of another corner and use this as extra input to correctly classify the room. The accuracy of the room classifier would almost certainly go up if two input pictures of the same room and same position were given instead of just one.

The reason all corners of the room as well as the door of the room were included in the database is the direct consequence of a more initial, not completed, idea. The plan was to find the exact location of the camera within the room, which required some sort of scaling of the photograph. A picture of a door is very useful for this, since doors in general, but certainly doors in public buildings as the ones in the database, have set measurements. With this knowledge and the position vector determined from the corner, one can determine the exact location of the camera within the room. This idea was

ultimately not executed since the task of expanding the system to handling this task was expected to be very time-consuming, though not very scientifically challenging, nor something that had not been done before. If could be a useful expansion of the system in the future though.

Literature references

Canny, J.F. (1986). “A computational approach to edge detection” IEEE Transactions on Pattern Analysis and Machine Intelligence, 679-698

Lowe, D. (2004). Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 60(2), pp.91-110

Mallya, A. and Lazebnik, S. (2015). Learning Informative Edge Maps for Indoor Scene Layout Prediction. International Conference on Computer Vision, 2015.

(17)

OpenCV. (2017). Feature Detection — OpenCV 2.4.13.2 documentation. [online] Available at: http://docs.opencv.org/2.4/modules/imgproc/doc/feature_detection.html?highlight=canny#id1 [Accessed 20 Jun. 2017].

Rosebrock, A. (2016). k-NN classifier for image classification - PyImageSearch. [online] PyImageSearch. Available at: http://www.pyimagesearch.com/2016/08/08/k-nn-classifier-for-image-classification/ [Accessed 20 Jun. 2017].

Rosten E. and Drummond T. (2006) Machine Learning for High-Speed Corner Detection. In: Leonardis A., Bischof H., Pinz A. (eds) Computer Vision – ECCV 2006. ECCV 2006. Lecture Notes in Computer Science, vol 3951. Springer, Berlin, Heidelberg

Scikit-image.org. (2017). GLCM Texture Features — skimage v0.14dev docs. [online] Available at: http://scikit-image.org/docs/dev/auto_examples/features_detection/plot_glcm.html [Accessed 26 Jun. 2017].

Se, S., Lowe, D. and Little, J. (2005). Vision-based global localization and mapping for mobile robots. IEEE Transactions on Robotics, 21(3), pp.364-375

Shi, F., Zhang, X. and Liu, Y. (2004). A new method of camera pose estimation using 2D–3D corner correspondence. Pattern Recognition Letters, 25(10), pp.1155-1163.

Siegwart, R., Scaramuzza, D. and Nourbakhsh, I. (2011). Introduction to autonomous mobile robots. 2nd ed. Cambridge, Mass.: MIT Press.

Zhang, W. and Koseckà, J. (2005). Localization Based on Building Recognition. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

Appendix 1

Link to the database used in this research, as well as the Python code used to run the system: https://drive.google.com/open?id=0B0rszAkjH1RBZmdHSC1ub3hwY1E

Sparse salient visual feature-based localization