3d localization of a (transparent) bottle in changing lab environments using Artificial Intelligence

(1)

Master Thesis

3d localization of a (transparent) bottle in changing

lab environments using Artificial Intelligence

Michiel Hoogeboom

Hanze University of Applied Sciences 23-06-2019

(2)

1

1. Summary

Robots and cobots are more and more core technology within laboratory automation. Within BD Kiestra all kinds of robots and cobots are used for lab automatization however these robots have no way to graph or interact with objects without knowing by forehand where the object is which is necessary for BD Kiestra to advance their automated laboratories. This thesis research will answer and demonstrate the real time localization of 3d objects (position and orientation) by means of applying Artificial Intelligence (AI). The object that is localized is a Bactec bottle from BD Kiestra. Localizing objects real time in 3d is not an easy task. To be able to do this, some sort of 3d measuring system is needed. In this approach stereo cameras are used to determine the position of the Bactec bottle.

With stereo vision a 3d depth image can be made by matching two 2d images [14]. The advantage is that, in contrast to the other methods, no color data is lost when making the depth image. However, stereo vision needs a lot of computational power to work in real time, it is difficult to get an accurate 3d position with the 2d images in longer ranges [7] and it is difficult to find stable features in transparent objects. This is also the reason why, until now, there is no general solution found using stereo cameras.

To design the localization algorithm using stereo vision, data is collected using a UR5-e cobot. This cobot is programed to move around the object in order to capture the bottle in different angles and different positions. Besides moving a separate thread is made in the robot which sends the position information to the Jetson TX2 development board. The development board couples the image to the location. In total ~5k images are collected which can be used to design a localization algorithm with. To process the data deep neural networks are used. Because the application uses a new approach for 3d object detection this thesis covers how different hyper parameters influence the neural network performance on the localization method using stereo vision in changing lab environments without changing the Bactec bottle.

Within the neural network the influence of the following parameters on a localization task are researched: - epoch sizes - optimizers - learning rates - batch sizes - loss functions - network complexity - layer types

First some theoretical research is done. Then experiments are done to analyse their performance. From these experiments several things can be concluded.

First it was important for training the neural network that the epoch size is >160. This because otherwise the neural network could not be reaching its global minima or a solution close to the global minima.

Secondly , the experiments have shown that for this localisation task, only Adamax can be used to train the neural network. Probably due to an infinite-order norm which makes it more stable and robust to noise than the other optimizers. When using other optimizers the neural network is could not detect the Bactec bottle.

(3)

1

Thirdly it was important to keep the learning rate on 0.05 or lower and a batch size of 10. This combination has shown the best performance when training the neural network. These learning rate settings make the network able to train to a local minima without overshooting directly and the batch size makes the network able to find new local minima.

The fourth parameter was the loss function. The experiments have shown that the mean squared error loss function was the best to use when training the neural network for the localization task. When using the other loss functions the neural network was not able to detect the Bactec bottle. The last parameters/settings, the network complexity and the layer types, were more variable. The experiments have shown that changing this is the parameter or setting results in a higher accuracy or lower loss from the network. Although different network sizes and configurations were able to show good performance on the localization task, there were several requirements to the network to be able to localize the Bactec bottle. First, it was important to use 2 convolutional 2d layers as the input and the first hidden layer without any dimensional reduction layers in between. This because the dimensional reduction layers were removing detail which was important in the first two layers of the network to be able to localize the Bactec bottle.

The best performing experiment was one of the simplest networks that were tried which shows that a higher complexity does not necessarily increase the performance. However, even with this simple neural network it was visible that the predication is not optimal. Although it is possible to determine the position, nevertheless it is difficult to have a high accuracy within a range lower than +/- 60 centimeters. The same situation is in the rotation estimates. The rotation can be estimated with a high accuracy if the estimates can be in +/- 1 radians.

These single results are not sufficient enough to use on one of the robots within BD. However, when estimating the position of the Bactec bottle several times from different angles and combining the results to give one single estimate, it could be that the results are sufficient enough. This should be determined in further research. In the recommendations some suggestions can be found which potentially improve the performance of the neural network still being able to localize the Bactec bottle with a high precision and accuracy.

(4)

2

2. Foreword & acknowledgements

This thesis it written as a completion to the master Sensor System engineering at the Hanze University of Applied Sciences. This master program focuses on the development and research of smart sensor systems using data analysis, machine learning and adaptive filtering.

The subject of this theses is 3d localization of a (transparent) bottle in changing lab environments using Artificial Intelligence. As a master student I would like to thank several people from BD Kiestra which helped me along the way.

First I would like to thank Johannes Bruinsma for giving me the opportunity to research this challenging problem. Besides Johannes Bruinsma I would also like to thank my company supervisor Frans Feijen for always having the time to discuss my progress and challenges within the project. Lastly I would like to thank my colleagues, Antons Prokopenko, Mahir Faik Karaaba, Samer Ahmed, Maksym Kyryliuk, Timo Holzmann, Rinalds Kugis and Roland Renkema for always being open to discuss problems and give suggestions.

Besides my coleagues I would also like to thank the people from my university. First my company supervisor and teacher Felipe Nascimento Martins for checking all my work, giving really useful feedback and for the times when I was stuck and he was there to help me find a solution for the problem. I would also like to thank all the teaches who helped me to perform at the level at which I now am. These include, Ronald van Elburg, Marietta de Rooij, Corina Vogt, Bryan Williams, Ward van der Houwen and of course Felipe Nascimento Martins. It was a pleasure to do the program and I have learned a lot.

(5)

3

3. Contents

4. Rationale (introduction) ...5

4.1. Problem definition ... 5

4.2. Initial situation ... 6

4.3. Goals and outcomes (desired situation) ... 6

4.4. Outline of conceptual model ... 7

4.5. Ethical consideration ... 9

4.6. Thesis outline ... 9

5. Situational & theoretical analysis ... 10

5.1. Measurement system ... 10

5.2. 3d depth estimation with stereo cameras ... 11

5.3. Localization method ... 12

5.3.1. Older techniques ... 12

5.3.2. Neural network method ... 15

5.3.3. Conclusion ... 19

5.4. Localization In 3d using TensorFlow ... 19

5.4.1. Keras (TensorFlow)... 19

5.5. Tuning neural networks ... 20

5.5.1. The layers ... 20

5.5.2. The compiler ... 26

5.5.3. The batch size ... 29

5.5.4. Experiments with tuning a Custom Deep Neural network ... 29

6. Research design & results ... 31

6.1. Data collection ... 31

6.1.1. How much data is needed and how to collect this data efficiently? ... 31

6.1.2. Design camera construction ... 33

6.1.3. How the data is captured ... 34

6.2. Data preparation ... 38

6.2.1. Resizing images ... 38

6.2.2. How the pictures are annotated ... 39

6.2.3. Data split and shuffle ... 40

6.2.4. Neural network initialization ... 41

6.3. Training & Tuning ... 41

6.3.1. Parameter review ... 41

6.3.2. Review of best performing experiments ... 47

6.4. Prediction ... 48

(6)

4

8. Recommendations ... 52 9. References ... 53 10. Attachments ... 58

(7)

5

4. Rationale (introduction)

Robots and cobots are more and more core technology within laboratory automation. Robots enclosed in instruments are often engineered for a specific application and with discrete mode of operation. The control is from sensor feedback and tolerances are an overall effort between Hardware (HW) and software (SW). Any complexity or flexibility increases the overall cost. Even the application of traditional machine vision is a leap in cost. Both on the instrument sides because of the requirements which are based on precise conditions but also in the amount of engineering effort in SW which is needed.

Technology based on Artificial Intelligence (AI) is arising, self-driving cars are already using this method for object detection in runtime [81]. The availability of 3d scanners is improving, a few years ago only expensive LIDAR laser scanners were available. Since the Kinect, many more low cost 3d sensors became available.

This thesis research will answer and demonstrate the real time localization of 3d objects (position and orientation) by means of applying AI. The output will be sufficient for a robot control algorithm in order to manipulate these objects in poorly controlled environment. In this context the poorly controlled environment is the lab environment within BD Kiestra which cannot be changed or manipulated. Understanding how to annotate the dataset and select/use the proper toolchain will reveal the technology and skill set required for deployment of real applications.

4.1. Problem definition

BD Kiestra uses all kinds of robots and cobots for lab automation. This includes the robotic arms which can move Petri dishes between conveyer belts and the incubator which stores petri dishes in an organized way for incubation. These robots and cobots work in the lab but have no way to graph or interact with objects without knowing by forehand where the object is. To be able to determine the position of an object without manually entering this position, the robots and cobots will need some type of object locating system in real time.

Localizing objects in real time is not an easy task. Especially because the objects cannot be changed. To be able to do this, some sort of 3d measuring system is needed. There are several 3d measuring options [2] these include:

- Stereo vision [1] - Time of Flight - Structured light - LIDAR

With these sensing methods, 3d location determination should be possible however they all have problems which cause them to have difficulties locating objects in real time. This is especially true in this situation because localization method should potentially be able to detect transparent objects. In the project initialization phase (which is described in chapter 5.1) there is chosen to work with stereo vision. With stereo vision a 3d depth image can be made by matching two images [14]. The advantage is that, in contrast to the other methods, no color data is lost when making the depth image. However, stereo vision will need a lot of computational power to work in real time, it is difficult to get an accurate 3d position with the 2d images in longer ranges [7] and it is difficult to find stable features in transparent objects. This is also the reason why there is no general solution found using stereo cameras.

(8)

6

4.2. Initial situation

This research is done for BD Kiestra. BD Kiestra is part of Becton Dickinson (BD), a leading medical technology company that is active worldwide. BD Kiestra develops and produces high-quality innovative automated laboratory systems for medical bacteriological laboratories.

BD Kiestra worked on the development of cobots within their lab environments. The cobots are used to work collaborative with humans. The target is to use this research within BD Kiestra’s cobots, however it does not rely on any existing cobot project. The research itself is new for BD Kiestra which means that no work is done by forehand and that there is little knowledge about the subject at BD Kiestra.

Because there is no knowledge within BD Kiestra some research is done on external projects which try to tackle a similar problem.

The Bactec bottle that needs to be detected in this research is made of glass and is partially transparent. As noted earlier it is difficult to detect transparent objects, the reason why current sensors and solutions have problems detecting transparent objects. And no general solution could be found [72].

In earlier attempts people tried other approaches than cameras. This because, as explained earlier, transparent objects are almost invisible for a normal camera [71]. One of these attempts is using LIDAR [72]. However, the lasers of the LIDAR sensors reflect several times before they hit any surface. This causes inaccurate data and no possibilities to detect, localize and/or reconstruct transparent objects. In the end this all comes down to having a lack of stable features to define the characteristics of the transparent object.

4.3. Goals and outcomes (desired situation)

To find a solution for the problem the goal of this research was to answer the following research question:

How do different hyper parameters influence the neural network performance on a localization method where the location and orientation of a disposable (Bactec bottle) needs to be estimated in 3d, using stereo vision in changing lab environments without changing the disposable?

During the research a prove of concept is developed that is used to answer the research question. This prove of concept is a system with the following main target:

A system is designed, made and documented which shows how accurate the position and orientation of a moving object can be determined in real time using Artificial Intelligence

(9)

7

To achieve the main goal the following tasks are performed:

- Determine how the object location can be measured (determine which measurement setup should be able to perform the localization task)

- Determine how the object location can be determined (how can the location be extracted from the measurement setup)

- Collecting the data which includes:

o Design and build the data collection setup

o Doing research about the amount of data needed

o Collecting the data in different environments with different backgrounds and different illumination

- Gaining knowledge about neural networks

o Research into how neural networks work and which types exist o Research into how to tune all the neural network hyper parameters o Research into how a custom network can be trained

- Research into existing 2d and 3d location methods

- Research into what the best tool is to implement the neural network with - Research into what the best type of neural network is for the application - Research into how the data can be annotated

- Training and tuning the neural network

- Test and evaluate the performance of the system

4.4. Outline of conceptual model

To achieve the goal and to answer the research question a system is developed based on the conceptual model visible in Figure 1.

(10)

8

This model consists of three parts. The first part is the object of which the position and orientation needs to be determined. The object that will be used to train the system is a medical Bactec bottle from BD Kiestra which is visible in Figure 2.

Figure 2 The object that needs to be detected (Bactec bottle from BD Kiestra)

The bottle that has to be detected has several characteristics that can change relative to the camera. These characteristics include a changing orientation and a changing position both in 3 dimensions. The changing position includes a back and forward motion, an up and down motion and a left and right motion. The changing orientation includes the roll motion, a pitch motion and a yaw motion. Because the position and orientation is relative to the camera these motions can occur when the camera moves or when the Bactec bottle moves. Beside this, the background and illumination can also change.

The second part is the measurement setup. This measurement setup is a stereo camera setup. By using two cameras a depth perception can be added to already existing 2d localization techniques. This should make it possible to localize objects in 3d as further explained in chapter 5.2 (3d depth estimation with stereo cameras). The cameras that are used are Logitech C925e cameras.

The third part of the model is the neural network model. Neural networks can be seen as a cascaded chain of logistic regression where the input of each layer is the output of the previous one. Each layer learns a concept based on the previous layer which means that it does not have to learn the whole concept at once. The result is a chain of features that build the knowledge. The neural network has a specific architecture. In the architecture the amount of layers, activation of layers and size of layers are defined. By experimenting with the architecture the neural network is trained for the optimal results to be able to localize objects in 3d.

The neural network model is built in Keras using TensorFlow. Keras and TensorFlow are free open source software libraries for machine learning or Artificial Intelligence (AI) applications based on neural networks. With TensorFlow neural networks can be highly optimized to work on GPU’s and TPU’s (Tensor Processing Units) [32, 33, 34]. The neural network and all the rest of the processing will be implemented on a Jetson TX2. This is a Nvidia board that is able to run complex AI applications [82].

The choices behind this design are further explained in chapter 5.4 (Localization In 3d using TensorFlow).

(11)

9

4.5. Ethical consideration

There are several ethical considerations within this project which need notice when using stereo vision. First of all in the technological field. In the project cameras are used to localize the Bactec bottles. These Bactec bottles could potentially hold body material which is private information. Therefor secure data storage should be required or no storage at all. Secondly, by being able to localize the Bactec bottles or other medical disposables, the interaction between robot and disposables could be more efficiently which could result in a speed improvement in the overall diagnosis time. This could positively impact the public health.

To investigate how people’s point of view is on this project the question is asked: ”Would you mind if a bottle containing your body material was photographed in a laboratory with the aim of improving public health?” This question is asked to 7 people with different ages reaching from 22 to 59. Their answers were:

- Three people said: “No I don’t mind. As long as it is increasing public health it is okay.” - One person said: “No I don’t mind. As long as the law on privacy is enforced to make sure

the data stays private”.

- One person said: ”Yes I do mind. I don’t want to have any of my private data saved because I have no clue what is done with the data.”. After a follow up question where the person was asked if he would still mind if it is for his own health, the person answered: “If it is for my own health it is okay.”

- One person said: “I don’t mind as long as the data is anonymized or not stored” - My own opinion is that I also don’t mind as long as it benefits the public health.

In these opinions there was no one for whom it really matters, as long as it is for their own health. In the beginning of the project it is decided that, because there are still groups of people which express concern on how their data is handled [96], that the data will not be stored over time. This could potentially sacrifice improvement over time but defends the people’s privacy.

4.6. Thesis outline

To explain the subject the following thesis outline is used:

Figure 3 Thesis outline

The thesis starts by a situational and theoretical research where some preliminary research is done about how the Bactec bottle can be measured and how the localization method should work. After the situational and theoretical research the research design is explained. In this research design the data collection, data preparation, training, tuning and prediction steps of the neural network method are further described. In the tuning chapter the experiments are done to optimize the method for the most optimal performance. In the last chapter, the results of the research design are discussed. This includes a full evaluation of the results and an overview of the influence of the hyper parameter on the localization task (neural network). In the end the conclusion and recommendations are noted down.

(12)

10

5. Situational & theoretical analysis

In this chapter, all preliminary research is described. To be able to localize an object, first a measuring method is needed. This measuring method should be able to localize objects in 3d and potentially be able to localize transparent objects. The solution chosen is stereo vision. This measuring method and the reasons for choosing this method are described in chapter 5.1.

While choosing the measuring method some research is done regarding current 2d localizing methods using camera vision. This will answer the question which current 2d method could be an inspiration when designing 3d object localization. Because the final focus is on transparent object detection (where it is difficult to find stable features), the focus is on machine learning methods which should be able to detect the anomalies in transparent object features when found. It is decided that neural networks are expected to be the solution to localize in 3d. This research can be found in chapter 5.3.

In chapter 5.2 the neural networks method and the measurement method are further researched for the usability in localizing in 3d. This will answer the question if and how neural networks could be the solution for localizing objects in 3d. Furthermore it is decided to use Keras with TensorFlow to implement the neural networks. This can be found in chapter 5.4.

5.1. Measurement system

To be able to determine the position of an object in 3d, a method is needed to detect an object. Localizing objects in real time is not an easy task. The measuring option that is chosen is stereo vision. To make this decision several options were researched. All options had difficulties to overcome when using them for object localization. Common options where [2]:

- Stereo vision [1] - Time of Flight - Structured light - LIDAR

The first option, stereo vision, will need a lot of computational power to work in real time. This because, in contrast to other methods, two images need to be processed and matched instead of a single image [2]. With stereovision a 3d depth image can be made by matching the two images without losing the color information. Besides this it is difficult to get an accurate 3d position [7] using 2d images, especially in longer ranges. It is also difficult to find stable features in the transparent objects. This is also the reason why there is no general solution using stereo cameras for transparent object localization in 3d. In theory, it should be possible to detect glass with a camera because measurable light (with a wavelength from 350nm to 1050nm) [6] is partly absorbed by the glass [4]. The second option is Time of Flight (ToF). ToF works with Near Infra-Red (NIR). It measures the time it takes for an emitted light source to return to the lens. The longer it takes the further an object (or point) is [9]. ToF cameras only capture depth information and can produce noisy data if the surface is reflecting poorly [2]. Besides this, it has the same problems with transparent objects as a normal camera because the light emitted is NIR.

The third option is structured light. Structured light consists of a projector and a scanner. The projector projects a light pattern on the object. Based on the pattern, which is captured by the scanner, the depth map can be made [11]. With structured light a high resolution 3d data image can be captured for relative low cost. However, structured light has a low acquisition rate, is sensitive to illumination changes and is limited to static scenes [2]. This could make it difficult to detect objects. Also, it projects light with a wavelength between 400-700nm [10] which makes it difficult to detect glass for the same reason as it is difficult to detect glass for the camera [4].

(13)

11

The last option is LIDAR which works with laser beams. Like a ToF camera it works with the time of flight principal. The emitter in this case is the laser beam [13] which are usually partly reflected and refracted several times before any surface is hit. This could lead to no or false 3d information [3, 5]. Besides this problem LIDAR also needs a lot of computational challenge because of high dimensionality of point clouds [8]. Most of the LIDAR systems work with a wavelength of 905 or 1550 nm [12]. This leads also to it having the same issues with detecting transparent objects as stereo vision [4].

All options have their own difficulties and all have the same problem with detecting transparent objects. Stereo vision is chosen because is the only option that can obtain more information besides depth information. The aim is to overcome the problems that are faced using stereo vision.

5.2. 3d depth estimation with stereo cameras

In theory it is possible to determine the 3d position of an object with a stereo camera setup. This because stereo cameras can, in contrast to single cameras, add a depth perception which could give the 3th dimension of a detected object. This depth perception can be calculated in the following way (Figure 4).

Figure 4 3d depth estimation with stereo cameras

If the target is to calculate the distance E, which is the distance to a specific point in the image, the calculations are the following.

Several things need to be given: - v = field of view

- f = focal distance of camera - G = distance between cameras - B in pixels

First 𝐴 needs to be calculated with the following equation:

𝐴 = 2 ∗ 𝑓 ∗ 𝑇𝑎𝑛 (𝑉

(14)

12

When A is known it can be used to calculate the distance of one pixel (p) which is done with the following equation:

𝑝 = 𝐴

𝐴𝑚𝑜𝑢𝑛𝑡𝑂𝑓𝑃𝑖𝑥𝑒𝑙𝑠𝐼𝑛𝐼𝑚𝑎𝑔𝑒𝐴𝑥𝑖𝑠 2

When p is also known B can be calculated in a metric unit. This is done by:

𝐵 = 𝐵𝑝𝑖𝑥𝑒𝑙𝑠∗ 𝑝 3

After B is calculated the target is to calculate c because it can reveal the angle of d. This is done with the following formula:

𝑐 = 𝑇𝑎𝑛−1₍𝐵

𝑓) 4

This c gives the angle of d by:

𝑑 = 90 − 𝑐 5

When d is calculated the same steps needs to be done for the other side to calculate h. knowing h the final step can be performed which reveals the distance E to the point in the image. This can be

calculated with the following calculation [35]:

𝐸 = 𝐺 ∗ (𝑠𝑖𝑛(ℎ) ∗ 𝑠𝑖𝑛(𝑑)

𝑠𝑖𝑛(ℎ + 𝑑) ) 6

When a depth image is made by calculating the distances to all the points in an image, the orientation of objects can also be determined based on the object that is detected on that specific place.

5.3. Localization method

The chosen measuring method is stereo vision. To be able to localize in 3d, first the 2d localization methods that are using camera vision are researched.

5.3.1. Older techniques

During the last decade object detection/localization methods could be categorized in two main methods [15]:

- Extracting local image features like SIFT and HOG.

- Fitting rich object models such as deformable parts models to detect the presence of objects [20, 21].

The first method, extracting local image features like SIFT and HOG, can construct a Bag Of Visual Words (BOVW) representations and run statistical classifiers [16, 17, 18, 19] as further explained in chapter “SIFT” and chapter “HOG”. These methods have shown good performance in image classification but locating objects has been unfruitful. This because the classifier relies on visual words that barely describe the context of the object and fall in the background.

(15)

13

The second method is fitting rich object models such as deformable parts models to detect the presence of objects [20, 21] (which is further explained in “Fitting rich object models such as deformable parts models to detect the presence of objects”). This method can reveal location, constellation and pose of the detected objects. However, most of the time the model is trained from images where the locations or even the counterparts of objects are known.

Both methods have their advantages and disadvantages. It is also possible to combine these methods. This is done with success [22]. Further detail about the methods is given in the following sub chapters. SIFT

SIFT (Scale-Invariant Feature Transform) is a feature detection algorithm that detects and describes local image features [36]. SIFT searches for potential key points. To be able to find potential key points scale-space filtering is used, where in it, DoG (Difference of Gaussians) is used. This is an estimation of the LoG (Laplacian of Gaussian) which acts as a blob detector that uses the change in to detect blobs in various sizes.

When the DoG are found, the images are searched for local extrema over space and scale. This is done by comparing each pixel with its 8 neighbors, as well as 9 pixels in the next and previous scales. All local extrema’s are potential key points.

When the potential key point locations are found, they are refined to get accurate results. This is done using scale space or Taylor series expansion which is a process where every extrema gets rejected with a value lower than a threshold value. After this, the edges are removed via a concept similar to Harris corner detection. Harris corner detection uses a 2x2 Hessian matrix to compute principal curvature. What remains are the strong interest points.

When the strong interest points are determined it is the target to assign an orientation to all the key points. This is done by taking the neighborhood around the key point location. The size of the neighborhood depends on the scale. Then the gradient magnitude and direction are calculated for that area. After the calculations it creates key points with the same scale and locations but different directions.

The fourth step is to create the key point descriptor. This is created by creating a neighborhood of 16x16 around the key point, dividing this into sub blocks of 4x4 and creating an orientation histogram for every sub block. The result is 128 bin values which are represented as a vector to form the key point descriptor.

The last step is keypoint matching. This is done between two images by identifying their nearest neighbors or by taking the ratio of closest-distance to the second-closest distance.

HOG

HOG (Histogram of Oriented Gradients) is a feature descriptor that can be used for object localization. In this case a feature descriptor is an image representation or patch that extracts the useful information and removes the extraneous information for the purpose of simplifying the image [83]. The HOG feature descriptor extracts this useful information by counting the occurrences of a gradient orientation in localized portions of an image [84]. It does this for every window in a sliding window. To process the HOG for every window first the source image is divided into blocks. Each block is divided in cells (small regions). The blocks usually overlap each other which means that the same cell could be in several blocks. For each cell pixel the horizontal and vertical gradients are obtained. After this the magnitude and phase of a gradient are determined.

(16)

14

When the gradients are determined the HOG is created for every cell. On these cells normalization is applied to remove the contrasts. This is used to make a descriptor for every window which consists of all the cell histograms for each block in the window [85].

To make HOG feature descriptor work for object localization it is often combined with Linear SVM (Support Vector Machine) to train a robust object detector. In this case the HOG feature descriptor produces the HOG features of an image on which the SVM is classifying what object it is and where the object is located.

To make the object localization with SVM and HOG robust the following steps are often performed: - Sample the positive samples from a negative training set that does not contain any object that

needs to be detected. After this extract HOG descriptors from these samples.

- Sample negative samples from a negative training set that does not contain any object that needs to be detected and extract the HOG descriptor for these samples as well.

- Train a SVM on the positive and negative samples.

- Apply a sliding window technique and slide the window across all images. In each window compute the HOG descriptors and apply the SVM classifier which should give a correct classification. If it gives a false-positive classification, record the feature vector associated with the false patch along with the probability of classification. This is called hard-negative mining.

- The false-positive samples can be sorted by the confidence. This can be used to retrain the classifier using these hard-negative samples.

- Now the classifier is trained and can be used to classify the dataset. To be able to use the classifier, the sliding window technique can be used to slide past the window. Every window can be classified which results in object detection. To remove redundant and overlapping boxes non-maximum suppression could be used. This is optional but makes the result better [27].

HOG has shown to be a good method for object localization in 2d. It has the advantage of being easy and fast to train when using it together with SVM compared to modern neural networks [28]. On the other hand the disadvantage is that HOG is very sensitive to image rotation [29]. This could give problems when using the algorithm in 3d on a potentially moving camera setup.

Fitting rich object models such as deformable parts models to detect the presence of objects

Deformable Part Model (DPM) detection was one of the most popular object detection methods. This method relies heavily on deformable parts for object localization [20]. DPMs are originally proposed for Pascal VOC challenge and they are good in handling large appearance changes for challenging datasets. However the disadvantage is that it will take more than 10 seconds per image (without parallelization on a 2.8Ghz 8-core Intel Xeon Mac Pro computer running Mac OS X 10.5) in Pascal VOC[30, 86]. This makes the speed the bottleneck of DPM in a real time application.

In DPMs, the score of detection of each hypothesis is determined by the score of appearance without the deformation cost. This appearance score is calculated by the correlation between HOG feature and the sequence filters including parts and root. This takes most of the time due to the high dimension [30].

(17)

15 5.3.2. Neural network method

At the moment neural networks, or especially deep convolutional neural networks have shown the best performance on image classification tasks [23]. This is a method using artificial intelligence (AI) [31]. The performance of recent image/object classification/localization applications using these convolutional neural networks show extremely successful results [24, 25, 26]. This is then the “new” solution for image object localization, mainly in 2d. With this method successive feature vectors will be constructed that progressively describe the properties of larger image areas.

Neural networks are examples of non-linear hypothesis. They scale better than Logistic Regression for a large number of features and can be trained to classify much more complex relations. Neural networks are formed, as the name indicates, by artificial neurons which are organized in layers. There are three types of layers:

- Input layer - Hidden layers - Output layer

Neural networks are classified from their number of hidden layers and how they connect. They can have multiple hidden layers. When a neural network has more than 2 hidden layers it can be considered as a deep neural network. Deep neural networks have the advantage that more complex patterns can be recognized [31].

In neural networks the output of a full layer can be calculated as a matrix multiplication followed by an element-wise activation function. This has advantages when using Tensor Cores, Matlab, Numpy or also when implemented on other hardware. The process of calculating the output of each layer from the input to the output layer is called forward propagation. Further information about the working of neural networks will be given in the chapter 0 .

Why are neural networks better than Logistic Regression? To see the difference a neural network can be seen as a cascaded chain of logistic regressions where the input of each layer is depending on the output of the previous one. Each layer learns a concept based on the previous layer which is better because the layers do not need to learn the whole concept at once. A chain of features is build that build the knowledge of the neural network [31].

Neural networks in depth

In a neural network a neuron is a thing that holds a value. These neurons are connected to each other with a connection. This connection has a weight in the form of a perception [37, 38, 39].

All neurons are organized in layers. These layers interact with each other. The activation in one layer will determine the activation in another layer. In each layer successive feature vectors will be constructed that progressively describe larger image areas. An example of a simple neural network is visible in Figure 5.

(18)

16 Figure 5 Example of a basic neural network [87]

With a neural network two actions that make the network self-learning can be performed: - Forward propagation

- Backwards propagation

Forward propagation

Forward propagation is the estimation action where, based on the input, features will be selected and extracted which results in an estimation. As noted before, this is done in several layers. Each layer can be calculated relatively simple via a matrix calculation. This calculation looks like the following:

𝜎 ( [ 𝑤0,0 𝑤0,1 𝑤1,0 𝑤1,1 ⋯ 𝑤0,𝑛 ⋯ 𝑤1,𝑛 ⋯ ⋯ 𝑤𝑘,0 𝑤𝑘,1 ⋱ ⋯ … 𝑤𝑘,𝑛 ] [ 𝑎0(0) 𝑎₁(0) ⋮ 𝑎𝑛(0)] + [ 𝑏0 𝑏1 ⋮ 𝑏𝑛 ] ) 7

In this calculation all the 𝑤𝑘,𝑛 are the connections with weights in between the layers as visible in Figure 5. The 𝑎_𝑛(0)are the current values of the neurons. The matrix with 𝑏𝑛 are the biases matrix. The 𝜎 is in this case the activation function which are further explained in chapter 5.5.1 .

Because this is solvable in a matrix calculation it can be implemented and optimized in GPU’s. There are several tools that are able to do this. For example Tensor Cores, Cuda, Matlab or Numpy.

An example of a neural network performing forward propagation can be seen in Figure 6. This is a neural network that is recognizing a hand written digit with 4 layers.

(19)

17 Figure 6 Detection of a hand written digit using a neural network [37]

The digit 9 needs to be recognized. In an ideal situation the network could be trained to detect straights and corners in the first layer. This will result in the segments visible in Figure 7.

Figure 7 Dividing the hand written digit in segments [37]

Next, a layer which combines these segments will perform its action. It makes long straights and rounds from the earlier segments (see Figure 8).

Figure 8 Dividing the hand written digit in bigger segments [37]

Based on this result the output layer can estimate the result and say which number it is.

Most of the times neural networks are not as simple as in this example because they are using features which are not easily understandable. However, the interaction between layers is similar.

(20)

18

Backwards propagation

Most neural networks are a form of AI. However by estimating something, which is done with forward propagation, it is not yet intelligent or self-learning. This skill is added by backwards propagation. In the backwards propagation step the neural network compares the estimated result with the expected result. Comparing these two values gives the error which is used to calculate the cost function. The cost function is the difference between the expected and estimated values squared. When the cost function is determined the negative gradient of this function is calculated. This shows how each weight in each layer needs to be changed to decrease the cost. It can also been seen as the partial derivative of the cost function with respect to all neural network values.

Experiment

To test and experiment with the capabilities of neural networks a program is made that uses a pre build neural network to determine the position of an object in an image. This is a 2d object localization method. The model used is the ‘ssd_mobilenet_v1_coco’ model from [40]. This model is based on the coco dataset, a large-scale object detection, segmentation and captioning dataset. It consists of 330k images where around 200k are labelled. There are 80 object categories and 1.5 million object instances [41, 88].

The program performs the following steps: - Initializing

o Downloading the pre-trained model o Build a frozen graph

o Optimize the model for TensorRT o Create a session and load the graph - Running

o Capture an image

o Run the network on image using the tf_sess.run function

o Display the results by drawing the boxes in which an object is detected on top of the image

The program was able to determine the position of multiple objects in 2d on the Jetson TX2 and using the Logitech C925e camera. It was using the TensorFlow library to optimize the algorithm for the GPU in the Jetson. The object localization worked in real time and is visible in Figure 9.

(21)

19 5.3.3. Conclusion

Several methods have been used for 2d image classification and object detection. Some of these methods should be possible to work in 3d. It seems that a neural network is better than the older logistic regression methods because a neural network is able to learn a total concept in several layers. It is a chain of features that built the knowledge. This results in a better scalable method for a large number of features and a method that can be trained to classify much more complex relations than logistic regression methods.

Therefor neural networks are chosen to make the 3d object localization algorithm. To be able to do research into how this is possible, neural networks will be further researched in next chapter.

5.4. Localization In 3d using TensorFlow

To be able to determine the position and orientation of an object in 3d a method for implementing the neural network is chosen. For implementing and making a 3d object localization algorithm Keras using TensorFlow will be used. This is a library to make machine learning or AI applications based on neural networks. By the depth perception that is added with the stereo cameras (Chapter 5.2) the neural network should be able to determine the 3d position and orientation.

5.4.1. Keras (TensorFlow)

TensorFlow is an open source software library to make machine learning or AI applications based on neural networks. This free library can be used to research or to develop products and is highly optimized to work on GPU’s and on TPU’s (Tensor Processing Units). It is developed and maintained by Google [32].

TensorFlow can implement neural networks in a different way. A layer is an operation. These operations are combined in a so called data flow graph. An example of a data flow graph is visible in Figure 10.

(22)

20

TensorFlow is chosen to implement the neural networks. This because it is free, has a relative large community, is used earlier within BD Kiestra, can be highly customized (supports complex workflows) and has hardware acceleration support. The other options compared were [33, 34]:

- pyTorch - CNTK - Apache MXNet - Theano - Torch - Infer.NET

These toolkits were more complex to learn, not easily accessible or had a smaller community then TensorFlow.

To implement TensorFlow there is chosen to use Keras. Keras is the officially supported API that is now build in to TensorFlow which makes it faster to implement, train and test neural networks in TensorFlow.

Besides this there is chosen to use the NVIDIA Jetson TX2 because it implements a lot of computational power (with a GPU and CPU) in a small package with the purpose to be used for AI and machine learning. This board has excellent support for TensorFlow with Tensor Cores.

5.5. Tuning neural networks

Tuning a neural network is important for the optimal performance of the network. To do this properly some knowledge is needed about the adjustable parameters. When making a neural network the following parameters can be adjusted [89]:

- The layers

o Amount and size of layers o Type of the layers

o Activation function of layer - The compiler

o Type of optimizer

o Optimizer specific parameters like momentum and learning rate o Type of loss function

- The batch size

When tuned correctly the neural network works optimal. 5.5.1. The layers

Amount and size of layers

The amount and size of layers need to be specified by forehand. The best way to specify the amount of layers is via systematic experimentation with a robust test harness.

When using a single-layer neural network the network can only represent linearly separable functions. Because of this it can only be used for very simple problems where, for example, only two classes needs to be separated which can be neatly separated by a line [45]. When only one layer is used the advantage is that the neural network can be trained a lot faster than when using multiple layers. When using a multilayer network, convex regions can be represented. With this it can learn to draw shapes around examples in some high-dimensional space. A multilayer network can overcome the limitation of linear separability.

(23)

21

The difference with a single layer and a multilayer perception is known. However there is no idea about how many nodes to use in a single layer to learn their weights efficiently. To answer the real question about how many layers and nodes are needed there are several approaches which could be used to determine this.

The first approach is experimentation. There is no analytic way to calculate the number of nodes and layers. These two parameters are hyper parameters which need to be specified by forehand. With experimentation there will just be experimented with different configurations until the correct result is reached.

The second approach is intuition. With this approach more knowledge about how to globally solve the problem is needed. Besides this, experience is needed with similar problems so that an estimation can be done for the amount of layers and nodes.

The third approach is to go for depth, these deep networks perform better because they can understand more complex relationships. This argument suggests that many layers are the solution. However using many layers also mean that the training time is longer and more complex.

The fourth approach is to borrow ideas meaning, that the number of layers is defined based on literature with similar problems. Then use a similar setup.

The last approach is to do an automated search to test different network configurations. The search can be seed with different ideas from literature and intuition. Some popular search strategies include:

- Random (random configurations of layer and nodes) - Grid (systematic search across number of layers and nodes)

- Heuristic (a direct search across configurations like a Bayesian optimization or genetic algorithm)

- Exhaustive (try all combinations of layers and the number of node)

When working with larger networks or datasets this can be challenging. There are some options to reduce the amount of time. These are:

- Parallelize the process, so trying multiple options at once - Bound the size of the search

- Fit on smaller subsets than the original training set of data [45] Type of the layers

Each of the layers in a neural network can have its own type. In Keras there are a lot of different layers which can be categorized in [54]:

- Core layers - Convolutional Layers - Pooling layers - Locally-connected layers - Recurrent layers - Remaining layers

In the categories there is a high variety of layer types. Most of them are neural network layers that learn but some are layers that do not learn. The latter called computational or edit layers.

(24)

22

Core layers

The Keras core layers include the following layer types: - Dense - Activation - Dropout - Flatten - Input - Reshape - Permute - RepeatVector - Lambda - ActivityRegularization - Masking - SpatialDropout

These layers are the simplest types of layers in a neural network. Most of the layers are just calculations to organize the data. For example flattening which flattens the data. They can be seen as calculation editing layers.

One of the layers in the core layers is not from this type. This is the dense layer. The dense layer is a regular densely-connected neural network layer. It implements output = activation (dot (input, kernel) + bias) [55].

Convolutional Layers

The convolutional layers include: - Normal Conv - SeparableConv - DepthwiseConv2D - ConvTranspose - Cropping - UpSampling - ZeroPadding

In the convolutional layers there are also some calculation or editing layers. These include the cropping, UpSampling and Zeropadding layers.

The normal convolutional layers are the normal 1d, 2d and 3d convolutional layers that can be used. These layers create a convolution kernel. This convolutional layer is convolving the layer input over a temporal or single spatial dimension. The SeparableConv combines a depth wise spatial convolution with a pointwise convolution and the ConvTranspose does do reversed convolution [56, 57].

They convolutional layers are the most commonly used layers for object localization because they show the best performances in these tasks [69].

(25)

23

Pooling layers

The Pooling layers include [58]: - MaxPooling

- AveragePooling - GlobalMaxPooling - GlobalAveragePooling

The Pooling layer reduces the amount of dimensions to make the calculations less dependent on computational power [59]. The type of pooling determines in what way the dimension is reduced. This is more a computational layer type that can be used in between other neural network layers.

Locally-connected layers

The Locally connected layers include [60]: - LocallyConnected1D

- LocallyConnected2D

The locally connected layers are the same as the normal convolution layers. The only difference is that the weights are unshared and that a different set of filters is applied at each different patch of the input.

Recurrent layers

The Recurrent layers include [61]: - RNN - SimpleRNN - GRU - LSTM - ConvLSTM2D - ConvLSTM2DCell - SimpleRNNCell - GRUCell - LSTMCell - CuDNNGRU - CuDNNLSTM

The recurrent layers are the type of layers where a direct graph, along with a temporal sequence is formed for the connection between nodes. This makes temporal dynamic behaver possible. The patterns can then change in time. These type of layers are often used as forecasting layers [62].

Remaining layers

The layer types that cannot be categorized are: - Embedding layers [63]

- Normalization layers [64] - Noise layers [65]

The embedding layer compute a vector for an index. Embedding layers can reduce the amount of dimensions of a specific inputs. An example can be seen in Figure 11.

(26)

24 Figure 11 Example of using an embedding layer [67]

This layer is useful to separate different types of features that can be used to draw conclusions. It can bring structure in items that don’t have a structure [67]. The second layer, the normalization layer, can normalize the data to fit all the input data points/features on the same scale. This is useful for features that vary a lot. This normalization layers is a calculation/editing layer [68].

Noise layers can be used to mitigate overfitting [65]. It does this by adding random data augmentation and can also be seen as a calculation/editing layer.

Activation function of layer

Activation functions are multiplied to the data. They introduce nonlinear properties (nonlinearities) to the network. Without them only linear functions could be calculated with the network which limits the networks ability. All activation types are visible below [46, 47]:

- Softmax* - ELU* - selu - softplus - softsign - ReLU* - tanh - sigmoid - hard_sigmoid - exponential - linear - LeakyReLU** - PReLU** - ThresholdedReLU**

* Advanced options are also available in Keras ** Only advanced option is available

The advanced options have more complex settings that can be changed. The most popular activation functions of the list are the sigmoid, Tanh and ReLU.

(27)

25

Sigmoid

Sigmoid has the mathematical notation of [49]:

𝑓(𝑥) = 𝐿

1 + 𝑒−𝑘(𝑥−𝑥0) 8

Where 𝐿 is the curve’s maximum value, almost always equal to 1, 𝑘 defines the steepness of the curve and 𝑥0 is the value of the sigmoids midpoint. In neural networks the sigmoid often has a 𝐿 of 1 and a 𝑥0 of 0 which results in the following equation [48, 50]:

𝑓(𝑥) = 1

1 + 𝑒−𝑘𝑥 9

The sigmoid is one of the first activation functions that is used because it can be interrupted as the firing rate of a neuron where 0 means no firing and 1 a fully saturated fire.

While sigmoid adds the nonlinearities to the network this sigmoid has two problems [48]. The first problem is that it causes the gradients to vanish. When a neuron activation saturates to either 0 or 1 the gradient at this region is close to 0. During the backwards propagation step this local gradient will be multiplied by the gradient of this gate output for the whole objective which, if the local gradient is small, makes the gradient vanish and causes that close to no signal will flow through the neuron its weight.

The second problem is that its output is not zero centered. This means that the value after the function starts at 0 and ends at 1. This results in a positive value after the function which can mean that the gradients go too far in different directions. This makes optimization harder.

Tanh

The sigmoid is improved by the Tanh (Hyperbolic Tangent Function). Its mathematical notation is:

tanh(𝑥) = 𝑒 2𝑥_{− 1}

𝑒2𝑥_{+ 1} 10

With Tanh the output is between -1 and 1 which means that its output is zero centered. Although this makes the optimization easier there is still no solution for the vanishing gradients.

ReLU

ReLU (Rectified Linear Unit) has become popular recently. Its mathematical notation is the following: 𝑅(𝑥) = max(0, 𝑥) = {0 𝑓𝑜𝑟 𝑥 < 0_{𝑥 𝑓𝑜𝑟 𝑥 ≥ 0} 11 This means that the min value is 0 and increases linear with its input [9]. With this it learns faster and removes the vanishing problem. Although it is used in almost all networks nowadays, it is only used for the input and hidden layers. For the output layer a Sofmax function can be used because it gives the probability for different classes and a linear function for the input layer.

The only problem with ReLU is however that some units can be fragile during training which means that a big gradient flowing through could cause a weight update which makes the neuron never activate on any data point again. So all gradients flowing through it will always be 0.

(28)

26

Leaky ReLU

This fragile property is solved by a Leaky ReLU. Leaky ReLU can be represented by: 𝐿𝑅(𝑥) = {𝑎𝑥 𝑓𝑜𝑟 𝑥 < 0

𝑥 𝑓𝑜𝑟 𝑥 ≥ 0 12

In this equation 𝑎 is a constant which makes the negative slope less steep than the positive slope. This results in a characteristic visible in the following graph:

Figure 12 Leaky ReLU and Parametric ReLU displayed [51]

Besides Leaky ReLU there are several other options which tackle the problem of ReLU in a similar way. These are:

- Parametric ReLU

- Concatenated ReLU (CReLU) - ELU

- SELU

5.5.2. The compiler

With the compiler several parameters can be tuned: - Type of optimizer

- Optimizer specific parameters like momentum and learning rate - Type of loss function

These parameters will be further explained in this chapter. Type of optimizer

The optimizer enables the neural network to learn from data. Almost all optimizers are based on gradient decent, one of the most popular optimizer methods. It doesn’t immediately guess the best solution for a given objective. Instead, it estimates an inactive solution and steps in a direction closer to a better solution if the estimate is better than its current state. It repeats this step for every training item.

This is done by computing the gradient of the loss function with regard to the parameters to the entire training set for a given number of epochs. One epoch is one pass over all the training data to train the neural network [90]. Because it is done for the full dataset it is relatively slow and intractable for

(29)

27

datasets that don’t fit in the GPU or RAM memory. To get around this stochastic gradient descent can be used [40].

All optimizers also have their own parameters. The similar parameters are further explained in appendix 10.3.

Stochastic gradient descent

Stochastic gradient descent is an iterative method for optimizing the neural network with suitable smoothness properties. It is called stochastic because the method uses randomly selected (or shuffled) samples to evaluate the gradients.

With Stochastic gradient descent a parameter update is performed for each training sample and label which causes the updates having a higher variance and the results to fluctuate more intensely. This can be a good thing to find new and better local minima.

Figure 13 shows that it is important that multiple local minima’s can be found. This because a minima does not have to be the best solution. With normal gradient decent the model would be stuck in a, maybe less optimal local minimum, which could lead to a solution not being optimal.

Figure 13 Representation of local minima and the global minima when tuning a neural network

Mini-batch gradient

While stochastic gradient decent could be a good solution, it could also start to overshoot due to the high variance. To overcome this a combination between gradient decent and stochastic gradient decent called mini-batch gradient decent can be used. This method uses the best of both worlds staking the small batches and doing the update after this batch. While this method is different than the original stochastic gradient descent it is often just still called the same.

Adagrad and adaDelta

Adagrad is an optimizer that adapts the learning rate based on the parameters by making big updates for infrequent parameters and small updates for frequent parameters. With adagrad the learning rate does not have to be manually tuned. However its big weakness is that the learning rate only decreases which could lead to the model not learning anymore and not being able to find a new local minima. The cause for this is a square root which is constantly added to the sum which causes the learning rate to decrease. The added sum is never negative. This problem is solved with the adaDelta. Adadelta is an extension of Adagrad that reduces the aggressive, monotonically decreasing learning rate. Adadelta restricts the window of accumulated past gradients to some fixed size instead of accumulating all past squared gradients [44].

(30)

28

RMSprop

RMSprop also makes the learning rate able to adapt by solving the problem of a constant decreasing learning rate. The difference is that RMSprop divides the learning rate by an exponentially decaying average squared gradients [44].

Adam

On its turn, Adam is an extension of AdaDelta. It can be seen as a combination of RMSprop and momentum where RMSprop contributes to the exponentially decaying average of the past squared gradients and momentum accounts for the exponentially decaying average of past gradients [44]. This method updates, beside the learning rate also the momentum per step. Because Adam adapts all the parameters in the neural network it is the best choice most of the time [40]. Adam has several versions which perform a bit better than the base Adam. The first option is Adamax which is more stable and more robust to noise than Adam. The second option is Nadam where a Nesterov accelerated Adaptive moment estimation is added to Adam. The Neterov accelerated adaptive moment makes the momentum more sensitive when the model deteriorates so that the high momentum doesn’t cause an overshoot [44].

Type of loss function

The loss function determines how the error is calculated. There are several loss functions [70]: - Mean squared error

- Mean absolute error

- Mean absolute percentage error - Mean squared logarithmic error - Squared hinge

- hinge

- categorical hinge - logcosh

- categorical crossentropy - sparse categorical crossentropy - binary crossentropy

- kullback leibler divergence - poisson

- cosine proximity

These loss functions can be differentiated into two types: classification and regression loss. Classification loss is a loss function that will generate an error for every class that can be estimated. For example if a digit needs to be classified and it can be between 0 and 10 each number is represented by a class. So 0 is a class, 1 is a class, etc. The classification loss will calculate the error of every class.

Regression loss is a loss function which calculates a number, this could be, for example, the price of products or the number of sales in a day.

(31)

29

In the article on [52] there are several cases noted with the best loss function. These are:

- When needing a regression loss function the best loss function is the Mean squared error (MSE).

- When needing a classification loss function that is used for a model with a binary outcome the best loss function is the Binary Cross Entropy.

- When needing a classification loss function that is used for a model that predicts a single label for multiple classes the best loss function is the cross entropy.

- When needing a classification loss function that is used for a model that predicts multiple labels from multiple classes the best option is a binary cross entropy.

This is summarized in the table Table 1 [52].

Table 1 Problem with suitable loss functions [52] 5.5.3. The batch size

The batch size is the subset on which the correction on the neural network will be done. The larger the batch size the faster the network trains and how more accurate the network can get. However, when the batch size is too big the network could potentially be over fit, which means that it doesn’t work very good anymore. The optimum size is difficult to estimate, this needs to be determined using experimentation.

5.5.4. Experiments with tuning a Custom Deep Neural network

To test how the neural network training works and which parameters have the most influence, some experiments are done with building and tuning a custom neural network. The model is made to classify hand written digits using a MNIST like dataset [53]. This dataset includes over 15k images which are from 0 to 9 and written in different styles. These pictures have a size of 100x100 pixels. To be able to classify the hand written digits first the data is prepared. This includes:

- Increasing the image dimensions to 200x200 pixels (to make the image more difficult to process and more equal to the dimensions to the localization images)

- Gray scaling the images (to lose color information because this is not necessary for the digit recognition, this can only confuse the network)

- Creating one single data array with a single label array (everything needs to be in one array to be able to train the network easier)

- Shuffling the data (as quoted from [91], “To make sure the data point creates an independent change of the model. Without being biased by the same points before them”)

- Splitting the data in training and validation (testing: 13680, validation: 1500) (the neural network should be able to recognize data which it never saw before)

(32)

30

After the preparation, the model is trained in 18 experiments using different hyper parameters. The target was to find out which parameters have the biggest influence and which parameters could be useful for the final neural network in an image recognition application. This because a dataset that is proven to work with neural networks can give a good base of what influences the image recognition neural network the most, so that the correct parameters will be tried when implementing the final neural network for the 3d object localization. The results of the experiments can be found in chapter 10.1 .

In the first experiments the target was to find out if it was possible to classify the digits with a simple neural network (with only 1 hidden layer). This is done in experiment 1 to 9 where different learning rates, optimizers, activation functions, momentums, training times (from 30 min to 10 hours per experiment) and optimizer specific parameters were used. Unfortunately no better accuracy than 30% could be achieved. Although it didn’t work it was clear that the learning rate had the biggest impact on the result and especially in the speed that the maximum result was achieved.

During experiment 10 to 17 deep dense neural networks were tried (more than one hidden dense layer). These bigger networks did not achieve a higher accuracy and took a lot longer to train. While the smaller neural networks took around 30 minutes to an hour to show their potential these networks needed well over an hour.

The last experiment was using different type of convolutional neural network layers. This was the key to success. The classification had a maximum accuracy rate of 84.3%. However, after the maximum accuracy rate was achieved the neural network overshoot which did lead to a final accuracy of 11%. The high learning rate (0.5) is assumed to cause this overshoot.

When the neural network for the 3d localization was trained the same settings as in the last experiment were used with a lower learning rate to prevent the overshoot.