Installation correctness using object positions

(1)

Installation correctness using

object positions

Jorick van Rhenen

jvrhenen@gmail.com

12-07-2018, 47 pages

Supervisor: Ana Oprescu, UvA

Host organisation: Alliander,https://www.alliander.com

Universiteit van Amsterdam

Faculteit der Natuurwetenschappen, Wiskunde en Informatica Master Software Engineering

(2)

2.2 Classification algorithms . . . 8 2.2.1 Linear Models . . . 8 2.2.2 Trees. . . 8 2.2.3 Neural Network. . . 8 2.2.4 Nearest Neighbors . . . 8 2.3 Unbalanced datasets . . . 9 2.3.1 Mitigating approaches . . . 9 2.4 Metrics . . . 9 3 Pipeline 12 3.1 Architecture. . . 12 4 Experiment design 14 4.1 Python, scikit-learn and TensorFlow . . . 14

4.2 Object Detection . . . 14

4.2.1 Setup . . . 15

4.2.2 Training the algorithms . . . 16

4.2.3 Threat to validity . . . 17 4.3 Classification . . . 17 4.3.1 Data Preparation. . . 18 4.3.2 Setup . . . 18 4.3.3 Threats to validity . . . 19 4.4 Pipeline experiment . . . 20 5 Results 21 5.1 Object Detection . . . 21 5.2 Classification . . . 22 5.3 Pipeline . . . 23 6 Discussion 24 6.1 Object Detection . . . 24 6.2 Classification . . . 25 6.3 Pipeline . . . 25 7 Related Work 26

(3)

7.1 Quality assurance using images . . . 26

7.2 Object detection benchmark. . . 26

7.3 Classification benchmark. . . 27

7.4 Chaining Machine Learning algorithms. . . 27

7.5 Dataset . . . 27

8 Conclusion 29 8.1 Future work . . . 29

A Object detection evaluation 35

(4)

Abstract

Quality assurance on gas installations is critical. Small mistakes in the configuration can cause severe damages. Manual inspection of every gas installation that is updated is extremely time and resource consuming. Quality assurance using images is not a new idea. We can classify gas installations as correct and incorrect. Using image classification, we can train an algorithm to classify gas instal-lations. However, this approach does not yield high accuracy. Our approach uses object detection algorithms to detect all elements of the gas installation. Those object positions are feed into a classi-fication algorithm. In our research, we conduct three experiments. The first experiment evaluates six different object detection algorithms. Secondly, we evaluate eight different classification algorithms to classify object positions into correct or incorrect. The last experiment combines object detection and classification algorithms in a pipeline. Our combination of object detection and classification performs up to 106% better than image classification regarding of F1-score.

(5)

Chapter 1

Introduction

Quality management has gained in popularity mainly because of increasing consumer consciousness of quality and growing international competitive pressure. For many years, the importance of product and service quality has been acknowledged as a major factor contributing to competitive advantages. With the need to cater for more demanding customers and to cope with intensifying competition, quality orientation seems to be the required strategy to remain competitive. [57]

Quality assurance is part of quality management. Quality assurance on gas installations is also critical because a simple error can cause serious damage or even loss of life. To ensure this does not happen, companies make sure they inspect the installation on correctness. Gas installations need to be checked for leaks, unsecured elements and items that are missing or misplaced.

Energy grid operators are responsible for correct and safe operation of gas installations. Grid opera-tors maintain gas installations for households. Each gas installation needs to measure up to certain standards. Many sub-contractors are working on gas installations. To determine if a gas installation is properly installed, grid operators send inspectors to check the installation. However, they can not inspect all changes that have been done. Checking every gas installation takes a lot of time. In the meantime, there can emerge a fault in the installation that is causing serious problems. The inspec-tors take samples of the renovation and inspect the quality of them. Even when an inspector checks a gas installation, he or she can miss certain faults in the installation. This can cause severe problems. We want to automate the checking of gas installations. This way we can detect faults in the in-stallations before a serious problem occurs. Currently, sub-contractors make pictures of the gas installation after renovation. We can use these images to determine the quality of the installation. Quality assurance using images is not a new idea [11,35,4,5,59].

1.1 Problem statement

Image classification, as the name suggests, classifies images in their appropriate classes. Gas instal-lations can be classified as ”correct” and ”incorrect”. Image classification to determine the quality is proven to be effective in research [4,59,5].

Using the same approach as research, we use image classification to classify gas installation. Our dataset is comprised of 1138 correct and 288 incorrect images of gas installations. Our experiment of image classification yields an accuracy of 63%. The precision of 0.26 and recall of 0.35 results in an F1-score of 0.30.

When taking a closer look at the training of the algorithm (figure 1.1), we notice that the train-ing accuracy approaches one. However, the testtrain-ing accuracy alternates between 0.55 and 0.70. We also notice that there is a descending trend in the testing accuracy. These are signs of overfitting on

(6)

training data. Detecting the difference between correct and incorrect images is very difficult appar-ently. The difference between a good and bad gas installation can be very subtle. Perhaps only one part of the installation is missing. For example, a gas seal (see figure1.2). We need a more robust solution to detect correct and incorrect installations.

Figure 1.1: Screenshot of TensorBoard: Accuracy of image classification on gas installations. On the vertical axis: Accuracy (max 1, minimum 0). Vertical axis: number of training steps. Orange line: training data, Blue line: test data.

Figure 1.2: On the left an example of a correct gas installation. On the right an example of a bad gas installation. The installation of the right is missing gas seals.

1.1.1 Research Questions

Determining installation correctness using image classification is not possible with high precision. Our solution uses object positions to classify an image as correct or incorrect. We explain more about our system in chapter3. We are challenging the state-of-the-art research in image classification to detect correct and incorrect installations. To determine if this hypothetical system will work, we ask the following question.

Will a classification algorithm, based on object positions, yield a better F1-score than regular image classification for the detection of correct/incorrect installations?

(7)

We need to find the most suitable object detection algorithm to detect objects in gas installation. Hence, we need to answer the following subquestion: SR1: What object detection algorithm scores the best on accuracy for detecting objects in gas installation images?

The second part of our research is to find out what algorithms work best to classify object posi-tions. To find the best algorithm, we ask the following research question: SR2: What Machine Learning algorithm yields the highest F1-score to classify correct or incorrect installation of a gas installation using object positions?

One important aspect of achieving a high accuracy is the form of the input data. We may need to filter or adjust our object positions before we can classify them. Therefore, we ask the following question: SR2.1: How to prepare object position data for classification to achieve a high accuracy?

1.2 Contributions

Our research makes the following contributions:

1. A strategy, based on object positions to classify gas installations on correctness, that ourper-formes state-of-the-art by 106%.

2. An object position data preparation strategy for classification algorithms.

3. A framework that allows combining object detection algorithms, data preparation and classifi-cation algorithms.

1.3 Outline

Chapter2will give background information for this research. Chapter3explains our vision for a new system to determine the correctness of gas installations. Chapter4 explains the experiments we are conducting. We present the results of our experiments in chapter5 and discuss them in chapter 6. In chapter 7we refer to other research that has been conducted in the same key areas as our thesis. Finally, in chapter8, we state our conclusions and mention future work.

(8)

Chapter 2

Background

Our solutions uses object positions of gas installation elements to determine the classification. First, we discuss various object detection algorithm algorithms. Secondly, we describe various classification algorithms. Because of the unbalance in our dataset, we will dive deeper into this topic. Finally, we describe metrics to compare object detection algorithms and classification algorithms.

2.1 Object detection algorithms

Object detection algorithms can detect objects in an image. Object detection algorithms are usually extensions of image classification models. Near final layers of an image classification model are used as feature selectors. These features are passed to an object detection algorithm which is able to localise and detect objects. [31,23,45]

R-CNN is one of the first modern object detection algorithm. R-CNN introduced selective search which combines the strength of both an exhaustive search and segmentation. Fast R-CNN [19] and Faster R-CNN [46] are improvements based on the original implementation. Faster R-CNN is the fastest and achieves the highest accuracy of them all.

You only look once (YOLO) presents a new approach to object detection. Prior work on object detection repurposed classifiers to perform detection [45]. YOLO frames object detection as a regres-sion problem. The whole detection pipeline is a single network. YOLO can be optimised end-to-end directly on detection performance. Compared to other state-of-the-art detection algorithms, YOLO makes more localisation errors, but is less likely to predict false positives on the background. YOLO has two successors: YOLOv2 [43] and YOLOv3 [44]. YOLOv3 achieves the highest accuracy on the COCO object detection challenge.

Overfeat is a multi-scale and sliding window approach that can be efficiently implemented within a ConvNet. Overfeat is an end-to-end system for object detection. Traditional object detection algorithms make use of pre-trained classification models as feature selector. Overfeat is the winner of the localisation task of the ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013) and obtained very competitive results for the detection and classifications tasks. [50]

SSD discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments of the box to match the object shape better. [33]

R-FCN present region-based, fully convolutional networks for accurate and efficient object detec-tion. In contrast to previous region-based detectors such as Fast/Faster R-CNN which apply a costly per-region subnetwork hundreds of times, R-FCN region-based detector is fully convolutional with

(9)

almost all computation shared on the entire image. This results in nearly the same accuracy as Faster R-CNN, but with a lower speed per image. [10]

RetinaNet is designed to create a one-stage detector with high accuracy. One-stage detectors trail in accuracy compared to two-stage detectors like R-CNN. Using Focal Loss Lin et al. achieves a healthy gap with its closest competitor. [31]

2.2 Classification algorithms

Classification algorithms are the most common supervised learning tasks in machine learning [18]. Classification algorithms classify data into classes. We can classify classification algorithms in the following classes [18]: linear models, trees algorithms, neural networks and nearest neighbors. Each type of algorithm has his benefits over the other.

2.2.1 Linear Models

Logistic Regression is a linear model. Logistic Regression is commonly used to estimate the probability that an instance belongs to a particular class.

Naive Bayes Classifier technique is based on the so-called Bayesian theorem and is particularly suited when the dimensionality of the input is high. Despite its simplicity, Naive Bayes can often outperform more sophisticated classification methods. [18]

Support Vector Machine (SVM) can perform linear or nonlinear classification. Support Vectors are simply the coordinates of individual observation. Support Vector Machine is a frontier which best segregates the two classes. [18]

Geron describes that non-linear data can be transformed to use the data in linear models. Polynomial transformation followed by standard scaler can make sure linear models can use non-linear data [18].

2.2.2 Trees

The Decision tree is a popular machine learning algorithm. Decision Tree algorithms require very little data preparation and do not require feature scaling or centering. The tree is also not influenced by unbalanced datasets. Decision tree creates a tree of rules. Each node has a rule that guides the classification down the tree. The leaf nodes contain the classification. [18]

Random Forest is an ensemble of Decision Trees, generally trained via the bagging method. The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best features among a random subset of features. This results in a greater tree diversity, which trades a higher bias for a lower variance, generally yielding an overall better model. [18]

Boosted Tree is also an ensemble method that combines several weak learners into a strong learner. The general idea of a boosting method is to train predictors sequentially, each trying to correct its predecessor. [18]

2.2.3 Neural Network

Neural Networks are a simplified computational model of how our biological neurons work. Neural Networks are the foundation of deep-learning. Neural Network can take on many different forms and functions. Multi-Layer Perceptron is a Neural Network with one input layer, one or more hidden layers and one final output layer. Every layer except the output layer includes a bias neuron and is fully connected to the next layer. [18]

2.2.4 Nearest Neighbors

k-nearest neighbors (KNN) is a non-parametric, lazy learning algorithm. Its purpose is to use a database in which the data points are separated into several classes to predict the classification of a new sample point. Then giving a new data point, KNN finds k-nearest known points and takes

(10)

the majority vote of the neighbors. k-nearest neighbors makes no assumptions about data and is not sensible to an unbalanced dataset. [49]

2.3 Unbalanced datasets

Akbani et al. show that % accuracy of an unbalanced dataset will yield a useless value. ”Take an unbalanced set of ratio 99 negatives to 1 positive. A classifier that classifies everything negative will be 99% accurate” [1]. However, this will fail to detect, for example, the positive results of a brain cancer test. It is important that we use a metric that exposes the true score of a model trained on an unbalanced dataset.

2.3.1 Mitigating approaches

Resampling and down-sizing are both suitable solutions as data-preparation steps [25]. Resampling takes the minority class and generates new points. Down-sampling is a technique where data-points from the majority class are thrown away until we reach a balanced dataset. Both methods are effective when encountering unbalanced datasets.

Penalised classification adds an additional cost on the model for making classification mistakes. These penalties can bias the model to pay more attention to the minority class [17].

Decision trees often perform well on imbalanced datasets because their hierarchical structure al-lows them to learn signals from both classes. In modern applied machine learning, tree ensembles (Random Forests, Boosted Trees) almost always outperform singular decision trees. [18,49]

2.4 Metrics

We need metrics to compare object detection algorithms or classification algorithms. We conduct a literature study to find metrics used in papers. We use the mapping strategy proposed by Petersen et al. [41]. We first define our search space to the categories ”object detection”, ”classification” and ”unbalanced dataset”. The next step is to determine the quality of the paper. We set a limit on ten papers for each category totalling thirty papers in total. Table2.1shows all the different metrics we encountered and the number of occurrences in the set of papers.

(11)

Metric Classification Object detection Unbalanced data set Total % of correct [21,60,56,3,37,22] [36,34,42] [27,26,7,20] 13 Recall [34,2, 58,12,13,32] [1,27,53,26,7,20] 12 Precision [34,2, 58,12,13,32] [1,27,53,26,7,20] 12 F-value score [58] [27,53,26,20] 5 Average Precision [40] [12,23,13,32,34] 6

mean Average Precision [23,32] 2

% error rate [6,8,37] 3 % of improvement [30] 1 Per-class accuracy [60] 1 Fragmentation [34] 1 True positives [42] 1 False positives [38,42,61] 3 True negatives [42] 1 False negatives [42,61] 2 Sensitivity [53,1] 2 Specificity [53,1] 2 G-mean [53] 1 Cohen’s kapper [26] 1 Krippendorf’s alpha [26] 1

Area under ROC [26,7] 2

Precision Recall curve [26] 1

Jaccard coefficient [42] 1

Yule coefficient [42] 1

Memory [23] 1

GPU Time [23] 1

Table 2.1: Metrics used to compare data

% of correct is the most popular metric followed by precision and recall, according to literature (table2.1). When we look at the category ”unbalanced dataset” and ”object detection”, we see that recall and precision are more used. The F-value score is used as often as % of correct in the category of ”unbalanced dataset”.

Precision and Recall are originated from information retrieval [1]. Information retrieval is the task of finding documents that are relevant to a user’s need for information. Precision measures the propor-tion of documents in the result set that is relevant. Recall measures the porpropor-tion of all the relevant documents in the collection that are in the result set. The F1-score is the harmonic mean of recall and precision. It is widely used to evaluate an algorithm trained on an unbalanced dataset. [49].

F 1 = 2 ∗ _{P recision+Recall}P recision∗Recall P recision =_{T rueP ositive+F alseP ositive}T rueP ositive

Recall = _{T rueP ositive+F alseN egative}T rueP ositive

Intersect over Union calculates the overlap between a predicted object and ground-truth. The object overlap should exceed a certain threshold to be viewed as a correct prediction.

IoU = B1∩B2

B1∪B2

Well-known object detection challenges like COCO [9,32], Pascal VOC [13] and ImageNet [47] use Average Precision (AP) and mean Average Precision (mAP). Average Precision is precision averaged across all values of recall between 0 and 1. Intersect over Union is used to determine if an object

(12)

is correctly detected. mean Average Precision takes the mean of multiple Average Precision with different thresholds for IoU.

Pascal VOC has an IoU threshold of 0.50. COCO mean Average Prevision has a set of IoU thresholds starting from 0.50 to 0.95 with increasing steps of 0.05 (0.50:0.05:0.95)

Average P recision =

Pn

k=1P (k)×rel(k)

numberOf RelevantDocuments

mean Average P recision =

PQ

q=1Average P recision(q)

(13)

Chapter 3

Pipeline

Current state-of-the-art is not able to detect the correct and incorrect gas installations with high accuracy (section1.1). State-of-the-art approaches take an image as input and give a classification as output (e.g., correct or incorrect). Image classification uses the whole image to determine the class. A lot of noise is surrounding gas installations. This makes it hard to classify correct and incorrect installations.

A gas installation is comprised of several elements: a gas-meter, gas-tap, b-valve, seals and pipes. Our approach uses these object positions. We want to detect all the objects in the input installation and use their positions as input for a classification algorithm. By doing this, we remove noise from the image and focus on the elements of the gas installation.

To succeed, we need to detect objects in the gas installation image. Object detection algorithms can detect object positions in the installation. A classification algorithm will take these object posi-tions as input and classify the posiposi-tions correct or incorrect. Figure3.1shows the comparison between our system and the state-of-the-art.

Figure 3.1: Proposed pipeline

3.1 Architecture

As described, we want to create a two-step approach to the classification of gas installations. First, we detect objects in the images, then use a classification algorithm to classify based on object positions. The field of Machine Learning is moving fast [18]. Our framework needs to be modular to leverage future algorithms. We use components to separate concerns. We have a component for object detec-tion and a component for classificadetec-tion. We also have a data library component that takes care of data storage and preparation. Because GPU’s are often used in Machine Learning [49], we want our framework to support both CPU and GPU execution.

Object detection is the first step in our pipeline. The object detection algorithm takes as input an image and outputs a set of bounding boxes. Gas installations have multiple instances of the same object. For the detector, it is important that all objects can be detected. The object detection al-gorithm should be trainable on our dataset. This way, we can train the alal-gorithm to detect seals, b-valves, gas-taps and gas-meters. The bounding boxes of the detection are input to the classification algorithm. Geron states ”garbage in, is garbage out” [18]. Therefore, accuracy of an object detection

(14)

algorithm is more important than the speed of the algorithm.

The second and last step in our pipeline is classification. The classification algorithm takes bounding boxes as input and classifies correct or incorrect. Before the classification can make use of the bound-ing boxes, they first need to be transformed. This is done usbound-ing the data library.

Our data library consists of two parts, a data store and a filter system. The data store makes sure all annotations for training and testing are preserved. When there is need to train a new algorithm, the framework will provide direct access to all data that is necessary. Our filter system makes it easy to filter and/or transform the data from the store to a desired format for an algorithm. We can create filters or transformers and chain them together.

(15)

Chapter 4

Experiment design

In order to answer our research questions, we devised three experiments. The first experiment, which answers SR1: ”What object detection algorithm scores the best on accuracy for detecting objects in gas installation images?”, evaluates object detection algorithms. The second experiment, which answers SR2: ”What Machine Learning algorithm yields the highest F1-score to classify correct or incorrect installation of a gas installation using object positions?”, focuses on classification of object positions. The final experiment, to answer our main research question, combines the results of the object detection and classification experiment.

4.1 Python, scikit-learn and TensorFlow

For our experiment, we make use of Python. Python is a general-purpose, interpreted, object-oriented programming. The Python programming language is establishing itself as one of the most popular languages in scientific communities [39]. There are many Machine Learning framework that make use of Python [18,49].

For our research, we make use of the following packages:

Scikit-learn is a Machine Learning framework, including a wide range of algorithms for data prepro-cessing, supervised and unsupervised learning, model validation and selection and metrics. Scikit-learn has implemented many algorithms, all sharing a uniform interface.

TensorFlow is a deep learning framework. Tensorflow makes it possible to create complicated models and train them. TensorFlow is the most popular Machine Learning framework.

Keras is an easy to use deep learning framework. High level API’s make it easy to stack complicated Machine Learning layers on top of each-other. Keras is capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, Theano or MXNet.

COCO is the official library for the COCO challenge. This library contains all evaluation metrics used in the COCO challenge.

4.2 Object Detection

In order to answer SR1 ”What object detection algorithm scores the best on accuracy for detecting objects in gas installation images?” we train and evaluate six different object detection algorithms. After training the algorithms, we compare them to each other using two metrics.

Dataset For this experiment, we make use of a single dataset provided by Alliander. The dataset consists of 1430 images of gas installations. The dataset is completely annotated and double-checked

(16)

for faulty annotations. We have pre-split the entire dataset in training and testing sets. The training set contains 80% and the testing set 20% of the images [18]. Each algorithm will use the same training set for training and test set for testing.

Metrics When training the algorithms on our dataset is finished, we use Average Precision (AP 0.50), mean Average Precision (mAP 0.50:0.95) and throughput to compare them. (see chapter2.4) Average Precision and mean Average Precision are used in big object detection challenges like Pascal VOC, COCO, ImageNet. AP and mAP implemented in the official COCO library. For throughput we will use the metric of time, in particularly milliseconds, to detect all elements in gas installations (seals, b-valve, gas-tap and gas-meter). We will evaluate all images of the test set and measure the total time it takes to process all images. Then we take the average time per image.

Environment Our experiment will run on two different machines. The training will be done on a GPU accelerated machine. Evaluation of throughput will be both on a GPU machine and CPU machine.

The CPU machine is a 2016 15” MacBook Pro running on macOS High Sierra 10.13.4. The machine has a 2.6 GHz Intel Core i7 with 16 GB of LPDDR3 RAM. This machine is running Tensorflow 1.8.0 and Keras 2.1.6.

The GPU machine is running Linux Ubuntu 16.04.4. The machine has an Intel Core i7-6950X running at 3.0 GHz and 62 GB of DDR3 RAM. There are two graphics cards in this machine, a Nvidia GeForce GTX 980 with 4 GB of video RAM and Nvidia GeForce GTX 980 Ti with 6 GB of video RAM. To make use of the GPU’s, Tensorflow-gpu 1.8.0 and Keras 2.1.6 is installed.

4.2.1 Setup

For this experiment, we evaluate the following algorithms: YOLO, Faster R-CNN, Overfeat, SSD, R-FCN, RetinaNet. In this section, we discuses how we configure the algorithms.

YOLO

YOLOv3 is released in 2018 [44]. This version of YOLO scores a higher mAP than YOLOv2 and YOLO. We use the implementation of experiencor/keras-yolo3 [14].

We use the default configuration file found in the repository [14]. We change the labels to ’b-valve’, ’gas-meter’, ’gas-tap’, ’seals’. We change the anchors in the configuration with a script that is pro-vided by the repository (gen anchors.py).

To use our own dataset, we change the function create training instances in the train.py. Instead of using their import method, we integrate our own data library.

Faster R-CNN

Faster R-CNN is implemented in the official TensorFlow repository [54]. The repository supplies a simple script that transforms a JSON into a TFRecord file. We used the basis of this script and applied this to our own dataset. We can use this TFRecord file for each algorithm we need to train with the Tensorflow object detection API.

Faster R-CNN with Inception ResNet achieves the highest accuracy. A drawback of Inception ResNet is the speed of the model. It is by far the lowest of all the feature selectors. Huang mentions ResNet 101 as the sweet spot for Faster R-CNN. We will evaluate both the Inception ResNet and ResNet 101 feature selector backends for Faster R-CNN. [24]

For ResNet 101 we take the example config supplied by the repository (faster rcnn resnet coco) and tailor this for our own dataset. We change the training and evaluation input to the newly created TFRecord file. We also change the number of classes to 4 (seals, b-valve, gas-tap, gas-meter).

(17)

We decreased the batch size to 1, batch queue capacity to 50, num batch queue threads to 4 and prefetch queue capacity to 5. This solves memory problem while training Faster R-CNN ResNet 101.

In order to train Faster R-CNN with Inception ResNet we use the default configuration found in the repository (faster rcnn inception resnet coco). In this config file we change the path to the training and evaluation TFRecord file. We also change the number of classes to 4.

Unfortunately, we could not train on the GPU machine because the algorithm requires a lot of mem-ory. We move training the algorithm to Google Cloud ML. In order to do this, we reverted to an older version of the repository because of a bug when training on the cloud. On Cloud ML, we created a cluster of 10 GPU’s. One master node and nine worker nodes. Cloud ML automatically provisions all the workers and manages the training.

SSD

SSD is implemented in the Tensorflow object detection API [54]. The preparation of the algorithm and dataset is the same as for Faster R-CNN (section4.2.1). In the object detection API repository, SSD based on Inception V2 achieves the highest score in terms of mAP [54]. We use Inception V2 as feature selector for the algorithm.

The repository supplies a config file for SSD Inception V2 (ssd inception v2 coco). In order to accept our dataset, we change the training and evaluation path for the TFRecord file and set the batch size to 12.

R-FCN

R-FCN is implemented in the Tensorflow object detection API and has the same preparation steps as SSD and Faster R-CNN (section4.2.1). R-FCN with ResNet 101 yields the highest accuracy [23]. The repository supplies a default configuration file to train a R-FCN network based on ResNet 101 (rfcn resnet 101 coco). As with SSD and Faster R-CNN, we change the input file for the training and evaluation TFRecord and the number of classes we want to predict.

Overfeat

Overfeat object detection algorithm is implemented by russell91 [48]. The implementation is based on the paper by Stewart et al. [52]. We use the default configuration of overfeat.

To allow the training on the Alliander Dataset, we replace the function in the repository that feeds data into the algorithm, by creating our own implementation of load data gen in utils/train utils.py. RetinaNet

RetinaNet is implemented by Fizyer in Keras [16]. The implementation is derived from the original paper on RetinaNet [31]. According to Lin et al, RetinaNet achieves the highest accuracy based on RetinaNet 101.

The implementation is easy to extend. The implementation uses a Generator to feed data into the algorithm. We create an AllianderGenerator which extends from the base Generator. The Al-lianderGenerator reads data from the Alliander Data Library and transforms into the desired input of the algorithm.

4.2.2 Training the algorithms

In previous sections, we discussed the implementation and configuration of the algorithms. In this section, we will look at how we conduct the training and determine when an algorithm is done training.

(18)

YOLO

The implementation of YOLOv3 makes use of the EarlyStop function of Keras. This function monitors the losses and stops when the loss is not decreasing within a time frame. This function allows us to train for 100 epochs. When the algorithm is not improving any more, the EarlyStop function automatically stops training.

Faster R-CNN, SSD, R-FCN

In order to see the progress of the trained model, the Tensorflow object detection API supplies a eval.py script. This script evaluates the latest checkpoint of the training every 5 minutes. The evaluation script uses the official COCO evaluation metrics. We use the graphs created by the script to determine the highest point and thus, most accurate checkpoint.

Overfeat

To monitor the progress of the algorithm, we watch the loss function. We need to make sure the model is not overfitting and stop training at the most optimum moment. When the training and testing loss are not moving in the same direction, it means the model is overfitting.

RetinaNet

We configure the algorithm to train for 100 epochs. After each epoch, the algorithm runs an evaluation script to determine the accuracy of the model. By watching this accuracy, we can determine after which epoch the model performs at its best.

4.2.3 Threat to validity

In our experiment, we do not fine-tune the algorithm parameters. We do not do this because we want to find out if our pipeline achieves a higher score than image classification. Future research on this topic can focus on fine-tuning parameters for gas installation objects in order to maximise the results. We found that our GPU machine is not sufficient to train the network for the Faster R-CNN In-ception ResNet network. We moved training to Google Cloud ML. During training we encountered a fatal exception which seems to occur randomly. This made it unable to train the network on Google Cloud ML. We made the author of the repository aware of this issue. They provided an older version which still worked on Google Cloud ML to continue with our research. There can be a difference in performance, because of a older version. Normally, newer versions are more stable and provide updates on bugs or introduce new features. There can be a difference between the latest version we were unable to test and the older version we tested on.

4.3 Classification

To answer SR2 ”What Machine Learning algorithm yields the highest F1 score to classify correct or incorrect installation of a gas installation using object positions?” we train and evaluate eight different algorithms on eight different dataset configurations.

This experiment makes use of the Alliander dataset. It is the same dataset as used in the object detection experiment (section 4.2). This dataset contains 1430 data-points, 1138 correct and 288 incorrect gas installations. All data-points have annotated object positions for each element in the gas installation. The dataset is split into 80% training and 20% testing set. The training and testing set will remain the same during the experiment.

To evaluate and compare the different classification algorithms, we use F1 score, Recall, Precision and Accuracy. All metrics are implemented in the scikit-learn library.

(19)

4.3.1 Data Preparation

Data preparation is an important step in Machine Learning [18]. We conduct a sub-experiment to determine what data preparation steps perform best on object positions. In total, we will run three data experiments with two variations each. The total number of data experiments is 23_{= 8. For each}

classification algorithm we evaluate, we run the eight data experiments, totaling 64 experiments. Experiment 1: object information First we determine whether we need all the data. Unnec-essary data is bad for training and can lead to less performing models [18, 49]. We are evaluating if we need full object information: x coordinate, y coordinate, width and height or we can only use x-center and y-center.

Experiment 2: object normalisation Normalisation and generalisation is very important for machine learning algorithms [18]. Our dataset contains images with different sizes. In order to normalise we will rescale all data-points to 640x480. The second variant of this experiment uses the gas-meter as middle point (0,0) and rescales all other objects based on this point. In doing so, we expect the data to be protected against distance and localisation within the image.

Experiment 3: object ordering Our last experiment tests object positions. Machine learning algorithms expect features to be in the same place all the time. The first variation for this experiment focuses on categories. We predetermined the order in which we will place the objects: 4x seals, 1x b-valve, 2x gas-tap and 1x gas-meter.

This order will potentially face an issue. The order of seals and gas-taps can fluctuate because of the different order given by object detection algorithms. That is why, in our second variant, we order the mutual parts based on there positions in the image.

Experiment # Object info Normalisation Object ordering

1 full Rescale category

2 full Rescale object

3 full Relative category

4 full Relative object

5 center Rescale category

6 center Rescale object

7 center Relative category

8 center Relative object

Table 4.1: Data preparation experiment with data configurations

4.3.2 Setup

In this experiment, we train eight different machine learning algorithms: Logistic Regression, Naive Bayes Classifier, Support Vector Machines, Decision Tree, Random Forest, Boosted Tree, Neural Network and Nearest Neighbors.

Logistic Regression

Logistic Regression is implemented in the scikit-learn framework. To transform our dataset into a linear model, we use PolynomialFeatures() with a degree of 3. Geron suggests StandardScaler after the use of PolynomialFeatures. [18]

Logistic Regression is sensitive to unbalanced datasets. Since our dataset is unbalanced, we need to make sure that the algorithm compensates for this problem. Scikit-learn implementes a parame-ter class weight in LogisticRegression to increase the weight of the smaller class. Thus, we set the class weight to ’balanced’.

(20)

Naive Bayes Classifier

The Naive Bayes Classifier follows the same steps as logistic regression, we first use a PolynomialFea-ture() with a degree of 1 followed by a StandardScaler().

Scikit-learn implements this model as GaussianNB(). This function accepts the parameter to in-crease the weight of smaller classes. Thus, we set class weight to ’balanced’.

Support Vector Machines

We process our data with PolynomialFeatures() with a degree of 2 followed by a StandardScaler(). For Support Vector Machines we use the LinearSVC() implementation. We change the loss function to ’hinge’, C to ’0.5’. We set the class weight parameter to make sure the algorithm is handling the unbalanced dataset.

Decision Trees

The Decision Trees is implemented in scikit-learn into the function DecisionTreeClassifier(). We set the max depth of the tree to 3. We do not need to change the dataset in order to make it work with decision trees.

Random Forest

The Random Forest algorithm is based on decision trees. Scikit-learn provides an implementation of the algorithm in the function RandomForestClassifier(). The algorithm takes the same parameters as the decision tree. We set the max depth to 3.

Boosted Trees

The boosted trees algorithm is implemented in scikit-learn. We use the GradientBoostingClassifier() function to train a boosted tree. The boosted tree algorithm is based on decision trees. We set the max depth to 3. This is the same value as for decision trees.

Neural Network

Scikit-learn provides a multi-layer Neural Network implementation. The downside of this implemen-tation is that the network does not work well with unbalanced datasets. To balance the dataset, we will use the resample technique. We resample our dataset and generate a new data point to balance the training data. We do not balance our testing set.

We use a ’lbfgs’ solver with an alpha of 1e − 4. The algorithm will slowly decrease the learning rate. This is done by adding ’invscaling’ to the learning rate property.

Our network has four layers: one input layer, two hidden layers and one output layer. Our first hidden layer has a size of five and the second hidden layer size of two.

Nearest Neighbors

In scikit-learn, Nearest Neighbors is implemented in the function KNeighborsClassifier(). We set n neighbors to 5 and algorithm to ’auto’. For training, we make sure the algorithm uses all possible CPU cores to maximize training performance.

4.3.3 Threats to validity

In our experiment, we follow a trial-and-error approach to determine the parameters for the classi-fication algorithms. This is not a sound way to determine the correct hyper parameters. We did not tune the hyper parameters because of time constraints. Future research can focus on finding

(21)

the correct parameters for this problem. This can be done by using GridSearchCV implemented in Scikit-learn. To fine-tune the parameters, we need to make sure to create a validation set. Oth-erwise we will tune the parameters to perform well on our test set, but not on real world examples [18].

The dataset is provided by Alliander. We double checked the object annotations. Unfortunately, we are not qualified and competent to validate correct or incorrect gas installations. We depend on Alliander to have annotated this perfectly. It is also possible that Alliander has only selected images where mistakes are easy to be detected. If this is the case, future implementation is likely to be less accurate because the system can not detect the hard examples.

4.4 Pipeline experiment

The pipeline experiment is our final experiment and answers our main research question: ”Will a classification algorithm, based on object positions, yield a better F1 score than regular image clas-sification for the detection of correct/incorrect installations?”. This experiment builds on top of the object detection (section4.2) and classification experiment (section4.3). We use the architecture that is described in chapter3.

The dataset is provided by Alliander. It contains 286 images, 64 correct and 222 incorrect. All images have annotated object information and are classified as correct or incorrect. We use this never seen before dataset in the pipeline experiment.

Chapter 3 describes how the pipeline is composed. First, we use an object detection algorithm to detect all gas installation objects, then we feed these objects positions into a classification algorithm that determines whether the gas installation is correct or incorrect.

In section4.2, we describe our object detection experiment. We use the results of that experiment to select the top AP and mAP scoring object detection algorithms. If the top performing algorithms for AP and mAP are the same, we also select the 2 scoring mAP. Section4.3, describes the classification experiment. From this experiment, we will use the top two F1-scoring algorithms. If there are multi-ple algorithms which achieve equal scores, we will use all the top scoring algorithms.

This experiment will combine all object detection algorithm with every classification algorithm, which are being used in this experiment. We use the F1-score as main metric to compare the combinations. Next to the F1-score, we will also measure Recall, Precision and Accuracy.

Experiment # Object detection algorithm Classification algorithm

1 #1 AP #1

2 #1 AP #2

3 #1 mAP #1

4 #1 mAP #2

(22)

Chapter 5

Results

In Chapter4we discuss the three experiments we are conducting. This chapter will show the results of each experiment. First, we start with the object detection experiment, followed by the classification and finally the pipeline experiment.

5.1 Object Detection

In chapter4.2, we explain our object detection experiment. We compare six different object detection algorithms: YOLO, Faster R-CNN, Overfeat, SSD, R-FCN, RetinaNet. We compare these algorithms on three different metrics: Average Precision (AP) with a threshold of 0.50, mean Average Precision (mAP) with a threshold of 0.50:0.95 and time of evaluation in seconds. Individual results can be found in AppendixA

We calculate the Average Precision (AP), for every object in the gas installation and for each al-gorithm. Table5.1 shows the full comparison. From our experiment, we can conclude that YOLOv3 achieves, on average, the highest Average Precision (AP). YOLOv3 also achieves the highest score for seals, b-valve and gas-meters. Only R-FCN beats YOLOv3 for the detection of gas-taps.

Seals B-valve Gas-tap Gas-meter Average

YOLOv3 84.3 96.7 55.7 97.0 83.4 Faster R-CNN - Inception ResNet 79.5 93.6 56.0 96.9 81.5 - ResNet-101 70.7 89.0 54.2 96.7 77.7 Overfeat 65.1 91.5 57.6 96.3 76.6 SSD 53.6 71.0 29.0 92.0 61.6 R-FCN 75.9 92.9 60.2 96.8 81.4 RetinaNet 73.1 93.2 58.6 96.3 80.3 Table 5.1: Average Precision comparison (multiplied by 100)

The mean Average Precision (mAP 0.50:0.95) is calculated with different thresholds starting from 0.50 and increasing with 0.05 per step untill we reach 0.95. Table 5.2displays the comparison of all objects and algorithms using the mAP metric. Faster R-CNN based on Inception ResNet outperforms all other algorithms. Faster R-CNN achieves the highest scores on average and for all objects.

(23)

Seals B-valve Gas-tap Gas-meter Average YOLOv3 32.5 54.6 19.4 66.6 43.3 Faster R-CNN - Inception ResNet 34.7 55.8 25.1 71.1 46.7 - ResNet-101 25.0 45.0 18.7 69.0 39.4 Overfeat 22.7 42.7 17.9 55.1 34.6 SSD 19.4 26.6 8.9 61.5 29.1 R-FCN 28.7 47.2 22.3 66.3 41.1 RetinaNet 30.3 52.8 24.6 59.6 41.8 Table 5.2: mean Average Precision comparison (multiplied by 100)

We measure the time it takes to evaluate the entire test set, which contains 286 images, in seconds. Table5.3shows the evaluation time for GPU accelerated machines and CPU machines. We measure the total time to evaluate 286 images and average time per image.

GPU CPU

Total time Average/image Total time Average/image

YOLO 71.41 0.25 360.60 1.26 Faster R-CNN - Inception ResNet 186.40 0.50 4038.32 14.12 - ResNet-101 71.11 0.25 1392.82 4.87 Overfeat 23.63 0.08 401.63 1.40 SSD 33.03 0.11 117.62 0.41 R-FCN 53.63 0.19 589.16 2.06 RetinaNet 46.14 0.16 1185.80 4.14

Table 5.3: Evaluation time for 286 images in seconds

5.2 Classification

In chapter4.3, we set up our experiment to determine the best performing classification algorithm for our dataset. We calculate the F1 score, Precision, Recall and Accuracy to rank the algorithms. This chapter will show the results of the following algorithms: Logistic Regression, Naive Bayes Classifier, Support Vector Machines, Decision Tree, Random Forest, Boosted Trees, Neural Networks and Near-est Neighbors.

On Logistic Regression (tableB.1), Naive Bayes Classifier (tableB.2), Support Vector Machine (table

B.3) and Random Forest (tableB.5) data experiment 2 achieves the highest score. Data experiment 1 achieves the highest score for Decision Tree (table B.4) and Boosted Trees (table B.6). Neural Networks (table B.7) performes best using data experiment 5. Two algorithms perform well on two different data experiments. Decision Tree (tableB.4) achieves the same score for data experiment 1 and 2. Nearest Neighbors (tableB.8) for data experiment 1 and 4.

For each algorithm, we took the highest scoring data experiment (F1-score) and combined them in table5.4. We have three algorithms that achieve the best F1-score of 0.74. Logistic Regression has the best recall score, Boosted Trees the best precision score and Random Forest is more in balance in terms of Precision and Recall.

(24)

Experiment Accuracy Precision Recall F1-score Logistic Regression 2 (full, rescale, object) 0.89 0.78 0.70 0.74 Naive Bayes Classifier 2 (full, rescale, object) 0.83 0.63 0.61 0.62 Support Vector Machine 2 (full, rescale, object) 0.88 0.82 0.64 0.72 Decision Tree 1 (full, rescale, category) 0.88 0.88 0.56 0.69 2 (full, rescale, object) 0.88 0.88 0.56 0.69 Random Forest 2 (full, rescale, object) 0.89 0.89 0.66 0.74 Boosted Trees 1 (full, rescale, category) 0.90 0.93 0.61 0.74 Neural Networks 5 (center, rescale, category) 0.87 0.85 0.52 0.64 Nearest Neighbors 1 (full, rescale, category) 0.88 0.82 0.62 0.71 4 (full, relative, object) 0.88 0.82 0.62 0.71

Table 5.4: Classification Results

5.3 Pipeline

In chapter4.4, we explain how we test the pipeline experiment. We use the YOLOv3 algorithm and Faster R-CNN Inception ResNet for the object detection and logistic regression, random forest and boosted trees for classification algorithms. We evaluated the combinations with a new data set. The dataset is the same during the pipeline experiment. The dataset is not used in previous experiments. Table 5.5 shows the results of this experiment. From the classification algorithm (section 5.2), we noticed that the algorithms each performed well on different aspects. We see these aspects back in the pipeline experiment. Logistic Regression has the best Recall score, Boosted Trees the highest Precision and Random Forest has the best balance between Precision and Recall.

From the table, we can conclude that YOLOv3 in combination with Random Forest and Faster R-CNN with Boosted Trees achieves the highest F1-score. YOLOv3 with Boosted Trees comes in a close second. Both detectors with Logistic Regression achieve the lowest score.

Detector Classification Data experiment Accuracy Precision Recall F1-score YOLOv3 Logistic Regression 2 (full, rescale, object) 0.70 0.39 0.56 0.46 YOLOv3 Random Forest 2 (full, rescale, object) 0.86 0.80 0.50 0.62 YOLOv3 Boosted Trees 1 (full, rescale, category) 0.86 0.88 0.45 0.60 Faster R-CNN Logistic Regression 2 (full, rescale, object) 0.75 0.46 0.56 0.50 Faster R-CNN Random Forest 2 (full, rescale, object) 0.81 0.59 0.50 0.54 Faster R-CNN Boosted Trees 1 (full, rescale, category) 0.85 0.74 0.53 0.62

(25)

Chapter 6

Discussion

6.1 Object Detection

In our experiment, we notice that gas-taps lack in accuracy compared to the other objects. This is probably due to the variation of gas-taps. Our initial thought was the size of the objects. On average, the gas-taps have a height and width of 100x115 pixels. Compared to gas-meters (458x613 pixels) and b-valve (219x235 pixels), this seems small. However, seals with an average height and width of (89x119 pixels) score much better on AP and mAP than gas-taps. When taking a closer look at the gas-tap in our dataset, we notice that there are many different gas-taps. Gas-taps can also be in different orientations. In order to train an object detection algorithm with higher accuracy, we need to getter more data on the different gas-taps.

During our first experiments, we made use of an easy to use COCO evaluation tool. This tool made it very accessible to evaluate a model according to the COCO standard. After we evaluated several models using this tool, we noticed that the evaluation scores different models with the exact same score: 71.9. We then used the official COCO evaluation tools to evaluate the models and found that those results were different. We made a report to the author of the code and made him aware of our difference in evaluation. From that point onwards, we used the official COCO evaluation script. We trained and evaluated many different algorithms for this research. Each algorithm has its own method of determining whether the model is done training or not. In the implementation of YOLO, we encountered an automatic EarlyStop functionality. This functionality is part of the Keras library. We recommend future systems to implement this EarlyStop functionality. In doing so, we do not have to manually determine when the model is done and stop the training. Using this technique, we let a machine determine when to stop training, this reduces human errors when determining when a model is most optimum.

When we compare our results, trained on the Alliander dataset, to the research of Huang et al [23], we see a similar order of algorithms. The order of object detection algorithm in the Huang research is not biased to the dataset used. The object detection algorithms show similar characteristics when facing different datasets.

It is interesting to see that we have two different algorithm on top for AP and mAP. For Average Pre-cision, YOLOv3 is the top performing algorithm and Faster R-CNN comes in second (difference of 1.9 AP). For mean Average Precision, Faster R-CNN achieves the highest score and YOLOv3 achieves the second place (difference of 3.4 mAP). This means Faster R-CNN is more accurate. However, YOLOv3 localises objects better than Faster R-CNN.

(26)

6.2 Classification

We have trained multiple algorithms which can be placed in different categories: linear models, trees, neural networks and nearest neighbors. In our experiment, tree based algorithms achieve on average the highest F1-score.

Five times out of the ten top results, data experiment 2 (full, rescale, object) achieves the high-est score. Three out of then times, data experiment 1 (full, rescale, category) achieves the highhigh-est score. The combination of full object information and rescale achieves eight times of ten times the highest score.

In our object information experiment, we test if we need the full object information (x, y, width, height) or that we can make use of only center positions (x-center, y-center). From our results, we notice that full object info outperforms center positions 28 times out of the 32 experiments. The algorithms we train on our problem perform better when receiving full object information. We rec-ommend other researches to use full object information, and let the algorithm determine what to do with the data.

In the normalization experiment, we test basic rescaling (to 640x480 pixels) versus relative posi-tions (to gas-meter). Rescaling outperforms relative posiposi-tions 21 times of the 32 experiments. We believed relative object positions would be a better normalisation of object positions, and achieve a higher score. From our experiment, we notice that this is not the case.

Finally, we experiment with object order. Ordering on object level outperforms category order 17 times out of the 32 experiments. Category order outperforms object ordering 11 times out of the 32 experiments. Four times it does not matter whether we use object ordering or category ordering.

6.3 Pipeline

Three different classification algorithms achieve the highest score in our classification experiment (chapter 5.2). Logistic Regression has a high recall and low precision, Boosted trees has a high pre-cision and low recall and Random Forest with both with prepre-cision and recall in-between the other algorithms. In our pipeline experiment, we notice similar characteristics.

In our object detection experiment, YOLOv3 and Faster R-CNN achieve the highest score for AP and mAP. In the pipeline experiment we notice that Faster R-CNN outperforms YOLOv3 two out of three times.

For now, this research sounds promising. We achieve a precision of 74% and 80%. However, the recall of 50% and 53% is more a problem. A recall of 50% means that half of the incorrect gas installations is marked as incorrect. The other half are not detected as incorrect. Before using this system, we should focus on increasing the recall of the system. For safety systems we want to detect as much incorrect systems as possible and care less for correct systems marked as incorrect.

Currently, our approach detects seals, b-valves, gas-taps and gas-meters. Maybe when we detect more objects in the installation, we can increase the accuracy of the pipeline. We can, for example, detect gas pipes or gas entry points in a house.

(27)

Chapter 7

Related Work

This chapter presents various researches conducted in the same area as our thesis. Where necessary, we compare approaches to our own, highlighting the differences.

7.1 Quality assurance using images

Edreschi et al. present an approach to classify potato chips, using pattern recognition from colour digital images. This approach consists of 5 steps: (1) image acquisition, (2) preprocessing, (3) segmen-tation, (4) feature extraction, and (5) classification. They combine the results of segmentation and feature extraction in their classification. Edreschi et al. achieve a confidence of 78% and a probability of 95%. [11]

Morison et al. use Image Analysis to determine the quality of fish. Their approach is based on manual work. Morison et al. suggested this approach in 1998. This approach gathers features about fish from images. These features are added to a spreadsheet. This spreadsheet classifies the fish in age groups. Although they do not use an automated machine to process the images, it does show that quality can be determined from images. [35]

Brosnan et al. describe the need for an automated food quality system. In their paper, they use X-Ray images and image classification to solve this problem [4]. Chang et al. use a technique called transfer learning to retrain a model to recognize breast cancer based on images [5]. Xia and Xu use transfer learning and image classification in their way to classify flower types [59]. These researches all use images to determine the quality of an object.

The approaches presented above all show quality assurance using images. Older approaches use image analysis and classification to determine the quality. New approaches use image classification to determine the quality. Our approach is a combination of both. We use new technology such as object detection as feature selectors and use classification (like Edreschi) to determine the quality.

7.2 Object detection benchmark

Erhan et al. propose a new object detection algorithm. They evaluate their approach to other object detection algorithms. Erhan et al. make use of an object detection challenges, in this case: Pascal VOC. This object detection challenge is widely used to evaluate and compare object detection algo-rithms. They measure the accuracy of the model using Average Precision (AP). [12]

Huang et al. present a deep comparison of object detectors. They compare three different object detection algorithms. Each object detection algorithm can be based on six different backbones (fea-ture selectors). All the algorithms are trained on the COCO challenge. Huang et al. compare the algorithms on mean Average Precision (multiple thresholds, training time and memory. [23]

The papers presented above use a single dataset and train the object detection algorithms. After training, the algorithms are evaluated using Average Precision and mean Average Precision. Our

(28)

approach is similar to theirs. The difference in our experiment is the dataset. They make use of open-sourced object detection challenges to determine the accuracy of the algorithm. We use a specialized dataset to determine the accuracy in a real-world scenario.

7.3 Classification benchmark

Lessmann et al. compare several classification algorithms. They train classifiers on ten different public datasets. All classifiers are evaluated using AUC (Area under Curve). This research does not perform any fine-tuning. [29]

Our experiment on classification is almost the same as the research done by Lessmann et al. We train and evaluate several classification algorithms. Instead of ten different datasets, we only use one dataset. For evaluation, we use F1-score instead of AUC.

Kou et al. [28] compare classification algorithms on eleven different public datasets. They score algorithms on ten different performance criteria. The datasets and performance criteria are fixed, the only thing that is changing is the classification algorithm.

Instead of eleven different datasets, we only have one dataset. This is the only difference from our re-search approach to theirs. We both evaluate multiple classification algorithm on multiple performance criteria.

7.4 Chaining Machine Learning algorithms

Fasel et al. present an approach to determine facial expressions. Using object detection, they can detect the face and locate the eyes in an image. Following object detection, Fasel et al. use image cropping to create a new image of only the eyes in the face. They use the newly created image to classify the facial expression of the faces in an image. [15]

Their approach is very similar to ours. By first making use of object detection and image cropping, Fasel et al. remove all unnecessary data to continue the classification. The difference between our ap-proaches is the step between object detection and classification. We make use of the object positions for classification instead of a cropped image.

We introduced the work by Edreschi et al in section 7.1. They use images to classify potato chips. First, they use of pattern recognition to determine the features in an image. These features are used in a classification algorithm to classify the potato chip. [11]

This approach is comparable to ours. We use features selectors first and then classification. However, we use object positions instead of pattern recognition as input for our classification algorithm. Object detection algorithms have two variants: one stage and two-stage detectors. Two-stage de-tectors use an existing feature selector to extract features in an image and then train a localization algorithm on top. These algorithms are then chained together. [55,51,46]

Our approach is loosely coupled compared to object detection algorithms. We could create one big algorithm which combines object detection and classification, but then we are not able to replace parts of the system.

7.5 Dataset

The Pascal Visual Object Classes (VOC) challenge has five different challenges: classification, detec-tion, segmentadetec-tion, action classification and person layout. The challenge ran from 2008 untill 2012. For the evaluation of the object detection challenge, they use the metric Average Precision which is based on Intersect over Union (IoU) with a threshold of 0.50. The Pascal VOC object detection challenge has 20 classes. The challenges consist of 11.540 images and 31.561 objects. [13]

Microsoft Common Object in Context (COCO) is a new object detection challenge. Compared to Pascal VOC, COCO has more classes of objects and more instances of the objects to be detected.

(29)

They also have a higher average of object instances in an image to be detected. The COCO object detection challenge contains 328.124 images and 80 classes. COCO uses mean Average Precision to evaluate the algorithms. They use multiple IoU thresholds starting from 0.50 to 0.95 with increasing steps of 0.05. [9]

We have created a gas installation dataset. Our dataset is focused on detecting objects from a gas installation. In our dataset, we have four different objects to be detected. The dataset contains 1430 images and 7058 objects. Compared to Pascal VOC and COCO, our dataset is small. However, our dataset is more specialized. Pascal VOC and COCO are challenges to detect objects in a different context. Gas installations are mostly in the same context and objects of the gas installation look very similar.

(30)

Chapter 8

Conclusion

The goal of this research is to find a solution that can detect incorrect gas installations based on images. In our pre-study, we found that state-of-the-art image classification is not suitable for the task. In chapter3we describe our approach to solve this problem. To find out if our solution performs well, we ask the following research question: Will a classification algorithm, based on object positions, yield a better F1-score than regular image classification for the detection of correct/incorrect installations?.

First, we need to answer SR1: What object detection algorithm scores the best on accuracy for detecting objects in gas installation images?. Chapter 4.2 describes the experiment to find the best performing object detection algorithm. From our experiment, we found that YOLOv3 achieves the highest Average Prevision (AP 0.50) and Faster R-CNN Inception ResNet the highest mean Average Precision (mAP 0.50:0.95) score (chapter5.1).

Secondly, we ask SR2: What Machine Learning algorithm yields the highest F1-score to classify correct or incorrect installation of a gas installation using object positions?. Chapter 4.3 describes the experiment to answer this question. In our experiment, we encountered three classification algorithms which achieve the highest score: Logistic Regression, Random Forest and Boosted Trees (chapter5.2).

Our final experiment (chapter4.4) shows how we combine the object detection algorithm and classifi-cation algorithms to predict correct and incorrect gas installations. With our approach, we perform up to 106% better than state-of-the-art by. We achieve an F1-score of 0.62 compared to state-of-the-art with a score of 0.30 (chapter 5.3). Our approach outperforms state-of-the-art on every metrics. We achieve an accuracy of 0.86 compared to 0.63 of state-of-the-art, a precision of 0.80 versus 0.26 and a recall of 0.50 versus 0.35 of state-of-the-art.

Accuracy Precision Recall F1-score Our approach 0.86 0.80 0.50 0.62 State-of-the-art 0.63 0.26 0.35 0.30 Table 8.1: Comparison state-of-the-art vs our approach

8.1 Future work

All gas installations are connected to each other by pipes. If we can detect what elements are con-nected to each other, we can create a graph object with all connections. This graph network can be used to classify correct or incorrect. Maybe this approach outperforms our pipeline because it preserves information on what elements are connected.

(31)

can focus on hyper-tuning the top performing algorithm from our research. Using hyper-tuning, we can improve on our result.

(32)

Bibliography

[1] Rehan Akbani, Stephen Kwek, and Nathalie Japkowicz. “Applying support vector machines to imbalanced datasets”. In: European conference on machine learning. Springer. 2004, pp. 39–50. [2] Alvaro Bayona, Juan Carlos SanMiguel, and Jos´´ e M Mart´ınez. “Comparative evaluation of stationary foreground object detection algorithms based on background subtraction techniques”. In: Advanced Video and Signal Based Surveillance, 2009. AVSS´09. Sixth IEEE International Conference on. IEEE. 2009, pp. 25–30.

[3] Oren Boiman, Eli Shechtman, and Michal Irani. “In defense of nearest-neighbor based image classification”. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Con-ference on. IEEE. 2008, pp. 1–8.

[4] Tadhg Brosnan and Da-Wen Sun. “Improving quality inspection of food products by computer vision—-a review”. In: Journal of food engineering 61.1 (2004), pp. 3–16.

[5] J. Chang et al. “A method for classifying medical images using transfer learning: A pilot study on histopathology of breast cancer”. In: 2017 IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom). Oct. 2017. doi:10.1109/HealthCom.2017. 8210843.

[6] Olivier Chapelle, Patrick Haffner, and Vladimir N Vapnik. “Support vector machines for histogram-based image classification”. In: IEEE transactions on Neural Networks 10.5 (1999), pp. 1055– 1064.

[7] Nitesh V Chawla. “Data mining for imbalanced datasets: An overview”. In: Data mining and knowledge discovery handbook. Springer, 2009, pp. 875–886.

[8] Dan Ciregan, Ueli Meier, and J¨urgen Schmidhuber. “Multi-column deep neural networks for im-age classification”. In: Computer vision and pattern recognition (CVPR), 2012 IEEE conference on. IEEE. 2012, pp. 3642–3649.

[9] _{COCO. Detection Evaluation. url:}http://cocodataset.org/#detections-eval.

[10] Jifeng Dai et al. “R-fcn: Object detection via region-based fully convolutional networks”. In: Advances in neural information processing systems. 2016, pp. 379–387.

[11] FP Edreschi et al. “Classification of potato chips using pattern recognition”. In: Journal of Food Science 69.6 (2004).

[12] Dumitru Erhan et al. “Scalable object detection using deep neural networks”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, pp. 2147–2154. [13] M. Everingham et al. “The Pascal Visual Object Classes Challenge: A Retrospective”. In:

In-ternational Journal of Computer Vision 111.1 (Jan. 2015), pp. 98–136.

[14] _{Experiencor. Training and Detecting Objects with YOLO3. Apr. 2018. url:}https://github. com/experiencor/keras-yolo3.

[15] Ian Fasel, Bret Fortenberry, and Javier Movellan. “A generative framework for real time ob-ject detection and classification”. In: Computer Vision and Image Understanding 98.1 (2005), pp. 182–210.

(33)

[17] Aitor Gast´on and Juan I Garc´ıa-Vi˜nas. “Modelling species distributions with penalised logistic regressions: A comparison with maximum entropy models”. In: Ecological Modelling 222.13 (2011), pp. 2037–2041.

[18] Aur´elien G´eron. Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. O’Reilly, 2017.

[19] Ross Girshick. “Fast r-cnn”. In: arXiv preprint arXiv:1504.08083 (2015).

[20] Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. “Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning”. In: International Conference on Intelligent Comput-ing. Springer. 2005, pp. 878–887.

[21] Robert M Haralick, Karthikeyan Shanmugam, et al. “Textural features for image classification”. In: IEEE Transactions on systems, man, and cybernetics 6 (1973), pp. 610–621.

[22] Joseph C Harsanyi and C-I Chang. “Hyperspectral image classification and dimensionality re-duction: an orthogonal subspace projection approach”. In: IEEE Transactions on geoscience and remote sensing 32.4 (1994), pp. 779–785.

[23] Jonathan Huang et al. “Speed/accuracy trade-offs for modern convolutional object detectors”. In: CoRR abs/1611.10012 (2016). arXiv: 1611.10012_{. url:} http://arxiv.org/abs/1611. 10012.

[24] Jonathan Huang et al. “Speed/accuracy trade-offs for modern convolutional object detectors”. In: IEEE CVPR. 2017.

[25] Nathalie Japkowicz. “The class imbalance problem: Significance and strategies”. In: Proc. of the Int Conf. on Artificial Intelligence. 2000.

[26] L´aszl´o A Jeni, Jeffrey F Cohn, and Fernando De La Torre. “Facing imbalanced data–recommendations for the use of performance metrics”. In: Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on. IEEE. 2013, pp. 245–251.

[27] Sotiris Kotsiantis, Dimitris Kanellopoulos, Panayiotis Pintelas, et al. “Handling imbalanced datasets: A review”. In: GESTS International Transactions on Computer Science and Engi-neering 30.1 (2006), pp. 25–36.

[28] Gang Kou et al. “Evaluation of classification algorithms using MCDM and rank correlation”. In: International Journal of Information Technology & Decision Making 11.01 (2012), pp. 197–225. [29] Stefan Lessmann et al. “Benchmarking classification models for software defect prediction: A proposed framework and novel findings”. In: IEEE Transactions on Software Engineering 34.4 (2008), pp. 485–496.

[30] Rainer Lienhart and Jochen Maydt. “An extended set of haar-like features for rapid object detection”. In: Image Processing. 2002. Proceedings. 2002 International Conference on. Vol. 1. IEEE. 2002, pp. I–I.

[31] Tsung-Yi Lin et al. “Focal loss for dense object detection”. In: arXiv preprint arXiv:1708.02002 (2017).

[32] Tsung-Yi Lin et al. “Microsoft coco: Common objects in context”. In: European conference on computer vision. Springer. 2014, pp. 740–755.

[33] Wei Liu et al. “Ssd: Single shot multibox detector”. In: European conference on computer vision. Springer. 2016, pp. 21–37.

[34] Vladimir Y Mariano et al. “Performance evaluation of object detection algorithms”. In: Pattern Recognition, 2002. Proceedings. 16th International Conference on. Vol. 3. IEEE. 2002, pp. 965– 969.

[35] Alexander K Morison, Simon G Robertson, and David C Smith. “An integrated system for production fish aging: image analysis and quality assurance”. In: North American Journal of Fisheries Management 18.3 (1998), pp. 587–598.

Installation correctness using object positions