Car damage estimation using deep learning

(1)

i | P a g e

Master of Business Administration

specialization in Big Data & Business Analytics

Master Thesis

CAR DAMAGE ESTIMATION USING DEEP LEARNING

by

JAMES PAUL GNANASEKARAN

11417714

30-Sep-2018

Supervisor:

Assessor:

(2)

ii

Acknowledgments

My sincere gratitude goes out to my supervisor Dr. Stevan Rudinac for his valuable guidance and constructive suggestions throughout this research.

Thanks to my managers Yvonne vd Veen and Judith Toet for their encouragement and allowing me flexible working hours which helped enormously with balancing my work and study.

I am grateful to the following colleagues of mine:

Gerard Koch, for his creative ideas, inspiration, and unfailing support;

Paul Knapper, for his assistance in getting the images needed for this research;

Olaf van Haver, for the useful discussions around labeling and model setup in this project; Vasilis Bankov, for helping me get started with AWS cloud9 and lambda services.

Personally, I would like to thank my family and friends, for their love and moral support throughout these years.

(3)

iii

Executive summary

This thesis investigates the possibility of estimating car damages by applying deep learning technique on the images available with Aegon. An end-to-end solution approach is proposed. In this initial research, we offer a solution for classifying the severity of the damage into low, medium and high. This thesis has given valuable insights into the data and the latest techniques that are available for image classification. The challenges, recommendations, and direction for further research are listed down.

Recently some start-ups have started entering the market in this space, which shows the potential for such a solution. This solution has use case not only for car Insurance but also for other insurance products such as building damage, building contents and liability and even for car rental companies. The biggest differentiator for Aegon is the data that it has on the claims and the insurance expertise. It would be worthwhile for Aegon to pursue this research further and explore the possibility of kick-starting this as a separate venture either with or without partnering with start-ups in this domain.

(4)

iv

put together by Daimler Chrysler, SPSS and NCR to satisfy the need to have a standard approach to handle data mining problems, irrespective of the industry, modeling tool and application. It involves x phases. It is an adaptive approach, in the sense that we need to go back to the previous phase if questions arise in the current stage.

In the Data understanding phase, it is essential to do exploratory analysis and get an understanding of the data and gather initial insights. In this case, as discussed further in section ‘Data understanding’ it was helpful to identify the different types of input images, which helped to decide the models (image classifiers needed).

Data preparation is a labor-intensive phase, in which we prepare the data before using it to train the model. We used a supervised machine learning technique, which needs labeled images.

A label indicates category/classification (also called class label) of the image. For instance, if we want to train a model to classify an image as either Car or Not-car, we need images containing car as well as images that do not include a car, along with the label for each image. Some models would expect images of a fixed size, in which case we need to resize the image before we can feed it into a model. In the case of class imbalance (proportion of images belonging to a particular class is much smaller than other class), we may have to do data augmentation – increase the number of images by applying specific techniques such as flip, mirroring, shearing and rotation – to ensure that the model is not biased towards the class that is in high proportion. In the modeling phase, we select appropriate models to solve the business problem. The model is trained by fine-tuning the parameters to get the optimum results. Several different models might be applied. Moreover, we may need to go back to the data preparation phase to get the data in the format expected by the model. The modeling phase would result in one or more models.

In the evaluation phase, we evaluate the models based on appropriate metrics and ascertain if the model meets the intended goals set out in the business/research understanding phase. The outcome of the evaluation phase could either result in choosing a model for use or going back again to the modeling phase.

The developed model must finally be integrated into the business process for it to be useful. A deployment strategy for the model along with the steps needed to execute it should be in place. This phase also involves monitoring and maintenance of the model. It is essential that we continue to monitor our image classification model and ensure that the model is re-trained when needed. For instance, over time the prediction might be off when the input images are very different from the images that were used to train the model. We will see in the subsequent sections the activities carried out per phase.

(7)

2

2.Business understanding

Aegon is a financial services company operating worldwide with products ranging from life insurance, property and casualty, pension and mortgages. Car insurance product is sold in Aegon country units, Netherlands and Hungary.

In the Netherlands, an insured can take the car directly to the garage for repair. The garage notifies Aegon and fixes the damage. The cost is settled by Aegon directly with the garage. The communication between the garage/repair shops and Aegon goes via Audaflow (Solera), and claims below a certain limit are settled automatically. Claims of specific categories require an expert inspection.

In Aegon Hungary, the customer contacts the Insurance company directly and the claim handler requests for pictures of the car damage and estimates the damage. The claim handlers use an application in which they select the make and model of the car, different parts that were damaged and estimate based on their judgment if the part needs to be repaired or replaced. Based on these inputs, the application gives back a cost estimate referring to a price catalog.

Automatically estimating the damage would improve the business process by saving efforts for the claim handlers, reduce fraud and enhance the customer experience.

Research question

With the given advancements in deep learning neural networks for computer vision, we aim to investigate whether it is possible to:

1. Assess the severity of the damage (low/medium/high) 2. Detect the parts of the car from the image

3. Estimate if the damage can be repaired or replaced, and the cost estimate

Cost estimate could be done based on the available price catalog for car repairing based on the make and model of the car and the part damaged. In this initial research, we addressed the first item. However, the models suggested in this thesis could be used as a starting point for researching further the other two items. Recommendations for future work is provided in the last section.

Literature review

Research on the relevant literature was focused on the application of computer vision for identifying car parts and damages.

In general, image classification which comes very naturally for a human is instead a hard problem for a computer. Over the years, computer vision has evolved to a state where image classification by computers is improving. Nowadays, for a computer, it is relatively easy to detect objects in an image, such as a car, but it is very challenging to identify the make and model of a car. This type of classification is referred to as fine-grained image classification, as there is less variance between different models. Variations between different make and model are very subtle which makes the problem harder. In our situation, we want to identify the damaged part of a car and assess the severity of the damage. The challenge is in detecting and differentiating the car parts with damage and those without damage, which is a fine-grained classification problem.

(8)

3 In 2011 (Chavez-Aragon et al., 2011) applied the technique Cascade of Boosted Classifiers (CBC) (Viola & Jones, 2001) to detect up to fourteen different parts of a car base on images showing the lateral view of the car.

The earliest work related to estimating car damages using images is from 2013. (Jayawardena, 2013) proposed a method for vehicle damage detection by overlaying a 3D CAD model over an image of the damaged vehicle and estimating damage by the extent of the deviation of the edges to the 3D model. Deep learning-based techniques are recently being tried out for automated car damage classification. (Patil et al., 2017) have applied feature extraction using various pre-trained networks, to identify the damage type on images of car damages downloaded from the web. They used an ensemble method to combine the prediction from the pre-trained networks, which worked best. (de Deijn, 2018) has applied fine tuning with a pre-trained VGG16 model to estimate car damages on images downloaded from the web. They tried to identify the type of damage, location, and severity of the damage. Terms such as feature extraction, fine-tuning and VGG16 are explained in the section on neural networks.

In the recent years, some startups have come up focusing on applying deep learning to estimate car insurance damages. Tractable founded in 2014 is a startup based in the UK which is active in this area. In 2017 it partnered with the Belgian Insurer Ageas to automate their car insurance claim assessment. Based on a set of images of car damage, it classifies the images into various pre-defined parts of the car and assesses the severity of the damage. It uses ‘interactive learning’ to do labeling faster. By applying multi-instance learning, the model can analyze a set of images taken at different angles on the same damage before it makes the final prediction(“Automotive Insurance with TensorFlow,” 2018). It claims that their technique is ten times faster than a human claim expert at estimating damages.

In 2017, Ant financials (a group of Alibaba) has come up with a deep learning based computer vision App ‘Dingsunbao,’ that does a damage estimation based on pictures of damaged cars. Ant financials set up a challenge against claim experts to evaluate 12 cases, wherein the app took 6 seconds for all the cases while the claim experts took 6mins and 48 seconds(“China Auto Insurance Claims Adjusters Get AI Boost from Ant,” 2017) and the decision on the claims was the same. Alibaba group says “Dingsunbao 2.0’s secret sauce includes 46 patented technologies, such as simultaneous localization and mapping, a mobile deep-learning model, damage detection with video streaming, results in a display with augmented reality and others.”

In this thesis, we want to explore the possibilities and challenges of applying deep learning to estimate damage severity based on the images that Aegon has received from the repair shops.

(9)

4

3.Data understanding

The images in scope for the thesis are the images received by Aegon Netherlands for car insurance claims in the year 2017. In total there are 77014 images which belong to 13239 claims. All these images were unlabeled.

As we saw earlier in the section Business understanding, the communication between the repair shops and the Insurance companies goes via Audaflow. Audaflow verifies the information sent by the repair shops, and it generates a PDF report and sends it to the claim administration system, with the pictures embedded in the PDF. We could extract the pictures from the PDF, but the quality of these pictures would be low. Audaflow also stores the images in their native format(.jpg) with the same quality as sent by the repair shops. These images were of high quality and were used for this research.

The images could contain personally identifiable information such as the license plate or persons. So, in line with the organization guidelines, a Privacy Impact Assessment (PIA) was done to get the clearance from the Privacy board to do the analysis.

Exploratory analysis of the data

As suggested by CRISP model, before applying the models we started with understanding the data – the kind of images that are available. 77.000 is quite a large set to examine manually. So, a subset with all claims submitted in January and February was selected. This subset had in total 10280 images which formed the basis of this research.

Although this was a tedious job, it proved to be very useful in gaining insight into the kind of images we had. Initially started viewing the images as icons (thumbnail view) in windows explorer. However, later in the exploration phase, came across an open source tool called Orange,(“Orange – Data Mining Fruitful & Fun,” 1997) an open source machine learning, visualization software from the University of Ljubljana.

Orange helps to create machine learning pipelines quickly without coding, by just dragging and dropping the pre-built components. For analyzing the images, we built a workflow to do feature extraction using pre-trained ImageNet VGG-16, and then do hierarchical clustering on those features. Orange also has a nice feature that allows one to select the images interactively from the hierarchical clusters, and then display/save them separately in a folder, which was very useful for labeling a particular group

Exploration of the different clusters brought to light that we had many images that didn’t belong to the car. Refer Figure 2 for the overview of the different types of images that are present in the input. There were images of company logos (two of them were prominent and many duplicates as they were present in many claims), Insurance certificate (green card of the car) and other documents (forms filled by repair shops, additional notes). So, we had to have a model to identify these non-car images and filter them out of the pipeline.

Next, when we looked further into the images of the car, there were images of the dashboard, cars in full view, separate parts showing works-in-progress or fully repaired. These images were present across different damage categories. These images would not allow the models to learn well and had to be filtered before we can use the images to train the models to detect car parts and damage severity. For assessing the severity of the damages, we need the images which have a close-up view of the damages and the images from the full view of the car with significant damages.

(10)

5

Figure 2 Overview of the different images present in the input

The input contains different kinds of images – logos and documents, dashboards and pictures of chassis numbers, separate parts showing work-in-progress or finished, full view of the car and close-up shots of the damages.

It is essential that we have a good understanding of the data before we start with modeling. Though the task of identifying car damages might sound generic, we need to look into the input images to ensure that we set up the correct pipeline.

Exploration of the images gave a good understanding of the data and shed light on the machine learning pipeline needed (Figure 3). We built three different models that work in sequence. The first model separates the images into car and not-car. The images of car are then passed to the second model which separates them into four categories – dashboard, separate parts, cars in full-view and a close-up view of cars. We are interested in the close-up view of cars for estimating the damage. These images are then passed to a third model which estimates the severity of the damage. We also propose another model for future research, to identify the part that is damaged. The predicted severity and the part of the car can be used as input along with the price catalog to estimate the damage amount.

(11)

6

Model-1 Car vs Not-car classifer

Model-2 car drilldown classifier

Model-3 damage severity classifier

Model-4 car parts classifer - Low - Medium - High - Bumper - Door - Hood etc

Out of scope for this thesis Estimate

repair cost

In-scope of this thesis

Figure 3 Image classifier pipeline

Price catalog

(12)

7

4. Data preparation

We are using a supervised form of machine learning, which means that we need to know the label for all the images that are used for training the model. The biggest challenge was that the images were not labeled. We explored the option of manually labeling the images taking the help of claim handlers. However, this is an arduous task, as we would have to get large set of images validated and possibly seek validation with three different people and take a majority vote to ensure that it is not mislabeled. There are labeled data of cars available ((Tafazzoli et al., 2017), (Yang et al., 2015)) which have been used for the task of identifying vehicle make and model, which is also a fine-grained image classification problem. These image datasets have images showing the front, rear and the lateral view of the car. However, in our situation, we have images showing damages, and hence we cannot use these existing car datasets for training our model. We needed to label the images and use it for training our models. We tried different approaches for labeling as outlined below.

Labeling the images

Manual labeling

As mentioned in the exploration phase, we manually checked the thumbnails of the images in windows explorer. The first model classifies the images into car and not-car, this was much easier to identify as the not-car images were the logos and documents which stood out from the rest of the images. So, all the 10280 image thumbnails were inspected and labeled as either car or not-car.

Cluster and label

The other approach that we used was to first apply clustering, an unsupervised algorithm on the unlabeled images. Then randomly check some images in each cluster to see if we can identify a logical grouping and use it for labeling. As mentioned earlier in the section on exploratory analysis, Orange was helpful with

(13)

8 this task. Figure 5 shows the hierarchical clustering with the legs in the dendrogram selected for the images that contain the close-up shot of the tire /wheel of the car.

The visual clustering greatly helped to identify the various groups of images that were present. It was pleasantly surprised to see that all images belonging to various categories, such as car dashboard, wheels were automatically clustered in separate groups. With this approach, we labeled the input for the model-2 ‘car drill down’.

Figure 5 Image embedding and Hierarchical clustering

Data enrichment

The third approach we followed for labeling was to get the label of the images by connecting them to their corresponding claim information. We wanted to fetch additional information such as purchase value of car, claim reason and damage severity from the claim administration system and link it to the images. However, it was much more difficult than thought initially. Audaflow has its own case-id’s to manage the images. We had to match the claim number maintained in the claim administration system of Aegon to the case-id in Audaflow. However, the case-id of Audaflow is present only in a PDF report. So, first, we had to parse the PDF report to extract the required fields such as claim reason, repair cost, the value of the car, build year of the car. To add to the challenge, not all case-id’s had a match – some PDFs were of a slightly different format, and some were missing the required information. Furthermore, the standard report from the repairers does not contain information on a specific category of claims (windshield damage, claims involving experts), so we were not able to successfully match all the images to their corresponding damage severity information from the claim system.

Information about total loss is not administered correctly in the claim system. Only the purchase value is administered in the claim system. Repair cost vs. the purchase value of the car was used to categorize the damage. Based on the input from claim handlers the thresholds were determined as follows; <= 5% is

(14)

9 categorized low, >5 and <= 15% as medium and > 15% as a high category. Depreciation value is not considered.

Pre-processing the images

The input layers of the VGG16 model expects the images of 224 x 224 size. The input images that we have are of different sizes. They need to be pre-processed before they can be sent to the model for the training. Keras has a function to do pre-processing, and it makes it easier to build the pipeline. When resizing the images, the aspect ratios were maintained.

(15)

10

5. Modeling

How does the computer see an image?

The basic unit of an image is a pixel, which denotes the intensity of the color. The computer sees an image just as an array of numbers, each denoting the pixel intensity. In a greyscale, the image is black and white, and the intensity of a pixel could vary between 0 and 255. The light intensity increases from 0 to 255. In the RGB color space, each pixel is indicated by three numbers, each denoting the intensity of the colors Red Green and Blue respectively. In the computer, an image in the RGB color space is therefore represented as a three-dimensional array – Width x Height x Channel depth. The channel depth is 1 for greyscale images and 3 for RGB color space.

Combination of various values in the RGB color space produces different colors. A [0,0,0] would indicate black and [255,255,255] would indicate a white color (mix of an equal proportion of red, green and blue). An image of 640x480 pixel in the RGB color space will have 921,600 elements in the array.

Traditional computer vision techniques

For humans, the task of classifying an image is straightforward. However, for computers, it is a challenging task. Traditional computer vision techniques involved handcrafting features. Pixel intensities were not used directly as input for the image classifier. Instead, the features were extracted by applying different techniques, and then passed as input to the classifier. Various techniques were applied on the images to extract features such as the texture (Local binary patterns(Ojala et al., 2002), Haralick texture), Shape – (Hu moments, Zernike moments(Khotanzad & Hong, 1990)), color (color moments, color histograms, color correlograms(Huang et al., 1997)) and interesting regions of an image (Key point detector, local invariant descriptor(Lowe, 1999)).

However, in deep learning, the raw pixel intensities are passed directly as input to the neural network model, and the model learns the features automatically.

Hand crafted feature extractions Traditional Machine learning classifier Predicted class label Images Convolutional neural network Automatically detects features Predicted class label Images

(16)

11

Neural networks

Neural network is the buzzword recently. It might be surprising to know that they have been in existence since the 1940s. The reason it became so popular now is due to the increasing amount of data and the computing power that is available nowadays.

History of neural nets

The following section gives the history of the neural network, when it started and where it stands today. Neural networks have been called by different names – Cybernetics, connectionism and the familiar Artificial Neural Network (ANN) – over the years. The first neural network model was a binary classifier, introduced in 1943(McCulloch & Pitts, 1988). The problem with this model was that the weights had to be determined manually. This human intervention to adjust the model weights limited the scalability of the model. Later in 1950, Rosenblatt came up with the Perceptron algorithm, that could automatically learn weights. However a perceptron algorithm - which uses a linear activation function (step function) - can only solve a linear classifier and cannot solve a non-linear problem, irrespective of the number of layers in the network(Minsky & Papert, 1969). This limitation effectively put a freeze on the progress of research on neural networks for some period, which is also referred to as the AI winter. However, with the introduction of the backpropagation algorithm(Werbos, 1974), and the use of non-linear activation functions, neural networks have started becoming used widely.

A neural network with many layers is called as a deep learning network. In deep learning, the model applies what is called hierarchical learning. The lower layers of the network learn features such as edges; the higher layers tend to build upon the previously learned features and detects contours, which is then used by the next layers to detect the parts of the object in the image(Zeiler & Fergus, 2013).

(17)

12

What is a neural network model?

The human brain is composed of neurons which are interconnected by dendrites. Messages are passed from one neuron to another using electrochemical signals. Neural networks are not realistic models of the brain; instead they are inspired by the functioning of the brain. A Neural network is a directed graph structure with a set of nodes and the labeled connection between the nodes. Each of the nodes performs a simple computation. The connection passes the signal (output of one node) to another node, and the strength of the signal is indicated by labeled weights which either enhance or diminish the signal. Drawing a parallel to the functioning of the brain, each of these nodes in the neural network represents neurons. In a human brain, a neuron receives a signal from many neurons, and if the power of the signal exceeds a certain threshold, then the receiving neuron gets activated and sends out a signal to another set of neurons. The critical point here is that a neuron is either activated or not. There is no in-between state. Artificial neural network models mimic this brain function of neuron firing, using what is called an activation function.

Figure 7A cartoon drawing of a biological neuron (left) and its mathematical model (right).

Image source:(Karpathy, 2015) Activation function

A weighted sum is calculated based on the input X and weights W. This weighted sum is then passed through an activation function which determines if the neuron fires or not. There are several activation functions such as sigmoid, tan-hyperbolic and recently popular ReLU. Rectified LinearUnit(ReLU) function(Hahnloser et al., 2000) returns zero for negative inputs and increases linearly for positive inputs. ReLU is recommended over other activation functions, as it is computationally inexpensive and found to greatly accelerate the convergence of stochastic gradient descent(Karpathy, 2015).

Layer

A neural network can have multiple layers, with each layer having multiple nodes. There will always be an input layer and output layer, and there could be layers in between which are referred to as hidden layers. There could be more than one hidden layer. Earlier neural networks used to have a small number of hidden layers(LeNet, four layers), however, with the recent advancement in computing power we see networks with more hidden layers(ResNet 100+ layers). The more the number of layers the network has, the more it can learn. A deep neural network refers to a neural network with more hidden layers. When the activation function in a node sends out a signal, the signal is propagated further to the nodes in the next layer. This way the network propagates the signal all the way to the output layer where the prediction is made.

(18)

13 The input layer only contains the input to the network. The hidden layer and the output layer do contain an activation function.

An image classifier based on the neural network takes a set of input data (the raw pixels of the images along with the actual class labels) and feeds it to the input layer. The network learns a scoring function that maps the input data to the class label by defining and optimizing the weights in such a way as to increase the overall classification accuracy.

The number of nodes in the input layer will be equal to the number of pixels times the color channel. The number of nodes in the output layer corresponds to the number of class labels to be predicted. The activation function in the output layer will usually be a softmax classifier.

Scoring function

The function that maps the input data to the target class label is called the scoring function.

Loss function

A loss function determines the extent to which the classifier predicts the class label correctly. The more the predicted class and the actual class (also referred to as ground truth) agree, the lesser is the loss (and hence higher the accuracy). In an image classifier, the commonly used loss function is the cross-entropy loss. This loss function should minimize the negative log-likelihood of the correct class.

Cross-entropy loss per image is defined as 𝐿𝑖 = −𝑙𝑜𝑔𝑃(𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖)

Cross-entropy loss is calculated by the negative log of the normalized probability of the correct class. A softmax function calculates the normalized probability for each class label. This function takes the exponent on the results of the scoring function and normalizes it over the sum of the exponents for the data point. The probability is calculated by the exponent of the score (the result of the scoring function) for the actual label divided by the sum of exponents of the score for all labels

𝑃(𝑌 = 𝑦𝑖|𝑋 = 𝑥𝑖) = 𝑒𝑆𝑦𝑖

∑ 𝑒𝑆𝑗

𝑗

The loss function for a single image(datapoint) in terms of the score can be written as follows: 𝐿𝑖 = − log (

𝑒𝑆𝑦𝑖

∑ 𝑒𝑆𝑗

𝑗 )

Stochastic Gradient Descent (SGD)

Obtaining a lower loss (higher accuracy) is dependent on finding the optimal values of weights. Optimization techniques help us in arriving at the optimal value of weights, by iteratively taking a step that moves us closer in the direction that minimizes loss. Stochastic Gradient Descent (SGD) is a commonly used technique in machine learning, to optimize the weights by reducing the loss. A learning rate alpha controls the step size, and is a hyperparameter that needs to be fine-tuned to obtain lower loss. Typical values of alpha could range from 0.1 to 0.001.

(19)

14

Backpropagation

Backpropagation algorithm consists of two phases – forward pass and a backward pass.

The forward pass is when the inputs from one layer are passed on to the next layer, and all the way through the output layer and a prediction is made. The backward pass is where the gradient of the loss function is calculated in the final layer and is applied to the weights of the prior layers all the way to the input layer. This process is also referred to as the ‘weights update phase.’

When the neural network starts training, we start with some random weights. There are several ways in which weights in the neural network can be initialized. Keras (“Keras Documentation,” 2015), the deep learning library we use for building the models, uses ‘Glorot initialization’ or ‘Xavier initialization,’(Glorot & Bengio, 2010)’. While training, the backward pass of the backpropagation algorithm learns the optimal weights to achieve a lower loss.

Need for regularization

We need to ensure that the model we trained works not only on the training data but also for data that it has not seen during training. We refer to this as a generalization of the model. To improve the

generalization of the model usually the input data that we use to train the model is split into training,

validation and testing data. Testing data should strictly not be used for training. The model is trained based on the training data, and the prediction is validated using the validation data. We test the performance of several models on the validation data and then choose the best model, which is then tested on the test data.

Overfitting is a situation when the model works well with the training data but doesn’t perform well with

unseen data – we say that the model does not generalize well. To avoid issues of overfitting, we resort to regularization. There are various types of regularization L1, L2, and elastic net which applies an additional parameter to the loss function. We also have a regularization technique, dropout which is added explicitly to the neural network itself. Drop out skips explicitly some connections between nodes from one layer to another layer, thereby ensuring that the activation is not dependent entirely on one single node but rather on multiple, redundant nodes when given similar inputs.

Too much of regularization could lead to underfitting, where the model is not able to learn well from the training data and does a poor job of mapping the input features to the target class label.

Our loss function with the regularization term is written as follows. The lambda (𝜆) controls the strength of the regularization that is applied.

𝐿 = 1

𝑁∑ 𝐿𝑖+ 𝜆𝑅(𝑊) 𝑁

𝑖=1

Convolutional neural networks

Let us now focus on Convolutional neural network, which is the most common network that is applied in the field of computer vision. Convolutional neural networks or shortly ConvNets or CNN is a specific form of feedforward neural networks. In Feedforward network architecture, a connection from a node in one

(20)

15 layer is allowed only to the next layer. It is not possible to have a backward connection or inter-layer connection. In the feedforward network, each neuron in the input layer is connected to all neurons in the output layer. However, that is not the case with a convolutional neural network.

Convolution is not as complicated as it sounds. It is the operation of element-wise multiplication (not the dot product) of two matrices and summing up the result. As we saw earlier, an image is a multi-dimensional matrix. This matrix is multiplied by a smaller matrix called Kernel or a convolutional matrix. We can associate these kernels with the handcrafted kernels that we traditionally used to introduce blur, sharpen or identify edges in the images. In the traditional computer vision techniques, these kernels had to be hand defined. However, in a convolutional network, the model automatically learns the right values for these kernels.

The convolutional network has neurons arranged in three dimensions: width, height, depth. The neurons in a convolutional layer will be connected only to a small region of the input layer.

A convolutional network is built of several layers: • Input layer

• Convolutional (CONV)

• Activation (ACT or RELU, where we use the same or the actual activation function) • Pooling (POOL)

• Fully-connected (FC) • Output layer

Let us look briefly at these layers and relate them to the VGG16(Simonyan & Zisserman, 2014) architecture to get our understanding of how a convolutional neural network is constructed.

Simonyan and Zisserman provide the configuration of the various networks they tried out. We are focusing on configuration D in the paper, commonly called as VGG16, as it contains 16 weighted layers. Later, to do our image classification, we will be using this pre-trained model as a base for transfer learning.

Input layer

The image is fed into the input layer. There are no parameters to learn in this layer. The VGG16 model takes an image of size 224 x 224 in the RGB color space. The size of this layer will be 224 x 224 x 3

(21)

16

Figure 8VGG16 network architecture

Image source: (leonardblier, 2016)

VGG16 has 16 layers with weights. It uses a convolutional kernel of size 3x3, with a padding of 1 and the convolution stride fixed to 1 pixel. Max-pooling is carried out over a 2 × 2 pixel window, with stride 2.

Figure 9 VGG16 - Parameters and output size per layer

Layer VGG16 layers Description Matrix size Parameters

0 Input layer

No parameters to learn. The input layer just

receives the input image 224 x 224 x 3 -

1 Convolutional layer 64 kernels of size 3x3 224 x 224 x 64 1,792

Pooling layer 2x2 kernel with max pooling, reduces the shape 112 x 112 x 64 -

Pooling layer Kernel of size 2x2 56 x 56 x 128

Pooling layer 2x2 kernel with max pooling 28 x 28 x 256

8 Convolutional layer 512 kernels of size 3x3 28 x 28 x 512 1,180,160

14 Fully connected layer connected to all nodes of previous layer 1 x 4096 102,764,544

15 Fully connected layer connected to all nodes of previous layer 1 x 4096 16,781,312

16 Fully connected layer(Output layer) softmax classifier to predict output labels 1 x 1000 4,097,000

(22)

17

Convolutional layer

The Convolutional layer is the core building block of a Convolutional Neural Network. This layer consists of a set of kernels (values are learned during back-propagation) of size m x n. The size of the kernel is also referred to as the receptive field(F). We can imagine this kernel as a sliding window that slides from left-to-right and top-to-bottom across an image, performing convolution operation at each step. At every step of the slide, the number of pixels to which the kernel is shifted is denoted as the stride. The border of the image might be padded with zeroes to ensure that the kernel’s center goes through the pixels in the border of the images. The neural network layers are built of many three-dimensional matrices whose dimension is represented by Width(W), Height(H) and Depth(D), also commonly as volume.

Calculation of output volume

If we consider an input volume of size 𝑊_𝑖𝑛𝑝𝑢𝑡, 𝐻𝑖𝑛𝑝𝑢𝑡, 𝐷_𝑖𝑛𝑝𝑢𝑡, with padding P, stride S, receptive field of the kernel F, and the number of kernels K, then the size of the ouptut volume is:

𝑊_𝑜𝑢𝑡𝑝𝑢𝑡 = ((𝑊_𝑖𝑛𝑝𝑢𝑡 − 𝐹 + 2𝑃)/𝑆) + 1 𝐻_𝑜𝑢𝑡𝑝𝑢𝑡 = ((𝐻_𝑖𝑛𝑝𝑢𝑡 − 𝐹 + 2𝑃)/𝑆) + 1 𝐷_𝑜𝑢𝑡𝑝𝑢𝑡 = 𝐾

Figure 10 How convolution works

(23)

18 The matrix in the 1st_{column could be considered as an image of pixel 7x7 in the RGB color space, each}

matrix referring to each of the colors Red, Green, and Blue. We could see that the matrix has zero padding on the border. The matrices in the next two columns are the convolutional filters. There are two filters W0 and W1, each having a depth of three equal to the depth of the input layer (the image in this example). The convolution process results in the output volume of 3x3x2(we see a depth of 2 because we have two convolutional filters/kernels)

If we look at the first convolution layer of the VGG16, we can see that the input is the image (224 x 224 pixels, with three color channels - RGB). The receptive field of the kernel is 3x3. VGG16 uses 1 pixel for padding and has a stride of 1. A bias term will be included in the input weights. Therefore the output size per kernel, after convolution will be

𝑊_𝑜𝑢𝑡𝑝𝑢𝑡 = ((224 − 3 + 2)/1) + 1 => 224 𝐻_𝑜𝑢𝑡𝑝𝑢𝑡 = ((224 − 3 + 2)/1) + 1 => 224 𝐷_𝑜𝑢𝑡𝑝𝑢𝑡 = 3

Number of weights to learn

Given the size of the kernel(F), depth of the input volume (Dinput), number of kernels(K), the number of

weights (also referred as parameters) in this layer is calculated as (F*D+1)*K

For VGG16, the first convolutional layer then has in total 1792 weights (3*3*3+1)*64)

Activation

Activation is not a separate layer, but after the convolution operation, within every convolutional layer an activation function is applied in-place, and this does not alter the shape of the matrix.

Pooling layer

Pooling is the operation of applying a function to choose the maximum or average from a smaller region of the input matrix. The pool size is referred to as the receptive size(F) of the pool. Moreover, similar to the convolution kernel, the pool kernel is also slid across the input matrix. As we see in Figure 11 Max pooling, pooling allows us to reduce the input size, thereby reducing the number of parameters and computation in the network.

We can calculate the output volume after applying pooling as follows:

With an input volume of size 𝑊_𝑖𝑛𝑝𝑢𝑡, 𝐻𝑖𝑛𝑝𝑢𝑡, 𝐷_𝑖𝑛𝑝𝑢𝑡, and a pooling kernel with the receptive field(F) and stride(S), the size of the output volume is:

𝑊_𝑜𝑢𝑡𝑝𝑢𝑡 = ((𝑊_𝑖𝑛𝑝𝑢𝑡 − 𝐹)/𝑆) + 1 𝐻_𝑜𝑢𝑡𝑝𝑢𝑡 = ((𝐻_𝑖𝑛𝑝𝑢𝑡 − 𝐹)/𝑆) + 1 𝐷_𝑜𝑢𝑡𝑝𝑢𝑡 = 𝐷_𝑖𝑛𝑝𝑢𝑡

(24)

19

Figure 11 Max pooling

Image source: (Rosebrock, 2017)

So, for the first pooling layer of VGG16, which has an input volume of 224 x 224 x 64 and uses pooling kernel size of 2 x 2, the output size will be:

𝑊_𝑜𝑢𝑡𝑝𝑢𝑡 = ((224 − 2)/2) + 1 => 112 𝐻_𝑜𝑢𝑡𝑝𝑢𝑡 = ((224 − 2)/2) + 1 => 112 𝐷_𝑜𝑢𝑡𝑝𝑢𝑡 = 64

Fully connected layer (FC)

Nodes in FC layers are fully-connected to all activations in the previous layer, which is the standard for feedforward neural networks. It is always placed at the end of the network. It is common to have two FC layers before the final fully connected layer (output layer with the softmax classifier, with nodes equal to the number of class labels to be predicted).

For VGG16, if we look at the first fully connected layer. It has 4096 nodes. It is connected to all the nodes from the previous layer, which is 25089(7*7*512+1(bias term)). Then the total number of parameters in this layer is 102.764.544(25089*4096). In total, VGG16 has 138 million parameters to learn.

Additionally, a convolutional network might contain a Batch Normalization and a dropout layer, which help with regularization. VGG16 does not contain batch normalization as it was introduced later.

Transfer learning

Convolutional neural networks need lots of data to train from scratch, the more the data better the accuracy is. ImageNet is trained on 1.2 million images, and the whole training takes few days in a multi-GPU setup. In many situations, when we want to apply supervised deep learning, we may not always have such large quantities of data available. In those cases, transfer learning techniques are beneficial. Transfer learning is when a model, trained on a set of images, is used for training an entirely new set of images that are not even related to the original image set. For instance, we could use a pre-trained model that is trained in classifying images of cats and dogs to train the model to detect new categories, such as dashboards and flowers.

Transfer learning is possible because a convolutional neural network learns patterns in stages. The earlier layers of the network identify generic features such as edges in the image, the next layer would be able to detect contours, and the final layer would be able to detect features that are specific to the target that the network tries to predict. In general, transferring features is better than training a model from scratch using random features(Yosinski et al., 2014).

(25)

20 If we want to detect labels of the classes that are part of ImageNet, we could directly use readily available model and start predicting. However, if we have labels that don’t fall under ImageNet, we could still use these pre-trained models as a starting point and apply the transfer learning technique to train the model on an entirely different set of images.

Techniques of transfer learning

There are two types of transfer learning, namely feature extraction and fine-tuning. Let us take a closer look at these two types as we will be using these techniques.

Feature extraction

We can use the pre-trained convolutional network as a feature extractor. In the convolutional network, when the image is given as input, features are learned in each layer and propagated all the way to the output layer where the class label is predicted. Instead of using the output layer for prediction, we could stop earlier and get the features extracted till then and use a classical machine learning algorithm such as logistic regression or a linear SVM. We could choose to get the features from any layer; the layers closer to the input layer have generic features, and the features become increasingly specific to the images as they layers get closer to the output layer.

Figure 12 Feature extraction and fine-tuning in transfer learning

Image source: (Karpathy, 2015)

The first figure shows the model trained on Imagenet. The second figure shows the original model where the last fully connected layer is swapped with a new classifier using the features extracted. Moreover, the last figure shows that layers from the original model can be fine-tuned as well.

(26)

21

Fine-tuning

We create a new model from a pre-trained model by replacing the last fully connected layer with a new layer. We then start training the new model, with random weights for the new layer. We have the possibility to fine tune the weights for the old layers (from the pre-trained model). We could fine tune one or more layers, by allowing backpropagation to backpropagate the weight adjustments through the layers we want to fine-tune and stopping it on the other layers. If the data available for training is substantial (> 1000 per class label), fine-tuning will give better results than feature extraction.

When to choose which technique

The specific technique to apply depends mainly on the size of the dataset and how similar it is to the data that was used to train the original model. These are some guidelines put forth by in his lecture notes of Stanford’s cs231n class(Karpathy, 2015).

Images similar to the pre-trained model’s training images

Images dissimilar to the pre-trained model’s training images

Small dataset Feature extraction using the last fully connected layer and a traditional classifier

Feature extraction using lower level convolutional layers and a traditional classifier

Large dataset Fine-tuning might work Fine tuning may not work, most likely

the model has to be trained from scratch

In our situation, the images we have are not like ImageNet. Though ImageNet has class ‘cars/trucks,’ they are trained on images showing full cars. Our dataset does contain such images, but as we saw in the section ‘Data understanding,’ we have more variations. Moreover, the dataset we have is quite small. So most likely the feature extraction would work out best. However, we also tried out fine tuning to compare the result. The results of the models are discussed in section ‘Model evaluation.’

Choosing a model for transfer learning

For transfer learning the models trained on ImageNet data challenge is used, as they have been trained on a large dataset. There is a slight confusion when we refer to ImageNet(Deng et al., 2009). ImageNet is a project that aims to label and categorize images into 22000 categories based on a defined set of words and phrases. This project was started, to get a large number of labeled data using which the image classification algorithms could be improved. The goal is to have 1000+ images per category.

However when in the deep learning community people refer to models trained on ImageNet, they refer to the ImageNet is a hyperparameter that needs to be fine-tuned to obtain lower loss challenge ILSVRC ImageNet Large Scale Visual Recognition Challenge (Russakovsky et al., 2014). The goal of this challenge is to train a model to classify 1,000 separate categories (“ILSVRC2014,” 2014) using approximately 1.2 million images for training, 50,000 for validation, and 100,000 for testing. This challenge ran from 2010 until 2017. From 2012 onwards, the convolutional neural network models have consistently been at the top. Each year different CNN model had won this challenge. We could pick any of these models as our pre-trained model.

(27)

22

Which models were developed?

We applied transfer learning and chose VGG16 network as our pre-trained model. VGG16 was chosen as the architecture is quite simple to understand. Also, it has been demonstrated to generalize well (as compared to other network types such as GoogLeNet and ResNet) to datasets it was not trained on (Rosebrock, 2017), so it is commonly used in transfer learning.

VGG16 was developed by Oxford Visual Geometry Group and secured first place in the localization challenge of the ILSVRC challenge for the year 2014(Simonyan & Zisserman, 2014). The downside is that the model size is large (528 MB), and it takes a lot of time to train. The VGG16 model training on ImageNet data took 2-3 weeks on a system equipped with four NVIDIA Titan Black GPUs. The exciting thing is that we do not have to spend that much time to train our models. By applying transfer learning technique, we could use this pre-trained model and build a powerful image classifier in a few hours.

There are mainly three variations

1. Feature engineering: Removed the last layer (layer 16) and extracted the features from layer 15, which has 4096 features. These features were used as input for logistic regression.

2. Fine-tuning: Swapped the last layer (layer 16) with a new layer. The last fully connected layer (layer 16) of the VGG16 has 1000 nodes corresponding to the 1000 class labels of ImageNet. The new layer we replace will contain nodes corresponding to the class labels we want to predict. For instance, in the Car vs. Not-car model, this new layer will have only two nodes. We fine-tune only the weights of the last layer. The other layers are ‘frozen,’ which means we do not backpropagate the weight adjustments to those layers.

3. Another variation to the above fine-tuning model is to fine-tune both layers 15 and 16.

Model description Model-id for reference

Model-1 Car vs. Not-Car

Feature engineering, followed by logistic regression M1-FEAT-LOGREG Model-2 Car drill-down

Feature engineering, followed by logistic regression M2-FEAT-LOGREG Fine tuning by removing only the last fully connected

layer

M2-FNTN-FROM-LYR16 Finetuning by removing the last fully connected layer

and the last convolutional layer

M2-FNTN-FROM-LYR15 Model-3 Damage severity

Feature engineering, followed by logistic regression M3-FEAT-LOGREG Fine tuning by removing only the last fully connected

layer

M3-FNTN-FROM-LYR16 Finetuning by removing the last fully connected layer

and the last convolutional layer

M3-FNTN-FROM-LYR15

Train and validation split is 75:25 for all models.

(28)

23 Training neural network models on GPU are much faster than on CPU. Amazon Web Service (AWS) provides several types of EC2 instances with GPU acceleration, which can be quickly spun up.

• Amazon EC2 P3 Instances have up to 8 NVIDIA Tesla V100 GPUs.

• Amazon EC2 P2 Instances have up to 16 NVIDIA Tesla K80 GPUs.

• Amazon EC2 G3 Instances have up to 4 NVIDIA Tesla M60 GPUs.

The models were built using a P2.Xlarge machine which has one GPU(NVIDIA K80).

System details

Source: https://aws.amazon.com/ec2/instance-types/

Software:

Keras is used for developing the models. Keras is a high-level API written in python, which can run on top of deep learning libraries Tensorflow or Theano. It acts as an abstraction layer and provides a more natural way to program deep learning networks without having to deal with the complexities of tensorflow programming.

(29)

24

6. Model Evaluation

Metrics

Accuracy is the usual measure to check how many cases are correctly categorized.

However, accuracy may not give a correct picture when we have a class imbalance. It is crucial that we pay attention to other metrics such as precision and recall. Precision refers to the proportion of selected items that are relevant. Whereas recall refers to the proportion of relevant items that are selected. When we focus on improving precision, the recall could go down. F1-Score connects both precision and recall and gives a single score for interpretation. F1-score is the harmonic mean of precision and recall, a higher value indicates a better model.

Image source: Wikipedia

Model evaluation – model-1 car vs. not-car

Input data

Class label # of images Class distribution %

Car 9356 91%

Not car 924 9%

(30)

25 Model: M1-FEAT-LOGREG

The label ‘Not car’ is present only in 9% of the cases. However, despite the considerable class imbalance, we have got good performance from the model, just by using logistic regression on the features learned from a pre-trained VGG-16 model. We did not explore with further models as this model already yields a good result

Transfer learning type Training

time(mins) Optimizer Learning rate Epochs Validation Accuracy Validation Loss Feature extraction + Logistic regression < 10mins NA NA 1 0

precision recall f1-score support car 1.00 1.00 1.00 2348 notcar 1.00 0.98 0.99 222 avg / total 1.00 1.00 1.00 2570

Model evaluation – model-2 car drilldown

Input data

Dashboard 484 5.17%

Separate parts 953 10.18%

Full car view 2103 22.47%

close-up view 5816 62.16%

Total 9356

We tried three different models

Model-ID Transfer learning type Training

time(mins) Optimiser Learning rate Epochs Validation Accuracy M2-FEAT-LOGREG Feature extraction + Logistic regression 10 NA NA NA 0.88 M2-FNTN-FROM-LYR16

Fine tuning (replacing only the head – the last FC layer)

71 RMS prop 0.001 25 0.86

M2-FNTN-FROM-LYR15

Fine tuning (replacing the last FC layer and the ConvNets)

(31)

26 Model: M2-FEAT-LOGREG

precision recall f1-score support car 0.90 0.92 0.91 1453 dashboard 0.97 0.94 0.96 108 fullcarshot 0.89 0.87 0.88 550 separateparts 0.73 0.65 0.69 228 avg / total 0.88 0.88 0.88 2339

Despite being under-represented, the model can identify features that discriminate the dashboard clearly from the rest of the images, yielding a very high precision of 0.97. Also, we see that the precision and recall of the car and full-car-shot is around 0.90 which is good. However, the category ‘separate parts’ doesn’t do that well. When we look further into the images within this group we have a vast variety of images – it is hard to learn the standard pattern.

Model: M2-FNTN-FROM-LYR16

precision recall f1-score support car 0.89 0.89 0.89 1444 dashboard 0.88 0.97 0.92 134 fullcarshot 0.80 0.94 0.87 520 separateparts 0.81 0.43 0.56 241

avg / total 0.86 0.86 0.85 2339

The F1-score of the model M2-FNTN-FROM-LYR16 is lower than the model M2-FEAT-LOGREG. Moreover, we could notice that the recall for the class ‘separate parts’ reduces to 0.43.

precision recall f1-score support car 0.92 0.91 0.91 1444 dashboard 0.96 0.97 0.97 134 fullcarshot 0.85 0.92 0.88 520 separateparts 0.76 0.68 0.71 241

avg / total 0.89 0.89 0.89 2339

When compared to the model M2-FEAT-LOGREG, the accuracy has improved slightly. Also, we see that for the class ‘separate parts,’ the precision and recall has increased.

(32)

27

Model evaluation – model-3 damage severity

Input data

Low 1736 54.88%

Medium 1309 41.38%

High 118 3.74%

Total 3163

There is a considerable class imbalance. The label ‘high’ is present only in less than 4% of the cases. We started with three different models. As they did not perform well, the models were rerun with the class weights adjusted to handle class imbalance.

Model-ID Transfer learning type Training

time(mins) Optimiser Learning rate Epochs Validation Accuracy M3-FEAT-LOGREG Feature extraction + Logistic regression 10 NA NA NA 0.60 M3-FNTN-FROM-LYR16

Fine tuning (replacing only the head – the last FC layer)

23 RMS prop 0.001 25 0.48

M3-FNTN-FROM-LYR15

Fine tuning (replacing the last FC layer and the ConvNets)

93 SGD 0.001 100 0.58

Model: M3-FEAT-LOGREG

precision recall f1-score support high 0.67 0.09 0.16 22 low 0.66 0.68 0.67 447 medium 0.54 0.54 0.54 322 avg / total 0.61 0.61 0.60 791

The overall precision of this model is 0.61. So far with the other two models (1, 2), we have been getting a precision close to 0.90. One reason could be that we have very less number of images per label, compared to what we used earlier. Also, we see that the class ‘High’ is highly imbalanced, it is present only in 3.74% of the cases. The recall is only 0.09, which indicates that amongst the items predicted as ‘high,’ only 9% is relevant. The classifier is doing an abysmal job for the ‘high’ category.

(33)

28 This model performs even worse, the F1-score has gone down from 0.60 to 0.48. It is not able to identify even a single ‘high’ case correctly.

Similar to M3-FNTN-FROM-LYR16, this model also is not able to identify even a single ‘high’ case correctly. Increasing the number of layers to be finetuned seems to increase the precision a bit, but it does not help to classify the label ‘high’ correctly.

Adjusting the class weights

M3-FEAT-LOGREG with class weights adjustment

The overall F1-score of this model is the same as we had previously without the class weights. However, we see that for the class ‘high,’ the F1-score has gone up.

M3-FNTN-FROM-LYR16 with class weights adjustment

The overall F1-score has gone down by 0.08 compared to the same model without the class weights, but the F1-score for the class ‘high’ has increased to 0.16.

M3-FNTN-FROM-LYR15 with class weights adjustment

(34)

29 When compared to the same model without the class weights, the overall F1-score has dropped by 0.15. Moreover, the F1-score for class ‘high’ has increased, while it has dropped a lot for medium (to 0.21 from 0.52).

Conclusion:

When adjusting the class weights, we see that there is a slight improvement in the prediction of F1-score for class ‘high’ in all the three models. The model M3-FEAT-LOGREG, with feature extraction and class weights, has a better result than the rest of the models.

To further improve the prediction we could try the following. These could be taken up in future work. a) Data augmentation to handle the class imbalance.

b) Explore the images further. There may still be some clutter in the images spread across all the different damage types, which should be analyzed and removed.

c) Furthermore, accurate labeling for damage type could help us improve the prediction. We had taken an approach of using the % of the repair cost of the total purchase value of the car as the basis for labeling. We could consider the depreciation value of the car to come to an accurate categorization.

d) Get more labeled data

Opening the blackbox

For a model to be trusted, it is essential that its rationale behind its prediction can be explained. Simpler models are easier to explain. Take the case of linear regression; it is easier to understand which parameters influence (both positively as well as negatively) the prediction. Neural networks are preferred over simpler models, as they can learn complex patterns and are more accurate. However, one drawback that comes up is that the more accurate the models become, the more complicated it is to understand and explain per prediction precisely what goes on inside the network and the parameters that influenced the outcome. The explanation becomes even harder for a deep learning network with millions of parameters (the VGG16 network that we use has 138million parameters!)

For the users to accept the model decision and improve willingness to use the model, we need to make the predictions explainable. Also, GDPR (General data protection regulation), the data privacy regulation that came into force in May 2018, offers EU citizens the ‘right to an explanation.’ For instance, if an Insurance company rejects a customer’s claim based on the decision of a model, it still needs to give a clear explanation to the customer on the reasons that led to the decision. Furthermore, when data scientists develop models, they need to understand why a model behaves a certain way so that they can compare and choose the best model.

In the recent years, much research has gone into the explainability of neural network models. DARPA((Gunning, 2017)) has started a transparency project XAI(Explainable AI) that aims to “produce "glass box" models that are explainable to a "human-in-the-loop," without greatly sacrificing AI performance.” Attempts have been made to visualize the feature maps in various layers of the neural network, using class activation maps and providing explanation process in a heat map involving sensitivity analysis and Layer-wise relevance propagation(Samek et al., 2017).

We pay specific attention to LIME(Local Interpretable Model-agnostic Explanation) (Ribeiro et al., 2016). LIME claims to explain any specific prediction of a blackbox model. LIME focuses on fitting a local model

(35)

30 to explain why a prediction is made. It considers the model to be explained as a blackbox. LIME generates perturbed samples closer to the input data that needs to be explained and obtains a prediction for these points from the blackbox model. Using this new data (perturbed samples and predictions), LIME fits a new simple model (such as linear regression) weighted by the proximity of the samples to the input, which can then explain the outcome. It is an excellent local approximation of the original blackbox model.

LIME does not know the complex model(f) that classifies into two different regions (blue/pink). The red cross in bold is the point to be explained. The surrounding points are predicted using the complex model, and their size indicates their proximity to the red cross. The dashed line indicates the new linear model fitted.

Image source: (Ribeiro et al., 2016) paper “Why should I trust you.”

For an image classification model, LIME can make it visually easier to understand which pixels of an image is for/against the prediction. Let us look at the explanations of LIME for our Model-2 for predicting ‘dashboard.’

The first image is the actual image. The second image shows only the part that LIME identified as contributing to the prediction. The third image shows the part along with the rest of the image. The fourth image shows the parts contributed positively (highlighted in green) and the parts that contributed negatively (highlighted in red) for the prediction.

From the explanations, we see that the model is looking for the curved surface with the numbers showing up. Moreover, it looks like the color does play a role. The dark patches seem to indicate the dashboard.

(36)

31

7. Model deployment

When the models are developed, we want to be able to apply these models whenever an image related to a car claim is sent by the customer or repair company and send the prediction (damage estimation) to the claim system. In the initial phase, we want the model to assist the claim handlers, and not automatically show the repair estimation directly to the customer/repairer.

We did not get as far as deploying the model in a real-time production environment. However this step was thought through, and an approach is suggested here. Amazon AWS makes it easier to deploy the trained models. The trained model can be stored in an S3 bucket. A lambda function can be setup with a trigger to invoke the model whenever a new image is stored in an S3 bucket. The images and the prediction can be stored back in an S3 bucket for consumption by other applications such as the claim administration system or the workflow applications.

Figure 14 Model deployment with AWS

(37)

32

8.Conclusions and directions for further research

Collect better quality images

The images that we get from the repair shops network in the Netherlands are different from the images collected in Aegon Hungary (as the images are submitted both by customers and repair shops). What we noticed in this research is that there is a considerable variation in the images sent by different repair shops. Some companies have a standard way of taking pictures of the complete car and the dashboard reading. Some companies share the images of the repair work-in-progress (the parts are already separated out from the car), or the repair that is already completed, and the paintwork is in-progress. It is essential that we have images that show the damage and not the ones after the damage is fixed. Furthermore, the images are taken at different light settings – outdoor, within the workshop, and there is much reflection on the images, which makes it difficult even for a human to identify the damage or even understand which part of the car is in the image.

The first image shows the picture of the left side view of the car, with the reflection of the tiles and the parking lane indicator (white dashes). The second image shows a damaged windshield (damaged area circled in red), very hard even for humans to identify.

Mobile app to pictures of the damage

It will greatly benefit if we provide a mobile app (combined with AR / VR) that could guide the customer to take the pictures. We could request

the customer to take pictures of various sides of the car, with the mobile app prompting the views to be taken, with a real-time verification if the customer is pointing the camera to the right view of the car as indicated. Also, the app can also highlight the damaged area (scratch or dent) with the help of semantic segmentation. The app would facilitate getting the right label automatically, while the picture is being taken.

Car damage estimation using deep learning

Master of Business Administration

specialization in Big Data & Business Analytics

Master Thesis

CAR DAMAGE ESTIMATION USING DEEP LEARNING

by

JAMES PAUL GNANASEKARAN

11417714

30-Sep-2018

Supervisor:

Assessor:

Acknowledgments

Executive summary

Contents

2.Business understanding

Research question

Literature review

3.Data understanding

Exploratory analysis of the data

4. Data preparation

Labeling the images

Pre-processing the images

5. Modeling

How does the computer see an image?

Traditional computer vision techniques

Neural networks

History of neural nets

What is a neural network model?

Scoring function

Loss function

Stochastic Gradient Descent (SGD)

Backpropagation

Need for regularization

Convolutional neural networks

Transfer learning

Techniques of transfer learning

When to choose which technique

Choosing a model for transfer learning

Which models were developed?

6. Model Evaluation

Metrics

Model evaluation – model-1 car vs. not-car

Model evaluation – model-2 car drilldown

Model evaluation – model-3 damage severity

Opening the blackbox

7. Model deployment

8.Conclusions and directions for further research

Collect better quality images

Mobile app to pictures of the damage