Fine-grained classication of seeds in images

(1)

Faculty of Economics and Business, Amsterdam School of Economics

15 juli 2018 2017-2018

dhr. prof. dr. M. Worring Period 5 and 6

Master thesis Econometrics Big Data & Business analytics

Fine-grained classification of

seeds in images

Bram Postma

(10790675)

(2)

Statement of originality

This document is written by Bram Postma who declares to take full responsi-bility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

Abstract

The aim of this work is to explore how to classify images of very similar looking seeds. Traditional methods made use of handcrafted features of the seeds. Those features where then used to train a model that could classify the seeds. In this work we explore how deep learning methods (neural nets) can be applied for classifying images of very similar looking seeds. From an application perspective, the additional challenge is to find out how image information and the characteristics of objects can be combined to increase classification performance, compared to using only image information or characteristics. We trained different convolutional neural network on the images. We made use of normalization of the input data and dropout in the fully connected layers to reduce overfitting. The best preforming classifier trained on the characteristics is a random forest. By taking the mean of the two predicted probabilities the accuracy is increased to 99.64%. This is higher then the SVM that is currently used which reaches a accuracy of 97.4%.

(4)

1 Introduction

In the seed breeding industry millions and millions of seeds are processed ev-ery day. The visual difference between those seeds is vev-ery small, but some of the seeds contain gluten while others do not. It is clear that a person with a gluten allergy finds it important that those seeds get separated consistently. Because of the small differences between those seeds it is hard for a human to discriminate between them. A human can easily recognize differences between different houses or horses because they have big differences in appearance, but it is difficult for a human to tell the difference between two small seeds. Addi-tionally, checking every single seed by hand takes a lot of time and manpower which is mind-numbing work to do. Therefore, this asks for advanced methods to classify different seeds.

The advantages of using an automated system over a human are clear. Machines can work 24 hours a day and seven days a week. While humans get bored or sleepy during their work, a machine does not and will always have the same accuracy while that of a human will usually decrease. Additionally, by law, humans are not allowed to do this kind of work for more then 4 hours a day and 1 year in total. So the hypothesis is that creating a machine to do the sorting using image processing will improve the sorting process.

Traditional models use handcrafted features to classify the seeds. Some variables are extracted from the image of the seed. Those variables are then used to train a model that can do the classification. This approach requires an extra step, namely the defining and extracting of the variables.

Current approaches to image recognition make essential use of machine learning techniques. One of these techniques, deep convolutional neural net-works (CNNs), has led to a series of breakthroughs for image classification (Krizhevsky, Sutskever & Hinton, 2012, pp. 1097-1105). These networks are able to discover low/mid/high-level features (Zeiler & Fergus, 2014, pp.

(8)

818-833) or attributes of a picture such as curves or colour transitions, and the “levels” of features depends on the number of stacked layers in the network. Deep CNNs can learn these features and that makes them great for this job.

Figure 1: Image of a Seed Recognizing seeds in images (Figure 1)

differs from most image recognition. Net-works are commonly trained from images with large differences. But the differences between seeds are very small and this is a special branch of image recognition. Dis-criminating sub-categories belonging to the same basic-level category such as seeds is called fine-grained recognition. It falls be-tween identifying generic visual categories such as bikes and identification of individual

instances such as faces. Because it is a fine-grained classification problem other techniques in addition to standard CNNs may be required.

It is hard to explain why a neural network makes a certain decision and for this reason they get called black-boxes (Olden & Jackson, 2002). We make use of different graphical visualization techniques to get an insight into the working our CNN. Additionally, by better understanding the working of the network we could improve the network and understand why the network predict some observations wrongly.

This research will focus on recognizing and sorting seeds using image recog-nition. We also have additional variables that are extracted from the picture. Those variables include features like the size of the seed in pixels or the length of a fitted ellipse. It is possible to do image recognition using images of the seed but an alternative is to categorize the seeds using those additional variables. For example this metadata can be used to train a random forest classifier.

(9)

Currently deep learning on the images in combination with extracted data has not been often applied in the task of recognizing different seeds. This re-search will try to develop a model by combining both using deep learning. The central research question asks if it is possible to develop a model that outper-forms currently used methods to recognize different seeds using a combination of meta data and images. Our hypothesis is that combining both will boost performance.

The aim of this thesis is to develop efficient methods for classifying images of very similar seeds, in which each image depicts one seed. From a method-ological perspective, the aim is to explore how deep learning methods can be applied to do fine-grained image recognition. From an application perspec-tive, the additional challenge is to find out how image information and the characteristics of objects can be combined optimally to increase classification performance, compared to using only image information. To evaluate the mod-els, we compare them with non deep models like a support vector machine and random forest models. Finally, the working of the model is shown graphically so the reason for wrong predictions can be deducted. So the main goal is to create a model that combines the raw images and the meta data in an optimal way to discriminate between different seeds.

This thesis is organized as follows. Chapter 2 covers the theory as well as the empirical methods that are used in this research. Chapter 3 describes the preparation process of the data, and chapter 4 the research methods. Chapter 5 explains how the visualization of the inner workings network were created. Chapter 6 states the results and an empirical analysis. Finally, Chapter 7 summarizes and provides concluding remarks.

(10)

(11)

2 Related Work

To better understand what deep convolutional neural networks are and what the problems are when trying to classify images of very similar seeds, rele-vant scientific papers will be discussed in the following paragraphs. Image recognition in general as well as the specific classification of seeds is reviewed. Techniques used in the past will be discussed as well as the improvements made. Finally, different views and opinions of several researches concerning this topic will be discussed.

2.1 Seed Classification

Machine vision has been used for quite some time to classify seeds. In 1995 Shatadal, Jayas, Hehn, and Bulley developed a model to classify seeds using machine vision. They used a dataset with a combination of grain, small and large seeds. Later Shahin and Symons (2003) developed a model to classify lentil types using machine vision. Recently Kurtulmus, Alibas and Kavdir (2016) proposed to classify pepper seeds using machine vision. All three pa-pers extracted attributes of the seeds from the images and used that to train their models. The selection of attributes is based upon experience and advice of experts on the type of data. The attributes they included contain image features such as color, shape and texture.

Different machine learning techniques were used to predict the class based on the extracted features. Shatadal et al. used k-nearest neighbor and the Bayes decision rule. Shahin and Symons also used k-nearest neighbor but in addition to that they also used linear discriminant analysis, quadratic discrim-inant function and normal density-kernels. Kurtulmus et al. trained a neural network on the attributes to classify different kinds of pepper.

(12)

75.5% of the small seeds. The large seed category defined in their research corresponds to the seeds used in this thesis. Shahin and Symons achieved a much higher accuracy when trying to classify lentil types. They reached an overall testing accuracy of 98.9 percent using only the seed size and seed color. This is much higher then Kurtulmus et al. when they tried to classify pepper, they reached an accuracy of 84.9 percent. Note that the k-nearest neighbor classifier of Shatadal et al. preforms worse than the neural network but that the k-nearest neighbor classifier of Shahin and Symons achieved a higher accuracy. Based on the papers alone, because they used different datasets, it is not possible to definitively conclude that one method works better than the other. Note that these three papers all did not use the images directly and only extracted features from the images to train their neural network. In this thesis using the image directly as well as extracted features is used to classify seeds.

2.2 Basic Image Recognition

A lot of research has been done on the topic of image recognition. Convolu-tional neural networks get used a lot for image recognition because of their ability to efficiently take the correlation between pixels into account. One of the very first convolutional neural networks was named LeNet5 and was the work of LeCun, Bottou, Bengio, & Haffner (1998) and was the result of many previous successful iterations since 1988. Later the CNN was compared with other methods by the same LeCun, Huang, and Bottou (2004). They did research on several popular learning methods for the problem of recogniz-ing basic-level categories. They used a relatively small data set consistrecogniz-ing of 194,400 images divided in to five categories. The improvements made since 1998 were not substantial until the research by Krizhevsky, Sutskever & Hin-ton (2012). They trained a large deep CNN to classify images which was the start of the intensive use of large neural networks. Their dataset is much

(13)

bigger than the set used by LeCun et al. and consists of 1.4 million images with 1000 generic visual categories coming from Imagenet. Since they won the Imagenet competition their network AlexNet has been successfully applied to a larger variety of computer vision tasks. The ImageNet Large Scale Vi-sual Recognition Challenge (ILSVRC) was first held in 2010. Every year, the ILSVRC provided images of 1000 (varying) object categories for classification, with a total of over 1.2 million images for training, 50 000 for validation and (at least) 100 000 for testing (Russakovsky et al., 2015). Every year multiple teams participate in this competition. They publish their results in a paper, and are responsible for most developments in the field of image recognition.

2.2.1 Methods

Different methods and network architectures to classify images have been com-pared in previous papers and since 1988 a lot of improvements have been made to CNN. LeCun et al. tested Nearest Neighbor methods, Support Vector Ma-chines (SVMs), and Convolutional Networks (CNNs) to predict the category of an image. They also made their models invariant to pose, lighting, and surrounding clutter. Their CNN was still relatively primitive. Krizhevsky et al. used eight layers in total in their model and combined that with the dropout method (Hinton, Srivastava, Krizhevsky, Sutskever & Salakhutdinov, (2012)) to reduce overfitting. Later Zeiler and Fergus (2014) worked with the same network architecture but used a novel visualization technique to show that some hyperparameters could have been chosen differently to increase the performance of the network. They used different filter sizes that outperformed Krizhevsky et al. model on the ImageNet classification benchmark. Simonyan and Zisserman (2015) further worked on Alexnet by adding more layers to their network. This was possible because computer power had increased since then and because they used smaller filters. The current state of the art models

(14)

are developed by Szegedy, Ioffe, Vanhoucke and Alemi (2016). They mainly improved the work of others by combining it. They used the very deep incep-tion network which is compareable in performance with the work of Simonyan and Zisserman, and combined it with the residual learning framework from He, Zhang, Ren and Sun (2016).

2.2.2 Results

Convolutional neural networks proved to be the best for basic image recog-nition. LeCun et al. (1998) showed that using convolutional layers instead of using each pixel as a separate input of a large multi-layer neural network resulted in higher accuracy. This is because images are highly spatially corre-lated and using individual pixel of the image as separate input features would not take advantage of these correlations. Later work of LeCun et al. (2004) concluded that when trained on images with an uniform background Convo-lutional networks perform better then an SVM. When using cluttered images, SVM proved to be of no use, while a CNN had a higher error rate than before but still acceptable. Using a CNN they reached an accuracy of 83.3 % on their relatively small dataset. Krizhevsky et al. achieved top-1 and top-5 error rates of 37.5 and 17.0 percent. This is much better than previous attempts (45.7 and 25.7 percent respectively). Zeiler and Fergus improved this network by resizing the filters and achieved top-1 and top-5 error rates of 36 and 14.8 percent. Additionally they showed that their model generalizes well to other datasets and is better than any other model at that time. Finally, Simonyan and Zisserman found that there is a significant improvement in accuracy when the depth of the network increases. They reached top-1 and top-5 error rates of 23.7 and 6.8 percent which is a huge improvement over the original work of Krizhesky et al. Currently a top-5 error rate of 3.08 % is achieved by Szegedy et al. by combining the work of others. From the previously discussed papers we

(15)

can see that the performance of CNNs in image classification has significantly increased in the past couple of years.

2.3 Fine-grained Image Recognition

All previously discussed papers focus on recognizing basic-level categories such as car, dogs or houses. In this thesis, we focus on fine-grained image categoriza-tion. This requires an approach to capture the fine and detailed information in images. An example is discriminating between different breeds of dogs. Tech-niques described in Section 2.2 will be used to do this in combination with the techniques found by researches which focused specifically on fine-grained image recognition.

Unlike basic image recognition, a lot of different datasets are used for grained image recognition. One of the first to address the problem of fine-grained image recognition were Hillel and Weinshall (2007). They considered 12 subordinate classes from 6 basic categories. Their dataset came from dif-ferent source among which the Caltech Motorcycle and Faces database. Later, Yao, Khosla and Fei-Fei (2011) also tried to classify images from the Caltech set in combination with other sources. The downside of their research is that their data requires object or part annotations. This is done by placing a bounding box on top of the images that indicates where discriminative image patches are. Chai, Lempitsky and Zisserman (2013) combined the work of Yoa et al. and Singh, Gupta and Efros (2012). They used the same data but required only a loose bounding box around the instance in each image. Wang et al. (2014) used the triplet dataset containing a Query image and a negative and positive example of that Query where, according to the human raters, the positive im-age is more similar to the Query imim-age than the negative imim-age. Doing this for each image classification problem would be too much work. Finally Zhang, Xiong, Zhou, Lin and Tian (2016) focus on fine-grained image recognition and

(16)

propose an automatic fine-grained recognition approach. Their data needed no annotation whatsoever. While some researchers used feature annotation, this is not necessary for fine-grained image recognition.

2.3.1 Methods

Because fine-grained image classification appears to be harder than basic cat-egory classification, a few new techniques have been introduced by several authors. Hillel and Weinshalls approach is motivated by observations from cognitive psychology, which says there is a difference between recognizing a category and a sub-category. They split their model into two parts. Part one discriminates between categories and part two discriminates within a category by looking at features of sub-categories. In this thesis, we consider only one category namely seeds so part one is not needed in our model. Hillel and Wein-shalls did not use the raw images but trained an SVM on features deducted from the images. Yao, Khosla and Fei-Fei (2011) used a random forest with a discriminative decision trees algorithm to solve the problem of fine-grained image categorization. In this algorithm every tree node is a discriminative classifier, which uses a feature of the image to make a decision in each node. This is trained by combining the information in this node as well as all nodes above the node in the tree. Randomization is used to handle the huge fea-ture space and to prevent overfitting. Chai, Lempitsky and Zisserman (2013) combined the work of Yoa et al. (2011) and Singh, Gupta and Efros (2012) to select features automatically and using those, instead of the whole image, to discriminate between categories. Wang et al. (2014) proposed a deep rank-ing model that employs deep learnrank-ing techniques to find similarities between images. Zhang et al. (2016) made two major contributions. First, they pick good filters which respond to specific parts significantly and consistently. For example, a filter that detects the beak of a bird for all kind of birds in all

(17)

kind of settings. Secondly, they propose a simple but effective feature encod-ing method. They add weight to filters with the goal to highlight the filter responses that are crucial for the recognition. They do this by giving those filters a relatively high weight. An important difference with most previous works on fine grained image recognition is that their model does not depend on any object or part annotation at both training and testing stages.

2.3.2 Results

Despite not using the same data the findings of previous discussed papers can be combined. Hillel and Weinshall show that the large number of features used to discriminate within a category is critical for successfully recognizing sub-categories. Additionally they noted that multiple SVM training sessions are typically much shorter than multiple applications of relational model learning. This advantage becomes more important when the amount of data and features increases. However speed is less of an issue with the current state of the art hardware. Yoa et al. showed that their methods achieve state-of-the-art performance and that they are able to get a lot of useful information and features from images. The downside of their research is that training of feature localization depends on object or part annotations, which are heavily labor-consuming and the obstacle when using those models for practical applications. Chai et al. showed that previously described object or part annotations are not necessary. They showed that the accuracy is much higher when using unsupervised training rather than human annotation to train certain features detectors, such as a head detector. This is in accordance with previous research done by Singh et al. (2012). Wang et al. (2014) came up with similar findings. Extensive experiments showed that the algorithm they proposed outperformed models based on hand-crafted visual features and deep classification models. The downside here is again the demand of labor, the model they proposed is

(18)

trained on a dataset consisting of triplets made by human raters.

2.4 Conclusion

Research done in the past shows a few important things that will be used in this thesis. First of all, it was shown that convolutional neural networks perform better then alternatives like SVMs or Nearest Neighbor methods. Fur-thermore, SVMs proved to be impractical when the observations are cluttered. Secondly, Simonyan and Zisserman (2015) have shown that the network depth has a positive correlation with the accuracy when trying to recognize images. Multiple others showed that feature learning based on deep learning outper-forms hand crafted features. Zeiler and Fergus (2014) came up with a novel visualization technology to show how the intermediate feature layers of a con-volutional neural network work. This will be used in this thesis as well. Many papers rely on object or part annotation which is less of a problem in this thesis because of the solid background in each image. The hypothesis based on the literature is that it is possible to develop a better model then the one currently used to classify images of very similar seeds. This is based on the fact that convolutional networks have performed better then SVMs in simi-lar tasks. Whether adding additional information to the network besides the images will significantly improve the model is unclear and is therefore further investigated.

(19)

3 Data

This chapter is devoted to the exploration and preparation of the dataset. In Section 3.2 the sources are mentioned and the data is described. In Section 3.2 the data cleaning is described. Additionally, data-issues are discussed and remedies to some of these issues are proposed. In Section 3.3, the preprocessed dataset is explored. The entire analysis is performed with R, a free software environment for statistical computing and graphics.

3.1 Data Description

Figure 2: Observation before cleaning

This section provides a description of impor-tant features of the data. The data is pro-vided by SeQso. SeQso is a leading Dutch high-tech company; innovative in patented seed sorting machines and seed analysis in-struments. Dedicated image analysis and smart mechanical solutions are the basis of the sorters made by SeQso. In addition to the images itself, metadata is available for every picture. The meta data is described in subsection 3.1.2.

3.1.1 Images

The data consists of fixed-size 250X250 RGB images of seeds where each image depicts one seed (see Figure 2 for an example). Every image is split in to three different images where the first image represents the red values, the second one the green values and the last one the blue values. There are four type of seeds (A-D). The differences between A and B are small and C and D are easy to

(20)

distinguish from the rest.

The data contains 4123 observations of seeds, this is a relatively small dataset and the question is if this influences the performance of the model. Because the data does not contain complicated features we expected this to have only a small influence. The frequencies of the different seeds are given in Table 1. The data is quite unbalanced, 70% of the observations belongs to category B. The consequences of this are that the model can predict cate-gory B for every image and still reach an accuracy of 70%. Furthermore, the model may not be able to learn the characteristics of other seeds because the observations are sparse. In the next paragraph some methods to deal with this problem will be discussed. Additionally, some model specifications are able to deal with this problem, they will be discussed in the next chapter.

Table 1: Frequencies of different seeds in the data Seed Number of observations Percentage of total

A 445 11% B 2890 70% C 292 7% D 496 12% Total 4123 100% 3.1.2 Meta Data

Besides the images, 25 parameters are available. These are derived from the images. Table 3 shows all 25 variables and their descriptive statistics. These parameters will be used to train classifiers that need handcrafted features. Those classifiers will also be used in combination with CNNs with the hypoth-esis that combining both will boost performance. Below is a short description all 25 parameters.

(21)

Size Surface of the seed in pixels

AvgLL Mean L value from lab color space

AvgAA Mean A value from lab color space

AvgBB Mean B value from lab color space

LegendreEllips Ratio Ratio width/length fitted ellipse AvgLL2

Mean L value from lab color space while

removing the outer 3 pixels

AvgAA2 Mean A ” ”

AvgBB2 Mean B ” ”

Contour smoothness Circumference contour/convex hull

AvgLL3 See bottom page

AvgAA3 See bottom page

AvgBB3 See bottom page

Edge percentage Percentage ’edge’ pixels in the seed

Breedte histogram contrast Difference thick and thin parts of the seed Breedte histogram contrast relatief Same but normalized

LegendreEllips breedte Length of short side of fitted ellipse LegendreEllips lengte Length of long side of fitted ellipse Contour Max breedte Longest distance between contour points

BreedteOffset Border of histogram

FeretMax Maximum Feret diameter

FeretMin Minimum Feret diameter

FeretRatio Ratio max/min Feret diameter

Convexiteit Ratio total surface/surface convex hull

Breedte asymmetry Symmetry wrt longest side

FeretAngle Angle between Feret max and min diameter

The AvgLL3, AvgAA3 and AvgBB3 variables are created by taking the LAB color space values of the center part of the seed. After that, histograms of those L, A and B values are created. Finally, the values of the variables are defined as the mean L, A and B values in the top 30% of the histogram.

A t-test is conducted to test whether the parameter values for seed A en B differ from each other. Parameters which did have a significant difference in

(22)

mean have an asterisk behind their name in table 3.

Table 3: Descriptive statistics of the metadata, Mean(std. dev.). An asterisk behind the name means that there is a significant difference in mean between this variable of seed A and this variable of seed B.

Variable Name Seed A Seed B Seed C Seed D

Size 3961.73 (567.42) 3961.1 (1012.55) 3403 (86.23) 2919.01 (424.58) AvgLL 77.79 (4.51) 77.77 (4.18) 68.81 (3.73) 67.23 (2.85) AvgAA* -1.39 (0.56) -0.71 (0.61) -1.12 (0.51 0.77 (0.45) AvgBB* 16.72 (1.81) 14.66 (2.19) 10.04 1.32 12.73 (1.46) LegendreEllips Ratio* 2.64 (0.39) 3.45 (0.44) 3.3 (0.3) 1.85 (0.22) AvgLL2 85.36 (3.99) 85.15 (4.12) 76.38 (4.07) 74.78 (2.91) AvgAA2* -1.81 (0.49) -0.91 (0.58) -1.21 (0.42 0.43 (0.52) AvgBB2* 17.25 (1.97) 14.08 (2.41) 10.71 (1.44) 13.32 (1.84) Contour smoothness 0.953 (0.036) 0.949 (0.04) 0.89 (0.04) 0.93 (0.04) AvgLL3* 10.68 (2.26) 9,94 (2.22) 10.19 (1.72) 10.13 (1.41) AvgAA3* 0.41 (0.5) 0.54 (0.51) 0.61 (0.53) 1.15 (0.35) AvgBB3* 4.79 (0.92) 4.3 (0.9) 4.23 (0.67) 4.21 (0.87) Edge percentage* 25.69 (14.73) 10.74 (13.64) 6.32 (11.29) 7.56 (7.34) Breedte histogram contrast* 17.06

(3.67) 10.46 (2.9) 9.96 (2.34) 10.08 (2.23) Breedte histogram contrast relatief* 35.51

(5.35) 26.63 (6.92) 26.25 (5.67) 21.96 (4.14) LegendreEllips breedte* 57.89 (5.31) 65.76 (10.8) 59.98 (4.52) 41.24 (2.53)

(23)

LegendreEllips lengte* 22.17 (2.43) 19.07 (2.31) 18.27 (1.71) 22.62 (2.64)

Contour Max breedte* 48.27

(5.49) 40.14 (4.51) 38.76 (3.39) 46.76 (5.45) BreedteOffset* 29.14 (12.17) 26.73 (11.88) 29.01 (13.29) 27.3 (11.5) FeretMax* 118.46 (11.05) 133.95 (22.19) 119.238 (8.79) 84.52 (6.26) FeretMin* 46.84 (5.39) 38.64 (4.55) 37.48 (3.33) 45.11 (5.63) FeretRatio* 2.56 (0.36) 3.47 (0.46) 3.2 (0.27) 1.89 (0.22) Convexiteit 0.989 (0.017) 0.99 (0.08) 0.97 (0.019) 0.99 (0.09) Breedte asymmetry 0.044 (0.412) 0.023 (0.447) 0.154 (0.404) 0.115 (0.559) FeretAngle 89.64 (7.14) 89.88 (4.14) 90.247 (3.05) 89.68 (6.62)

When looking at Table 3 we see that most of the variables of seed A and B have a significant difference. Additionally we see that all variables can be used a discriminating feature for at least one type of seed.

3.2 Data-issues and Cleaning

Before the data is used to train a model, we explore the data and modify it where needed. This process is often called data cleansing and without doing this it can be hard for the model to predict accurately. In this section the encountered issues in the data as well as the fixes made will be discussed.

3.2.1 Quality of the Data

When looking at the images of the seeds the first thing that stood out is the fact that the background is noisy. This could disturb our model. It could try to categorize the seeds based on the information in the background, which has

(24)

no correlation with the type of seed depicted in the image. To prevent the model from doing this, all backgrounds are made solid blue. The second thing we noticed is that only a small percentage of the image is occupied by the actual seed and the rest is background. Unfortunately are we not able to crop the images. The reason for this is that we do not exactly know which part of the images is covered by the seed. This won’t be a problem because every image contains the same background now and the model will only focus on the actual seed. Lastly all seeds are centered so that each seed is in the middle of the image.

When looking closer at the data a few things stood out. First of all, some images contain just the seed cover and no actual seed. Secondly, some images only contain half a seed and are therefore not representative. Finally, some images contain a seed without its cover. All those images could influence the way the model classifies the seeds. Observations with only the cover or half a seed do not occur much in the data and can be deleted without serious consequences but observations with seeds without a cover occur too often to exclude them from the data. Because of this the images of just the seed cover and the images of half a seed are deleted from the set. The images containing a seed without its cover are kept in the set. A problem with deleting those observations is that those also occur much in the testing data and still need to be classified. When the cover is removed from the seeds the seeds do look even more alike and it is almost impossible for a human to discriminate between the seeds now.

3.2.2 Quantity of the Data

The previous paragraph showed that the data was unbalanced and the corre-sponding problems. There are a few methods to deal with this. First, it is possible to collect more data but this will take time. Second, it is possible to

(25)

create ”new observations” by using perturbed versions of the original obser-vation such as rotations and shifts. From every image 48 perturbed versions are created where every image is displaced and/or rotated a bit as indicated. Seeds A en B are very similar while seeds C and D are much easier to recog-nize. For this reason only perturbed images of seeds A and B are generated. First we separate A and B from the rest based on the original data and then discriminate between seed A and B using the data with the perturbed versions. Last, it is possible to naively sample the data such that the data is not skewed. There are two ways to do this. The first option is adding copies of instances from the under-represented classes, this is called over-sampling. The second option is to remove observations from the over-represented class which is called under-sampling. But in order to still have enough data to train the model collecting more data is also needed. For this reason we will not use under-sampling. The descriptive statistics of all those options will be given in the next paragraph.

3.2.3 Normalization

Overfitting is one of the problems that occurs when training CNNs. Differ-ent techniques to counter this have been proposed, one of those techniques is normalization. Normalization is a process to make the data have a structural distribution. It can take different forms, the most common one is to subtract the mean of the data from each individual observation, and divide by the standard deviation. It is called standard score

x0 = x − µ σ ,

where µ is the mean and σ is the standard deviation. We normalize the input data (images) by subtracting the mean RGB values and dividing by their standard deviation. Figure 3 gives an illustration of the normalization proces.

(26)

The second image is the data with the mean subtracted. The last one is the standard score.

Figure 3: Normalization proces

3.3 Data Description after Preprocessing

The frequencies of the different seeds with different techniques used are given in Table 4. The sample size of the data when using the under-sampling technique is so small that we expect no valid results. For this reason we will not use this (sub) sample to train models on. Figure 4 gives a preview of what the data looks like after cleaning.

Table 4: Frequencies of the different seeds using different techniques Seed Original With perturbed versions Over-sampling Under-sampling

A 445 21360 2.890 292

B 2.890 138.720 2.890 292

C 292 0 2.890 292

D 496 0 2.890 292

(27)

(a) Seed A (b) Seed B (c) Seed C (d) Seed D Figure 4: Image samples of the four different seeds

(28)

4 Research methods

A few steps have to be taken to develop efficient methods of classifying images of very similar seeds. First, before we describe how different methods work, we choose how we combine the metadata with the ’raw’ images. Second, the network architectures that will be investigated have to be chosen. Also the training, testing and implementation of the models will be described. Finally, the other machine-learning techniques that are trained on the metadata will be described.

4.1 Metadata Integration

The goal of this thesis is to develop a method to recognize different seeds using a combination of meta data and images. The handcrafted variables that are available for every seed are combined with the ’raw’ images in order to increase the prediction accuracy. Combining those can be done in a few different ways. First, the extra variables can be added to the first fully connected layer in the CNN (see Table 5). Besides that it is also possible to train a model on just the handcrafted variables and a CNN on the ’raw’ images and combine the predictions of both models. This can be done by taking a weighted sum of the two predicted probabilities that a observation falls into a certain class. The result of this is that if one methods is sure about its prediction while the other is not the weighted sum acts like a tie breaker. The no free lunch theorem (Wolpert & Macready, 1997) says that there exist no best algorithm that performs best in all cases. Because of this reason, we will try to train a few algorithms and compare their performance.

(29)

4.2 CNN

In this subsection, the CNNs used to recognize different seeds are discussed. The input data for these models are the ’raw’ images, as described in the previous chapter. Three models based on winning network architectures in the ImageNet competition are used to classify our data. First the Alexnet model based on research by Krizhevsky et al.(2012) is used. Second, the VGG-net VGG-network structure based on research by Simonyan, and Zisserman (2014) is used. Finally the Resnet network structure by He et al. (2016) is used. Note that all these architectures were developed to recognize images of basic-level categories and not specifically for fine-grained image recognition. Those architectures are also used for fine-grained image recognition (Zhang et al., 2016). The difference is that many researches (Chai et al., 2013) used parts-based modelling to explicitly or implicitly find local parts and attributes to locate subtle differences in appearance across sub-categories. We hypothesize that the network is able to learn the distinctive features itself, because of the fairly simplistic structure of the seeds.

4.2.1 Architecture

The input we use to train the models is a fixed-size 250 X 250 RGB image. The image is passed through a number of convolutional layers. Some of the convolutional layers are followed by spatial pooling layers. Alexnet and VGG use multiple max-pooling layers. Resnet uses only one average-pooling layer after the last convolutional layer. The covolutional layers are followed by Fully-Connected (FC) layers. Where Alexnet and VGG have two layers with 4096 channels each and a Relu activation function. All networks contain an output layer that performs 4-way classification and thus contains 4 channels (one for each class). The final layer contains a softmax activation function to get prob-abilities. In some cases the dropout technique (Srivastava, Hinton, Krizhevsky,

(30)

Sutskever and Salakhutdinov (2014)) is applied in the fully connected hidden layers. To reduce overfitting, dropout randomly selects the connections that are used to train the network.

4.2.2 Configurations

The configurations of the CNNs, evaluated in this paper, are presented in Table 5, one per column. The names of the layers are defined as follows:

[Type layer][Filter Size†] − [Number of filters/nodes†]. †= Optional

For example “conv11-96” would be a convolutional layer with 96 filters with a filter size of eleven. The best configurations where based on results of Zeiler and Fergus (2012). The configurations where found by doing a 10 fold cross validation for different filter sizes and strides.

Figure 5: Identity mapping in Residual blocks (Jay, 2018) The first three structures follow the same

generic design, and differ only in depth. We have a small dataset and our hypothesis is that this will cause the VGG network to give bad results. Additionally, the network has a lot of parameters and the computation time would be incredibly high while the predic-tions time would be to slow for a live im-plementation. For those reasons we will not

consider this network in the results. In addition to standard convolutional layers, the Resnet architecture uses residual connections in the model (Fig-ure 5). The idea behind this is that, instead of hoping that each few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping w.r.t. the identity. They hypothesize that it is easier

(31)

to optimize the residual mapping than to optimize the original, unreferenced mapping.

The initialization of the weights of the network can be of influence on the results. Depending on the choice of starting weights, the network may get stuck in a local minimum. To avoid this we choose our weights at random every time we train, making sure that the scale of gradients is roughly the same in all layers. This is called Xavier initialization from the work of Glorot and Bengio (2010). This helps us keep the output from exploding to a high value or vanishing to zero, which would stop the network from learning or would cause the network weights to diverge. We choose the weights uniform between [−c, c], where c =

r

2.24 1

2∗(nin+nout)

. Where nin is the number of neurons

(32)

Table 5: Network configurations of different networks (columns) Network Configurations

Alexnet Alexnet+ VGG19 Resnet18

8 layers 9 layers 19 layers 50 layers conv7-96 conv7-96 conv3-64 (x2) conv3-64 (x3)

maxpool3

conv5-192 conv5-192 conv3-128 (x2) conv3-128 (x3) maxpool3

conv3-384 conv3-384 conv3-256 (x4) conv3-256 (x3) maxpool3

conv3-384 conv3-384 conv3-512 conv3-512 conv3-384 conv3-512 conv3-512 conv3-512 conv3-512 conv3-512 conv3-512 maxpool3

conv3-256 conv3-256 conv3-512 (x4) conv3-512 (x4)

maxpool3 avgpool3 Dropout (p = 0.5) FC-4096 Dropout (p = 0.5) FC-4096 Dropout (p = 0.5) FC-4

(33)

4.2.3 Training and Testing

The training of the network is done by optimizing the softmax cross-entropy loss function using mini-batch gradient descent. Softmax is an activation func-tion which allows us to interpret the outputs as probabilities of class member-ship. The probability returned by the softmax activation function that an observation falls into category j is defined as:

P (y = j|z) = σ(z)j =

ezj

PC

i=1ezi

≡ ˆpj for j = 1, 2, 3, 4

where z is the output of the final layer before the activation function. The Cross-entropy loss function measures the difference between the predicted probabilities and true probabilities, where it penalizes big difference logarith-mically harder then small differences. The function is given by:

L(w) = −_N1 N X n=1 C X i=1 wi∗ yin∗ log(ˆpin) ,

where yin is 1 when observation n falls into category i and 0 otherwise. ˆpin is

the estimated probability (coming from the softmax activation function) that observation n falls into category i. C denotes the number of categories and N the total number of observations. The weight of category i is denoted with wi, where setting the weights as 1 over the number of observations of that

category counters unbalanced data. This way every category contributes the same to the loss and the model does not tend to get stuck in one (the biggest) category.

To find weights that minimize the value of the loss function on the train-ing dataset stochastic gradient descent (SGD) is used. SGD an optimization algorithm and is used to find the weights of our model. This is done by calcu-lating the slope of the loss function and changing the weights such that the loss moves towards its minimum. Mini-batch gradient descent calculates the loss and its derivative after each subset of observations and updates the weights.

(34)

These small subsets of the data are called mini-batches which gives the name mini-batch gradient descent.

In order to get reliable results, we perform a five fold cross validation. We split the dataset into five subsets and train our model on four subsets and test on the last remaining subset. We do this for all combinations of the five subsets and take an average of the achieved training accuracies. When performing a five fold cross validation the test set consists of 20% of the data which is over 800 observations. For a higher order cross validation the test set would become so small that it becomes hard to draw conclusions from it.

Training is preformed using the publicly available MXNet libarary. Accel-eration libraries like MXNet offer powerful tools to exploit the full capabilities of GPUs and cloud computing. MXNet stands out because of its combina-tion of high performance, clean code, access to a high-level API, and low-level control (Chen et al., 2015).

4.3 Random Forest

A random forest (Breiman, 2001) is an estimator that fits a number of decision tree classifiers on various sub-samples of the data and uses the average of all these tree classifiers to improve the predictive accuracy and control overfitting. We use a sub-sample size that is the same as the original input sample size where the samples are drawn with replacement (bootstrap). A decision tree is a flowchart-like structure in which each internal node represents a ”test” on an attribute (e.g. whether our metadata variable ’size’ is bigger then 3000), each branch represents an outcome of the test, and each leaf node represents a class label (decision taken after all tests). The paths from root to leaf represent classification rules. The most important difference with a CNN for this thesis is that the features used here are determined by humans and a CNN learns the attributes by itself and learns a classifier based on those attributes. A

(35)

grid-search is done to find the best hyperparameters (number of trees, etc) for the Random Forest. The settings that we tested are presented in Table 6. In order to find the best combination a 3 fold cross validation is preformed.

Table 6: Gridsearch interval for hyperparameters

RF param Values SVM param Values GB param Values

N estimator 1-20 kernel Linear

Rbf N estimator 200-350

max depth 1-10 C 1-10 max depth 1-5

min sample leaf 1-10 Criterion function Gini

Entropy

4.4 SVM

An SVM classifies points by assigning them to one of two disjoint half spaces, in a higher dimensional feature space. In the case of more than two classes the ”one-against-one” approach (Knerr et al., 1990) is used. So in the case of four classes (4 ∗ 3/2 =) 6 classifiers are trained. The idea of the support vector machine method is to construct a multi-dimensional hyperplane as a decision surface such that the margin of separation between observations on both sides of the hyperplane is maximized. The SVM is trained on the metadata described in the previous chapter. A grid-search is done to find the best hyperparameters for the SVM. The settings that we tried are a linear or Radial Basis Function kernel in combination with different values of the C parameter. The settings that we tested are presented in Table 6. In order to find the best combination a 3 fold cross validation is preformed.

(36)

4.5 Boosting Classifier

A boosting classifier is an ensemble algorithm that uses weak learners to create one strong learner (Schapire, 1990). Where the learners are trained sequetially, where future learners try to model the residuals from the previous learners. We will use smaller decision trees or even just trees with one node and two leaves as our weak learners. Again a grid-search is done to find the best hyperparameters for the boosting classifier. A few different number of boosting stages to perform are tested. The settings that we tested are presented in Table 6. In order to find the best settings a 3 fold cross validation is preformed.

(37)

5 Visualizing and Understanding CNNs

To better understand the working of the neural network we visualize it. The visualizing methods used here come from the work of Zeiler and Fergus (2014, pp. 818-833). We combined those techniques with the metadata that we have extracted from every image to better understand what each filter is reacting to.

5.1 Weight Visualization

Figure 6: Weight visualization A common strategy is to visualize the

weights. The first convolutional layer uses the raw image data as input and because of that will have the most recognizable fil-ters. It is not always clear what feature a filter tries to capture but it is still useful to visualize the weights. The reason for this is that well-trained networks usually display

nice and smooth filters without any noisy patterns. When a network contains filters that have noisy patterns it can indicate a few things. First, it can be that the network has not been trained long enough, or secondly, that the network has been overfitting caused by a very low regularization strength.

Visualizing the weights is done by portraying what a filter reacts to. The filters weights get multiplied with the RGB values of the images, so the more the images corresponds to the filter weights the higher the output is. If the filter has high values in the blue layer and not in the red or green layer it reacts strong to images containing a relatively high amount of blue. And if a filter has high values on the diagonal of the filter then it reacts strong to images or parts that also have a diagonal stripe. So the weights of the filter can be seen

(38)

as RGB values and the weights can be plotted as images (Figure 6).

5.2 Occlusion Sensitivity

We may want to know which part of the image determines the correct classifi-cation of the images, and that it doesn’t make its classificlassifi-cation based on parts of the images that do not play a role. We can determine this by systematically occluding parts of the input image. We do this by placing a white square over a part of the images and then feeding it to our classifier. By calculating the differences in probability for the correct class we see which parts of the image are important for the classification. We can visualize those difference in probability in a heatmap (see Appendix C for an example).

5.3 Neuron Activation

To get an understanding of what a specific neuron is trying to capture we can find the images that maximize the activation of that neuron. We do this by feeding all the images to the network and for every neuron in the last convolutional layer we collect the five images that maximize its activation. For example if the last convolutional layer consists of 256 neuron (as in the case of Alexnet) we have 256 times five images. We have chosen to display the top five responses for reasons of convenience.

To even better understand what the neurons look for we try to link them to metadata variables. We do this by not only collecting those five images that maximize the neurons activation, but also collecting the five images minimizing the neurons activation. The metadata of those images is known and we can look for big difference in the metadata of the two groups of images. We define the difference as:

Difference = 1 5 P i∈Bxi− 1₅ P j∈W xj 1 5 P i∈Bxi+ 1 5 P j∈W xj ,

(39)

where set B contains the metadata of the images which maximize the activation and W contains the metadata of the images minimizing the activation. The metadata variables that show the biggest difference are possible properties that the neuron uses to discriminate.

(40)

6 Results and Empirical Analysis

In this chapter the results of this research are presented. In the first section we present different models and configurations trained on the images. In the second section models trained on purely metadata are presented. After that the results of the models that use a combination of ’raw’ images and metadata are displayed. A visualization of the working of the best performing CNN is presented, as described in the previous chapter. Finally the results are compared and interpreted.

To present and evaluate our results a few things are shown. First of all, to evaluate the predictions we compute the accuracy of every model. Secondly, we are interested in the precision(P) and recall(R) of the predictions. Most importantly we want the seeds that we classify as the class B, to be indeed of class B, so we want a high precision for seed B. Another statistic that we use is the Cohen’s kappa coefficient. The Kappa coefficient measures the possibility of a correct classification occurring by chance.

Kappa measure is defined as:

ChanceAgreement = 1 100% 4 X i=1 %P redictedi∗ %Reali κ = T estAccuracy − ChanceAgreement 100% − ChanceAgreement

We also introduce an overfit measure to compare the degree of overfitting between models. The overfit measure is defined as:

OverM = T rain Accuracy T est Accuracy − 1

(41)

6.1 CNN based on images

The results of different CNN architectures and configurations that are used to classify our data are presented in Table 7. The scores reported here are achieved on the test set which consists of 20% of the data. Alexnet with normalization and dropout performs best among our selection of models. The accuracy, recall and precision are the highest when using those techniques.

We also see that using oversampling does not make a difference in per-formance. The reason for this is that in theory the networks are exactly the same the only difference is that while using oversampling the networks needs less epochs to get good results (see page 41-44). That it needs less epochs is a distorted picture because the reason for this is that the dataset is simply larger, namely both networks need the same amount of observations to learn.

Table 7: Results of different CNN

Network Accuracy R (B) P (B) κ OverM

Alexnet (Norm & drop) 99.0% 99.2% 99.6% 0.979 0.010 Alexnet (Batch Norm) 98.1% 98.9% 98.5% 0.959 0.019

Alexnet 97.6% 98.1% 98.8% 0.949 0.025

Alexnet (Oversample) 97.6% 97.9% 99.0% 0.949 0.025

Resnet18 97.4% 98.9% 97.8% 0.943 0.027

Alexnet+ 94.8% 96.0% 97.4% 0.887 0.055

The reason that using normalization of the images and dropout in the fully connected layers gives the best results is caused by reducing the overfitting. Every network reported here is able to almost perfectly predict every training-set observation but some overfit more then others. Because we are able to almost perfectly predict our training-set our loss is also very close to zero so choosing other weights in our loss function would not be beneficial,

(42)

reduc-ing overfittreduc-ing on the other hand would be beneficial. Normalization reduces overfitting by making sure that the images have the same distribution every time. Additionally, the dropout technique prevents overfittings by using less connection between hidden nodes while training.

A possible reason that more extensive models do not perform better is the fact that seeds do not contain really detailed features so a few convolutional layers should be enough to capture the features of those seeds. Resnet18 and Alexnet+ perform worse than the other models because they have more parameters and layers and are better in modeling the noise in the training data but this is not what we want. We see that the alexnet network with one additional layer preforms worse then the original alexnet network, this justifies our decission to exclude the VGG19 network.

The (un)certainty of a prediction is measured by the predicted class mem-bership probability. The certainty of the incorrect predicted observations lies below 53%. Of the correctly predicted observations only 3.3% has a certainty that lies below 53%. If correct classification is important, it is possible to for example assign a observation to a class if the certainty is above 60% and discard them otherwise.

(43)

Original Alexnet

Below are the results of training an Alexnet network structure. Figure 7a shows the the training (red) and test (green) accuracy against the number of epochs. This is an increasing function with some drops caused by a high learning rate. Figure 7b shows a moving average of the loss.

(a) Accuracy (b) Loss

Figure 7: Results of Alexnet

Table 8: Original

Real, Predicted A B C D Total (Recall)

A 93 5 0 2 100 (93.0%) B 8 568 3 0 579 (98.1%) C 0 1 55 0 56 (98.2%) D 0 1 0 89 90 (98.9%) Total (Precision) 101 (92.1%) 575 (98.8%) 58 (94.8%) 91 (97.8%) 825 (97.6%) Kappa: 0.9494

(44)

Alexnet with Oversample

Below are the results of training an Alexnet network structure while over-sampling the data. Figure 7a shows the the training (red) and test (green) accuracy against the number of epochs. This is an increasing function with some drops caused by a high learning rate. Figure 7b shows a moving average of the loss.

Figure 8: Results of Alexnet with Oversampling

Table 9: Alexnet with Oversampling

A 92 3 0 1 96 (95.8%) B 8 569 2 2 581 (97.9%) C 0 2 56 0 58 (96.6%) D 1 1 0 88 90 (97.8%) Total (Precision) 101 (91.1%) 575 (99.0%) 58 (96.6%) 91 (96.7%) 825 (97.6%) Kappa: 0.9493

(45)

Alexnet with Dropout and Normalization

Below are the results of training an Alexnet network structure with dropout and normalization of the input data. Figure 7a shows the the training (red) and test (green) accuracy against the number of epochs. This is an increasing function with some drops caused by a high learning rate. Figure 7b shows a moving average of the loss.

Figure 9: Results of Alexnet with dropout and normalization

Table 10: Dropout+Normalize

A 87 1 1 0 89 (97.8%) B 3 516 1 0 520 (99.2%) C 0 1 58 1 60 (96.7%) D 0 0 0 99 99 (100%) Total (Precision) 90 (96.7%) 518 (99.6%) 60 (96.7%) 100 (99%) 768 (99.0%) Kappa: 0.9794

(46)

Resnet 18

Below are the results of training an Resnet18 network structure. Figure 7a shows the the training (red) and test (green) accuracy against the number of epochs. This is an increasing function with some drops caused by a high learning rate. Figure 7b shows a moving average of the loss.

Figure 10: Results of Resnet

Table 11: Resnet

A 81 3 0 0 84 (96.4%) B 4 540 2 0 546 (98.9%) C 1 3 48 0 52 (92.3%) D 0 6 1 79 86 (91.9%) Total (Precision) 86 (94.2%) 552 (97.8%) 51 (94.1%) 79 (100%) 768 (97.4%) Kappa: 0.9435

(47)

6.1.1 Effect Size Fully Connected Hidden Layers

The basic Alexnet architecture uses 4096 hidden neurons in the fully connected layers, this makes sense because they have 1000 classes in their classification problem. In our case there are only four classes that we try to distinguish so it makes sense to make our hidden layers smaller. We tried a few different settings and the results can be found in Table 12. We used a dropout with probability 0.5 in all fully connected layers.

Table 12: Results of different number of hidden neurons in the FC layers #Hidden Neurons Accuracy Convergence at Epoch Overfit Measure

128 97.14% 40 0.03 256 97.43% 60 0.02 512 97.79% 280 0.02 1024 97.92% 350 0.02 2048 98.44% 450 0.02 4096 98.96% 550 0.01

We see that more hidden neurons lead to a higher test accuracy but that it takes long for the network to converge to a solution (train accuracy above 99%). The hypothesis that an excessive amount of hidden neurons leads to more overfitting seems to be false. The explanation for this might be that the dropout in the layers prevents this from happening. But even when no dropout is used the difference in the overfit measure is not significant. When using 64 hidden neurons the overfit measure does decrease significantly but the model gets stuck around 92% accuracy. So dropout does not seem to affect overfitting for different number of hidden neurons.

(48)

6.1.2 Effect Dropout Probability

Until now we used dropout with a probability of 0.5 because this gets the highest variance in the network architecture distribution. But we can also use other probabilities, in Table 13 are the results for a probability of 0.2, 0.5 and 0.8. All training sessions use a learning rate of 0.001 which explains the slower convergence then the convergence shown on page 41-44.

Table 13: Results of different dropout probabilities

Dropout probability Accuracy Convergence at Epoch Overfit Measure

0.2 98.1% 280 0.023

0.5 98.9% 510 0.022

0.8 98.7% 600 0.019

First, we see that a dropout probability of 0.5 performs best but that 0.8 performs almost the same which is in line with the results Srivastava et al. (2014) found in their paper. Secondly, we see that the higher the probability is the longer it takes the network to convergence to a solution. This makes sense because when 80% of the connection is dropped only 20% of the weights is trained during that epoch. Therefore, it takes longer until all connection are well trained.

6.2 Classifiers Based on Metadata

The results of a few different classifiers trained on the metadata are presented in Table 14. The score reported here is achieved on the test set which consists of 20% of the data. The results of the SVM classifier reported here corresponds to the current method used in seed sorting using machine vision.

The hyperparameters used for the network reported here are found using a grid-search. The optimal parameters for the random forest are an entropy

(49)

criterion function with 19 trees in the forest. The gradient booster algorithm uses 195 boosting stages which leads to the highest score on the test set. Note that we would expect overfitting with that many boosting stages but looking at the results this does not seems to be the case. The neural network uses two hidden layers with 64 nodes and Relu activation functions. The SVM uses linear kernels and a C value of 2. The results of a KNN classifier are also reported which uses two neighbours to make its predictions. The fitted decision tree is found in Figure 11.

Table 14: Results of different classifiers based on metadata Network Accuracy Recall (B) Precision (B) Kappa

Random Forest 98.8% 99.5% 98.8% 0.9744 Gradient Booster 98.5% 99.0% 99.1% 0.967 Neural Network 98.5% 98.8% 99.1% 0.9696 SVM (Linear) 97.4% 97.7% 98.4% 0.9443 Decision Tree 96.8% 97.1% 98.4% 0.9296 KNN 88.7% 95.7% 88.9% 0.7399

First, we see that a Random Forest predicts with the highest accuracy. Secondly, we see that a Gradient booster and neural network classifier have a higher precision when predicting for seed class B. So when precision is really important those classifiers could be considered. The neural network trained on the metadata is not the best performing technique but comes close the random forest. Linear SVMs as the one used here are equivalent to single-layer neural networks. So it is clear that the NN could outperform a SVM when using multiple layers. When we look at the errors that the models make we see that different classifiers are not as correlated as we would expect and combining multiple models could increase the performance.

(50)

the differences are small. The best performing CNN achieves a slightly higher accuracy than the best performing classifier based on metadata. We see that the SVM that is currently used is comparable with a basic Alexnet or Resnet architecture. Only the KNN classifier preforms significantly worse then the other classifiers and CNNs. The results of classifiers based on metadata and CNNs do not differ much, this could indicate that our handcrafted features are well defined and able to discriminate well.

(51)

Figure 11: Decision tree trained on the metadata

(52)

6.3 Combining images with metadata

By combining the predicted probabilities of a model trained on images and a model trained on the metadata we get new results. These are presented in Table 15. Of every combination the accuracy is reported.

Table 15: Results of different Classifiers based on metadata

Network Random Forest Gradient Booster Neural Network Alexnet (Norm & drop) 99.64% 99.39% 99.14%

Alexnet 98.14% 98.91% 98.8%

We see that the accuracy of our predictions increases when we take an average of the predicted probabilities. With a random forest and our best performing CNN as best performing combination. All combinations make the same mistakes as the best performing combination when looking at which observations get classified incorrectly. The fact that the accuracy of our best performing CNN is increased by combining it with a random forest could indicate that the CNN misses some of the features in our metadata. It could be that the CNN has difficulty capturing some of variables that are in our metadata. We can make an importance ranking of the metadata variable by looking at how much each metadata variable decreases the loss of our random-forest. Besides that, we have collected which metadata variables correspond to each filter in the last convolutional layer and made a ranking of the variables according to the CNN. When we compare these two rankings we see some differences. For example, the CNN attaches more importance to color then the random forest does, while the random forest uses intensity as discriminating feature.

It is hard to draw conclusions from those results because an increase in accuracy of 0.12 percent point is classifying one more observation correct. The reason for this is the small dataset of only 4123 observation. Testing on 20%

(53)

results in the fact that every observation is worth 0.12 percent point. To draw stronger conclusions this research should be repeated with a bigger dataset.

(54)

6.4 Visualizing and Understanding CNNs

In this section we present the visualization of the best preforming CNN. All three visualization methods that were described in chapter 5 will be discussed as well as the insights that can be extracted from the visualizations.

6.4.1 Weight visualization

The weights of the first layer of the Alexnet network (94 filters) are portrayed In Figure 12. In Figure 12a displays the weights of the Alexnet network that is only trained for 10 epochs. Besides that is Figure 12b which shows the weights of the Alexnet network that uses dropout in the fully connected layers, normalization of the input data and is trained for 8 times as many epochs. Of each filter a grey and coloured version is presented, where the first gray filter (left upper corner) corresponds to the first coloured filter.

(a) Alexnet after 10 epoch (b) Alexnet (Norm & Drop) after 80 epoch Figure 12: Weight visualization

The filters in Figure 12a appear to be more noisy than the filters in Figure 12b. This suggests that the second network is not overfitting as much as the first network. Some of the filters on the right are still noisy, this could indi-cate that the network should be trained even longer or that the regularization

(55)

strength of the network is still to low.

6.4.2 Occlusion Sensitivity

We have covered parts of the input images to find the predictions sensitivity to every part of the image. In figure 13 we see an example of an image and a heat-map which shows which parts are essential for prediction.

Figure 13: Occlusion Sensitivity On the right we see the original

image and on the left a intensity map of change in probability when that part of the image is covered. Where white indicates a high change in prob-ability and black low to no change. In this case the two ends of the seed are essential for determining which kind of seed it is. This makes sense

be-cause this is a seed is of a class with relative small seeds so by knowing where it starts and stops you know the size and in this case the class. Additionally, the shape of the ends of a seed is a distinctive feature according to experts. In a lot of cases (see Figure 15) covering a particular part of the seed gives a high change in probability. This does not mean that the rest of the seed is not important for the classification but rather that at least that part is important.

6.4.3 Neuron activation

We have collected the 5 images that maximize the activation of every of the 256 filters in the last convolutional layer. Figure 14 displays the results of one particular filter. Those five images result in the highest output of that filters.

(56)

Figure 14: Top 5 images that maximize neuron activation

As expected the images that respond well to this filter have a lot in com-mon. All images displayed here are of the same seed class, this indicates that this filter discovered some distinctive features. When looking at the metadata we see that the variable that this filter corresponds the most to is ’edge per-centage’ which is an indication of the smoothness of the outside of the seed. All images that are displayed here have uneven surfaces so this neuron seems to discriminate on surface smoothness. Other metadata variables that cor-respond well with this filter are color and size. Again, when looking at the images we can easily see that they have those properties in common as well.

The fact that the images have a lot in common indicates that the net-work behaves well. It is able to discriminate based on features of the seeds. Moreover, it is able to discover features that distinguishes classes from each other. Here we have shown one filter but the conclusions drawn above hold for almost all neurons in the last convolutional layer (see Figure 16). By ag-gregating the features that correspond with each filter we get an indication how important every metadata variable is according to our CNN. Here we see that the most common features the filters react to are color, edge percentage (surface smoothness) and ’Breedte asymmetry’.

Seeds that react strongly to the same filter are usually rotated in the same way. The filters seem to be sensitive to rotation. Because of this the results did not improve when training on the perturbed dataset as described in chapter 3. It would be an idea to rotate all seeds in the same position during the preprocessing of the data.

(57)

(58)

Fine-grained classication of seeds in images

Fine-grained classification of

seeds in images

Bram Postma

(10790675)

Statement of originality

Abstract

Contents

1

Introduction

2

Related Work

2.1

Seed Classification

2.2

Basic Image Recognition

2.3

Fine-grained Image Recognition

2.4

Conclusion

3

Data

3.1

Data Description

3.2

Data-issues and Cleaning

3.3

Data Description after Preprocessing

4

Research methods

4.1

Metadata Integration

4.2

CNN

4.3

Random Forest

4.4

SVM

4.5

Boosting Classifier

5

Visualizing and Understanding CNNs

5.1

Weight Visualization

5.2

Occlusion Sensitivity

5.3

Neuron Activation

6

Results and Empirical Analysis

6.1

CNN based on images

Original Alexnet

Alexnet with Oversample

Alexnet with Dropout and Normalization

Resnet 18

6.2

Classifiers Based on Metadata

6.3

Combining images with metadata

6.4

Visualizing and Understanding CNNs