The use of image classification in the estimation of price indexes

(1)

The use of Image classification in the

estimation of price indexes

Carmen Wolvius

Supervisors: Raoul Grasman, University of Amsterdam & Edwin de Jonge, Statistics Netherlands

P

rice indexes are important statistical mea-sures, for example for economic inflation. Un-fortunately, they are subject of discussion and there hasn’t been consensus about a reliable estimation method. The increase in information that is available in webshops might provide ways to approach the estimation of price indexes differ-ently. One way is to use automatic image classifi-cation to assign category labels to product images. These category labels can be used to estimate price indexes. In this paper clothing images of web-shop Wehkamp are used to indicate whether im-age classification is a reliable classification method for clothing items. For the image classification we constructed a k-nearest neighbor algorithm and a convolutional neural network. We compared the models with a rule-based method that uses key-words as reference. In the comparison we focused on the classification accuracy and on efficiency and made an estimation about the additional value of image classification in the estimation of price in-dexes.

Introduction

Price indexes are statistical measures that represent the ratio of a magnitude to a reference value. For example, the price of a product in the year 2000 could be used as a reference to the price of the same product in 2001, 2002, up to 2015. By setting a reference value the development of these prices can be evaluated and tracked for a certain period (Espejo, 2005). Price development is an important measure of economic development. Economic inflation, for example, is determined with the use of price indexes for a wide range of products. Price indexes have been subject of much debate (de Haan & Hendriks, 2013). There has not been consensus about a reliable estimation method yet. Price indexes can be calculated in two different ways. Origi-nally, a price index is calculated based on the price tends of samples of different products. These sample products are randomly selected from a store and the price of the samples is tracked over time. This method, however, is subject to several forms of bias. The biases that arise are mostly

due to the variety in shops, variation in the assortments of shops, periods of sale and seasonality, which means that a product is only sold for a certain season, like shorts in the summer. These biases cause items to be difficult to track or even to disappear. This leads to missing values, which makes a price index less reliable. A price index can also be determined by making specific categories and calculating a mean value per category, for example: ‘woman’s woolen sweaters’. Instead of the actual price, changes of the mean value are observed. This way, specific items do not have to be tracked, but are replaced by a new set of items from the same category. This method is much less subject to bias, however it requires a lot more time and effort because the required sample has to be much larger.

Since recently, Statistics Netherlands has started ex-perimenting with automated processes to determine price indexes. The advantage of an automated process is that larger samples can be used without the process becoming more labor-intensive. Currently, a rule-based method is im-plemented that uses keywords within product descriptions to classify products. The products and product descrip-tions are first collected by an internet robot that scans webshops and second divided into categories by the rule-based method. The rules that are used for classification are very basic. Therefore, this method requires a large amount of supervision, maintenance and manual correc-tion of mistakes. To bring down the amount of human interference Statistics Netherlands has been experimenting with machine learning algorithms to automatically classify items based on product descriptions (de Boer, 2015). This automatic text classification is tested on clothing items and uses the description that is accompanied by the item to indicate which category the item belongs to. The results of this algorithm have been promising, however it has some particular issues. For example, when parsing the keywords ‘laptop’ it ends up classified as a ‘top’, or a multicolored item ‘blue sweater with yellow dots’ is classified as yellow and as blue even if it is primarily a blue sweater. These issues are direct limitations of the use of text and might be limited when the images of the items are used as well. Image classification based on machine learning algo-rithms has shown great results, by itself or in combination of item descriptions (Guillaumin, Verbeek, & Schmid, 2010; Simo-Serra, Fidler, Moreno-Noguer, & Urtasun, 2014). The

(2)

Figure 1: Example of overfitting a sine function. The points represent the individual data points originating from a sinus function with added noise. The green solid line represents an approximation of the data points and estimates the underlying sine function. The red dotted line overfits the data by explaining every data point. It can therefore not be generalized to the population that the sample belonged to.

state-of-art algorithms are able to classify images up to 70 percent correct using just the pixel values. This paper investigates the possibility of using image classification to classify products that are used for price indexes. The recent increase in clothing webshops has made it relatively easy to gather information about clothing items. Therefore, this project will focus on the use of image classification in the clothing industry. Different image classification techniques will be discussed and evaluated on two datasets.

Image Classification

When an image is presented to a computer it is a 3-dimensional array of numbers: one dimension represents the width of the image, one dimension represents the length and one dimension represents the number of color channels. To classify an image a function has to map all these num-bers onto a category label. Instead of specifying all features of these categories, an algorithm could derive the specifics of a category by learning them from a set of example im-ages. These examples, with corresponding category labels, allow the algorithm to distinguish the different categories and use this to assign category labels to new images. This is done by a machine learning algorithm.

There are many image classification methods available for such a learning approach. These methods can be divided into parametric and non-parametric classifiers. Parametric classifiers require a training phase in which the model pa-rameters are repeatedly adjusted to fit a training dataset. This training phase can be quit intensive, which can lead to very specific parameter weights for the selected categories. This specificity generally leads to parametric classifiers outperforming non-parametric classifiers. This intensive training phase can, however, also lead to the risk of ad-justing the parameters to random noise in the training set. When this is the case the parameter weights change due to noise instead of an actual feature of the category. This is called overfitting. Overfitting results in weights that are not representative for the population and, therefore, do not fit the test set (Bishop, 2006). Figure 1 shows an example of a model that overfits a dataset.

Non-parametric classifiers do not require a training phase, but only compare test images to a set of training images

that are already assigned to a category label. The model assigns the test image to the category of the image it resem-bles most (Bishop, 2006). Because non-parametric models do not have parameters to estimate, the parameters can not overfit. However, it is important for a non-parametric classifiers that the training set is representative for the population, because it needs representative examples when identifying test images. A non representative training set has a similar effect as overfitting parameters in a para-metric model. The lack of a training phase also makes a non-parametric method less computationally intensive. For these reasons Boiman, Shechtman, and Irani (2008) advocate the use of the k-nearest neighbor (k-NN) algo-rithm, a straightforward non-parametric classifier. The input of this classifier exists of k-closest training examples in the feature space. The output can be a property value in the case of regression, or a class membership in the case of classification. This depends on whether the outcome variable is continuous or categorical.

To indicate whether a non-parametric or a parametric method would suit the current problem better, both meth-ods will be tested and evaluated. The k-NN approach will be evaluated as a non-parametric method and as a parametric method convolutional neural networks (CNN) will be evaluated. A CNN is a neural network that allows a model to observe local features. When using a normal neural network for image classification, it ignores the fact that closely positioned pixels correlate more strongly than pixels that are further apart. The different layers of a CNN, especially the convolutional layer, allow this infor-mation, that is specific for images, to be used (Bishop, 2006). Given that parametric methods usually outperform non-parametric methods we expect the CNN to perform better than the k-NN. Both methods are compared on their performance on the classification of a hand labeled dataset and their efficiency.

Methods

Data Collection

An internet robot, developed at Statistics Netherlands, col-lected 100.000 clothing items from the Wehkamp website by automatically gathering the images with their accom-panying tags. This number is based on the items that are presented daily on the website and is, therefore, an approximation of the number of items that would be used if the method would be implemented on a daily basis. There are two types of tags: the category label assigned by the website and an additional description. This description is written by the website itself or by the original seller (brand) of the item. The 100.000 images are labeled by the rule-based method that is currently used by Statistics Netherlands. This method may also have incorrectly clas-sified images as been discussed above. Therefore, first a sample of the images was labeled by hand by two employ-ees of Statistics Netherlands. In a later stage we used all 100.000 images as a dataset to see how the algorithms cope with noise and over- or underrepresented categories. Due to problems with combining the images and labels the final

(3)

dataset was 51.733 items.

Golden Standard

The sample for the hand labeled set was randomly selected and labeled for all basic categories, for example: sweaters, jackets or skirts. With this sample we observed the behav-ior of the algorithms without the interference of too much noise. Human classification is in this case used as ‘golden standard’, however it is not free of biases. To minimize the amount of bias in the hand-labeled set we first computed the inter-rater reliability with Cohen’s Kappa of the sets (Cohen, 1960) and second compared the answers which

resulted in one labeled set.

Nearest Neighbor Method

In a nearest neighbor algorithm, the category label that gets assigned to a test item is based on the resemblance of the test item to an item, or a set of items, in the training set. This resemblance is based on the pixel values of an image, which are represented in a one dimensional vector. These vectors are used to determine the distance between the images. Because the images are of the same size, the pixels of the same position are compared. However, because the vector is one dimensional, the shape of the item is not taken into account. The model would for example not identify two identical images when one is rotated a quarter turn.

There are a lot of different ways to determine the distance between two images. We tested the most used equations but found little differences. Because the Euclidean distance is customary in image classification we used the Euclidean distance as distance measure: pPn_i=1(qi− pi)2, where qi is a training image and pi is the test image that has to be categorized. To determine the category label, the algo-rithm can use the category of the single nearest neighbor, or it can use the modal category of a group of the nearest neighbors. In the latter method, a group of nearest neigh-bors is selected that has the smallest distance to a test item. The category label that is represented best in this group is selected as the category label of the test item. For example, when there are 10 t-shirts, 2 sweaters and 3 dress shirts in this group, the selected category label is ‘t-shirt’.

Convolutional Neural Network Method

A CNN is a type of artificial neural network. Artificial neural networks are models inspired by biological neural networks. They consist of a group of interconnected nodes in different layers, just as neurons in the brain. A neural network model estimates the function parameters of a dataset from a set of examples. Instead of defining a function beforehand that will most likely suit the dataset, the function parameters are optimized in a training phase. The training dataset used in the training phase contains example inputs with the desired outputs. By means of this training set the model is trained to eventually predict what output belongs to a new input item. Neural networks can be supervised or unsupervised. In the supervised version of the model the data has the form of inputs and targets

Figure 2: Example of the Convolutional Neural Network ar-chitecture. The plane on the left represents the input image with a 5 by 5 receptive field, the plane in the middle repre-sents the convolutional layer which exists of feature maps that take input values from the receptive fields. The plane on the right represents the pooling layer.

(output values) and every new input value gets assigned one of the previously determined targets, just as the discussed nearest neighbor algorithm (MacKay, 2003). During the training phase the input values pass through the nodes. At every node a dot product is performed between the input value and the weight of the node. In the final layer a category label is determined by means of logistic regression. The weights of the different nodes are adjusted to optimize the estimation of the output values.

As was mentioned in the introduction a CNN is especially suited for image classification because it considers the dependency of closely situated pixels. This characteristic is incorporated in the network by a few properties: the receptive field in the layers, the sharing, or restraining, of weights and pooling of the images. Figure 2 shows these properties of a CNN layer. The figure originates from Bishop (2006) and has additional measurements to help explaining the model.

Convolutional layer The convolutional layer is what characterizes the network. In a convolutional layer all pixels of an image are organized in planes. Such a plane is called a feature map. The units of a feature map take pixels from a small part of the image as input. This de-pends on the size of the receptive field. All the units of a feature map are constrained to have the same weights. Figure 2 shows the input of a feature map from the input image. In this example the feature maps are organized in a large 2D plane, the convolutional layer, however they can also be visualized as separate planes stacked behind each other. You can think of the units in a feature map as feature detectors, identifying a small part of the image. Because of the constraints to the weights, this becomes a pattern. Every feature map identifies a different pattern, or feature, in order to distinguish the different categories. In the example figure the receptive field is 5 by 5 pixels and gives input to a 2 by 2 feature map in the convolutional layer.

(4)

Pooling layer After the convolutional layer, there is a pooling layer which shrinks the image based on a certain pooling size, a stride size. In the figure the image gets pooled with a stride of 2 by 2, leaving an image of 14 by 14. For every feature map there is a plane in the pooling layer with its own units that take input from a few feature map units by taking the average. A normal network would be fully connected in every layer. The receptive fields decrease the number of weights considerably. (Bishop, 2006).

Fully Connected Layer The final layer of a CNN is a fully connected layer, which has the architecture of a traditional multilayer perceptron. The input consists of the set of feature maps of the last convolutional layer. The category label is determined in the final layer by means of logistic regression.

Data Analysis

We compared both methods on the proportions of correctly classified items, the precision and the recall. Precision and recall are two different measures of classification accuracy. Precision represents the percentage of labels appointed to the test items that is classified correctly. For example, of 20 items classified as jeans, 16 are in fact jeans. Recall represents the number of items of a certain category that is labeled as its true label. So, for example, of the 24 actual jeans of the dataset, the algorithm categorized 16 as jeans. A precision or recall for the entire dataset is calculated by taking the mean of the precision and recall numbers of every category.

We aimed at a proportion correct of 75 percent which is competitive compared to previous research (Guillaumin et al., 2010; Simo-Serra et al., 2014; Boiman et al., 2008).To assess the accuracy of the image classification we addition-ally looked at a Cohen’s Kappa coefficient(Cohen, 1960).

The number of predictors that is used in the algorithms is equal to the number of pixels of an image, which is 210 by 315 by 3 for the Wehkamp images. We used examples from the Caltech101 dataset to indicate how many images are needed to compensate for the high number of predic-tors. This set has 9000 images for 101 categories and an image is 200 by 300 by 3 pixels. Researchers that used this dataset reported using 15 to 30 training images per category for 10 to 30 categories, for both parametric and non-parametric methods(Boiman et al., 2008; Dosovitskiy, Springenberg, Riedmiller, & Brox, 2014). Boiman et al. (2008) showed that more than 30 images per category as

training images, for the Caltech101, did not improve the performance of the algorithm. For clothing items there are about 40 categories. Assuming that this relationship is linear, and comparable for different images, at least 1200 training items are required if all categories are used. We an-alyzed different ratios of test and training items to find the optimal parameters for the algorithms and as an additional measure of quality and stability of the algorithms. Addi-tionally, cross-validation reduces the chance of overfitting the training set.

Computing Platform

We have implemented both algorithms in Python (van Rossum & Drake, n.d.) and mainly used the package Ten-sorflow (Abadi et al., 2015). This package allows the implementation of different machine learning techniques and visualization of the model architecture.

Results

The hand-labeled set existed of the categories: dinner jacket, jacket, jeans, dress, underpants, shirt, skirt, t-shirt and sweater, each of which consisted of 100 items, except for skirts for which only 95 examples were available. Co-hen’s Kappa was run to determine if there was agreement between two employees of Statistics Netherlands on the different categories. There was strong agreement, k = .94 (95% CI, .92 to .95). In order to focus on the characteristics of the clothes we used gray scale images. In a few instances color might contribute, for example for identifying ‘jeans’, but for most images we estimated it to lead too much bias.

Figure 3: Different numbers of ‘k’, or neighbors, for the mul-tiple nearest neighbor method. Accuracy measures for the nearest neighbor method are shown for 2 nearest neighbors, up to 150 neighbors. Given that every category had a maxi-mum of 100 items, from 99 neighbors the algorithm could logically not increase in its performance. This is indicated with the dotted line.

Nearest Neighbor

As was mentioned in the method section there are two ways to classify a dataset with a nearest neighbor model.

Figure 4: Classification accuracy for different numbers of train-ing items. We investigated the increase in accuracy from 50 to 700 training items. For every number of training items we used the same test set and only 1 k for determining the nearest neighbor.

(5)

Figure 5: Classification results of the smallest distance near-est neighbor method. The ‘Smallnear-est distance’ method only uses 1 k for determining the nearest neighbor. The darkest columns show how many items the dataset actually con-tained and the lighter columns show the labels the nearest neighbor selected. The figure shows which categories are under- or overestimated by the algorithms and could give an indications of the mistakes the classifier makes. It does not show how many items the classifier identified correctly, this is shown in table 1.

To see which method is most suitable in this situation we first tested multiple numbers of k to see which number of neighbors would result in the best classification. Larger values of ‘k’ usually reduce the effect of noise in the set, but also cause the boundaries between the different categories to fade. Figure 3 shows the results of the different numbers of k. The figure shows that an increase in neighbors results in a decrease in the performance of the model. Because of these results we decided only to look at the smallest distance method, which means we only use 1 k. Because there are only hundred items per category, more then 99 neighbors could logically never lead to a better estimation. In a perfect situation all categories are perfectly separated, which means that all 99 neighbors of the same category would be closest to the test item. This limit is indicated in figure 3 by the dotted line.

Next, we looked at the number of training items that were needed for the classification. As was mentioned in the method section previous research showed that with 40 items per category the maximal precision is reached. Since the hand labeled dataset consist of 9 categories we expected the optimal number of training items to be between 300 and 400. Figure 4 shows that the performance of the model increases as the number of training items increases. The figure also shows that at 400 items the precision becomes stable.

After optimizing the model with the above results, we had a nearest neighbor algorithm with the smallest distance method and a training set with 400 items. The results of this method can be seen in table 1. The model is able to classify around 78% of the clothing images in the hand labeled set with the correct label and it was able to retrieve 75% of the test items within the correct category. We ran Cohen’s Kappa to investigate the agreement between the expert, in this case the hand labeling, and the classifier. There was a moderate agreement k = .71.

Figure 5 visualizes the performance of the nearest neigh-bor for the hand labeled dataset. The bars denoted as

Table 1: Proportional precision and recall measures of the nearest neighbor and the CNN algorithm for the handlabeled dataset per category.

Precision Recall k-NN CNN k-NN CNN dinner jacket 0.53 0.83 0.75 0.76 jacket 0.46 0.87 0.60 0.70 jeans 1.00 1.00 1.00 0.98 dress 0.83 0.92 0.70 0.77 underpants 1.00 1.00 1.00 1.00 shirt 1.00 0.66 0.29 0.72 skirt 0.80 0.83 0.80 0.91 t-shirt 0.93 0.82 0.93 0.90 sweater 0.44 0.64 0.64 0.90 mean 0.78 0.84 0.75 0.85 sd 0.24 0.13 0.23 0.11

‘Real’ show the number of items in the test set per category and the ‘Smallest Dist’ bars show the classification of the smallest distance classification method. Together with the precision and recall measures of table 1, this figure gives an idea of the mistakes the classifier makes. For example, looking at the ‘Dinner jackets’, ‘Jackets’ and ‘Sweaters’ we can see that these categories are overestimated by the classifier. This means that other categories have to be underestimated. Looking at the figure you can see a large underestimation for the ‘(Dress) shirts’. This is also shown in the table, which shows that the precision of ‘shirts’ is 100 percent, while the recall is only 23 percent. This means that all test images denoted as ‘shirt’, were in fact shirts, however the classifier missed almost 80 percent of the shirts in the dataset. They most likely ended up with the label ‘Dinner jacket’, ‘Jacket’ or ‘Sweater’, because these categories are overestimated in the figure, but also because they have higher recall and lower precision results. From these result we could conclude that the classifier has more difficulty distinguishing clothing categories that cover the torso, then for example ‘sweaters’ and ‘skirts’. Which would exemplify that the image classifications makes ‘im-age specific’ mistakes, and not random mistakes. For a human it is more difficult as well to distinguish an image of a dress shirt from an image of a sweater, than from a pair of jeans. Later on, in the section with additional findings, we will show that this is in fact a mistake the classifier makes.

We expected the performance of the model shown in table 1 to deteriorate when tested on the larger set. This set has 64 categories. Given the number of categories and the previous research the training set should be around 1900-2500 items. However, the categories are not equally divided over the set. For example, a category ‘plus sizes’ only appears 12 times in the set. Therefore, we expect that we need more training items for reasonable results and comparison. We experimented with the number of training items: table 2 shows the results. The table shows a large decrease of accuracy for this set, which does not

(6)

Table 2: Results large set with different numbers of training items classified by the two nearest neighbor methods.

Number of items Acc smallest distance Acc 10-NN

2000 0.12 (0.035) 0.14 (0.03)

5000 0.14 (0.024) 0.17 (0.034)

9000 0.17 (0.024) 0.19 (0.024)

15000 0.16 (0.005) 0.19 (0.006)

50000 0.18 (0.01) 0.20 (0.01)

increase when the number of training items goes up. We also looked into the multiple neighbors method, which per-formed slightly better in this situation. since the model has more categories, the multiple neighbors method might allow some flexibility. Again, we examined different num-bers of k, but did not find any notable differences. There are a few explanations for the performance of the large set compared to the hand labeled set. Some of the categories in the larger set are covering multiple categories, for example ‘girls wear’ covers girls t-shirt, girls pants, girls shoes and more. Similar items therefore, could belong to different categories. Girls shoes could in this case have the label ‘girls shoes’ or ‘girls wear’. This makes it difficult for the model to distinguish between the categories. Second, the automatic labeling might have made mistakes as well when labeling the dataset. Therefore, the training set might have already had mistakes, which leads to bias when training the model.

Convolutional Neural Network

For the CNN we considered different settings for the recep-tive field, the strides and the size of the image. We started off with gray scale clothing images of size 28 by 28, which is similar to the standard MNIST example (LeCun, Cortes, & Burges, 1998). We found a fast decline in accuracy when increasing the strides and the receptive field size. The larger the receptive field, the more weights are constrained to be the same, so less information remains in the network. The larger the strides, the faster an image is increased and less information remains to be processed. We therefore used a receptive field of 3 by 3 and a pooling layer of 2 by 2 for the image size 28 by 28. In the convolutional layer there were 32 feature maps. Table 1 shows the results from the network with these settings for the hand labeled set. The network is able to classify around 84% of the clothing images in the hand labeled set with the correct label and it was able to retrieve 85% of the test items within the correct category. We ran Cohen’s Kappa to investigate the agreement between the expert and the classifier. There was a moderate agreement k = .72. Figure 6 again shows the predicted item labels and the actual labels present in the dataset. Just as the nearest neighbor, this figure shows that all categories that cover the torso are over- or underestimated. ‘t-shirts’, ‘shirts’ and ‘sweaters’ are in this case somewhat underestimated, while ‘dinner jackets’ and ‘jackets’ are somewhat overestimated. This can be seen in the precision and recall numbers as well. The differences are less present for the CNN than for the nearest neighbor,

Figure 6: Classification of the CNN algorithm for every cat-egory. The darker columns represent the number of items that the dataset contained per category and the light colored columns represent the number of items predicted by the algorithm. The figure, therefore, shows the categories that are over- or underestimated by the algorithm.

Table 3: Proportional precision and recall measures of the rule-based method, currently used by Statistics Netherlands.

Precision Recall

mean 0.61 sd 0.016 mean 0.63 sd 0.012

however they still give the impression that the mistakes the algorithm makes are not random, but fall within a wider category, for example ‘torso clothing’.

At last, we classified the large set with the CNN. For the CNN we also experimented with the number of training items and different images size. However, it did not have that much effect on the performance. The results are comparable with the nearest neighbor results and did not reach a performance higher than 20 percent.

Comparison

To indicate the relevance of the results, we looked at the currently used rule-based method. This comparison is not straight forward, because we are interested in classification accuracy but also in preparing time and human interfer-ence. A relatively simple version of the rule-based method classifies based on keywords that appear in the clothing tags. Additional rules can be added to a rule-based method to the extend that every item is classified correctly, however this requires a lot of time. Researchers have to come up with additional rules themselves and these rules only apply to a specific situation. They have to be adjusted for every different webshop, season or even trend. Therefore, we compared the simple keyword method to the methods dis-cussed in this paper, because it only requires clothing tags and frequent words as skirt or dress. It does not require very specific information about the website and could be used in multiple occasions, which is what we expect from the machine learning methods as well. Table 3 shows that the results of the rule-based method are less precise than the machine learning based methods.

(7)

Image Category Assigned Category

Website de-scription

Sweater Underpants ’WE Fashion sweatvest’

Jacket Underpants ’WE Fashion

coltrui*’

skirt T-shirt

’POLO Ralph Lauren denim kokerrok**’

Table 4: Images that were misclassified by a text mining algorithm based on website descriptions, and were correctly classified by the nearest neighbor algorithm. *Dutch for turtleneck, ** Dutch for pencilskirt.

Additional Findings

We have looked into image classification methods because of the specific mistakes of previously investigated methods. The text based methods showed text specific mistakes in previous research (de Boer, 2015). If the mistakes of the image classification methods are image specific, the two methods might complement each other. If that is the case then a combination of the two methods could results in the best classification strategy. In the result section we already showed that there is some evidence that the image classification makes mistakes that are not random, but seem to be related to the image specifics. To add to this claim, we compared the results of the image classification with a text mining method and looked at a few individual examples. Table 4 shows a few examples of images that were classified incorrectly by the text mining method, but correctly by the image classification. The web description shows some irregular names: ‘kokerrok’ and ‘coltrui’, are Dutch examples of word compositions. The word ‘rok’ (skirt) and ‘trui’ (sweater) would have been recognized by the text mining, however the compositions make it difficult to classify the item. The interesting thing is the category label that the item gets instead. This seems to be a random label, because it does not relate to the item other then that it is a clothing item. Table 5 shows items that have been incorrectly classified by both methods. The web description again shows irregular names, for example ‘sweat’ instead of ‘sweater’. Looking at the assigned category labels, the table shows that the image classification methods chooses a category similar to the actual category. Just as we expected from the precision and recall results, the image classification assigns a label that could be considered not far off the actual label, in this case ‘jacket’ instead of ‘sweater’. The same claim could be made for ‘dress’ instead of ‘skirt’.

Image Category Assigned

by TM Assigned by IC Website description Sweater Under-pants Jacket ’Gsus hooded sweat’

Skirt Sweater Dress ’Paprika

maxirok*’

Sweater

Under-pants Jacket

’Superdry hooded sweat’

Table 5: Images that were misclassified by the text mining (TM) algorithm and the nearest neighbor algorithm (IC). *Dutch for a long (or maxi) skirt.

Discussion

Both the nearest neighbor and the convolutional neural net-work reached, or exceeded, the classification performance that we aimed for. Furthermore, the CNN outperformed the nearest neighbor method. We established the best performance with the CNN, in combination with gray scale images of size: 28 by 28, two layers in the network, with small perceptive fields and small pooling layers. The auto-mated process only requires a labeled test set and does not require any additional rules or exceptions. Additionally, we compared the results with the results of the text mining method and showed the overlap and differences between the misclassifications of the image classification and text mining methods. This implies that the methods can com-plement each other when used for classification. With the results of the larger set, however, we showed that both methods are not robust to bias. This means that in practice a reliable, and probably hand labeled set would have to be available. Where the method itself decreases the amount of human interference, because of the training phase, the construction of this set might add to it again. Further research could indicate whether these methods could lead to better classification and more efficiency. When a method is proven to efficiently classify clothing items, the resulting categories can be used for determining price indexes. For further research we would recommend two strategies. The first thing to look into is the combination of text mining and image classification. The specific mistakes of the two methods indicate that the methods will most likely complement each other. More specifically, image classification could be used as a rough classification. In a previous stage the items can be classified as clothes to cover the torso, or the legs. The text mining can add the details in a later stage. As an example Yang, Luo, and Lin (2014) used clothing tags as a prediction parameter in a neural network. When the hand labeled set can be kept within a certain proportion and the model does not require too much supervision, this combination can greatly

(8)

add in efficiency. Second, it would be interesting to look into unsupervised, or semi-supervised, learning methods. These methods are allowed some, or total, freedom is the classification of items. They are capable of identifying new categories and adding them to the possible categories. Using such a method could make it possible to use a less specific, and not hand labeled, dataset.

References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., . . . Zheng, X. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Re-trieved from http://tensorflow.org/ (Software available from tensorflow.org)

Bishop, C. M. (2006). Pattern recognition. Machine Learning.

Boiman, O., Shechtman, E., & Irani, M. (2008). In defense of nearest-neighbor based image classification. In Computer vision and pattern recognition, 2008. cvpr 2008. ieee conference on (pp. 1–8).

Cohen, J. (1960). Kappa: Coefficient of concordance. Educ. Psych. Measurement , 20 , 37.

de Boer, G. (2015). (Tech. Rep.). Statistics Netherlands. de Haan, J., & Hendriks, R. (2013). Online data, fixed

effects and the construction of high-frequency price indexes. In Economic measurement group workshop (pp. 28–29).

Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., & Brox, T. (2014). Discriminative unsupervised fea-ture learning with convolutional neural networks. In Advances in neural information processing systems (pp. 766–774).

Espejo, M. R. (2005). Consumer price index manual: Theory and practice. Journal of the Royal Statistical Society: Series A (Statistics in Society), 168 (2), 461– 461.

Guillaumin, M., Verbeek, J., & Schmid, C. (2010). Mul-timodal semi-supervised learning for image classifi-cation. In Computer vision and pattern recognition (cvpr), 2010 ieee conference on (pp. 902–909). LeCun, Y., Cortes, C., & Burges, C. J. (1998). The mnist

database of handwritten digits.

MacKay, D. J. (2003). Information theory, inference and learning algorithms. Cambridge university press. Simo-Serra, E., Fidler, S., Moreno-Noguer, F., & Urtasun,

R. (2014). A high performance crf model for clothes parsing. In Computer vision–accv 2014 (pp. 64–81). Springer.

van Rossum, G., & Drake, F. L. (n.d.). Python language reference [Computer software manual]. Available at http://www.python.org.

Yang, W., Luo, P., & Lin, L. (2014). Clothing co-parsing by joint image segmentation and labeling. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 3182–3189).