Image classification of fashion industry

(1)

Image classification of Fashion Industry

BY

Sunita Putchala

,

11417811

MBA in Big Data and Business Analytics Thesis

SUPERVISED BY

dr. Overgoor, Gijs

Amsterdam Business School

(2)

Abstract

From iMaterialist Challenge (Fashion) at FGVC5 (https://www.kaggle.com/c/imaterialist-challenge-fashion-2018) the entire shopping experience is moving digital through online. People are more satisfied to sit, relax and order via a digital platform from home or office instead of visiting the physical stores residing at a distant location. As shops go online it is a challenge for the shopper to classify the products online automatically with minimal human intervention via machine learning algorithms. [1] However, an automatic product recognition and its classification is a tough job due to picture quality. A picture could be taken in different lightings, angles, background, and levels of conclusions. Apart from picture quality and pixilation, another factor is the colour naming convention. Two pictures can look very similar due to almost same colour as royal blue vs turquoise. It’s a genuine challenge to machine learning algorithms to detect and differentiate minute colour detailing in the fashion product. However, these detailing’s are very important for a customer to reflect to buy a product or to make appropriate buying decisions. Currently the classification of the products in the fashion industry is done manually by a person which is a tedious and time-consuming task. The present models work on fixed parameters of pictures or photos. The emphasis is placed on the deep learning classification approach and how this technique is used for improving classification accuracy. We tackle the problem of multi-label classification of fashion images. In this we are learning from noisy data with minimal human intervention.

INTRODUCTION

Fashion plays an important role in everyday lifestyle, yet it’s a complicated task for computer vision. Due to the subjections of fashion statements, obtaining high quality data for training learning-based models for fashion products is an open problem. Recommending fashion products according user or item is easy and already done but classifying under same category with respect to pictures, colors and labels is complicated. Apart from it, the amount of image data that is received from online shopping websites like Amazon, Wish, Malong etc is constantly increasing.

Can a computer automatically detect pictures of shirts, pants, dresses, and sneakers taken in different lighting angles, backgrounds and colour co-ordinates? [3] There comes the concept of Deep Learning algorithms. Deep learning has emerged as a new methodology with continuous interests in artificial intelligence, and it can be applied in various business fields for better performance. In fashion business, deep learning, especially Convolutional Neural Network (CNN), is used in classification of apparel image. However, apparel classification can be difficult due to various apparel categories and lack of

(3)

labelled image data for each category. Therefore, we propose to pre-train the GoogLeNet architecture on ImageNet dataset and fine-tune on our fine-grained fashion dataset based on design attributes. [4]

Visual Categorization

According to Fritz Venter (LEFT) and Andrew Stein, a human brain can simultaneously process multiple images and videos with sound at the same time. It is exceptionally responsive and effective in its own way to understand the meaning or context within few minutes. Today through image classification and detection we are trying to catch up with the brain functioning. Images and image sequencing (video) are making up 80% of the big data in the multimedia platforms.

Visual Categorization is generally classified into three categories i.e., superordinate-level, basic-level, and subordinate-level categorization. [8] Till today, most of the research and practical work was done in the field of basic-level categorization for example, classifying objects belonging to different species, such as products, objects, animals, flowers and plants. Presently, a lot of work is carried in subordinate-level categorization which is commonly called as Fine-Grained Visual Categorization (FGVC) has been attracting more attention. FGVC aims to distinguish the objects belonging to the same or closely-related categories, for example classifying different species or types of birds, flowers, animals, and products. [7]

FGVC is a challenging task because these objects commonly show inter-class variance and large intra-class variance for example variances caused by different poses and backgrounds.

Research Questions

My aim in my thesis is to answer below questions: 1) What data is provided by the FGVC5 –

a) Understand the characteristics of the data b) How to prepare the available data for modelling

(4)

2) What are the challenges that I faced during the image exploration and classification? 3) What are the existing models and literature available in the image classification? 4) Evaluate model with existing data

Business Requirements:

1) The model should be able to recognize the fashion product displayed in the image 2) The model should recognize the dominant color of the fabric in the image

Procedure of Image Recognition

• Data collection

• Data cleaning and pre-processing • Feature selection and extraction • Machine learning modelling • Parameter tuning

• Metrics, performance and accuracy

NOTE: If accuracy is less than repeat all the steps from 1. The second step is Recognition. It includes as below sub-steps:

• The same pre-processing in training • The same feature extraction in training • Feed features to trained classifier • Output classification results.

Deep Learning

Deep learning refers to neural networks with multiple hidden layers that can learn increasingly abstract representation of the input data. For example, deep learning had led to major advances in computer vision. We are now able to classify images, find objects in them, and even label them with captions. To do so, deep neural networks with many hidden layers can sequentially learn more complex features from the raw input image: [15], [19]

• The first hidden layers might only learn local edge patterns.

• Then, each subsequent layer (or filter) learns more complex representations. • Finally, the last layer can classify the image as a cat or dog.

(5)

Convolutional Neural Networks

Convolutional Neural Networks are very similar to ordinary Neural Networks. They are made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity. The whole network still expresses a single differential score function, from the raw image pixels on one end to class scores at the other. And they still have a loss function (for example SVM/Softmax) on the last (fully-connected) layer and all the tips/tricks we developed for learning regular Neural Networks still apply. [2]

ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture. Theses the make the forward function more efficient to implement and vastly reduce the amount of the parameters in the network.

High level overview

In this section, we briefly introduce the different libraries we tested and our experimental apparatus and protocol.

Deep Learning with Keras

The top two numerical platforms in Python that provide the basis for Deep Learning research and development are Theano and TensorFlow. Both are very powerful libraries, but both can be difficult to use directly for creating deep learning models. [19] Keras is the official high-level API of Tensorflow. Keras was developed and maintained by Francois Chollet, a Google engineer using four guiding principle: [16]

• Modularity – A model can be understood as a sequence or a graph alone. All the concerns of a deep learning model are discrete components that can be combined in arbitrary ways. • Minimalism – The library provides just enough to achieve an outcome, no frills and

maximizing readability.

• Extensibility – New components are intentionally easy to add and use within the framework, intended for exploration of new ideas.

The focus of Keras is the idea of a model. The main type of model is called a Sequence which is a linear stack of layers. We create a sequence and add layers to it in the order that we wish for the computation to be performed. Once defined, we compile the model which makes use of the underlying framework to optimize the computation to be performed by our model. In this we can specify the loss function and the optimizer to be used.

(6)

Once compiled, the model must be fit to data. This can be done one batch of data at a time. Once trained, we can use the model to make predictions on new data. We can summarize the construction of deep learning models in Keras as:

• Define the model – Create a sequence and add layers. • Compile the model – Specify loss functions and optimizers. • Fit the model – Execute the model using data.

• Make predictions – Use the model to generate predictions on new data.

[16] Three API styles are: [16]

1) The sequential Model • Dead simple

• Only for single- input, single-output, sequential layer stacks • Good for 70% of use cases.

2) The functional API

• Like playing with Lego bricks

• Multi-input, multi-output, arbitrary static graph topologies • Good for 95% of use cases

3) Model Sub-classing • Maximum flexibility

• Larger potential error surface

General idea is based on layers and their input/output • Prepare the inputs and outputs tensors • First layer created to handle input tensor • Output layer created to handle targets

• Model is built virtually between input and output layer.

Here are the steps for building CNN using Keras

(7)

• Python 3+

• SciPy with NumPy • Matplotlib

2) Install Keras

3) Import libraries and modules – I started with importing NumPy. Then, we imported the sequential model type from Keras. This is a linear stack of neural network layers, and it is perfect for feed-forward CNN. Then, we imported the core layers of the Keras. We have used “Flatten” and “Dense” layer in our case. Apart from “flatten” layer, there are other layers like “Dense”, “Dropout”, and “Activation”. These layers are used in almost any neural network. The flatten layer unrolls the values beginning at the last dimension. The dense layer is a simple regular layer where each unit or neuron is connected to each neuron in the next layer thus, densely connected.

Then I imported Keras CNN layers called Convolution2D (Conv2D) and MaxPooling2D. These layers help us to train on image data efficiently.

4) Data Collection: Conference on Computer Vision and Pattern Recognition (CVPR) has partnered with Google, Wish and Malong Technologies for tackling the issues related to automatic Fashion product detection under Fine-grained visual categorization.

The CVPR have provided image_url instead of the raw image. I downloaded a subset of the data image files around 20K to continue my analysis as it was taking too long time to download the images and also consume lot of space locally to store. I was provided with train.json, validation.json and test.json datasets in json file format.

• train.json: Training dataset with unique image_id, image_url and associated image_labels. • validation.json: Validation dataset with unique image_id, image_url and associated

image_labels.

• test.json: Test dataset with image_id and image_url. Images for which I need to generate predictions. Only image URLs are provided.

5) Data Cleaning and transformation 6) Pre-process class labels for Keras 7) Define model architecture

8) compile model

9) Fit model on training data 10) Evaluate model on test data

(8)

PROBLEM SOLUTION

Our task is to develop algorithms that will help with an important step towards automatic product detection, to accurately assign attribute labels for fashion images. For doing so below is the steps for the task.

1. Data Analysis and Data Exploration 2. Feature Extraction

3. Feature Engineering (Labelling Analysis) 4. Building Deep Learning models

5. Evaluation of models 6. Decision making

1. Data Analysis and Exploration –

After loading and reading the json files we found that there are 10,14,544 images with total of 228 labels in the training dataset. In test dataset we have 39,706 images with no labels, hence we need to predict the multiple labels. In validation dataset we have 9,897 images with 225 labels. It is in the below format in the json file:

I have written a python snippet to check the occurrences

(frequency

) of each label in the training dataset. I have noticed that the label 1 has high frequency (around 8000K images) labelled images in the dataset

(9)

(10)

1.1 Data

The images are interpreted as a matrix of pixels with defined height and width.Each pixel has an RGB (Red/Green/Blue) colour value, which is sometimes named channel in machine learning software, and typically varies between 0 and 255 for each of the three colours []. In total 60K images with greater than 200 * 200 pixels have been collected and loaded in the model.

I have defined a customised function which displays the image in an HTML format for a given image_id from the dataframe. In Python, Dataframe is a two-dimensional labelled data structure with columns of potentially different types. Similarly, I have defined a display_label function which displays the labels of an image based on image_id from the specified dataframe. From this function we can retrieve labels and total number of labels. Also, I have defined a function to display label_id with percentage of occurrences.

1.2 Downloaded Images

I have defined a customised function to download the images into a specific local drive by reading the provided urls in the json file.

1.3 Computing the number of labels in dataset

I have computed the total number of unique labels present in train, test and validation json files. I observe around 228 unique labels for the given training dataset.

(11)

In test dataset, I observe around 39706 images. My task is to find and match the labels in the test data by finding patterns in the training data.

I observe that there are 225 unique labels in validation dataset. A validation dataset is a part of data that is kept from training the model that is used to give to estimate of model skill while tuning model’s parameters. Validation set is used to minimize overfitting. Validation and test datasets are used to evaluating the models.

1.4 Distribution of Labels in the training data

The below graph displays the distribution of our unique labels in training. Basically, I am finding the frequency (or count) of each unique label.

(12)

(13)

1.5 Determining images with top labels with highest count:

(14)

(15)

1.6 Determining Images with the greatest number of labels

(16)

(17)

2.9 Determining Images with one label

(18)

(19)

1.7 Determining Images with two labels:

I found out the images which are associated with only two labels.

(20)

(21)

To make our data consistent I have eliminated the images with less dimensionality pixelated from training dataset before using them in models. Only images of 200 * 200 pixels and larger are kept. Further, I have resized some of the images down to 200 * 200 to extract the colours and counts [29].

1.8 Determining colours used in the images:

I have used the K-Means clustering algorithm to determine the colours in the images. According to Charles Leifer [28], every pixel in the image is an RGB colour value represented in a 3D space. The algorithm picks ‘k’ number of random pixels and assigns them as initial centroids for the clusters. It computes the new centroid for every cluster by averaging the pixels. The process is repeated till the reassignment for each pixel is satisfied and the centroids begin to stabilize.

The images in the training dataset were run with k=3 and the minimum distance=3.0

I have included some images below as the output of the algorithm. The below set of images are analysed to determine the existing colours combinations used in each image. The results from KMeans seem to be good however there seems to be some loopholes in it. The main loophole of the above is, along with apparel’s colour, the background colour and furniture colour are also being detected. while focusing on fine-grained visual categorization one should exclude the noise from the training data. The background colour used in the image is influencing the detection of average colour used in the fabric. From the paper : http://people.csail.mit.edu/khosla/papers/arxiv2016_Wang.pdf written by Khosla, he has adopted threshold based segmentation method to automatically detect the background region in the image for cancer tissue detection to nullify the background colour effect on the image.

(22)

Here, in the below image the ‘white’ background colour is highly detected rather than the fabric / fashion product used colours.

(23)

(24)

(25)

The features are extracted from Keras pre-trained ImageNet CNN (Inception-V3) [Szegedy et al., 2016]. The trained Deep Neural Nets are made publicly available in Keras package. The pre-trained keras models I have considered are Inception-v3, ResNet50, InceptionResNetv2 and MobileNet. CNN is a basic feature extraction technique to extract features from an image. Features are extracted from the images by fitting a pre-trained ImageNet CNN on a generic set of images. These architectures are the respective winners in the ImageNet Large Scale Visual Recognition Competition (ILSVRC).

Creating and training deep neural nets from scratch would take lot of time and involve many iterations, hence I have used the pre-trained weights of the mentioned deep neural net architectures that are trained on ImageNet dataset for my own training dataset. It is a data pre-processing and augmentation module in Keras library. It provides utilities for working with image data, text data and sequence data.

Keras applications are deep learning models that are available along with pre-trained weights. These models can be used for prediction, feature extraction and fine-tuning. Weights are downloaded automatically when instantiating a model. They are stored in my local ~/.keras/models/ directory by default.

The preprocessing steps include as below:

1) Load the pre-trained ImageNet model Inception-V3, resnet50

2) For each image load the image, resize the image to respective ImageNet model defined size and then fit it to the ImageNet model.

3) The model results in a feature vector size of 12048 (ResNet50) and 1, 131072 (InceptionV3). Inception V3:

It is a type of Convolution Neural Networks. It consists of many convolution and max pooling layers. It includes fully connected neural network. Our Inception V3 model is with weights pre-trained on ImageNet. We are taking default input size for this model which is 299 x 299.

ResNet50:

It is another type of convolution Neural Network. Our ResNet50 model is also with weights pre-trained on ImageNet. We are taking default input size for this model which is 224 x 224.

With above models I could able to differentiate and classify the different apparels/fashion products with a probability score under which type of clothing section it can fall under. As per the output Inception-v3 is an effective pre-trained model with highest accuracy of 96.5% in the ILSVRC.

(26)

Processing 40K images took 1 hour and 32 mins, which amounts to 7 secs per 100 images.

Here, there are only some limited set of predefined labels in the ImageNet models. If there is a fashion product not labelled in the model, then it classifies incorrectly. Further, the prediction of labels is quite initiative if the picture or image is relatively clear and there is prior knowledge of the fashion product is available. If the images are not clear, then the model fails to provide correct label.

(27)

(28)

Output prediction of ResNet50 and Inception V3

Conclusion and Application:

From the results we conclude that the computer model can approximately determine and

classify the dominant color of a fashion appeal based on the quality of the image. Also, it can

label the fashion product with a high-level name by using a pre-trained ImageNet model. Out

of the two models -

ResNet50 and Inception V3, Inception V3 produced better ranked results.

(29)

I realized that the model to the problem statement can also be applied in other industries in

addition to fashion industry as such in food and grains industry, wild-life classification and

detection etc. However, to do so the model has to be trained with labels concerned to the

dataset. If a label is not defined in the model then the image detection goes erroneous.

Future Work:

I would like to explore further on this area due to time limitations. There are further

improvements and approaches that could be followed to optimize the results and experiment

with other kinds of ImageNet models.

References

[1] Kaggle Competition

www.kaggle.com - imaterialist-challenge-fashion-2018

[2] Convolutional Neural Networks for Visual Recognition.

[3] Primary Objects, Software Development, Programming, AI.

[4] Image Recognition For Fashion with Machine Learning.

[5] IEEE International Conference on Multimedia & Expo 2013, Augmenting Descriptors for Fine-grained Visual Categorization Using Polynomial Embedding, The University of Tokyo

[6] Coarse-to-Fine Description for Fine-Grained Visual Categorization, Shiliang Zhang & Hanto Yao.

[7] Advances in Fine-grained Visual Categorization, D.Phil Thesis, Robotics

Research Group, Department of Engineering Science, University of Oxford.

[8] Fine-grained Visual Categorization via multi-stage metric learning, Qi Qian, Rong Jin, Department of Computer Science & Engineering, Michigan State University. [9] Using Pre-trained Models for Fine-grained image classification in Fashion Field, Moscow Institute of Physics and Technology.

[10] Style Finder: Fine-Grained clothing style recognition and retrieval, Wei Di, Department of Computer Science and Engineering, University of California.

[11] Deep Learning, Lisbon Machine Learning Summer School, Lisbon, Portugal.

[12] Multi-label Fashion Image classification with minimal human supervision, Naoto Inoue, The University of Tokyo, Waseda University.

(30)

[13] Convolution Neural Networks for Fashion Classification and object detection, Brian Lao & Karthik Jagadesh, Stanford University. [14] Using Python in Computer Vision: Performance & Usability, Brian Thorne, Raphael Grasset, HIT Lab NZ, University of Canterbury.

[15] Deep Learning with Keras, Implement neural networks with Keras on Theano and Tensorflow, Antonio Gulli, Sujit Pal.

[16] Introduction to Keras, Francois Chollet. [17] Keras Tutorial: An Introduction by Dylan Drover.

[18] Deep Learning 101 – a Hands-on Tutorial by Yarin Gal.

[19] Keras Tutorial: The Ultimate Beginner’s Guide to Deep Learning in Python, Elite Data Science.

[20] Keras official Documentation.

[21] https://github.com/DeepLearningSandbox/Dee pLearningSandbox/blob/master/transfer_learni ng/fine-tune.py [22] http://people.csail.mit.edu/khosla/ [23]https://www.kaggle.com/badalgupta/simpl e-data-exploration/notebook [24] https://deeplearningsandbox.com/ [25] https://buzzrobot.com/dominant-colors-in- an-image-using-k-means-clustering-3c7af4622036 [26] https://www.pyimagesearch.com/2014/05/26/o pencv-python-k-means-color-clustering [27] https://zeevgilovitz.com/detecting-dominant-colours-in-python [28] https://zeevgilovitz.com/detecting-dominant-colours-in-python http://charlesleifer.com/blog/using-python- and-k-means-to-find-the-dominant-colors-in-images/ [29] http://people.csail.mit.edu/khosla/papers/fgvc2 011.pdf