An Attribute-Embedding Clothing Recommendation System

(1)

A

N

A

TTRIBUTE

-E

MBEDDING

C

LOTHING

R

ECOMMENDATION

S

YSTE

SUBMITTED IN PARTIAL FULFILLMENT FOR THE DEGREE OF MASTER OF SCIENCE

R

ICK

V

ERGUNST

10793925

MASTER INFORMATION STUDIES DATA SCIENCE

FACULTY OF SCIENCE UNIVERSITY OF AMSTERDAM

D

ATE OF DEFENCE

2018-07-30

Internal Supervisor Title, Name dhr. Masoud Mazloom Affiliation UvA

(2)

An Attribute-Embedding Clothing Recommendation

System

Supervisor: Masoud Mazloom

Rick Vergunst

University of Amsterdam Student number: 10793925

vergunstje@hotmail.com

ABSTRACT

Nowadays everybody is using social media for different purposes in today’s day and age and the overall usage is increasing dramati-cally. This generates data, which can be used to perform analysis in different tasks. Recommendation Systems are an example of such analysis, that make use of the data to describe users and compare them among each other to recommend certain items, especially in fashion recommendation. People post pictures of themselves on a daily, often involving certain clothing pieces. Those pictures give a sense of styling of certain people, which can be used to recommend other similar users clothing pieces. In this thesis, a system was built to use that information to recommend clothing pieces based on the bodily features of the user. The system extracts features from pop-ular people and uses that information to match with the user input to recommend clothing that fits you based on how you look. The proposed system is a content-based method, which is different from existing work where the focus lies on collaborative methods. Fur-thermore it uses an attribute based method to compare users and recommend clothing pieces. The data set crawled in this thesis con-sist of 282.831 images gathered from Flickr. These images contain either the face or the full body of 1366 unique celebrities. Evalu-ation of such a system is not trivial as it is experimental in nature, thus no comparable benchmark is available for use. This meant that a user in the loop evaluation was used, which resulted in 27 respondents that were shown 15 images each. In the end, 23.21% of the images were liked suggesting that the system can work. The actual implementation takes more work however, along with future research to validate that the system actually works due to the ex-perimental nature of this research.

Keywords

Fashion; Neural Network; Recommendation system

1. INTRODUCTION

As of today, the number of people shopping online and the quan-tities in which they order are increasing rapidly. A a research from the Forrester Research group clearly indicates a yearly increase

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

WWW’17 ,

c

2017 Copyright held by the owner/author(s). Publication rights licensed to ACM. ISBN .... . . $15.00

DOI:http://dx.doi.org/...

in the amount of online shopping. [9] In [9], a poll under 5.000 customers found that in 2016 51% of the total shopping was done online, increasing from 2015 and 2014 when this number was re-spectively 48% and 47%. This could possibly be explained by the increasing acceptance surrounding online shopping by the general public. [13] One branch that is following and profiting from this trend is Fashion Retailers. Companies like Zalando, ASOS and TaoBao provide an online fashion platform to sell clothes from a combination of retailers and show a yearly increase in revenue. To make this explicit, as shown by the yearly numbers published by Statistia, Zalando has enjoyed an increase in revenue of 23.01% from 2015 to 2016.

Indispensable are the recommendation systems that technically back up these platforms. These platforms track your behaviour on their website and will display or recommend items similar to the things you have looked at or interacted with so far. It has been shown that these recommendations are very popular and therefore fruitful. This means that it is expected that these recommendation systems will play relevant and enlarged roles within the develop-ment of online platforms and their revenues. [3] Improving and investing in such systems seems therefore desirable and profitable for the aforementioned online fashion platforms.

Even though a lot of word has already been done in the field of recommendation systems, the systems are far from perfect and improvements can still be made. Numerous papers are published each year suggesting small improvements on existing models or proposing somewhat novel approaches to existing context. Within the field or recommendation systems, three main approaches to rec-ommend entities exist. The approaches are collaborative filtering, content-based filtering and hybrid approaches. Collaborative filter-ing uses user behaviour and purchase history to understand the user and match him or her to other people that show the same behaviour to recommend items. Content-based filtering, as the name suggests, analyzes both the user’s profile and the potential items and the user profiles and describes them using the same features. Subsequently the system selects the items that best match the user’s profile and recommends them. The last approach combines the content-based filtering and collaborative filtering approach in a way determined by the creator. The hybrid approach is often used in current rec-ommendation systems, as it starts off with content-based filtering and as the user buys and looks at items the system learns the users behaviour and slowly moves into using the collaborative filtering approach. This gives satisfying results for contexts such as movies or books since one’s taste is best described based on your likes and dislikes rather than on a preset profile. However, for the fashion context, this only works up until a certain degree. In this context, user behaviour is certainly relevant and helps understand the user, but in fashion the user profile and someone’s physical appearance

(3)

are of greater importance in order to determine the suitability of specific items of clothing to the individual. Therefore, this thesis proposes a content-based approach using the body features of a per-son. Features such as height, weight, body shape and hair colour are used to map the user’s body and to match it to that of celebri-ties. Assuming that celebrities dress well for their body type, it is expected that flattering recommendations can be made to the indi-vidual. So, the overall goal is to improve the current recommenda-tion systems by utilizing this method.

Examining current work in the field of fashion recommendation, it can be concluded that there is a lack of systems using bodily fea-tures to base their recommendations on. Still, the possibility for body feature extraction is improving and expanding even without any interest from the recommendation field. Through the current study, it is attempted to combine said improving methods into one recommendation system to recommend suitable clothing items to the user. Next to that, it is also attempted to improve the body fea-tures extraction methods all together. This means that the proposed system could contribute to both the recommendation and feature extraction field. Furthermore, the system does not have to solely be a stand alone implementation but could also be combined with current existing recommendation systems to improve their perfor-mance.

In order to formalize the aforementioned systems a main abstract research question is answered along with a few sub questions that create concrete problems to solve. The main research question in this thesis is the following:

How well does clothing recommendation based on body features work?

The concrete sub questions are mentioned below: 1. How to gather the data?

(a) Where to gather images to train?

(b) Where to gather the true labels for those images? 2. How to predict the body features?

(a) How to predict facial features?

(b) How to predict full body features like height, weight and bmi?

3. How to evaluate the created framework? (a) Which evaluation criteria to use? (b) Where to obtain comparable results?

In this thesis, three novelties can be highlighted that attribute something to the field. The novelties are the following:

• The data set created in this thesis can be used for other pur-poses as well. The set contains almost 300k images of celebri-ties along with the correct information for certain body fea-tures.

• The recommendation system can take a user input image and returns a list for celebrities and their clothing as recommen-dation for the user, which has not been done before and could be deployed next to existing systems.

• The system is a novelty, thus a user in the loop app was created for evaluation. A similar evaluation is applicable to many more cases.

The remainder of this thesis starts with a related work section, which compares the current literature. Afterwards, the proposed methods are explained followed by the experimental set up, which specifies the details and the evaluation. Subsequently, the results are discussed which leads to the conclusion and future work in the final section.

2. RELATED WORK

In this section, the current literature surrounding the field of rec-ommendation within the fashion context will be discussed. This discussion consists of different methods solving the same problems and similar methods solving different problems. The combination of both serves as a base to validate the thesis and prove both the scientific and practical relevance.

Current recommendation systems utilize collaborative or hybrid methods as it is proven to work well in many situations. This is re-flected in the literature, as many literature writers take the same ap-proach to improve on the current benchmarks. The same holds true for the field of fashion recommendations. In order to improve, peo-ple try to come up with new adaptions on the current approaches by adding specific features that are relevant for the field they work in. An example of such an adaption is the work of Yu-Chu et al.[12] In [12] the authors acknowledge that the fashion field is different from other fields. Systems from other fields assume that users behave similar when purchasing items but this is not the case for fashion since personal preference has a large impact and users rarely own the same pieces of clothing. Yu-Chu et al. suggest including per-sonal preference to overcome this difference by incorporating the history of clothing items and the user’s own evaluation of previous recommendations and not comparing those to others. Experimental results show that this method outperforms existing systems in the same conditions, thus suggesting a better approach. Another paper that emphasizes the importance of personal preference in fashion is that of Hu, Yi and Davis [6] In order to incorporate the personal preference, their system suggests pieces of clothing that match with the user’s personal preference. The system utilizes a functional ten-sor factorization to map the user and fashion items in combination with the non linear vectorization of the fashion items to suggest a set of clothing pieces. Experiments were performed by Hu, Yi and Davis on real user data validating the performance. A recently pro-posed hybrid method also adds on to current collaborative methods for better results. [7] Their article builds upon the recent success of incorporating visual features contained in items on top of the col-laborative filtering. The system was trained simultaneously to learn recommendations and image representations from clothing pieces in a Siamese CNN. This method outperformed similar methods that also utilize visual features. In addition to recommendations, the model was able to generate clothing pieces given a user and cate-gory, which could be useful for designing new clothing pieces and styles.

However, more recent literature shows an increase in interest for context based methods. Researchers consider new features and an-gles that differ from the common collaborative method. In the fash-ion context this often results in using the clothing pieces themselves as main component for recommendation due to the importance of visual appearance of clothing pieces. The paper by Yu et al. is in line with this trend. [11] It acknowledges the importance of the clothing pieces but differs from the pact by focusing on the aesthetic features of clothing pieces. Common approaches utilize classic image representations but they do not capture aesthetic fea-tures. The system of Yu et al. uses two neural networks to achieve the recommendation. The two models are a CNN that classifies the clothing from the RGB components of the clothing and a

(4)

pre-trained network that receives the image and creates a list of relevant aesthetic information. Subsequently, this information is combined with user information in a tensor factorization model to personalize the recommendation. Yu et al.’s results were compared to several current recommendation systems on real-world data which showed a significant improvement in performance. The study of de Melo, Nogueira and Guliato uses clothing pieces as main component as well. [5] Traditional recommendation systems use text and visual features to compute the similarity of two items. However, they ar-gue that the human visual attention plays an important role in how people perceive clothing. To incorporate this into a system, a model was trained that captures those attention points on clothing pieces. Afterwards the novel approach was used to calculate a similarity that was used in conjunction with more traditional methods result-ing in a similarity score. In the experimental results, this combi-nation culminated in better results than the benchmarks. The work of McAuley et al. aligns with the previously mentioned studies as well. [8] A system in this article that is able to find complementary and substitute clothing pieces based on a query image is presented. For example, the user uses an image of a shoe as input and the sys-tem returns pants, a shirt and a belt that go along with that shoe or it suggest another pair of shoes that is similar to the query. The idea is that humans make relationships between entities in a natural way and the system tries to capture that sense of relationships. To achieve this a model was used to extract features from images and the items were placed within a large scale network to find similar pieces. This network was utilized for both training and evaluat-ing and resulted in the eventual system which could be used in a wide variety of applications. The last paper discussed in this sec-tion differs from the previously discussed articles as it incorporates the user’s social circle in the recommendation. [10] It argues that the personal circle of the user has a large influence on his or her fashion style. The system consists of two parts, one part measures the fashion style consistency while the other considers the personal circle of the user. A Siamese CNN was trained for the fashion style consistency which transforms the fashion items features into a la-tent space and calculates the similarity between items. The social circle consisted of three factors namely the interpersonal influence, personal interest and interpersonal interest similarity. These factors were combined with the first part into a probabilistic matrix fac-torization. This resulted in a recommendation for clothing pieces. Evaluation was performed on real world data and showed promis-ing performance.

In addition to the previously mentioned already existing content-based methods, some systems that are similar to the system which is proposed in the current thesis are also already employed by some. There are, however, still significant differences to recognize. The study of Zhang et al. contains an evaluation system for clothing rec-ommendation with body features. [14] The motivation comes from an increasing amount of garments and the trouble for merchants to meet the demands resulting in a total increase in time spend on trying and finding clothes. The paper of Zhang et al. proposes a threefold approach as a remedy for this problem. The first step was establishing a data set of images along with a fashion level. This database was used to train a support vector machine by extracting weak appearance features. The face was located in the image and the hair, make-up, accessories etc. were localized, which was used alongside the SVM and the ground truth fashion level. The result-ing classification was validated through a hierarchical analysis and showed strong results in predicting someone’s fashion level. As the writers suggest, this approach can be taken further and incor-porated in recommending actual clothing pieces. The final relevant paper of this section comes closest to the approach as proposed in

Figure 1: Overview of the architecture. The top shows the data preparation where celebrities were used as subject to gather ground truths and images. The bottom shows the process of training a model and the process of predicting via user input.

this thesis as it combined both the body style and fashion style of women to recommend clothing. [4] Their system consists of three different modules that each make up one task. The first two clas-sify the fashion style and the body type from the input image of the user. The remaining module combines those results and links an image of a model with clothing pieces that match with the given style and body. This method gives the user the ability to choose different categories within clothing to match with the users specific need.

Different from the aforementioned works, a fashion recommen-dation system is proposed that utilizes attribute embedding and the content-based methods to recommend clothing pieces. As seen, this has not been done before in this way validating the work done in this thesis.

3. PROPOSED METHOD

The following section starts off with a problem definition in or-der to create a concrete solvable question. Subsequently, the pro-cedures will be explained on a high level to give an overview of the proposed methods. Finally, each method is explained and justified.

3.1 Problem definition

As set out in the literature review above, several fashion rec-ommendation systems already exist. Yet none of these satisfy the specific context required for fashion recommendations. Body fea-tures play a significant role in how clothing fits, however only a few of the systems use those features and those features are often not used to recommend specific clothing pieces. Building such a sys-tem comes with a a few complex problems to solve. The training data is the first problem to solve. Both the data and the grounds truths have to be crawled and such data is hard to come by. The system itself requires decisions in type of model and the training of such a system. The evaluation is difficult as well due to the nov-elty of the system, which means that a comparable benchmark is not available and an experimental evaluation has to be created. An overview of the proposed methods used in this thesis are presented in figure 1.

3.2 Data gathering and preparation

In order to create a successful fashion recommendation system, a data set that contains both images and ground truths is required. In

(5)

Algorithm 1 Celebrities list Result: List of celebrities

for Page in sources(IMDB, the Famous People etc.) do Scrape page

for Name in scraped page do if Name not in CelebList then

Add celeb to list end

end end

addition, said data set has to be large in volume and diverse enough to capture different kinds of people around the world. This includes but is not limited to race, hair color and eye color. The data also has to be online and free to use as the resources for this project are limited. The choice was made to use celebrities considering the mentioned requirements. Celebrities are a great fit for this project as they are often very public. This means that when considering rel-evant celebrities, both their pictures as well as additional recent and accurate information is available online for free. In order to gather such data, the celebrity names had to be scraped from the web. The overview of the process used in this thesis can be seen at algorithm 1. The first step was choosing the right sources that contain the celebrity names and most importantly are structured in a consistent manner. The structure was the most important as the names had to be scraped in a unstructured manner by considering the underly-ing HTML pages that contain the information. The moment these pages are consistent, the amount of false positives is reduced. The sources used throughout the thesis for this purpose were IMDB, The Famous People, US Magazine and Wikipedia. These pages are all recent, popular and contain many celebrity names.

Retrieving the grounds truths for each celebrity was the next step after collecting all the names. Choosing the source was the first step. The source had to be trustworthy, consistent in its markup and preferably contain as many celebrities as possible of all sizes and shapes. In the end, three similar websites fulfilled the require-ments the best, namely Healthy Celeb, Celeb Healthy Magazine and Celebrity Sizes. All three websites are somewhat trustworthy and consistent in their mark up, thus usable for this case. The dif-ferences mostly lie in the information that they contain. The goal was to gather as much information as possible to allow for prun-ing of information if certain categories are hard to predict or too diverse. All the websites contain the basic information which in-cluded height, weight, eye color and hair color. Healthy Celeb sets itself apart however, because of it contains two unique categories. The build category has a variety of build possibilities ranging from slim to large and everything in between. This category is useful as the body type has a large influence on how clothing fits, which could prove to be effective for a clothing recommendation system. The race of a person could also be interesting to use as the color of your skin has an effect on how colors of clothing look. The choice was made to use Healthy Celeb as the main source for the ground truths due to these two additional categories. Important to note here is that as the name suggests, the site mostly contains healthy celebrities, which could give a skewed image. In line with the com-mon trend, most celebrities are in shape today due to increasing emphasis on health which is represented on this website. However, the other mentioned websites suffer from the same problem as they also mostly contain healthy celebrities and there was no other easy source for the data that is required.

After selecting the source, the next task was retrieving the fea-tures from the source. Algorithm 2 contains the overview for the

Algorithm 2 Ground truths Result: Ground truths for Name in CelebList do

Forge Healthy Celeb link if Link is valid then

Extract features Preprocess features

Add name and features to list Guess gender

end end

Count categories

Merge and remove categories Calculate BMI category Normalize all the features

used procedure. The first step was creating a link and checking whether the link is valid. Afterwards, the page is considered and the features were extracted by looking at the HTML page. Subse-quently, the features were preprocessed to remove as many incon-sistencies, such as differences in measurements, as possible which could be differences in measurements for example. In addition, the gender of the celebrity was also guessed which could be done based on either their image or name. The method with the image is more accurate but also harder to implement and uses more resources to predict. Because of the limited resources available in this project, the lightweight option was chosen. After the ground truths were collected, the final proceedings consisted of removing and merg-ing categories to reduce the total number to a manageable amount of categories. The last preprocessing of the features consisted of normalizing the features. Normalizing usually yields better results on models and is sometimes even required and relatively easy to apply. In this case the values were normalized between 0 and 1 and the categories were one-hot encoded.

The actual data sets that had to be crawled must consist of im-ages of celebrities, preferably pictures of their whole body. Exam-ples of great sources are Flickr, Google Images, Imgur and Pinter-est. These sources fit this problem well as they offer high quality pictures and a search option, which allows for retrieving specific images via a query. The search options differ however when com-paring the sources. Imgur was dropped for example, as it often contains jokes and memes when querying a celebrity. APIs fur-ther set apart the different sources. Google Images stopped their support for an API which means that scraping images is hard, thus Google Images was dropped as well. The remaining two sources are rather similar in performance and both have API support, either as an official API or through a third party.

The difference between Flickr and Pinterest comes from the rules of the website. Flickr has strict rules for uploading images and is purely focused on images, while Pinterest also incorporates social network aspects. Due to this, the overall quality of Flickr photos is higher, which led to the decision for Flickr as source. The next step was collecting the images from Flickr. The process used for this can be seen at algorithm 3. The celebrity names were used as the query for the Flickr API. The total amount of pictures was limited to 500 because this would give enough data to work with while si-multaneously not letting the quality drop as much. Each image was filtered by checking whether it was RGB and whether there was a single face on the image. The probability that a celebrity was on a picture increased due to counting the faces, which improved the overall quality of the data. The images that passed this test were then preprocessed and a total of three data sets were created. The

(6)

Algorithm 3 Data set Result: Three data sets for Name in CelebList do

for Image in Flickr query[:500] do

if Image is RGB and Image contains 1 face and celebrity has at least 100 imagesthen

if Image contains body then Extract body

Extract face Resize body Resize face

Store body in data set 2 and 3 Store Face in data set 3 end

Resize and crop Image Store new Image in data set 1 end

end end

first data set contained images that were cropped and resized to the appropriate size for the system. The images were first resized and afterwards cropped to keep the aspect ratio the same. The second data set consisted of a body extraction from the original image. This extraction was then preprocessed to fit within the boundaries of an input image by resizing the extraction. The remaining space within the image was filled with black. The final data set had the same images as the second one and in addition contained the ex-tracted face from the original image resized in the same way as the body pictures.

3.3 Recommendation system

The proposed system requires both user features and a way to find similar people to those features in order to recommend cloth-ing pieces. Both requirements have multiple options to choose from to reach the same goal. The user features can be either predicted via some model or filled in by the user himself. For this thesis, the former method is proposed as way of gathering features. The choice was made based on two main reasons. The biggest differ-ence comes from the amount of user effort that is mandatory. Let-ting the user fill in all the information takes more time and effort making the threshold for using the system higher. In this exper-imental case, the threshold should be kept as low as possible as people are more hesitant to use the system validating the choice for the model. Furthermore, the accuracy and usability played a role in the decision making as well. Manual user input is expected to have a higher accuracy as the user can decide for himself, espe-cially when picking categories such as eye color and hair color. The build category is difficult however as that category is rather vague and placing yourself in such a category is difficult. The accuracy of the build was the most important one for this context due to the goals of the system. For this reason, the decision was made to use a model for predicting the user’s features. This model can be rep-resented in a mathematical overview. A set of users P provide a single image I of themselves. The model M takes the input I and extracts features F from this image. These features F are compared to the other users in set P or some data set D, which in this case is a data set of celebrities and their features F. This comparison results in a list R with the most similar users or entities from the data set. The top three of list R is returned as recommendation to the user. The implementation of such a model can be achieved in multiple ways. In this case, the proposed model was a convolutional neural

network (CNN). The systems within the current literature mostly utilize neural networks as they show the best performance in sim-ilar recommendation cases. The rise of CNNs can be explained by the increase in both computing power and volume of data. The datasets created in this thesis also contain a large amount of data points making CNNs a strong option. Transfer learning was cho-sen as the method for training the model instead of training it from scratch. The project was conducted in limited time with limited resources as mentioned before and by using transfer learning the total amount of time required to get results is reduced. VGG19 was picked as base model for the transfer learning. This was done for the sake of keeping it lightweight while still getting top results as the difference in performance on the Imagenet benchmark between different models is marginal. A few layers were added on top of the VGG19 to train specifically for this case. In addition, the learning was tweaked alongside the training to get the base results between using one of the three different data sets created in the previous steps.

Comparing users for similarity can be done via a multitude of similarity measures. Cosine similarity was used for feature set and is defined in the following way:

cos(t, e) = te ktkkek = Pn i=1tiei pPn i=1(ti)2pPni=1(ei)2 (1) This measure was chosen as it is more efficient on high dimen-sional sparse data sets. The feature set of this system fits perfectly as it is sparse and the dimensionality can be increased or decreased as seen fit. Furthermore, the literature shows that cosine similar-ity often performs better on positive data bounded between 0 and 1, which is the case as normalization between 0 and 1 was ap-plied in algorithm 2. The cosine similiarity can be used to find the most similar celebrities based on the user input image. Afterwards the model can recommend clothing pieces worn by the celebrities, which in this case are the same images that were used for train-ing. This was done as the available time was limited and the focus within this thesis was on learning the features and coupling those predictions with other users or celebrities.

4. EXPERIMENTAL SET UP

The next section is composed of the implementation details of the previous proposed methods. Furthermore the evaluation metric used for the experimental set up will be detailed. It ends with a section devoted to the experiment that was conducted to evaluate the system.

4.1 Implementation details

The implementation details are divided the same as the proposed methods to make the distinction obvious and to keep the structure clear. The programming language used throughout this section and the system itself was Python 3.5.2. Python was the most appropri-ate as the level of experience is the highest and every problem was solvable within that language.

4.1.1 Data gathering and preparation

The data gathering started of with collecting the celebrity names from the web. The BeautifulSoup and urllib modules were uti-lized to achieve this. Urllib allows for access to web links and the page content of those web pages. BeautifulSoup structures this web page by HTML tags and gives the user the ability to search those web pages by tag in a structured manner. In total, four differ-ent sources were accessed for the collection of celebrity names,

(7)

Figure 2: Ground truths extracted from Healthy Celeb along with the first five images crawled from the Flickr API.

namely Wikipedia, US Magazine, IMDB and The Famous Peo-ple. Wikipedia and US Magazine had a single page with names so only providing a single link along with the BeautifulSoup com-mands proved to be enough to gather names. The Wikipedia page was named "List of Asian Americans", which was used due to the lack Asians on the other website, while US Magazine had celebrity names of some of the main or general celebrities that are known worldwide. The other two sources required the forging of links to gather all the possible names. The IMDB source was a user created list of 1000 actors and actresses spread over 10 pages with the same format. A unique web link was created for each of those pages and using the same script all the names were harvested. The most use-ful source was The Famous People as it contained a section with birthdays of celebrities from different fields in contrast to the other sources that mostly provide movie star names, which could give a skewed fashion image. Each day in the year had a list with celebri-ties born on that day resulting in 366 webpages that were scraped with BeautifulSoup. The site goes back to the 1900s resulting in irrelevant celebrities however, thus every celebrity with a birth year later than 1970 was dropped. The final list of celebrity names had a total of 10.243 celebrity names from different fields and ethnicities. The ground truths had to be extracted from the Healthy Celeb site. Both BeautifulSoup and urllib were used to achieve this, simi-lar to the celebrity names. The process started with creating and validating the link using all the celebrity names. The final list contained 1847 celebrity names and their working links. Those links were used to extract the different features from the Healthy Celeb website using BeautifulSoup. Seven features were extracted for each celebrity namely height, weight, build, race, hair color, eye color and measurements. An example of the extracted ground truths and extracted images can be seen at figure 2. The height, weight and measurements contained both the metric and imperial values, however only one value was needed thus the imperial values were removed. In addition to the seven categories the gender was guessed for each celebrity due to the lack of gender on the website. The gender-guesser module was utilized to achieve this. The mod-ule takes a name as input and returns either male, female, mostly male, mostly female, andy(androgynous) and unknown. Every non male or female class was classified into male or female. This pro-cess was manual for each class not containing male or female in the name. Afterwards the categorical features were counted which resulted in a total of 167 unique categories, which was too many to learn. Therefore, the total amount of categories had to be cut to a relatively reasonable amount. The categories were inspected and many categories had a relative low amount of occurrences due to some celebrities having a sentence instead of a single word for a category. Furthermore, the hair color were also described as dark brown, light brown etc. In addition, some celebrities had no value for certain categories, especially the measurements category was sparse. The aforementioned problems were reduced or resolved by

Build Eye Hair Race Athletic Black Bald Asian Average Blue Black Black Large Brown Blonde White Slim Gray Brown Light-Skinned

Green Gray Multi-Racial Hazel Red

Table 1: The remaining categories after removing the measure-ments feature and merging the remaining 167 categories.

removing the measurements category and any celebrity that lacked a value for any remaining category. The latter was justified as the total amount of celebrities was still high enough after removing these celebrities. Non-distinctive categories were merged in one category besides removing rows and a column. The sentences and classes like dark brown were merged into one overarching class, which could be brown for example. This process resulted in reduc-ing the total of 167 categories into merely 21 different categories across the four categorical features. An overview of these values can be seen at table 1.

Afterwards, the BMI category was added to the total by using the basic formula. This resulted in three different real valued cate-gories, which were all normalized to a value between 0 and 1.

A Flickr API module and the urlib module were used to down-load images from Flickr. The FlickrAPI is called flickrapi and is a wrap around the official API for Python. Each celebrity name was used alongside a sorting based on relevance. The images were downloaded with urllib and saved locally. Each call produced up to 500 images of a celebrity if available, which was done to gather enough data while not reducing the quality by much. This resulted in a total of 799.715 images. Afterwards, the faces of each image were counted using the dlib and OpenCV modules. Every image that did not have exactly one face was removed reducing the total number of images to 324.191. This number was further crunched by removing any celebrity that did not have 100 images culminat-ing in a total of 282.831 images divided among 1.366 celebrities. The division for each category can be seen at tables 2, 3, 4 and 5. The first data set was created out of this image set. The model as described in section 4.1.2 requires an input of 224 by 224. There-fore, every image was resized and cropped to that size using the PIL module. The resizing was applied first and every image smallest side was reduced to 224 while keeping the aspect ratio the same. This often resulted in images with a side larger than 224, which was cropped as a solution. The crop was applied from the top as the top often contains the face, which had valuable information for the system. The other two data sets required more preprocessing. The original set of 282.831 images was taken and Mask RCNN was ran on those images. Mask RCNN can identify bodies on im-ages and the first body that it finds was extracted. The moment an image does not contain a body, the image was removed resulting in a total of 278.346 images. Subsequently, the faces from those images were extracted and stored as well. Finally, both the body and face images were resized to fit within a 224 by 224 boundary adding black pixels to fill up any empty space that was left. One data set was created with only bodies, while the other data set was composed of both the bodies and faces of each original image. An example of each created image can be seen at figure 3.

4.1.2 Recommendation system

In total, three models were built throughout the course of the re-search, one for each type of input as described in the previous

(8)

sec-Athletic Average Large Slim 480 226 21 639

Table 2: Build category distribution

Black Blue Brown Gray Green Hazel 43 388 582 15 210 128

Table 3: Eye category distribution Bald Black Blonde Brown Gray Red

11 272 325 697 18 43 Table 4: Hair category distribution

Asian Black Light-skinned Multiracial White

27 149 60 177 953

Table 5: Race category distribution

Figure 3: Three different ways an image was preprocessed for use. The resize and crop image was used for model 1, while the body extraction was used for model 2 and 3 and the face extraction was used for model 3.

tion. The different models were built using the Keras framework of Python. Keras was used as it is a high level module on top of Tensorflow, while still allowing the creation of complex architec-tures. Furthermore, the VGG19 base model is implemented in the module and it is easy to start building an architecture with Keras. The first step in building the models was choosing an architecture and specifically the layers added on top of the base model. Each model had the same structure making comparisons valid. VGG19 formed the basis of each model, however the third model had a sec-ond VGG19 model next to first one. One handled the body picture input, while the other handled the face image input. The added lay-ers for each model of the three consisted of one flatten layer, two dense layers with 4096 nodes, one drop out layer after each dense layer with a probability of 0.5 and finally the output layer. The out-put layer had five different outout-puts, one for each categorical feature and one for the real values. The nodes of these outputs were simi-lar to the amount of different categories or three for the real value output. An example of the first model is given at figure 4.

A generator had to be used to provide the data due to the sheer size of the data and the multiple outputs of the model, creating an unique case. Keras provides basic generators, however due to the multiclass multilabel nature of this system these basic generators do not fulfill the need, thus a custom generator had to be created. The first step consisted of creating a partition of the available data in training and test data. The Sklearn module was used to achieve this as it has a method called train_test_split which shuffles the in-put and partitions the inin-put in a given ratio. The partition was 80/20 in this thesis, which resulted in two lists of random image names. These names were the labels of the images in the data set as gath-ered from the Flickr API. Two label files were also created, one for the real values and one for the categorical values. These files were dictionaries where the key was the image name and the value was a list of the ground truths for that image. The generator required the indices as specified by the input file and the batch size. First, it randomized the indices, which means that each epoch has a dif-ferent ordering. Afterwards, the generator divided the indices into batches of the given batch size and it looped through the batches one by one. Each iteration formed a step within the epoch and for

Figure 4: Example of first model.

each step the generator looped through the indices and retrieved the images and the ground truth labels from the aforementioned label files. Everything was stored in Numpy arrays and for each step the generator yielded a tuple containing the images and a dictionary with the ground truths for each feature. This meant that two im-ages and the grounds truths were provided for model three and only one image for the first two models. The generator was wrapped in a True statement and only yielded a batch of images and ground truths if the model asks for it. It could be used for both the training and test set to provide the input data required to train or test the system.

The model also required different parameters to be set. The learning rate used for each model was the Adam Optimizer algo-rithm. The algorithm is the current default learning rate as it con-verges relatively fast and works with basic settings for most cases. [2] This was perfect for this case as there was a time constraint. The actual rate used for the model was 1e-5 for every epoch, how-ever some experimentation was done with tweaking the learning rate as can be see at section 5. Two types of losses were used, one for each type of feature. The real features had mean squared er-ror as their loss while the categorical features had categorical cross entropy. The latter was appropriate in this particular case as each feature had its own output. The amount of epochs was kept reason-able to retain control. For the first two models this resulted in 20 epochs per run, while for the third model with two VGG19s as base the runs were done in intervals of 10 epochs due to the size of the model and available storage space. Class weights were also used for each model. Tables 2 up to 5 show a skewed distribution for each category, especially the race category. To combat this Keras allows the user to define class weights to tell the model to learn more from specific classes and balance out the distribution. The compute_class_weight module of Sklearn was used to calculate the values for each different categories. These values were then passed to each module in a dictionary where each feature had its own entry and class weights to use.

The actual training was done on the DAS-4 cluster available on the UvA. [1] The cluster is public and has computers with TitanX graphic cards for use. Epochs took 36 minutes for the first two models, while the third models took up to 1 hour to train one epoch

(9)

Figure 5: The web pages of the app.

Figure 6: Overview of the web app architecture.

as it had a second VGG19 model attached increasing the total train-able parameters. The cluster decreased the training time signifi-cantly when compared to the home machines. Availability of clus-ters was limited however due to the high traffic and the clusclus-ters had limited personal space for storing data sets, which resulted in less training and experimenting than planned.

4.2 Evaluation metrics

The evaluation metrics highlighted in this section are the ones used for testing the model and evaluating the results of the experi-ment. The model was evaluated by two main metrics, one for the real features and one for the categorical features. Mean squared error was used for the real features. It is one of the default met-rics for regression tasks as it penalizes ’very’ wrong comparison harsher than slightly wrong predictions. This was important for our case as body features that have a large impact on how cloth-ing fits, like height and weight, should be guessed as accurate as possible. The metric is included in Keras and is logged if spec-ified by the user. The categorical features were evaluated by the mean average precision(MAP). This metric was chosen as it com-bats the accuracy paradox, which can make accuracy a misleading metric. In this case it was important to pick at least one class as the recommendation system needs to have something to calculate the similarity with. The implementation of MAP was not as trivial as mean squared error as MAP is not part of Keras due to con-text dependent calculations. In this thesis, Sklearn was used for implementation, specifically the precision_score method. The ac-tual evaluation function used the same generator for training to load in the data. Afterwards the probabilities were calculated for each batch of data. The probabilities for the categorical features were edited which meant turning the highest probability class into a one and the rest of the values into zeroes. This was done as the average precision method expects ones and zeroes for its calculation. Sub-sequently, the edited probabilities were given to the method along with the macro keyword, which specified that label imbalance is not taken into account. This resulted in a score for each category, which was summed and divided by the total amount of categories which was four. This calculation lead to the MAP score which was stored

in a dictionary with a key value pair for each epoch performed. The experiment was harder to evaluate due to the novelty of the system and subject. This meant that no benchmark was available that was applicable to this case, thus a specific metric had to be used for evaluation. The decision was made to use the user in the loop method for evaluation. The downside to this method was that the results were experimental, and thus can’t be used for strong conclu-sions. It could give an indication of performance however, which could rectify further research on the same subject. For evaluation, the user had to provide an image as input and the system would produce three celebrities along with five images of each celebrity. He or she could ’like’ the images with celebrities and their clothing which resulted in a score of x out of 15, where x was the amount of likes. These scores were summed up which would give a per-centage of likes per image, which could be interpreted as an exper-imental indication of success.

4.3 Experiment

An app was built for the experiment, which was distributed for people to fill in. The implementation was achieved by using Python, Bootstrap/HTML, Flask, Gunicorn and Nginx and an overview can be seen in figure 6. The first step was creating the app with HTML and Bootstrap. The decision was made that an online app would be best to reach as many people as possible. Bootstrap allows for creating apps that look good on mobile as well, which reduces the effort for the user, thus possibly resulting in more users that use the app. The app consisted of three pages which can be seen in figure 5. The first page had an explanation of the application along with instructions for the first page. The bottom had an example image along with a window to insert the users picture and the option to choose the gender. Manually choosing the gender was important as this had to be correct 100% to get results that were relevant to the user. Afterwards, the user could submit the file and gender, which redirected him or her to the next page. On this page, 15 images of celebrities were showcased as clothing recommendations. These images were the first five images of each celebrity as crawled by the Flickr API, which resulted in some out of place pictures. The user could choose as many of the images as her or she liked based on the clothing and submit those results. The last page was a thank you screen, which also confirmed that the results were submitted.

Flask was used for the functionality of the app. Flask is a frame-work build on top of Python that forms the link between HTML pages and Python code. The first functionality was preparing the user input and feeding it to the model to get the predictions. The user input was passed from the HTML page with AJAX to the Flask app on submit. The image was read using the PIL module, con-verted to RGB and resized and cropped the same way as the train-ing data. Subsequently, the image was passed to the model and based on those predictions the cosine similarity was calculated for the given gender. This resulted in a score for each celebrity and an ordered list with the most similar celebrities. The list sometimes contained more than three celebrities with a top score, meaning

(10)

Figure 7: Overall MSE value for each model for the real-valued features.

Figure 8: Overall MAP value for each model for the categorical features.

that the alphabetical ordering played a role, which was resolved by shuffling the top results. The results were passed back along with the second page, which rendered the 15 images as mentioned in the previous paragraph. The user picked the pictures that he or she liked and those results were passed to Flask. The data was a list of 15 Boolean values with a true value for each picture that was liked by the user. This list was stored in a SQLAlchemy instance, which is a framework for SQL databases for Flask. SQLAlchemy was used as it is easy to set-up, which saved time in the overall process. The last page was returned after storing the results, which gave confirmation of a successful process.

The app was self hosted on a local machine. Hosting required a WSGI HTTP server that handled serving the app and a system that handled incoming requests from outside. The former was achieved using Gunicorn, while the latter was done with Nginx. Gunicorn handled the communication with the Flask app, which meant re-questing pages, submitting requests and handling the responses from the Flask app. It was easy to implement and lightweight on re-sources, which saved time in implementation as well. Nginx was used on top of Gunicorn to handle the requests from the public. It can serve static web pages or communicate with Gunicorn for the Flask functionality that is required through proxy requests. Nginx is lightweight and can handle a multitude of request at the same time making it easy in use. In addition, it was easy to implement similar to Gunicorn, thus saving time.

The distribution was done via social media and through personal communication. This means that the results are biased and not rep-resentative for a larger and diverse group. The results are even more experimental and only give a slight indication on how the system could perform, which should always be kept in mind when review-ing the results.

5. RESULTS

This section contains the results of the aforementioned

evalua-(a) Build (b) Eye

(c) Hair (d) Race

Figure 9: MAP values for each category.

Figure 10: MAP values with different learning rates for the first model after the initial 40 epochs.

tion metrics along with some experimentation done while training the models. First, the training of the different models is consid-ered and afterwards the results of the application and the user in the loop are examined. An important note is that model 1 is the model with the resized and cropped images while model 2 and 3 are respectively the model with the body pictures and the model with the body and face pictures.

Figure 7 shows the MSE scores for each model over the 40 epochs. The graph shows that the models do not differ much in the even-tual MSE scores. Furthermore, the lines stabilize rather fast as they started of at 0.021 and did not improve at around 0.013. Further inspection shows that the model did not learn from the images as the predictions defaulted to 0, thus preferring celebrities that are small, have low weight or a low BMI all together, which is the bulk of the data. For this reason, the real values were not considered in the app.

The different scores for MAP are shown in figures 8, 9 and 10. Model 1 shows the best training out of the three after 40 epochs as seen in figure 8. Model 1 went from a 0.2473 score to 0.3914 at epoch 39. The scores seem to decrease with the increasing com-plexity in this context, which is interesting as model 3 is almost twice as large as the other models, while achieving only a max score of 0.3209. In addition, the three models seem to stabilize somewhat at this stage, but further training could show whether this assumption is true. Figure 9 shows the different MAP scores for each category. There is a clear division between two groups. The build and race categories are performing above average with max scores of 0.4646 for the former and 0.4191 for the latter, while the eye and hair categories are under performing with scores of 0.3460 and 0.3546 respectively. Interesting to note is that the strong per-forming categories are full body categories while the bad

(11)

perform-Image 1 Image 2 Image 3 Image 4 Image 5 Total Celeb 1 7 8 6 9 6 36 Celeb 2 4 7 3 4 9 27 Celeb 3 2 10 8 6 5 31 Total 13 25 17 19 20 94 Table 6: Amount of pictures liked per image and celebrity. The total amount of pictures shown to respondent was 405, thus each cell can have a total of 27 likes.

ing categories are specific to the face. Furthermore, model 3 had a specific face image attached to the model, but the scores did not increase relative to the other models, which is unexpected. Model 1 performed the best for each category, while the most complex model was the least performing similar to the overall MAP values. Experimentation was done with the learning rate after the initial 40 epochs on the first model which can be seen in figure 10. The scores did not differ much after changing up the learning rate and the initial learning rate even scores the highest after 20 additional epochs with a score of 0.4154 on epoch 5 or 45, which is slighter higher than the score after the first 40 epochs. This epoch was also used for the app predictions as it had the highest MAP scores.

The user in the loop results can be seen in table 6. The to-tal amount of respondents was 27, which results in a toto-tal of 405 shown pictures and 27 pictures per cell. 94 pictures were liked by the users, which means that 23.21% of the total amount of pictures were liked. The celebrities were ordered based on similarity, thus it is expected that the first celebrity has the most likes, while the last has the least. This is not true when considering the table however, due to the amount of likes, which are divided quite evenly. The images were also sorted based on relevance, thus the first image should be the best while the last should be the worst. The table shows something different seeing as the likes are distributed ran-domly and the first picture even had the least amount of likes. All in all, it seems like the ordering did not matter for both the celebri-ties and images.

6. CONCLUSION AND FUTURE WORK

The main goal of this thesis was exploring clothing recommen-dation systems based on body features. The research focused on creating a system and evaluating how well such a system works in deployment. The system consisted of a model that predicts the users features and uses those features to recommend clothing. The model showcased that it learned from the data for the categorical features, but did not learn the real features. This shows that the model is capable of learning from the crawled images, but tweaks have to be made to learn the height and weight for example. The user in the loop was used for evaluation of the whole system. One out of four pictures was liked by the respondents in this thesis. This suggests that the system indeed works to some degree when con-sidering the context. This work does have some problems however, thus future research is needed to resolve these problems in order to validate the suggestion that the system works. Solutions for these problems along with possible solutions are outlined in the coming paragraphs.

This thesis functions as a starting point to explore the recommen-dation systems based on body features. The experimental nature of this exploration along with the time restraints means that the

re-search has some flaws and leaves some areas undiscovered, which can be considered in future research either as new research or as ad-dition on this thesis. This is the case in every stage of the research and the possibilities will be discussed in the same structure as the thesis.

The data preparation can be improved in a multitude of ways. In this thesis celebrities were used as source for the data and ground truths. Celebrities do not represent the true population however, especially considering the fact that the celebrities were taken from a website focusing on healthy people. Therefore, future research could look at other sources for data, which could be assembling a large and diverse groups of users that do represent the true popula-tion who are willing to give photos and their grounds truths. This would combat the class imbalance that occurs in this research and possibly improve the training of the model as well. The data itself could also be considered in new research. Future research could look at different sources from Flickr or using specific queries to gather data, while also manually validating the data or using dif-ferent methods from the ones in this thesis. This could potentially improve the learning of the model and showcase better results. Pre-processing the data for the model is another thing to consider as it was limited to resizing and cropping or adding a black border in this thesis. Examples are using different methods to extract bodies or even extracting certain parts of the body like eyes or skin patches to use as input for the training of a model.

Future research could also explore more areas within the training of the model. Time constraints meant that the training was cut short and experimentation was limited. Experimentation could be done with the learning rate for example and trying out different amounts of epochs and learning rates or changing up the learning all together by using some dynamic learning rate. The architecture could be changed for different results as well. The base model could be changed to a heavier one or the output layers could be changed or added upon. Training a model from scratch that is attuned to the context is also something to consider in the future. Novel state of the art methods could be incorporated as well. Attentive models are an example of this, which would learn the model to focus on different parts of a picture to learn the features from. In this thesis, the eyes and hair were the worst performing categories, an attentive model could let the model focus on the eyes and hair in a picture possibly improving the results in the end.

The evaluation can also be improved upon in a few ways. This thesis only had 27 respondents which is way too little to make strong claims. Furthermore these respondents came from social media or personal relations, which does not represent the true pop-ulation. A survey group should be gathered that is representative and large enough to evaluate the system and make strong claims about the results. The recommendations are something to improve on as well. The current recommendations are the first five pictures of each celebrity which have a chance to not contain clothing or the celebrity all together. A remedy would be manually picking popu-lar pictures of the celebrities that do contain clothing pieces. A bet-ter option would be extracting clothing pieces from these pictures and recommending those to the user, optimally by linking them to real items in online stores to make direct recommendations. This solution is better because such a system could be deployed next to an existing recommendation system and possibly improve those systems resulting in an increase in revenue, which is interesting for web shops.

(12)

References

[1] Das-4 overview. URL https://www.cs.vu.nl/das4/.

[2] Gentle introduction to the adam optimization algorithm for deep learning, Jul 2017. URL https://machinelearningmastery.com/ adam-optimization-algorithm-for-deep-learning/.

[3] D. Baier and E. Stüber. Acceptance of recommendations to buy in online retailing. Journal of Retailing and Consumer Services, 17(3): 173–180, 2010.

[4] E. de Barros Costa, H. J. B. Rocha, E. T. Silva, N. C. Lima, and J. Cavalcanti. Understanding and personalising clothing recommendation for women. In World Conference on Information Systems and Technologies, pages 841–850. Springer, 2017. [5] E. V. de Melo, E. A. Nogueira, and D. Guliato. Content-based

filtering enhanced by human visual attention applied to clothing recommendation. In Tools with Artificial Intelligence (ICTAI), 2015 IEEE 27th International Conference on, pages 644–651. IEEE, 2015. [6] Y. Hu, X. Yi, and L. S. Davis. Collaborative fashion

recommendation: A functional tensor factorization approach. In Proceedings of the 23rd ACM international conference on Multimedia, pages 129–138. ACM, 2015.

[7] W.-C. Kang, C. Fang, Z. Wang, and J. McAuley. Visually-aware fashion recommendation and design with generative image models. In Data Mining (ICDM), 2017 IEEE International Conference on, pages 207–216. IEEE, 2017.

[8] J. McAuley, C. Targett, Q. Shi, and A. Van Den Hengel. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 43–52. ACM, 2015. [9] S. Mulpuru, V. Boutan, C. Johnson, S. Wu, and L. Naparstek.

Forrester research ecommerce forecast, 2014 to 2019 (us). Forrester Research, Inc., Cambridge, MA, USA, 2015.

[10] G.-L. Sun, Z.-Q. Cheng, X. Wu, and Q. Peng. Personalized clothing recommendation combining user social circle and fashion style consistency. Multimedia Tools and Applications, pages 1–24, 2017. [11] W. Yu, H. Zhang, X. He, X. Chen, L. Xiong, and Z. Qin.

Aesthetic-based clothing recommendation. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pages 649–658. International World Wide Web Conferences Steering Committee, 2018.

[12] L. Yu-Chu, Y. Kawakita, E. Suzuki, and H. Ichikawa. Personalized clothing-recommendation system based on a modified bayesian network. In Applications and the Internet (SAINT), 2012 IEEE/IPSJ 12th International Symposium on, pages 414–417. IEEE, 2012. [13] J. Zentes, D. Morschett, and H. Schramm-Klein. Online retailing. In

Strategic Retail Management, pages 71–93. Springer, 2017. [14] Y. Zhang, X. Liu, Y. Shi, Y. Guo, C. Xu, E. Zhang, J. Tang, and

Z. Fang. Fashion evaluation method for clothing recommendation based on weak appearance feature. Scientific Programming, 2017, 2017.