predicting image appreciation with convolutional neural networks

(1)

predicting image appreciation

with convolutional neural

networks

joop pascha

case study of:

food images on instagram

Bachelor thesis of: Artificial Intelligence European Credits: 18

Student Identification: 10090614 Supervisor: Mr. Drs Efstratios Gavves,

Institute of Informatics, Assistent Professor, University of Amsterdam Co-supervisor: Mr. Kirill Gavrilyuk University of Amsterdam

Faculty of Science Science Park 904 1098 XH Amsterdam 2016, June

(2)

Joop Pascha: Predicting Image Appreciation with Convolutional Neural Networks, Case study of:

Food Images on Instagram, c 2016, June. E-mail:

(3)

A B S T R A C T

In this study an attempt is made to model human image-appreciation us-ing Convolutional Neural Networks (CNNs) . This model could, in con-trary to questioning people, provide valuable insights into what people like about images in objective and quantifiable terms. Previous works have been primarily focused on predicting text and video popularity based on social cues, as image content is significantly harder to extract and correlates with social popularity. To fill in this gap, this study focused only on the effects that image-dependent features; colors, textures and composition , have on image-appreciation. Food images were extracted from Instagram with their corresponding number of likes as an image-appreciation metric. The ref-erence model; AlexNet, available in the deep learning framework Caffe, was then adapted to fit a regression problem and fine-tuned on this data. An euclidean loss function was used as minimization objective for the pre-dicted and observed number of likes. This resulted in a correlation score of 0.25, indicating that image-appreciation could be modeled moderately well by using only image-dependent features. Increased contrast, bright-ness and sharpbright-ness, as well as specific colors and filters, were associated with a higher number of likes. More findings were reported, but are pre-sumed to be tightly coupled to food images in particular. In future work, more image categories could be included to investigate whether the findings can be generalized to images in general.

keywords image appreciation, convolutional neural networks, modeling, food, Instagram, social media, computer vision.

(4)

A C K N O W L E D G E M E N T S

I would like to thank everyone that helped me come this far in my educa-tional journey, including all teachers, family and friends that supported me along the way.

You know who you are, Joop Pascha

(5)

C O N T E N T S

1 introduction 1 1.1 research questions . . . 2 2 theoretical framework 3 2.1 social media . . . 3 2.2 instagram . . . 4 2.2.1 related work . . . 5

2.2.2 prior observed phenomena . . . 6

2.3 convolutional neural networks . . . 7

2.3.1 model analysis . . . 8

2.3.2 caffe framework . . . 8

3 research methods 9 3.1 food data set collection. . . 9

3.1.1 initial data retrieval & analysis . . . 10

3.1.2 final data retrieval . . . 14

3.2 model selection & adaptation . . . 15

3.2.1 task formulation. . . 15

3.2.2 model training . . . 16

3.2.3 alexnet . . . 16

3.2.4 vgg-19. . . 16

3.3 model evaluation methods . . . 16

3.3.1 prediction scores . . . 17

3.3.2 manual top/middle/bottom analysis . . . 17

3.3.3 quantified top/middle/bottom analysis . . . 17

3.3.4 model scene variation scores . . . 18

4 results 21 4.1 data. . . 21 4.1.1 distributions . . . 21 4.1.2 purity . . . 22 4.2 model evaluation . . . 23 4.2.1 prediction scores . . . 23

4.2.2 manual bottom/middle/top analysis . . . 25

4.2.3 quantified bottom/middle/top analysis . . . 26

4.2.4 model scene variation scores . . . 29

5 discussion 31 5.1 future work . . . 34 bibliography 35 appendix 37 b code . . . 37 c used tags . . . 37 d tables . . . 38 e figures . . . 40 indices 43 v

(6)

vi contents

figures . . . 43

tables . . . 44

(7)

1

I N T R O D U C T I O N

Social media’s recent surge in popularity has increased the desire for indi-viduals and companies to improve their usage of these platforms (Kaplan & Haenlein,2010;Mangold & Faulds,2009). While the social aspects

respon-sible for text and video popularity have been extensively studied, image content has been mostly neglected. A possible reason for this is that pre-dicting image-popularity on image-content alone is significantly harder due to its high interconnectedness with social influences (Khosla, Das Sarma, & Hamid, 2014). In an ongoing effort to find general image properties that

affect image popularity, this study makes an attempt to quantify the image-dependent features ; colors, textures and composition, affecting the ’like’-ability of images by using Convolutional Neural Networks (CNNs) . As they rely on similar mechanics found in our brains, and rival human perfor-mance in many visualisation tasks (Nguyen, Yosinski, & Clune,2015), the

hypothesis is that these algorithms can model human image-appreciation given enough images as input with a metric of appreciation as target out-put. In contrast to humans, this model could then be used to systemati-cally extract valuable patterns that are responsible for image-appreciation in quantitative terms.

Until recently, the large amounts of data and time needed to train a CNN from scratch were the main barriers for applying CNNs to many problem domains. However, with the introduction of various deep learning frame-works, e.g. Caffe, fine-tuning from pre-trained state-of-the-art models that require far less resources has become an option. This is mainly possible due to the remarkable property of CNNs to learn similar features in the first few parameter heavy layers across seemingly very different problem domains (Yosinski, Clune, Bengio, & Lipson, 2014). On the other hand, the

num-ber of available pre-trained models have increased since the introduction of the yearly ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2010, where the inclusion of implementation details of algorithm submis-sions was strongly encouraged. As a result, many of the top performing models have been adapted to a variety of advanced deep learning frame-works. However, it was not until the introduction of the CNN AlexNet in 2012(Krizhevsky, Sutskever, & Hinton,2012), that CNNs became the leading

algorithm in many visualisation tasks and made their way into deep learn-ing frameworks (Russakovsky et al., 2015). With the introduction of Caffe

in 2014, a wide variety of state-of-the-art learning algorithms and reference models, including AlexNet became available (Jia et al.,2014). Therefore in

this study, Caffe will be used in combination with AlexNet to overcome these barriers.

Instagram, the 4th most popular social media service (Duggan, Ellison, Lampe, Lenhart, & Madden,2015), could provide the combination of images

with an appreciation metric needed to model human image-appreciation. Since its launch in October 2010, it has acquired 400+ million active users that upload 80+ million images daily (Press Page • Instagram,n.d.). Users of the platform can, among other things, ’like’ these images, resulting in a con-siderable amount of data on user image interactions. Interestingly, the user

(8)

2 introduction

group of social media is in general very diverse, showing little bias towards gender, living area, ethnic or racial groups (Perrin, 2015). Consequently,

this could enable conclusions that are drawn on basis of this data to be generalized to a large portion of the human population. Although limited information was known about the diversity of Instagram users in particular, this was considered a good starting point. In addition, Instagram’s ability to search for images that contain specific hashtags, favors the retrieval of a well contained data set where image-dependent features are presumably more dominant. Food was selected as such an image category for a variety of reasons. First, it is expected to have limited bias towards groups of people because of its wide availability. Second, the image subject is often centered in the frame, excluding other factors for attracting likes. Lastly, the variety of colors, textures and compositions found in food images are assumed to be ideal for CNNs to extract features from.

1.1 research questions

To investigate the possibility for CNNs to model human image appreciation, the following research question was posed:

Can Convolutional Neural Networks model human image-appreciation solely based on image-dependent features?

This was further narrowed down into the following sub-questions:

RQ1: How can a data set be obtained where image-dependent features are central, yet can also be generalized to larger groups of people? RQ2: Can Convolutional Neural Networks accurately predict the number of

likes using only image-dependent features?

RQ3: What are the most important image-dependent features that contribute to image-appreciation.

(9)

2

T H E O R E T I C A L F R A M E W O R K

As Instagram is only a part of the wider social media landscape and human population, it is therefore necessary to understand the relationship between the Instagram-using population and the rest of the world. In order to obtain results that can be generalised to the whole population, the data set should include a balanced representation of the human population. To make well-informed decisions about which data to include, a literature review was con-ducted where special focus was invested in understanding how user-groups and other image-independent factors effect the ’like’-ability of images. The benefits are twofold. First, it helps to simplify the problem for Convolutional Neural Networks (CNNs) when the target feature, the number of likes, has the same relational interpretation across the whole data. For example, fa-mous people have more followers and as result they gain more likes than the average person. However, this does not imply that their photos are more ’like’-able. Instead, it has the unwanted effect of emphasizing the relation that users have with their followers, rather than the images and followers, for their respective number of likes. Secondly, by focusing on the most com-mon relation that followers have to the images they like, the results could potentially be generalised to a bigger part of the population.

The literature review is broken up in three parts. First, the definition of social media is given in combination with a framework to further divide social media into different categories (see 2.1). Next, research with similar

objectives are discussed to give an overview of what is already achieved and which methods were applied. In addition, studies about Instagram that describe user-groups and other image-independent factors that could poten-tially affect the number of likes an image gets, are discussed (see2.2). These

two parts will be significant to understanding the large data impurity found during the retrieval of data, as well as to making well informed decisions about which data to include. Lastly, a quick introduction to CNNs is given to support its usage here (see2.3).

2.1 social media

Social media is a relatively new phenomenon and has changed rapidly in recent years. The term was introduced when the social networking sites MySpace and Facebook were created in 2003 and 2004 respectively. This characterized a shift from weblogging to today’s social networking sites, that later became known as social media . This shift was mainly feasible due to the growing availability of high-speed Internet access. Maha(2015) has

given a formal definition of what social media are: a group of internet-based applications that build on the ideological and technological foundations of web 2.0 that allow the creation and exchange of User Generated Content (UGC). Web 2.0 refers to the technological and ideological foundation, where content and applications are no longer created or published by individuals. Instead, they are continuously modified by a group in collaborative fashion. This was made possible by technologies such as adobe flash and Really Simple

(10)

4 theoretical framework

Syndication. UGC is defined as publicly available creative content that is created by amateuristic end-users.

Short, Williams, and Christie (as cited inMaha,2015) divide media into

categories according to their degree of social presence and self-disclosure. The presence theory states that media differ in their degree of "social pres-ence" and is influenced by the intimacy (closeness of contact) and imme-diacy (how direct the contact of the medium is). The higher the social presence, the larger the social influence the communication partners have on each others behaviour. Self-disclosure can be defined as the conscious or unconscious revelation of personal information that is consistant with the image one would like to convey about themselves (Goffmanas cited in

Maha (2015)). Maha (2015) categorizes social networking sites as having

high self-representation and medium social presence. It can be argued that Instagram belongs to this category according to their definition, because it contains rich visual content of nearby surroundings shared immediately to acquaintances and relatives. These two concepts seem particularly con-nected to marketing and self-promotion because of their ability to affect people’s behaviour.

Many businesses have made social media one of their top agendas, widen-ing the application of social media (Cristofaro, Friedman, Jourjon, Kaafar, & Shafiq,2014;Kaplan & Haenlein,2010). Despite their content not falling

un-der the UCG definition of social media due their collective nature, they have become a major part of many social media. It is even estimated that a ’like’ on Facebook is worth between $3.60 and $214.18 dollars of revenue ( Cristo-faro et al.,2014), giving an indication of its monetary applications. At the

other hand, it seems no longer possible for companies to ignore social me-dia as their public image is increasingly shaped by the democratized view of users on these platforms (Kietzmann, Hermkens, McCarthy, & Silvestre,

2011). This at least indicates that the users of social media could be very

diverse, resulting in a heterogeneous relation between its users and content. The user-groups of social media seem to be very diverse and relatively uniform with respect to the whole population. Perrin(2015) conducted a

comparative study of American surveys to gain insight into the social me-dia usage rates of different demographic groups. No significant differences were found for gender and ethnic groups or residential area. The biggest difference in usage rates was found between different age groups. Younger people had higher usage rates, but this difference has significantly decreased in recent years. In addition, people with higher educational levels and in-come were identified as the most active social media groups. Despite these differences, it seems that social media represents a relatively large part of the human population. However, major differences could still reside in be-havioural differences for various user-groups and specific social media sites.

2.2 instagram

Instagram launched on October 2010 and has over the years accumulated an active user base of 400 million users with an average of 80 million image daily uploads (Press Page • Instagram, n.d.). When people sign up, their account is by default public giving everyone around the world access to their posts. Users can upload images with optional text or tags describing it. These are collectively called posts and form the foundation of Instagram. A commonly used feature is the ability to add filters to images that change a

(11)

2.2 instagram 5

variety of components including contrast, brightness, saturation and colors resulting in a different appearance.

Users of the service can follow other users, which are referred to as fol-lowers from the perspective of the user that is being followed. The number of users a particular user is following is collectively referred to as textit-followings. The relation between followers and followings is a-symmetrical, meaning that a user can follow someone without actually getting followed back. However, if a user changes their privacy setting to private, all posts become only viewable by the followers they manually accepted.

Content on Instagram can be viewed in two ways. The most used view is the user’s time-line (Hu, Manikonda, Kambhampati, et al.,2014), where the

images of followers are prominently visible in Instagrams iconic squared image-dimensions. An other way is to search on a specific tag to find the most recent posts that contain this tag. A user can ’like’ an image by press-ing a heart icon below each image indexlike. This mechanic is used by Instagram to highlight the most loved/liked images when searching for a specific hashtag.

With the basic Instagram terminology explained, closely related studies will be discussed next. In the related work section (2.2.1) emphasis is put

on research with similar methodology and research-questions in order to respectively use and build upon in the research methodology of this study. Subsequently, in the section named prior observed phenomena (2.2.2) a more in

depth look is taken into the investigated phenomena on Instagram. These findings will be used in the extraction of data and give more context for the discussion in the end (5).

2.2.1 related work

Despite Instagrams popularity, only a few studies compared to other popu-lar social-media sites were conducted (Hu et al.,2014). Hu et al.(2014)

in-vestigated the most popular image categories, how users differ based on the types of images they post, and how these differences relate to the accumu-lation of followers. For their data the most recent images from Instagrams public time-line were extracted. These were then refined by excluding or-ganization, brands or spammers, and users that had less than 30 friends or followers and 60 posts. Only 14.6% of the users satisfied this selection. At the same time they found that 9.4% of the users changed their privacy settings to private during the retrieval of data. Eight major different type of image categories were found; friends, food, gadgets, captions, pets, activi-ties, self-portraits and fashion. Users where then clustered into distinctive groups based on the frequency of images they posted belonging to each of these categories. This resulted in five groups showing very different in-terests in the posed image categories. One group showed special interest in food and had the most diverse interests of all groups. This group in-cluded 11.22% of the users and was the second largest group, only inferior to a group that posted mostly self-portraits (selfies) or images from friends (22.44%). Interestingly, no significant relationship was found between the number of followers and user-type.

Kalayeh, Seifu, LaLanne, and Shah(2015) studied the popularity of selfies

by looking at how different filters and attributes affected image popularity. For their data, only images containing the ’selfie’-hashtag were extracted, of which 55% satisfied their definition of a self-portrait. Effects of filters on image popularity were studied by applying seven different Instagram filters

(12)

on 10k images to investigate their effect on their model predicted popularity score. A combination of feature vectors from different machine learning algorithms were taken to make this prediction, including a CNN trained on ImageNet. By extracting the feature vectors of the last fully connected layer of the CNN, a 0.41 correlation between the observed and predicted popularity score was achieved. This demonstrated that CNNs can detect useful features to predict image-popularity. The effects of filters were shown to be image-dependent, as they showed no overall trend. However, 76% of the images on Instagram are already being processed (Souza Araujo et al.,

2014). This makes this experiment mostly about how multiple filters affect

image popularity. Image attributes (e.g. female, baby, child) and objects (e.g. sunglasses, glasses) did have a significant effect on image popularity.

In his blog, Karpathy(2015) described a method for extracting features

responsible for selfie popularity. For the data five million images containing the ’selfie’-tag were extracted. This was further narrowed down to two mil-lion using a CNN classifier to remove any non-selfie image from their data set. The problem was framed as a classification problem where the images were divided into two groups, good selfies and bad selfies. To achieve this the data was sorted to ascending number of followers and cut into groups of 100 images, of which the top 50% in ’likes’ were labeled good selfies op-posed to bad selfies for the other half. Essentially limiting the effects that a higher number of followers can have on the attraction of likes. Images that contained females were found to be more popular than those with men. More findings were reported but were possible artifacts of improper use of center cropping. Most images from the top model predictions contained artificially added borders around the subject, which does not particularly seem an indicator of good images but rather puts emphasis on the subject by centering it within the image.

Based on the three studies above, the following conclusions can be drawn. Firstly, the number of followers were found to be important for the accumu-lation of likes as they were used byHu et al.(2014) andKarpathy(2015) to

make a user selection and group users according to their number of likes respectively. Lastly, a refinement of the data is necessary when the hash-tag search mechanism is used, considering that a significant part of the data was mistagged (Hu et al.,2014;Karpathy,2015).

2.2.2 prior observed phenomena

Khosla et al. (2014) studied how image content and social context

influ-enced image popularity. Image popularity was defined by the times an image was viewed. For the image content they extracted low-level image-dependent features that captured color patches, gradients and texture, much like CNNs do, and combined them with social cues. Interestingly, these image-dependent features had more predictive value for image popularity than social cues when images from the same user where compared (0.40 to 0.21 rank correlation). However, when images between users where com-pared, social cues were found to be significantly more important for the prediction of popularity (0.36 to 0.77). An other finding was that that se-mantically more meaningful structures, e.g. faces in contrast to landscapes, correspond with increased popularity. The same phenomena was observed with images that contained faces, which were more likely to receive likes and on average gained 14% more likes ((Bakhshi, Shamma, & Gilbert,2014; Khosla et al.,2014)). An analysis of color showed a rank correlation between

(13)

2.3 convolutional neural networks 7

0.12 to 0.23 with image popularity for different data sets by clustering the 16.8m available RGB-colors into 50 distinct color hues. Green- and blue-ish colors were found to be less important than the more redish colors. Sim-ple features such as hue and saturation were shown to have no significant impact on image popularity.

Souza Araujo et al.(2014) pioneered in analysing people’s interaction with

Instagram images. The study focused on 1.2+ million publicly accessible photos and videos of both ordinary and popular users. First, a rich-get-richer phenomenon was observed, where images that received a higher number of likes increased the perceived like-ability of images. Second, 76% of the im-ages were found to contain some level of processing and it was suggested that the remainder of non-processed images were mostly posted by ama-teur smart-phone users. Lastly, posts with under five tags gained the most number of likes, despite the expectation that the search-ability would im-prove as more tags were used. Other factors affecting the number of likes were age and possibly the time of posting. Araújo, Corrêa, da Silva, Prates, and Meira (2014) showed that over 50% of the images were posted at the

weekend with a relatively flat distribution of images throughout the day, showing only a decrease in the late evening hours. As a result, the time of posting has could have a significant effect on the amount of likes an image gets, since only the most recent posts appear on top of the time-line. Age matters for differences in engagement activity (Jang, Han, Shih, & Lee,2015)

along with social media usage in general (Perrin,2015). Teens (13-19)

signif-icantly attract more activity than adults (25-39) receiving signifsignif-icantly more comments and likes per photo (56.10 to 40.3). Even though the subjects of posts differed significantly in general, the percentage of food related posts between the groups remained the same (3.18% versus 3.13%). Perrin(2015)

showed that while there is a big usage-gap between young and old, the dif-ference has decreased significantly over recent years. Lastly, services have become available where likes can be bought, collectively called like-farms. Although no literature was found that confirmed this phenomena occurred on Instagram specifically, it seems likely as there are many sites that offer this service for Instagram (e.g. igfamous,buzzoid).

2.3 convolutional neural networks

Today CNNs, with their ability to automatically extract key features, outper-form all previously known methods in many visualization tasks. Since 2012, they are leading in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), starting with AlexNet to the current leader VGG-net (Krizhevsky et al., 2012; Russakovsky et al., 2015). For the first time in history these

algorithms show the ability to compete with human image recognition ca-pabilities in many tasks. Interestingly, Cadieu et al.(2014) has shown that

CNNs rely on similar mechanism for visualization tasks that were found in the inferior temporal cortex in primate brains, including humans. Despite the success of CNNs and its similarities with the way humans learn, it re-mains controversial whether they mimic the processes in the human brain in approximately the same way as it was shown that white noise images were classified as specific objects with high confidence (Nguyen et al.,2015).

This at the very least highlights fundamental differences in the way CNNs perceive images.

(14)

2.3.1 model analysis

A CNN consists of many different types of layers and hyper-parameter set-tings which makes it difficult to understand of what a trained model has learned (Zeiler & Fergus, 2014). The convolutional layer, which is central

to CNNs, exploits the correlation of spatially-local image input by slid-ing convolutional-filters of a specific size (wxh) over each R,G,B channel. The weights of the 2-dimensional filters are learnt during training and pro-duce higher activations to specific low level image features; e.g. edges and blotches of color. These activations are hard to interpret, especially since more consecutive convolutional layers learn higher level of abstrac-tions, which no longer directly correspond to structures we perceive. As a result, evaluating what a CNN model has learned is a particular difficult problem (Zeiler & Fergus,2014).

2.3.2 caffe framework

Training a CNN from scratch requires a lot of data and is time-consuming which therefore forms a problem for many tasks where the amount of data is limited. However, one key benefit of CNNs is that the first convolutional lay-ers learn general features, while consecutive laylay-ers learn increasingly more task specific features. This favors transfer learning, where trained models can be fine-tuned with relatively small data sets by only re-learning the last layers while freezing the remaining (Yosinski et al.,2014). In 2014, Caffe

was introduced which a fully open-sourced framework that provides an easy way to train, test, deploy, and fine-tuning (pre-existing) models (Jia et al.,2014). As the creators of Caffe strongly believe in reproducible research

and has many state-of-the-art pre-trained reference models, this framework was used in this study.

(15)

3

R E S E A R C H M E T H O D S

This study was broken up in three parts corresponding with the research questions posed in the introduction (1.1); data-collection, model selection

and fine-tuning, and lastly an extensive model analysis to extract the prop-erties of images that are responsible for image-appreciation. In short, the research questions were operationalised by the following methods:

O1: No data set existed that contained images with a corresponding like-ability measurement, therefore a new data set had to be created. A se-lection of images from Instagram were chosen that presumably relied more on image-dependent factors for their accumulation of likes. To find the criteria for this selection, the findings of the literature review were combined with the results of an initial data analysis were the re-lation between user-groups and likes were analyzed (see3.1.1). This

knowledge was then used to create a data set were image-independent factors had minimal effect on the number of likes to improve the capa-bilities of CNNs to model image-appreciation solely based on image dependent-features (see3.1.2).

O2: To quantify how accurately a CNN model can simulate human image-appreciation, the problem was then framed as a supervised classifica-tion as well as a regression problem. The accuracy of the classificaclassifica-tion problem could serve as a first indication how the difficulty of the prob-lem as it was simplified to only predicting whether an image was con-sidered good or bad (similar to the methods used by Karpathy(2015)

in his blog). The predicted number of likes of a trained regression model on the data, where image dependent-features were central, was then compared with the observed number of likes by calculating the correlation between them. This served as the best score to answer how accurately the model can predict the number of likes.

O3: The next step was to comprehend what these models have learned about image-appreciation in humanly understandable terms. For this, a series of visualization methods were used (see 3.3). First, images

from the top, middle and bottom of the regression model predictions were manually analysed for remarkable differences. Second, the group differences were then quantified by scoring and comparing the differ-ences in hue, sharpness, contrast and brightness. Next, a series of photos with variation in; colors textures and camera viewpoint, were scored by the model to investigate their effect on image-appreciation.

3.1 food data set collection

Ideally, a data set would be obtained where the number of likes have the same relational interpretation to their corresponding images across the whole data set. If not, it can be expected that more emphasis is put on distinguish-ing images based on user type by learndistinguish-ing how image content is related

(16)

10 research methods

to it. As user type is not technically an image-dependent factor this was considered noise in the data and was tried to be filtered out. For example, an universally recognised good image from a user that is mainly family ori-ented is still expected to gain fewer likes versus a bad images created by a celebrity. As result, it can be expected that the algorithm will learn more about what images belong to specific user groups, rather than what general image-features contribute to higher image appreciation. A solution to this could be to include meta-data for user-group normalization, but was disre-garded because it introduces two new problems. First, it would facilitate to learn specific features that are more important for different user-groups instead of focusing on features that are generally important. Secondly, it would make the model reliant on meta-data from Instagram, which makes it impossible to generalize outside the Instagram domain. Instead, the find-ings of the literature review were used in combination with an initial anal-ysis of the distribution to minimise this effect and ultimately answer RQ1 (1.1).

3.1.1 initial data retrieval & analysis

For the extraction of the data, Instagrams hastag-search capabilities were used to obtain the most recently posted food images. 10 food-related tags were selected (see appendixC.1) that covered a wide variety of food images while at the same time being popular enough to accumulate new images quickly (hashtag used 5+ million times). Each hashtag was used the same number of times to obtain the data set. Instagram offers posts to be viewed in two different sizes, of which the squared image is the default view (op-posed to the original dimensions of the uploaded image). Therefore, this image was used in the collection of the data. To cover the most active pe-riod of user interactions with posts, the number of likes were updated after exactly seven days after the upload date (F1). Interestingly, after two days of

activity, no significant differences were found in the accumulation of likes for popular and unpopular posts. Therefore no bias was introduced towards any of those groups. In addition, posts gained over 90% of their likes dur-ing the first two days in a 10 day period, indicatdur-ing that seven is enough to cover most of the user interactions with posts.

(17)

3.1 food data set collection 11

F1:Accumulation of likes during the first 10 days online. 1600 posts were followed during this pe-riod. No differences were found between popu-lar and less popupopu-lar posts after two days.

A sample of 200 images contained 30.5% of nonfood images in the data, reaffirmed the high levels of data impurity that were observed in other stud-ies that were discussed in the theoretical framework (2). An image was

consid-ered non-food if food was not the main subject in the frame or was covconsid-ered up in packaging. The most frequent non-food images contained women, text or selfies (F2). After a closer examination of the meta-data, 9 specific

hashtags (appendix C.2) were identified that correlated highly with misla-beled images. As food is one of the most popular hashtags on Instagram, it was assumed that a significant portion of the users that mistagged their images did it on purpose to achieve better post searchability. Sites such as

tagsforlikesand top-hashtagseven enable people to add popular hashtags to their posts. Fortunately, these sites add their own distinctive tag to it, making it easy to filter them out of the data completely. Even better results were obtained by excluding the posts that contained these hashtags as sub-strings of longer hashtags. As a result the percentage of non-food images in the data was reduced to 16.5%. With these filters in place, the initial data set was obtained that contained 140k images.

F2:Examples of frequently appearing nonfood image categories in the im-ages that contained food-related tags.

To improve the purity of the data, a food-nonfood classifier was obtained by adapted an AlexNet model trained on ImageNet data for binary classifi-cation (see3.2). One reason why nonfood images can be particularly bad, is

because they can contain semantically meaningful objects, e.g. faces as was shown in the theoretical framework (2), that could as noise to the remainder

of the food images. In a first attempt to improve the purity of the data, food images were taken from the food-101 data set (Bossard, Guillaumin, & Van Gool,2014) with a selection of ImageNet categories as negative

exam-ples. Despite achieving a 97.1% prediction accuracy on the validation set after only 1 epoch, the domain of the Instagram images proved to be vastly

(18)

12 research methods

different showing inadequate performance on the real data set. In a second attempt, the ability of CNNs to handle noisy data was exploited by label-ing 41k imagesfrom the 16.5% impure data set as food. To collect negative examples from a similar domain as were found in the data, hashtags that could describe the observed mistagged images were used to obtain a 41k nonfood data set for a 50/50 split. This resulted in the removal of approx-imately 53.1% of the non-food images with 80.0% precision with a 86.6% accuracy on the validation set. As a result, the impurity of the data was further decreased from 16.5 to 8.9%.

T 1:Correlation between meta-data, 117k data set.

hashtags posts followers following likes comments hashtags 1 _-0.0069 _-0.0029 0_.0042 _-0.021 _-0.044 posts 1 -0.00023 -0.000085 0.040 0.031 followers 1 0.00010 0.69 0.29 following 1 0.15 0.018 likes 1 0.52 comments 1 average 12.47 655 2394 607 76.8 2.5 median 10 289 341 277 28 1 std 9.65 1412 24811 7416 463.56 10.86

To obtain a data set where the number of likes had the same relation across the data set, the relation between the number of likes to user type and the number of followers, were investigated. The distribution of likes was very skewed, with the top 12.1% of the posts attracting 60.1% of the total amount of likes. This could mostly be devoted to user popularity, showing a correlation of 0.82 between the followers and likes in the 140k data set (appendix T5). However, the Pearson correlation is especially vulnerable to

outliers in the data. To gain a more realistic view of the correlation between followers and likes, the top 12.1% was excluded and the food-classifier was applied to gain a data set that contained 117k images. Interestingly, the correlation between followers and likes was still high at 0.69%(T1). The

distributions of likes per post (F3), followers per post and followers per user

all followed the same bell-shaped followed by a very long tail (appendix F36, F37). This was interpreted as a sign that the data was very unbalanced,

as only a small portion of users gained the most likes near the end of the the tail of the distribution.

F3:Initial distribution of likes, excluding 12.1% of most liked posts.

To understand the reasons for people with higher number of followers gaining more likes, other than possibly taking better images, user profiles in different parts of the distribution of followers were analysed. This showed

(19)

3.1 food data set collection 13

very distinctive user profiles for people with a higher number of followers, e.g. food channels, businesses, models, vloggers and other type of celebri-ties, in contrast to users around the mode of followers distribution who were more friends and family oriented. As a result, it can be assumed that a higher number of followers is a direct effect of having a higher social reach. It is therefore only logical that the relation that users have with their fol-lowers is very different for these groups, making it impossible to compare their number of likes. Even taking a naive approach to minimise the differ-ences by excluding all posts that gained over 99 likes or by normalising by dividing the number of likes to the number of followers, showed no signif-icant difference and instead introduced a bias towards people with fewer followers (F4). 0 2 4 6 8 10 log followers base 5 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 likes divided by followers pearsonr = 0.55; p = 0

F4:Nonlinear relationship between the number of followers and likes divided by the number of followers.

The most important conclusion from the analysis of followers was that the interpretation of the number of likes is very closely connected to the user-group and the number of followers. As can be seen in the elbow shaped joint distribution plot in figure 4, people with a higher number of

follow-ers have a lower ratio between the normalised number of likes to followfollow-ers. This can be expected, because people with an higher social reach have a more distant relationship with their followers, presumably resulting in a lower ratio of likes per follower. As the distribution of the number of likes is very skewed this would form a major problem as this would put ma-jor emphasis on the relatively small portion of images that gained a high number of likes. In addition, the rich-get-richer phenomena as discussed in the theoretical framework (2), would amplify this effect by putting even more

weight on image-independent factors contributing to the number of likes. This demonstrated the need for a different way to obtain the data set.

Many of the discussed problems could be resolved by selecting a specific group of users that have approximately the same relation to their followers. To make this selections two requirements were imposed. First, it should cap-ture the mode of the likes and followers as can be seen in F3and36. Second,

within this range, image-independent factors should play a minimal role in the accumulation of likes to capture the most frequent relation between im-ages and followers. To satisfy the last condition, a part of the data was

(20)

14 research methods

chosen where the correlation between followers and likes were low. It was expected that image-dependent factors would then play a more significant role in the accumulation of likes. To achieve this, the data was sorted in as-cending order of followers and the correlation and variance was calculated for different sizes of the distributions. A size of 20% of the data seemed to have a good trade-off between having a low correlation between followers and likes while still capturing a significant part of the top of the followers and likes distributions. (F5, F6). Interestingly, this showed a parabolic shape

with a relatively flat bottom. This part was selected which corresponded with a range between 101-224 followers (between 17 and 37% in F5, F6).

F5:Correlation between followers and likes in the distribution (sorted by the number of likes).

F6:Variance of the number of likes in the distribution (sorted by the number of likes, also see figure5). This data set contained a relative equal distribution of followers (F7). This

is beneficiary for CNNs as this would not favor users with more or fewer followers. As this selection would reduce the amount of data by 80%, more data was retrieved since it was expected that more data would be greatly beneficiary for the final results.

F7:Distributions of followers is relatively constant in the range between 101 and 224 followers.

3.1.2 final data retrieval

For the collection of the final data set similar mechanisms were used as seen in the initial data extraction (see 3.1.1). The follower range from 101-224

was used as was justified in the initial data analysis3.1.1. In addition more

effort was put in improving the food classifier for greater purity of the data. pre-processing

The same food-related hashtags were used to search for food containing images. This was then refined by only including users that had between

(21)

3.2 model selection & adaptation 15

101-224 followers and did not use any of the nine specific hashtags that correlated highly with non-food images. A shorter period of three days was used to update the number of likes as it already contained over 90% of the posts activity over a ten day period. As a result of this, the data set from the initial data retrieval was not used for the final analysis.

post-processing

The retrieved data set contained 93097 images and was further narrowed down by only allowing one image-per-user which resulted in 67009 images. This was necessary as it was important not to introduce a bias to the most active users, as age was identified as being an important factor for the num-ber of likes as was discussed in the theoretical framework (2). In an attempt to

improve the purity of the data, the methods of obtaining the food-classifier were improved upon by using the current state-of-the-art model VGG-19, in combination with an improved nonfood/food data set. For the food data set, the 117k images with 8.9% nonfood images from the initial data set. The nonfood data set was improved upon by manually selecting 171 of the 2500 most popular tags fromtop-hashtags.com (appendix C.3) to obtain a 117k nonfood data set for a 50/50 split. To diminish the odds of these images containing food, images that contained a selection of the 85 most used food related tags were excluded. The classifier was then used to obtain the final data set of 58k images (for the purity of this data set see section4.1.2). The

hastag filters, the user selection made based on followers, and classifier were assumed to be a good solution to RQ1.

3.2 model selection & adaptation

The reference models in Caffe; AlexNet and VGG-19, that were trained on the ILSVRC ImageNet data, were adapted and fine-tuned to answer RQ2. As these models were trained on differentiating between 1000 different image categories they were presumably well suited to capture the wide variety of available colors, textures and compositions, found in food-images.

3.2.1 task formulation

The problem of modeling human image appreciation was both framed as a classification and regression task. For the regression task the images were used as input with the number of likes as target output. The differences in the observed and predicted number of likes of the model could then be used to answer partially answer RQ2. In addition, the model prediction scores could be analysed by seeing how they correspond to specific patterns in their corresponding images. By framing it as a classification problem a more general sense about the difficulty of the problem could be achieved, as this would greatly reduce the complexity of the problem. The images in the top and bottom 24% of the number of likes were respectively labeled as bad and good. 24% corresponded with the mode in the distribution of the number of likes in the distribution. The accuracy score could contributing to answering RQ2.

(22)

16 research methods

3.2.2 model training

The adapted CNNs were then trained on the final data set (3.1.2). After

some considerations, the use of meta-data was excluded due to better gen-eralizability of the results (for a more in depth explanation, see the end of section3.1). For both models all fully connected layers, fc6, fc7, fc8 were

re-initialised and trained while the previous layers were remained unchanged as this consistently showed near-optimal out of the box performance and decreased training time. The most time was devoted to finding the best learning rate and weight decay settings. The used models with their corre-sponding loss functions will be discussed next.

3.2.3 alexnet

AlexNet has a relatively competitive mixture of performance and computa-tional requirements, which allows for quick fine-tuning and was therefore preferred over the larger more resource intense state-of-the-art models for modeling image appreciation. For the regression task the number of output nodes of the last layer was changed to one with the use of an Euclidean-LossLayer. As result, the network will minimise the sum of the squared differences of the observed and expected number of likes:

E = 1 2N N X n=1 | ˆyn− yn|22 (1)

For the binary classification the SoftmaxWithLoss layer was used, which calculates the cross-entropy classification loss for the softmax output class probabilities ˆp: E = −1 2N N X n=1 log(ˆpn,In) (2) 3.2.4 vgg-19

The VGG-19 model is the current state-of-the-art model and was adapted to fit a binary classification task for the food classifier. Similar to the binary classification task with the AlexNet model the SoftmaxWithLoss layer was used as loss layer (see equation1).

3.3 model evaluation methods

To investigate the most important features responsible for image apprecia-tion to answer RQ3, multiple evaluaapprecia-tion methods were used on the regres-sion model. First, a quantitative score of the performance of the model was obtained in terms of the correlation between the observed and predicted number of likes (see3.3.1). Secondly, images from the bottom, middle and

top of the model prediction scores were manually analysed for patterns (see

3.3.2). In particular attention was paid to colors, textures and composition

as these are presumably used by CNNs to learn to distinguish images. This resulted in assumptions about about the underlying differences for these

(23)

3.3 model evaluation methods 17

image-categories, which were later tested by taking a series of images from the same scene with only variation in angles, colors and primitive filters to see their affect on the predicted number of likes.

3.3.1 prediction scores

A quantitative measurement of the performance of the AlexNet model was achieved by getting the loss and accuracy of the model trained on respec-tively the regression and classification task. Subsequently, the distribution of the observed and expected number of likes were visualized and a statisti-cal analysis was applied. As result, a clear overall view was achieved about the performance of the model. To compare the performance of both models, a baseline was created by applying the SGDRegressor and SGDClassifier from thesklearnpython package on the 4096-D vector representation of the corresponding images of the fully connected layer fc6 and fc7.

3.3.2 manual top/middle/bottom analysis

To understand which features the model has learned, a manual analysis was performed to see which patterns were more visible in different parts of the model predictions. In specific, 500 images were taken from the top, middle, bottom of the distribution of the model predictions and were analysed for noticeable differences in hue, saturation, brightness, sharpness and subject framing (F8).

F8:Representative examples taken from the 500 images of the bottom, mid-dle and top model predictions.

3.3.3 quantified top/middle/bottom analysis

To quantify the differences between the three groups special attention was paid to differences in colors, brightness, contrast and sharpness. First, all the R,G,B pixel triplets were clustered to 25 distinctive hues (F9). The

fre-quencies of the hues were then visualised using barplots for each individual group as well as the differences in color frequency between the groups.

(24)

18 research methods

F9:Hues that were used to cluster all individual pix-els from the top/middle/bottom image-groups.

The following formula was used from Finley(2006), to calculate the

per-ceived brightness score of each image individually:

brightness = 1 pixels pixels_X i=1 q 0.299 ∗ (r2_{) + 0.587 ∗ (g}2_{) + 0.114 ∗ (b}2₎ ₍₃₎

After this the distributions of these brightness scores were compared be-tween all groups by visualisation and statistical analysis (t-test) to see if the differences were significant. Similar comparison methods were used for contrast and sharpness scores.

The contrast score was calculated by converting the image to grey scale to which the Mean Squared Error (MSE) was applied by subtracting the grey-value from the mean of the grey values of that particular image:

contrast = 1 pixels

pixels_X

i=1

(pixel[i] − image_average)2 (4)

The variance of the Laplacian was calculated as a measurement of image sharpness (see appendixB.1).

3.3.4 model scene variation scores

Photos from the same scene with different camera angles, colors, textures were taken to see how these variations affect the number of likes predicted by the model (F10). In addition, the effects of filters on this series of images

were investigated, as most images on Instagram already contain filters. This way the assumptions about the models inner-workings were actually tested in semi-real world situations. See table7 in the appendix for all variations

(25)

3.3 model evaluation methods 19

F10:Photo series with variations in camera angle, colors and filters that were used see their corresponding effects on the predicted number of likes.

(26)

(27)

4

R E S U L T S

4.1 data

The final data-set contained 58244 images from unique users that had be-tween 101 and 224 followers.

4.1.1 distributions

In the data significant correlation was found between followers and follow-ings (0.21, F11), followings and likes (-0.11, F12), followers and likes (0.089,

F13) and lastly posts and likes (0.16, F14). Interestingly, users with more

posts or followed more users, gained fewer likes. In addition people with more followers also followed more people.

100 120 140 160 180 200 220 followers 0 200 400 600 800 1000 following pearsonr = 0.21; p = 0

F11:Distribution of followers and number of users people are following. 0 100 200 300 400 500 600 following 0 20 40 60 80 100 likes pearsonr = 0.11; p = 1.9e140 F12:Distribution of number of users people are following and likes. 120 140 160 180 200 220 followers 0 20 40 60 80 100 likes pearsonr = 0.089; p = 1.2e102

F13:Distribution of followers and likes. 0 200 400 600 800 1000 posts 0 20 40 60 80 100 likes pearsonr = 0.16; p = 2.5e301

F14:Distribution of posts and likes.

(28)

22 results

The 10 different hashtags that were used to extract the images appeared in very different frequencies in the data after the pre- and post-filtering was applied (F15). Initially, all tags were equally used for image retrieval.

yum *porn yummy food *gasm foodie*photography *pics *stagram *pic

search tag (the "*" stands for "food") 0 2000 4000 6000 8000 10000 12000 14000 retrieved images 0.15% 0.16% 0.23% 0.15% 0.04% 0.1% 0.03% 0.03% 0.08% 0.03%

F15:Number of retrieved images for each search tag used in the collection of the data.

Significant differences were found between the distributions of likes for the different hashtags (F16).

yum *porn yummy food *gasm foodie *photography *pics *stagram *pic

search tag (the "*" stands for "food") 0 10 20 30 40 50 60 70 80 90 likes

F16:Distributions of likes per hashtag.

4.1.2 purity

The final data set contained an estimated 97% of food images with a recall of 98.3% (T2). This resulted in a combined accuracy of 0.961 with a F1-score

of 0.976.

T 2:VGG-19 nonfood-food classifier results (sample-size = 500₎

food nonfood

food 56496.71 (97,0% TP) 985.648 (9.0% FN) nonfood 1747.29 (3.0% FP) 10737.352 (91.0% TN)

(29)

4.2 model evaluation 23

4.2 model evaluation

4.2.1 prediction scores regression

The regression model adapted from AlexNet gained a 0.246 correlation be-tween the observed and expected number of likes (F17) with a MSE of 268.26

(T3). This was significantly better than always predicting the mean for the

validation set: 283.60 (denoted as mse baseline in appendix T8). Training was

stopped at 6 epochs with a learning rate of 0.000001 (appendix T8).

T 3:Statistics of regression model predictions.

mse observed mse baseline std avg misprediction min max

268.26 283.60 4.72 11.66 7.25 41.56 0 5 10 15 20 25 30 35 40

predicted likes

0 20 40 60 80 100

observed likes

pearsonr = 0.25; p = 9e161

F17:Distribution of the observed and models expected number of likes. Important: no-tice the different scale of the y axis from the x-axis.

The distribution of the predicted number of likes was centered around the true mean (F17), with a significantly lower standard deviation (F18, F19).

(30)

24 results 0 20 40 60 80 100 predicted likes 0 20 40 60 80 100 observed likes pearsonr = 0.25; p = 9e161

F18:Distribution of the observed and the model’s expected number of likes. observed expected distribution 0 10 20 30 40 50 60 likes

F19:Boxplot of the observed and

the model’s expected number of likes.

sgd-regressor

For a baseline the SGDRegressor was used from the pythonsklearnlibrary with the default parameter settings, except the eta0 parameter which was set to 0.000001. This was done for both the fully connected layers fc6 and fc7 to gain the vector representaion of the images from the AlexNet unchanged ref-erence model. This resulted in respectively a correlation of 0.178 and 0.130 (T4), indicating that the fc6 feature vectors were more useful for this task.

The SGDRegressor scored signficantly worse than the AlexNet regression model, but did marginally better than the mse baseline for only fc6.

T 4:Statistics of the regression model predictions.

mse observed mse baseline std avg misprediction min max

fc6 278.46 283.60 4.97 12.16 5.2 52.41 fc7 314.24 283.60 8.07 12.96 -14.69 65.25 10 20 30 40 50 predicted likes 0 20 40 60 80 observed likes pearsonr = 0.18; p = 3.7e84

F20:Distribution of the observed ex-pected number of likes. Feature vectors extracted from fc6 from the unchanged AlexNet refer-ence model with the default set-tings of the sklearn SGDRegres-sor. 10 0 10 20 30 40 50 60 predicted likes 0 20 40 60 80 observed likes pearsonr = 0.13; p = 2.6e45

F21:Distribution of the observed ex-pected number of likes. Feature vectors extracted from fc7 from the unchanged AlexNet refer-ence model with the default set-tings of the sklearn SGDRegres-sor.

(31)

4.2 model evaluation 25

classification

The regression model adapted from AlexNet gained an accuracy of 62.4% by using the top and bottom 24% of the data. Training was stopped at eight epochs with a learning rate of 0.000001 (appendix T10).

Despite abandoning the idea to use the complete range of followers from the 117k data set (3.1.1) the classification accuracy was calculated of the

top and bottom 25% of the data (which corresponded with the number of images that appeared before the mode of the number of likes) and produced an accuracy of 65.0%. Significantly higher than the 62.4% based on the 58k data set, but as explained earlier could be contributed to the model learning about the image-independent factor; user type. Training was stopped at 17 epochs with a learning rate of 0.000001 (appendix T9).

sgd-classifier

For a baseline for the AlexNet classification model, the SGDclassifier was used from the python sklearn library with the default parameter settings. With the fc6 and fc7 feature vectors an accuracy of respectively 0.567 and 0.558 was obtained.

4.2.2 manual bottom/middle/top analysis

The following notable properties were found in the 500 images with lowest prediction scores (for example images, see appendix F38):

• Hue: A remarkable amount of green and blue. A low variety of colors per image.

• Saturation and brightness: Low contrast and brightness, lacking whites and pure hue colors.

• Edges: Soft edges around the borders with large shadows.

• content: Most food was presumably home-made, while a descent amount of images did not contain food.

• Camera focus: Significant amount of images were out of focus.

• Subject framing: Most subjects were cut-off by the image-frame.

• General feel: Images contained presumably very low amount of stag-ing, as a descent amount of distractions from the main subject were prominently visible in the frame.

The following notable properties were found in the 500 images taken from the middle of the prediction scores distribution (for example images, see appendix F39):

• Hue: A high variety of colors (in specific more red and yellow).

• Saturation and brightness: Increased saturation and brightness. Still contained a lot of brown-ish colors.

• Edges: Well defined edges with large shadows.

• Content: Most food was presumably home-made.

• Camera focus: Only limited images were out of focus.

• Subject framing: Most subjects were still cut-off by the image-frame, but mostly taken from a wider angle.

(32)

26 results

The following notable properties were found in the 500 images with the highest prediction scores (for example images, see appendix F40):

• Hue: The high variety of colors.

• Saturation and brightness: Mostly saturated and bright colors, while lacking mostly black and brown colors in particular.

• Edges: Very well defined edges. Short shadow.

• Content: Most food was presumably made by food-fanatics or restau-rants as most images showed a high variety of different colors and shapes.

• Camera focus: An insignificant amount of images were out of focus.

• Subject framing: An low amount of images were cut-off by the image-frame.

• General feel: Images contained a lot of staging, as the subject was very well isolated from its surroundings.

A significant difference in the number of nonfood images appearing in the different categories were observed (F22). The nonfood images were mostly

concentrated to the bottom category, showing 653% of the average percent-age of nonfood in the whole data set (to 147% for the middle and 40% for the top respectively).

top

middle

bottom

image group

0

20

40

60

80

100 nonfood images

6

22

98

F22:Nonfood images in the selection of 500 im-ages from the top/middle/bottom prediction groups.

4.2.3 quantified bottom/middle/top analysis colors

Significant differences in color frequency were found for the image cate-gories. The most remarkable findings were that colors closer to yellow and cyan were more frequently found in the top and middle 500 model predic-tions (F24, F23) than in the bottom predictions (F25).

(33)

4.2 model evaluation 27 0 5 10 15 20 24 colors 0 1 2 3 4 5 6 7 pixel frequency 1e6

F23:Frequency of hues with the high-est 500 model predictions.

0 5 10 15 20 24 colors 0 1 2 3 4 5 6 7 pixel frequency 1e6

F24:Frequency of hues with the inter-mediate 500 model predictions.

0 5 10 15 20 24 colors 0 1 2 3 4 5 6 7 pixel frequency 1e6

F25:Frequency of hues with the low-est 500 model predictions.

Images with colors closer to red, blue and green hues appeared more fre-quently in images in the lowest image prediction category (F27, F28, F26).

However, some remarks can be made about how this experiment was con-ducted and will be noted in the discussion (5)

0 5 10 15 20 24 colors 2 1 0 1 2 3 4 pixel frequency 1e6

F26:Differences in hues in the low-est and highlow-est 500 model predic-tions.

(34)

28 results 0 5 10 15 20 24 colors 2 1 0 1 2 3 4 pixel frequency 1e6

F27:Differences in hues in the top and medium prediction group.

0 5 10 15 20 24 colors 2 1 0 1 2 3 4 pixel frequency 1e6

F28:Differences in hues in the

medium and bottom prediction group

brightness/contrast/sharpness

The measurements of brightness, contrast and sharpness of the images in the top, middle and bottom image categories showed significant differences between the image groups (F30, F31, F29). Higher contrast, sharpness and

brightness were all associated with higher image appreciation. The only comparisons between distributions that showed no significant difference was between the top and middle image categories for image contrast (p-value of 0.22, appendix T11, F29). 0.0 0.2 0.4 0.6 0.8 1.0 sorted contrast scores 1e4 0.00000 0.00005 0.00010 0.00015 0.00020 0.00025 0.00030 relative frequency top middle bottom

F29:Distributions of image contrast

for the different prediction

(35)

4.2 model evaluation 29 0.0 0.5 1.0 1.5 2.0 2.5 sorted brightness scores 1e2 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.016 relative frequency top middle bottom

F30:Distributions of image bright-ness for the different prediction groups. 0.0 0.2 0.4 0.6 0.8 1.0 sorted sharpness scores1e4 0.0000 0.0001 0.0002 0.0003 0.0004 0.0005 0.0006 relative frequency top middle bottom

F31:Distributions of image sharp-ness for the different prediction groups.

4.2.4 model scene variation scores

Applying filters on the images taken in a controlled environment, had differ-ent effects on image appreciation. First, increasing the contrast and bright-ness of the images consistently resulted in a higher predicted number of likes (32). Decreasing the contrast or image saturation showed no

signifi-cant effect on image appreciation, while changing the color warmth showed mixed results. Lastly, decreasing the brightness or saturation did show a significant negative effect on image appreciation. However, it can expected that these findings are heavily tied to this particular data set. This is left for the discussion (5).

filters

l_dd l_d s_dd h_c s_d c_dd c_d h_w s_i s_ii l_i c_i l_ii c_ii

filter 4 3 2 1 0 1 2 3 4 affect on likes

F32:Effect of filters on the predicted number of likes. The number of likes were normalised per image by subtracting the prediction of the black and white image from the image score. Abbreviations: l = light, s = saturation, c = contrast, h = hue, w = warm (+15%), c = cold (-15%) with dd = -30%, d = -15%, i = +15%, ii=+30%. Letters and percentages correspond with their corresponding settings in Photoshop.

angles

Images taken from a 32% angle were found to gain a higher number of likes than images taken from the 90% for the same scene (p-value = 0.036, appendix T12). As the sample size was relatively small, no significant

(36)

dif-30 results

ferences were between the comparison of other combinations of different angles (appendix T12). 32 45 67 90 angles 14 16 18 20 22 24 26 28 predicted likes

F33:Effect of angles on the predicted number of likes. Photos were taken from the same object under 4different angles. A total of 8 ob-jects were used and correspond with the different colors.

32 45 67 90 angles 16 18 20 22 24 26 28 30 predicted likes

F34:Barplot of the effect of angles on the predicted number of likes.

colors

Colors in contrast to angles, showed a significant differences for the pre-dicted number of likes and were found to rely heavily on color combina-tions 35. In specific, the color red was found to be responsible for a higher

number of likes. Interestingly, this was not the case once it was combined with the color orange showing one of the lowest predictions (ro, F35,

indi-cating that the combinations of colors can have a significant impact of the perceived predicted number of likes of the model.

F35). w b g y p r wb ro gp op py gr (mixture of) color(s) 0 2 4 6 8 10 12 affect on likes

F35:Effect of colors on the predicted number of likes. Comparison of im-ages taken from 45% angle normalised to their black and white model its prediction score. Abbreviations: w = white, b = brown, g = green, y = yellow, p = purple, r = red, with the compound of letters standing for the corresponding mixture of colors.

(37)

5

D I S C U S S I O N

This research shows that CNNs can model human image-appreciation mod-erately well when restricting the network to learn only image-dependent features for the specific image category; food. In accordance with earlier observations ofKhosla et al.(2014), separating social influences from

image-independent factors proved to be challenging. Selecting users based on followers is, asKalayeh et al.(2015) andKarpathy(2015) showed to a lesser

extent, a promising way to neglect the social effects on the data, as this was found to be highly associated with user types. In addition this was found a good way to neglect the rich-get-rich phenomenon found in litera-ture (Souza Araujo et al., 2014). However, this selection was expected to

sacrifice some generalizability of the results by excluding an estimated 80% of the total amount of data. Still, it was considered the best option among the alternatives. In addition, selecting users that had between 101-224 fol-lowers presumably favors better photography, as it was expected that this selects users who are more familiar with photography. The most important number in this study was the correlation of 0.25 between the observed and the predicted number of likes. While this significantly outperformed the created baselines (0.13 and 0.18), the score was still relatively low. This was reaffirmed by the classification accuracy of 62.4%. In the end, this corre-lation can be considered as the ultimate test to see to what extent human image-appreciation can be modeled solely based on image-dependent fea-tures. It is expected that a larger data set would improve the prediction score of the model the most. Higher contrast, image brightness and sharp-ness were associated with images with a higher image-appreciation score. Colors also had a significant effect on image-appreciation but is discussed further below.

Khosla et al.(2014) andKalayeh et al.(2015) gained higher correlations by

predicting image popularity using CNNs (respectively 0.36 and 0.41). Part of this difference could be explained by the extra emphasis put on separating user type from user content. This limits the ability for CNNs to learn about user type and how this corresponds to specific type of images. As user type was shown to highly correlate with the number of followers, which in return correlated highly with the number of likes (0.81), this was expected to limit the model even more on only image-dependent features. To neglect the effect of oversampling active users during the collecting of data, only one image per user was permitted. This was necessary as age in particular was shown to be associated with higher engagement levels of the followers of users which in response resulted in a higher number of likes among others (Jang et al.,2015).

Unfortunately, most performed experiments had some remarks about their design. First, the experiments that used the 500 top/middle/bottom of the predicted likes distribution were expected to suffer to a certain extent from the amount of nonfood images left in the data. The nonfood images seemed to be mostly gathered in the bottom category with 553% more non-food images than average. In particular, the color analysis was assumed to be affected as nonfood images are expected to contain very different