Using machine mearning to improve image selection in marketing

(1)

Using Machine Learning to improve

image selection in Marketing

Thesis for MBA Big Data at the Amsterdam Business School

University of Amsterdam

Version: Final

Date: 23 September 2018 Author: Stefan Beekman Student number: 11165480

Email: Stefan.Beekman@hotmail.com

(2)

TABLE OF CONTENTS 1 SUMMARY 3 ACKNOWLEDGEMENTS 4 1 INTRODUCTION 5 1.1BUSINESS QUESTION 5 1.2IMAGE POPULARITY 6 1.3IMAGE ANALYSIS 6 1.4TEXT ANALYSIS 7 1.5DELAWARE 7 1.6THESIS APPROACH 8 2 BUSINESS REQUIREMENTS 10 2.1GOAL 10 2.2REQUIREMENTS 11 3 DATA USED 12 3.1DATA SET 12

3.2DATA PREPARATION AND FEATURE ENGINEERING 14

3.3AUTOMATED FEATURE EXTRACTION FROM IMAGES 14

3.4AUTOMATED FEATURE EXTRACTION FROM TEXT 15

3.5FEATURE SCALING 16

3.6TRANSFORMING THE POST META DATA 17

3.7TRANSFORMING THE DEPENDENT VARIABLE 17

4 INITIAL MODEL EXPLORATION 21

4.1VARIABLES USED 21

4.2INITIAL MODELS 24

4.3MODEL VARIATIONS 25

4.4EVALUATION METRIC 26

4.5INITIAL MODEL VALIDATION 26

(3)

5.1FINAL MODEL USED 31

5.2USE OF PRE-TRAINED MODELS 32

5.3MODELS TRAINED 34

6 RESULTS 37

6.1PERFORMANCE EVALUATION METRIC 37

6.2PER BRAND 39

6.3ACROSS BRANDS 44

7 CONCLUSION AND RECOMMENDATIONS 46

7.1CONCLUSION 46

7.2RECOMMENDATIONS FOR IMPROVEMENTS 47

7.3RECOMMENDATIONS FOR USE CASES 47

7.4RECOMMENDATIONS FOR FURTHER ANALYSIS 48

(4)

Summary

This thesis was the final part of the MBA Big Data course at the Amsterdam Business School (ABS) during which an end to end data science project was carried following the CRISP-DM approach.

The idea for this thesis was born during some initial meetings that were held with company representatives from Delaware who had the idea to research whether Machine Learning algorithms will be able to assist employees in making better decisions in selecting images for marketing purposes. As a professional services provider, Delaware is interested in providing a service to its customers to be able to improve the return on their online marketing activities. In most marketing activities, images are used to attract the attention of customers and

consumers. Therefore, the topic for this thesis was to research and try out in practice whether, and how, machine learning techniques can help to predict which images are more likely to be popular and will therefore be more effective for marketing purpose. This has resulted in a trained and tested predictive model that can be used and further developed into a productive application.

The data set that was used for this thesis is a collection of about 150,000 Instagram posts and corresponding images. This data set was used in Python to train a large number of predictive models for about 500 different brands. Pre-trained deep neural networks Word2Vec and MobileNet were used to automatically extract high level and low level features from the image content and context. Using these automatically extracted features, the predictive models use the Elastic Net regression algorithm to predict the number of likes that a particular image will attract. To validate the model performance, the Spearman’s correlation coefficient was used to correlate the ranking of predicted likes with the ranking of the ground truth likes.

The results that were achieved in the relatively short time frame for this thesis are very

promising. The models, which only use automatically extracted features from the image content and context show an impressive average rank correlation of 0.55 for brands with at least 400 posts in one year. However, there is no guarantee that the model will work for all brands since the rank correlation ranges from 0.155 to 0.833.

This varying level of model performance for different brands was analysed to some extent, but will require further work to fully understand as mentioned in the last chapter with

(5)

Acknowledgements

The completion of this thesis project marks the end of two very intensive years. Two years of rushing at the end of the day twice a week to be on time for the lectures at the university. Of countless individual assignments and group projects. With many skype calls and meetings with fellow students which would involve interesting discussions. I have made new friends and have learned many new things. It feels like the grand finale of this MBA degree.

I could not have done this on my own and therefore I would like to take the opportunity to thank some people that helped me.

First and foremost, I would like to thank my lovely girlfriend Helen for her support and understanding for those moments where I had to choose to focus on my study. I would also like to thank my fellow MBA students for their help in understanding some of the challenging topics throughout the course. And of course, the MBA programme management Marc Salomon and Lesley Swensen for putting together and organising this interesting programme.

With regards to the thesis, I would like to thank the following people for their contributions: - Kurt Vergult and Thierry Bruyneel for their ideas and input on the topic for this thesis. - Wouter Labeeuw and Inez van Lear for sharing their knowledge on Python programming

and Deep Learning.

- Gijs Vergoor for his supervision throughout this thesis, including some very valuable Skype calls. And for sharing his data set with me.

- My colleagues at work for their understanding and flexibility which allowed me to focus on my study and thesis when I had to.

Stefan Beekman

(6)

1 Introduction

This chapter provides summary information on the business problem and a short introduction to the concept of image popularity prediction by analysing image content and context. The chapter concludes by outlining the approached that was followed during this thesis project.

1.1 Business question

As a professional services provider, Delaware is interested in providing a service to its

customers to be able to improve the return on their online marketing activities. This could be for example:

- increasing the number of likes on their corporate social media account, - improving the Click-Through-Rate (CTR) on email campaigns or,

- increase the sales on a web shop.

In most marketing activities, images are used to attract the attention of customers and

consumers. Therefore, the topic for this thesis was to research and try out in practice whether, and how, machine learning techniques can help to predict which images are more likely to be popular and will therefore be more effective for marketing purpose. In other words, which picture out of selection of n will be the most effective.

Currently, the selection of an image is done based on knowledge and experience of individuals. However, more and more companies are on a transformational journey to become more data driven. The model developed as part of this thesis can be used to help assisting employees to make better, data driven, decisions on which images to use. Such a model probably will not replace human experience just yet. However, it can be used:

- to validate their or someone else’s choice

- as an additional input in an image selection workflow.

- when potential images are very similar and hard to choose from

(7)

1.2 Image popularity

To measure image popularity, one can look at for example likes at a social media site, CTR of an email campaign or the sales of a web shop. The general idea is that the more likes, the higher the CTR or sales, the more popular an image is.

It is of course up for debate whether this image popularity can be measured objectively, however just like (Cappallo, Mensink, & Snoek, 2015), in this project it is also assumed that recorded online metrics like view count, comment count, and likes are indicative of overall popularity of an image.

It is this image popularity measure that will be predicted by the model that will be developed as part of this thesis project. The prediction can be done based on the image content and the image context.

1.3 Image analysis

In order to predict the image popularity, for example the number of likes, the image needs to be analysed. Based on this analysis the predictive model will make a prediction of the

popularity.

According to the very popular paper “What makes an image popular?” (Khosla, Das Sarma, & Hamid, 2014), image popularity can be predicted based on image content on three different levels:

- Colour and simple image features

This are simple features that can be interpreted by the human eye such as hue, saturation and colour space of an image.

- Low-level computer vision features

These are more complex features on a pixel level - High-level features

Objects in an image that can be recognised by humans

For this project only low level features as detected by Convolutional Neural Networks and high level features are used to predict image popularity. This is driven by evidence from research mentioned by (Khosla, Das Sarma, & Hamid, 2014) that low-level features and high-level semantic features tend to have more predictive power. Furthermore, this so called “automated feature extraction” method is also very easy to implement, which is one of the key requirements

(8)

1.4 Text analysis

Apart from images, text can obviously also be used for the online marketing activities as mentioned in paragraphs 1.1 and 2.1. The could be email subject in an email campaign, the image caption or hash-tags of a social media post or the product description for a web shop. But it could also be the meta data of the post such as “time of the day” and “day of the week”. This is part of what is called the context of the image.

A lot of work has been done on predicting social media popularity based on text content. For example, on Twitter messages by (Hong, Dan, & Davison, 2011) and (Petrovic, Osborne, & Lavrenko, 2011). Results published by (Gelli, Uricchio, Bertini, Del Bimbo, & Chang, 2015) and (Khosla, Das Sarma, & Hamid, 2014) show that next to image content, also the social context is important to predict image popularity.

Therefore, the models described further in this document will also include text analysis to predict the image popularity.

1.5 Delaware

Delaware is a global company that provides professional services and solutions to customers that will help them to create and sustain a competitive advantage. These services and solutions always have some sort of IT component in them, whether this is the implementation of

software solution (e.g. BI platform, ERP system or Digital Content Management system) or the development of a customer application.

Delaware has more than 1800 employees working at offices in more than 13 countries in North-America, Europe and Asia.

The idea for the topic of this thesis was born at Delaware in Belgium where some bright brains in the digital marketing team had the conviction that computer algorithms should be able to make better choices than their human counterparts when it comes to choosing which images to use for marketing purposes.

It is expected that the outcome of this thesis will be used to develop an application that will be offered to customers in which they can upload their own data set. This application will then train a customer specific predictive model on their data that will help them to make better decisions and bottom line increase their business results.

(9)

1.6 Thesis approach

This thesis was carried out as the final part of the MBA Big Data course at the Amsterdam Business School (ABS) of the Universiteit van Amsterdam (UvA). For doing this thesis, there were two options:

1. Carry out a Data Science project from beginning to end, with the goal to experience first-hand what it is to do such a project.

2. Describe and propose a solution to a more strategic problem. Typically, this option requires a deep literature study, alongside surveys or interviews at an organization. This thesis was of the type option 1, where a data science project was carried out with the purpose to evolve into a business solution or service as a next phase after finishing the thesis. For the execution of this project the CRISP-DM approach as published by (Chapman, et al., 2000) was largely followed. The methodology was the result of the work of employees of NCR, SPSS and DaimlerChrysler and has since then become the defacto standard for data

management projects.

It follows an iterative approach where multiple cycles can be performed to achieve the best results, which is visualised in Figure 1.

(10)

While executing this thesis project the following type of activities were performed during each of the first 5 phases:

- Business Understanding

Meetings with business representatives to brainstorm about ideas for possible machine learning use cases.

Translate these ideas into a machine learning problem definition. Create a high-level plan of approach.

Agree on requirements. - Data Understanding

Learn about Neural Networks, Deep Learning, Transfer Learning and how to use it in Python.

Collect the required data.

Investigate and assess the data quality. - Data Preparation

Analyse the data and determine which data transformation to apply. Determine which data subset(s) to use.

Apply data cleansing and data transformation so the data could be used for the models. Create some initial predictive models to assess which variable subsets to use and assess the time it takes to train the models.

- Modelling

Select the appropriate machine learning algorithm.

Test, validate and select the right methods for automated feature engineering. Train, validate and test the model.

Conduct various iterations of training, validating and testing the model to improve the performance of the predictive model.

- Evaluation

Compare all the different models that were trained and determine which metric to use for model evaluation.

Understand the results.

Determine how model can be best used in practice.

Throughout the entire thesis project, feedback sessions were conducted with the supervisor about the business problem, the chosen approach and models.

The result of this thesis is a working model programmed in Python that can be used to further develop the final model and application in which this model can be executed. Therefore, the Deployment phase was not part of this thesis project.

(11)

2 Business Requirements

This chapter describes in detail what the goal and requirements were of the thesis project.

2.1 Goal

The first phase of the CRISP-DM approach is the Business Understanding phase. This is a very important phase since without a solid understanding of the business problem, the risk is very high that the data science project will not deliver a result that will answer the right business question.

During this phase meetings were held with company representatives from Delaware who had the idea to research whether Machine Learning algorithms will be able to assist employees in making better decisions of which images to use for marketing purposes. During this phase, the idea and approach was also validated with the supervisor for this thesis project. All with the purpose to make sure that the right business question would be answered and that the problem would be approached in the right way so that the thesis project would yield a usable result for Delaware.

As a professional service provider, Delaware consults to their customer on how to improve the performance of their company. This can be done by improving business processes,

implementing new IT systems and training staff in many different lines of business. Companies are doing more and more business online so online marketing is becoming increasingly

important. To make these online marketing activities more effective it is important to use the right branding and approach the right audience. There are many agencies specialised in this field of advertising and marketing with a lot of experience. Based on this experience the right visual content is created and chosen for online activities such as social media posts, e-mail campaigns, corporate websites and web shops.

However, at Delaware there is the vision that Machine Learning algorithms should be able to assist marketing employees in choosing more effective visual content. More specifically, the idea is that when presented a set of images, a Machine Learning model should be able to predict which image in that set will the most effective. What this means in practice will depend on where the model is applied. This could be for example:

- increasing the number of likes on their corporate Instagram or Twitter account, - improving the Click-Through-Rate (CTR) on email campaigns or,

(12)

2.2 Requirements

To ensure that this project would yield the right result and to make sure that it could be done in the limited timeframe available, defining clear requirements was very important.

Below are the most important ones. - Simplicity

The key requirement was that the model should be simple to implement. The amount of data processing and manual feature engineering for the model to be trained and tested should be limited. Ideally it should be able to consume a new customer’s data set without having to manually creating a ground truth for the model. For example, by manually labelling each image. This ensures that it can be implemented and validated at low cost and that it can be used without having domain expertise in marketing or image processing.

- Generic

The model should not be specific to one industry or scenario. By adhering to this principle, Delaware will be able to implement the model for customers in various industries and predict image popularity in different scenarios like social media, marketing campaigns or online web shops.

- Cold start

If a company already has an online presence, for example on Instagram, but wants to improve the effectiveness it should be possible to train the model using that data. However, if a company has no online presence yet there is no data to train the model on. This is what’s called a “cold start problem”. Ideally, the model would be able to cater for such a scenario.

(13)

3 Data used

This chapter provides an overview of the data that was used for this project and how features are extracted from the image content and context. It also explains which data transformations techniques were applied.

3.1 Data Set

The data set that was used for this thesis is a collection of about 150,000 Instagram posts and the corresponding image that were posted between May 2015 and April 2016. This data set was provided by the supervisor of this thesis, G. Overgoor and has been used in research on image popularity before as published in (Overgoor, Mazloom, Worring, Rietveld, & van Dolen, 2017). As a source of data for this thesis Instagram is particularly interesting, because it is a visual social media platform where sharing images is the main focus of its users. It is one today’s most popular social media sites where users share an about 95 million videos and images per day. The selection of brands to scrape from Instagram was based on their Gartner L2 Digital IQ index (https://www.l2inc.com/about/l2-digital-iq-index) where the top 1000 brands were selected.

The data extraction was done half way July 2016, allowing enough time between date of posting (until April 2016) and date of scraping for each post to collect its likes. As can be seen in the summary below, the number of posts within the one year period differs from brand to brand:

Summary of the data set:

Number of brands with less than 100 posts: 283 (13,049 records) Number of brands with 100 to 200 posts: 208 (30,670 records) Number of brands with 200 to 400 posts: 220 (62,525 records) Number of brands with more than 400 posts: 86 (45,934 records)

Total number of brands: 797 (152,178 records)

From this initial data set various sub sets have been used to train a large number of predictive models, which will be explained in more detail below. To ensure that enough information is

(14)

available to train each of the models, only brands with at least 100 posts were used. In total, there were 514 brands with more than 100 posts during the 12 months mentioned above. Figure 2 shows an overview of the number of brands by industry with more than 100 posts:

Figure 2 Overview of number of brands and records by industry

It shows that there is a broad range of different industries (27) represented in the data set. However, the number of brands (Number of Models in the overview) represented in each industry varies widely from only 2 brands in for example the “Food and Beverage” industry, to 100 brands in the “Fashion & Personal Care” industry.

(15)

3.2 Data preparation and feature engineering

During the data preparation phase of the CRISP-DM process, the raw data is prepared so that it can be used by the predictive model that was chosen to solve the business problem. This is required to make sure that the model can be used and will yield the best possible performance. Which kind of data preparation tasks need to be performed will depend on both the data that is used and the model that is chosen. A large portion of effort during a data science project is typically spent on these tasks.

To prepare the data set from Instagram to be used in the Elastic Net regression model, the data transformations that were performed are described below.

3.3 Automated feature extraction from images

As will be explained in detail in section 5.2, a pre-trained neural network, excluding the top, can be used for automated feature extraction. The first part of the deep learning network mainly consists of multiple Convolutional Layers that learn features. The second part of the network consists of fully connected layers in order to predict classes, which is referred to as the top part. Figure 3 shows a simplified representation of this concept.

Figure 3 Source: https://www.learnopencv.com/keras-tutorial-transfer-learning-using-pre-trained-models/

(16)

For the project, the pre-trained MobileNet as published in Keras is used. A detailed explanation for the choice of this model can be found in chapter 5.2. Based on the MobileNet model, a stripped model excluding the top part was created by using the last Average Pooling layer called “global_average_pooling2d_1” as the output for the stripped model:

Figure 4 Neural Network layer used as output

The MobileNet model requires images to be passed to the input layer as a tensor representing the image of 224 by 224 pixels with each pixel containing a RGB value. This means that the input for the MobileNet model is in fact 224 * 224 * 3 = 150,528 features. This stripped model automatically reduces this to 1024 features.

A similar approach was also used in previously published work with different types of hidden layers that were used. For example, (Cappallo, Mensink, & Snoek, 2015) use “the output of last fully connected layer of the network” and (Khosla, Das Sarma, & Hamid, 2014) use “the layer just before the final classification layer”. Whereas, (Mazloom, Rietveld, Rudinac, Worring, & van Dolen, 2016) and (Overgoor, Mazloom, Worring, Rietveld, & van Dolen, 2017) use “the 15,293-dimensional output of the Softmax layer of the network to represent the image of each post”. Additionally, the full MobileNet model was also used to predict the probability of the occurrence of each of the 1000 classes it has been trained on.

3.4 Automated feature extraction from text

(17)

Corrado, & Dean, 2013) at Google. Word2Vec is a pre-trained deep neural network, which computes a 300-dimension vector by mapping each word onto its Word2Vec representation. The code for this pre-trained model is published on http://code.google.com/p/word2vec. In order to extract a 300-dimension vector representation of the Image Caption of each Instagram Post the following data transformation steps are performed as part of the Python code:

- Split the image caption (sentence) into separate words (tokens) - Each word is converted to lowercase

- Non-alphabetic words are removed

- Filter out stop words since they do not add any meaning

- For each word that is left, the 300-dimension vector representation is retrieved

- To represent the whole sentence, the average vector is calculated by dividing the sum of all word vectors by the number of words.

This approach of using Word2Vec for automated feature extraction was also used by (Overgoor, Mazloom, Worring, Rietveld, & van Dolen, 2017). They call it one of the “state of the art

features” and use it to represent a social media post in their data set.

3.5 Feature scaling

Because features in a predictive model can have varying magnitudes and units of measure it is good practice to apply a method to scale these features. Without scaling, features with a higher magnitude (e.g. kilograms) would have more impact on the prediction than features with a lower magnitude (e.g. milligrams). By applying scaling, we ensure that all features weigh in equally in applying the algorithm of choice.

Because of this reason, it is standard practice to apply scaling methods by default. This is also the approach that was taking during this project. To ensure that all automatically extracted features are on a scale between 0 and 1, the method of min-max normalisation was applied for each feature as displayed in Figure 5.

(18)

This was done for the variable sets X1, X2 and X3 as described below. The cosine similarity in variable X4 is already on a scale between 0 and 1 and therefore does not have to be

standardised.

3.6 Transforming the post meta data

Post meta data features that were included as explanatory variables are: - Count of image tags of the Instagram post.

- Hour, Day, Week, Week Day, Month, Year and Time of Day of the Instagram post. The variables with the time and date information are categorical variables, which contain numerical values. To ensure that these variables are used correctly by the regression model, they need to be transformed to dummy variables. This technique is used to be able to use categorical variables in a regression model.

Therefore, all categorical variables of the post meta data were transformed to dummy variables as shown in Figure 6.

Figure 6 Create dummy variables for categorical variables

This was done for the Hour, Day, Week, Week Day, Month, Year and Time of Day of the Instagram post.

3.7 Transforming the dependent variable

Looking at the overall dataset, the dependent variable is highly skewed with values ranging from 1 to more than 900,00 likes. This becomes clear after inspecting the boxplot and

(19)

Figure 7 Boxplot of ImageLikeCount

Figure 8 Histogram of ImageLikeCount

When looking at the skewness of the dependent variable per brand, the situation is more varied. Figure 9 shows a histogram of the skewness per brand grouped by the number of records available per brand.

(20)

Figure 9 Distribution of ImageLikeCount skewness per brand

In order to be able to plot this histogram, the skewness of the ImageLikeCount was calculated for each brand. Skewness is a statistical measure of the asymmetry of the distribution of a variable. So, the skewness value shows the amount and direction of the skew:

- If skewness is less than -1 or greater than 1, the distribution is highly skewed.

- If skewness is between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed.

- If skewness is between -0.5 and 0.5, the distribution is approximately symmetric. By inspecting Figure 9, it is clear that for the brands with lower number of records (between 100 and 400), the Likes are less skewed because the skew is more centred around the 0. For brands with more records (400 or more) we can see that the skew is a bit more spread. The model used to predict the Likes is a regression model, which performs better when the data is not skewed. Since models are trained both per brand and across brands, a broad-brush approach was taken to do a Log transformation of the dependent variable.

(21)

Figure 10 and Figure 11 are included to verify that the Log transformed dependent variable approximates symmetry and can therefore be safely used in a regression model. Outliers have not been excluded from the dataset.

Figure 10 Boxplot of Log transformed ImageLikeCount

(22)

4 Initial model exploration

This chapter shows the results of the initial data and model exploration phase. To explore which variables will provide a good balance between model complexity and model performance, different models were trained and evaluated. Each of model contains an increasing number of variables and will therefore increase in complexity. The chapter concludes with explaining which model was selected for further analysis.

4.1 Variables used

Different types of variables were used and have been grouped in so-called “Variable sets”. The table below provides an overview of the variable sets that were used. Variable sets X1 to X5 are the independent variables and Y is the dependent variable.

Variable set

Description

Purpose

Variables X1 Low level image features as extracted by pre-trained Keras model

Can image likes be predicted based on low level features of the image?

Variables X2 High level image features as extracted by pre-trained Keras model

Can image likes be predicted based on high level features (automatically detected objects) of the image?

Variables X3 Vector representation of image caption as extracted by

pre-trained Word2Vec model

Determine whether the image caption has a significant influence on the image likes.

Variables X4 Cosine similarity between image caption and user biography

Understand if image likes are significantly higher if meaning of image caption is in line with the user’s biography.

Variables X5 Meta data of Instagram post To understand which characteristics of the post will drive have an impact on popularity.

(23)

Variable set

Description

Purpose

Variable Y Dependent variable: image _{likes on Instagram}

To investigate whether it is possible to predict whether one image will receive more likes over another image.

By doing so a combination of image content and image context features were used to predict image popularity. This approach was driven by research published that indicate that both image content and image context contribute to image popularity. Examples are (Khosla, Das Sarma, & Hamid, 2014), (Gelli, Uricchio, Bertini, Del Bimbo, & Chang, 2015) and also (McParlane,

Moshfeghi, & Jose, 2014).

The grouping and naming of the variable sets was inspired by the article ““Nobody comes here anymore, it’s too crowded”; Predicting Image Popularity on Flickr” by (McParlane, Moshfeghi, & Jose, 2014). They use the following “features categories”:

- Image context - Image content - User context - Tags

Figure 12 shows a schematic overview of the variable sets that have been used to train the model. It also shows how this model is then used to predict popularity and how the model is validated by comparing the prediction against the ground truth using a Rank Correlation.

(24)

Figure 12 Schematic overview of model training, prediction and validation

Below are the number of features per variable set: X1: 1024

X2: 1000 X3: 300 X4: 1 X5: 118

These features were combined in the regression model using “early fusion”. Because of the large number of features in relation to the number of records the risk of overfitting exists. This risk is mitigated by using a penalised model as explained in detail in paragraph 5.1.

(25)

4.2 Initial models

As part of the data exploration phase of the CRISP-DM approach, the information contained in the data set was explored by initially training 5 different model types on various selections of the complete data set.

The purpose of this exercise was to two-fold:

1. Determine which type of features (variable sets) should be used in the final predictive model for deeper analysis as described in the next chapter.

2. Verify that the usage of the pre-trained MobileNet and Word2Vec models actually can provide predictive information for this application.

The following 5 model types were trained on different data selections:

Model Type

Variable sets

Description

Model M1 X1 Only image low level features are used to _{predict image likes.}

Model M2 X1, X2 Both the low and high level image features are _{used to predict image likes.}

Model M3 X1, X2, X3 Next to the low and high level image features, also the image caption is used for the prediction.

Model M4 X1, X2, X3, X4 Above features, plus the cosine similarity score between image caption and the user biography are used.

Model M5 X1, X2, X3, X4, X5 Finally, also meta data of the Instagram post is _included.

As described in section 2.2 one of the main criteria for this thesis was to keep the model as simple as possible. The two main reasons for this are:

(26)

- But also, the computational power required to train the model is limited. The more features are included, the more complex the model becomes and therefore training of the model will take more time.

Because it is commonly known that Machine Learning algorithms perform better with more training data, only the brands with a high number of posts have been considered during this phase of the project. A number of 400 posts have been used as a threshold. So only the 86 brands with 400 posts or more were used initially.

4.3 Model variations

For each of the 5 model types two different model variations were trained:

- Across Brands: A random selection of an incremental number of records (100 / 200 / … / 1,000 / 2,000 … / 9,000) across all 86 brands was used to train each of the 5 models.

- Per Brand: For each of the 86 brands that had more than 400 posts all records where used to train and validate each of the 86 models.

This means that a total of 5 * (86 + 18) = 520 predictive models were trained and validated. Later on, three additional “Across Brands” models were trained with respectively 10,000 / 15,000 / 20,000 records to analyse the impact on performance and weights of the nodes in the neural network.

The model performance of the “Per Brand” variation will probably be most representative for when such a model would be used for a customer of Delaware since the data of only one customer is used. The model performance of the “Across Brands” variation is also included in the data exploration both to see the difference with the other variation, but also because there might be situations where a customer will have a “cold start” situation and the data of a number of similar companies is used to train a new model.

As mentioned above one of the reasons to keep the model relatively simple is to keep the computation time as low as possible. This was especially important during the execution of this thesis as available time was limited to go through different iterations of the CRISP-DM model. As an indication of time, it took 18 hours to train and validate the 525 predictive models on a MacBook Pro with a 2.9 GHz Intel i5 CPU and 8 GB of DDR3 RAM.

(27)

4.4 Evaluation metric

Since the model is a regression model to predict the number of likes, it seems logic to use the R2_{for evaluation, also called the Coefficient of determination. The R}2_{measures the proportion}

of the variance in the dependent variable (in this case the likes) that can be explained from the independent variables (in this case the image features and text features extracted from the image and the image caption).

More simply put: by using the R2_{to evaluate the model we would look at how well the number}

of likes are predicted for an image. However, if we go back to the business requirement for this thesis as described in paragraph 1.1, the goal is not to predict the exact number of likes (or number of CTR’s depending on which measure of popularity is used). What we want to predict is which image out of a selection of n images will be more popular.

Therefore, this can be addressed as a ranking problem, by ranking the popularity of a series of images. This means that the model can be evaluated using the Spearman’s correlation

coefficient r, which is a measure of the dependence between two rankings (Spearman, 1904). The Spearman’s correlation coefficient is used to correlate the predicted likes with the ground truth likes. This means that the value of r measures how closely the predicted rank of likes aligns with the ground truth rank of likes. The value of r ranges from -1 (perfect negative) to 1 (perfect positive). If the value is 0, this indicates that there is a random relationship, which means in this case that the number of likes cannot be predicted.

The usage of the Spearman’s correlation coefficient aligns with previously published work on image popularity such as (Khosla, Das Sarma, & Hamid, 2014) and (Cappallo, Mensink, & Snoek, 2015).

4.5 Initial model validation

The main purpose of this data understanding phase was to understand which features would provide the right the balance between complexity and effectiveness: a model that is easy to create, quick to train, but still effective. Therefore, the effectiveness of the above 5 models have been compared and the results are displayed below (for models trained on 400 to 9,000 records):

(28)

Figure 13 Overview of initial model performance

For reasons explained above, the Spearman Rank Correlation is used to evaluate the model performance.

Figure 13 shows that the average model performance for the “Across Brands” models does not vary much and sits around the 0.44 mark with a relatively low standard deviation of about 0.09. For the “Per Brand” models however, there is a clear improvement of the performance when the image caption (Model M3) and when the post meta data (Model M5) is added. It also shows that for the “Per Brand” models the standard deviation is higher. For example, for the M3 model the Spearman Rank Correlation ranges from a very weak 0.16 to a very strong 0.85 and for the M5 model it ranges from a weak 0.32 to a very strong 0.92.

When investigating the scatterplot in Figure 14 for the “Across Brand” models of the model performance against the amount of records used in the model training, it is clear that there is a strong correlation: the more records were used for training, the higher the model performance.

(29)

Figure 14 Initial Model performance - Across Brands

For the detailed model evaluation and analysis in the next chapter, model M3 will be used for the following reasons:

- It only contains image features (low level and high level image features) and text directly related to the image (the image caption) so it is easy to collect these features. - It provides a relative high performance for a low complexity model, which requires less

effort to train.

- It excludes the cosine similarity, which doesn’t add much predictive power. Additional time would be required to analyse whether this is due to a non-representative user biography, or whether this similarity actually has no impact on whether a user likes an image or not.

- Because the main focus of this thesis is on how an image can increase the number of likes and not when an image is posted or when an email is sent, the post meta data (which is part of model M5) is excluded.

To validate that the pre-trained Neural Networks that are used (MobileNet and Word2Vec) are actually learning from the data that they are trained on, the number of empty features are investigated.

(30)

Model M3 contains a total of 2324 automatically detected features from the image and the image caption:

- MobileNet: 1024 features to represent the low-level image features - MobileNet: 1000 features to represent the likelihood for 1000 objects - Word2Vec: 300 features to represent the image caption.

The value of each feature represents a weight of an individual Node in the layer of a Neural Network. If too many features have a value of 0, this would indicate that the Neural Network has not learned from the data it has been trained on. For each of the 520 models that were trained, the number of empty features was recorded and plotted against the number of records trained.

Figure 15 shows that for all 5 models the number of empty features decrease when the models are trained using more data (100 to 9,000 records). This shows that the pre-trained models learn from the data provided and therefore can be used to predict the popularity of an image.

Figure 15 Empty coefficients per model - Across Brands

This learning effect increases when more data is used to train the models. This effect is very clearly visible in Figure 16 which shows the number of empty features for Model M3 which was also trained on a larger data set (10,000 / 15,000 / 20,000 records). The number of empty features decrease dramatically when a large number of records are used to train the model.

(31)

Figure 16 Empty coefficients for model M3 - Across Brands

As a result of the data exploration phase, Model M3 was chosen as the model for detailed analsysis. The results are described in the next chapter.

(32)

5 Modelling

This chapter provides a detailed explanation of the final model that was used plus details on how the model performance is evaluated.

5.1 Final model used

As explained in chapter 2, the dependent variable that we want to predict is the number of likes of an Instagram post. This can be modelled in different ways. For example, (McParlane,

Moshfeghi, & Jose, 2014) model it as a binary classification, because they want to predict whether an image will receive a high or low number of views and comments in the futures. On the other hand, there are for example (Khosla, Das Sarma, & Hamid, 2014) and ( Can, Oktay, & Manmatha, 2013) who address the prediction as a regression problem.

For this thesis, the latter approach is followed by using a regression model. Because a large number of features is used (as described in the previous chapter) there is a risk of overfitting the model. This is especially the case for the “Per Brand Models”, where the number of features is much larger than the number of observations (records in the dataset). In such a case a standard linear regression model would perform poorly.

To avoid overfitting, there most common approach is to use either of the three penalised regression models listed below. In these models, there is a penalty for the fact that there are a large number of variables in the model. As a consequence, the coefficient values are reduced towards zero. This can also be seen as an automated feature reduction method.

- Ridge Regression

This model tries to shrink the coefficient values as close to zero as possible to ensure that the features which contribute the least to the prediction will be reduced to almost zero. This is called L2 Regularisation.

An advantage of this model is that compared to a standard regression it will still perform well even though a large number of features is used.

A disadvantage of this model is that all features are included in the final model. Even the features that do not contribute much.

- Lasso Regression

This model overcomes the disadvantage of the L2 regression, since this model will try to reduce the coefficient values completely to zero for those that have a minor contribution to the predictive model. This is called L1 Regularisation. By doing so it performs an

(33)

automated feature selection and will result in a less complex model. This is also the most important advantage of this model.

- Elastic Net

This model was introduced by Hui Zou and Trevor Hastie in 2005 which combines the L1 and L2 regularisation methods of the Ridge and Lasso regression. According to their paper (Zou & Hastie, 2005), real world data and a simulation study show that the Elastic Net often outperforms the Lasso regression.

Since the Elastic Net model outperforms Lasso and Ridge by combining the L1 and L2

regularisations, this model was chosen. Especially since the authors mention that “The elastic net is particularly useful when the number of predictors is much bigger than the number of observations (Zou & Hastie, 2005), which is the case here.

5.2 Use of pre-trained models

To predict the popularity of an image, features need to be extracted from an image that will then be used as independent variables. This can be done by manually hand crafting features. Or by using pre-trained deep learning models that automatically extract features.

Automated feature extraction can be classified as 1 of 3 types of transfer learning as explained on this web page: https://towardsdatascience.com/transfer-learning-946518f95666.

The three types of Transfer Learning are: 1. Training a Model to Reuse it

This is used when you need to solve Task A, but you don’t have enough data to train a Deep Neural Network. Instead you train a model to solve a related Task B (for which enough data exists). This model is then used as a starting point to solve the initial task. 2. Using a Pre-Trained Model

This is the most common use of transfer learning in the Deep Learning field. This type of transfer learning is when you use a pre-trained model to solve your task.

3. Automatic Feature Extraction

Instead of manually hand-crafting features for which domain expert knowledge is required, features are created automatically. This type of transfer learning is used to automatically extract features from an image or text, which is then used to represent that image or text.

According to (Goodfellow, Bengio, Courville, & Bach, 2016), transfer learning “refers to the situation where what has been learned in one setting … is exploited to improve generalization in

(34)

To use pre-trained models, Keras was used for this thesis, which is a deep Learning library in Python. Currently there are 10 published pre-trained models in Keras:

Figure 17 Keras Models. Source: https://keras.io/applications/ As is mentioned on the Keras web site:

“Keras Applications are deep learning models that are made available alongside pre-trained weights. These models can be used for prediction, feature extraction, and fine-tuning.” In the M3 model used here, the MobileNet application is used. MobileNet is an application published by (Howard, et al., 2017) at Google and was trained to classify 1000 objects in images. The list of 1000 classes that were used to train this model, is listed on this Github page: https://gist.github.com/yrevar/942d3a0ac09ec9e5eb3a.

Figure 17 shows the size of the models, the Top-1 and Top-5 accuracy, the number of parameters of the model and the depth (number of layers) of the pre-trained deep neural network. It is clear that MobileNet does not have the best accuracy. However, this model was chosen for practical reasons, since it is the smallest of all models (only 17MB) and was

therefore quicker to train for the large number of models that were trained as explained in paragraph 4.3 and 5.3.

As mentioned in chapter 3.3, many other research projects have used pre-trained deep neural network models for automated feature extraction. Additionally, the use of this technique is also backed by a paper by (Yosinski, Clune, Bengio, & Lipson, 2014) in which they research the transferability of features in deep neural networks. They show that “the transferability of features decreases as the distance between the base task and target task increases, but that

(35)

It will be interesting to test whether the use of other, more complex, pre-trained Keras models will further increase the performance of the model. This is one of the recommendation for further research and is mentioned in paragraph 7.2.

5.3 Models trained

To limit the time required for training the models, during the initial data and model exploration phase only the 86 brands with more than 400 posts were used.

For a more through and detailed analysis of the model M3 in this final phase however, all 514 brands with 100 posts or more were used:

Number of brands with 100 to 200 posts: 208 (30,670 records) Number of brands with 200 to 400 posts: 220 (62,525 records) Number of brands with more than 400 posts: 86 (45,934 records)

Just like in the data exploration phase, two different variations of the model are created. In this case, for the selected model M3:

- Across Brands: A random selection of an incremental number of records (100 / 200 / … / 1,000 / 2,000 … / 10,000) across all posts of the brands that were selected.

- Per Brand: For each of the brands that had 100 posts or more, all records where used to train and validate each of the 514 models.

The training of the models was done in three different batches: - Selecting brands with 100 to 200 posts

- Selecting brands with 200 to 400 posts - Selecting brands with more than 400 posts

There were two reasons for splitting up the model training in three batches:

- In case the computer crashed during training, only the models in this batch had to be retrained.

- To test if the number of records per brand would have an impact on the model performance of the “Across Brands” models.

(36)

This process of data selections and model training is schematically represented in Figure 18.

(37)

This means that a total of 571 predictive models were trained and validated, which took in total about 33 hours:

Batch

Training

_time

Model type

# of models

100 to 200 ~9 hours Per Brand 208

Across Brands 19

200 to 400 ~9 hours Per Brand 220

Across Brands 19

More than 400 ~15 hours Per Brand 86

Across Brands 19

(38)

6 Results

This chapter contains a detailed analysis of the predictions that were made by the M3 model described in the previous chapter. It starts with an overall evaluation and then focusses on the Per Brand models and Across Brand models consecutively.

6.1 Performance evaluation metric

After training and validating the model on the training set, the model is tested on the test data set by predicting the likes for each image in the test set. The next step is to evaluate the performance of the predictive model.

As explained above, the Spearman’s correlation coefficient is used to evaluate the model’s performance. As a rule of thumb the value of this metric is interpreted as follows:

0.00 - 0.19: very weak rank correlation 0.20 - 0.39: weak rank correlation 0.40 - 0.59: moderate rank correlation 0.60 - 0.79: strong rank correlation 0.80 - 1.00: very strong rank correlation

Figure 19 shows an overall overview of the performance of Model M3 split by Model Variation (Per Brand / Across Brands) and Batch.

(39)

Figure 19 Overall M3 Model Performance

Some initial conclusions can be made from this overview that are valid for both Model variations:

- As the number of posts used to train the model increase, so does the model performance.

- An increasing number of posts means a lower p-value, so higher confidence in the model.

- For brands that have less than 200 posts, the p-value is too high (>0.05) and the average Spearman Correlation is significantly lower compared to the models trained on brands with more posts.

It can also be concluded that the Brand models perform better than the Across Brand models. Lastly it shows that when at least 400 posts of one brand are used to train the predictive model, the image popularity can be predicted quite accurately with average Spearman Correlation of 0.55. For some brands the Spearman Correlation is even as high as 0.85. When comparing this model’s performance with previously published work, for example with the very popular paper “What Makes an Image Popular?” by (Khosla, Das Sarma, & Hamid, 2014), the conclusion can be made that the performance of this model is very acceptable. (Khosla, Das Sarma, & Hamid, 2014) report rank correlations varying from 0.31 to 0.40 when only using image content. When combining image content with social content, they report that the rank correlations increase to between 0.48 and 0.81. Considering that in Model M3 only the image content and image description are used an average rank correlation between 0.4 and 0.55 is actually quite impressive.

(40)

6.2 Per brand

Figure 20 shows an overview of the model performance for the models trained per brand. By looking at the performance by industry, we can check if there are certain industries for which this model might work better than others. This information can be used by Delaware to initially approach customers that operate in the industries for which the model seems to perform best.

Figure 20 Per Brand Model Performance by Industry

All except two industries have an average rank correlation of 0.4 or higher.

To understand what portion of the models in each industry perform well, Figure 21 shows what portion of the models have a Rank correlation of more than 0.5 and what portion less than 05.

(41)

Figure 21 Per Brand Model Performance split by Rank Correlation

In general, industries with a higher average rank correlation also have a higher proportion of brands with a rank correlation >0.5. However, there are also industries with a low average rank correlation, but still the majority of the brands in that industrie have a rank correlation >0.5. This is the case in for example the Non for profit organisations and Sports, leisure and travel. As expected, Figure 22 show that the model performance in general increases when more records are used to train the model.

(42)

Figure 22 Correlation between number of records and Model Performance

Even though the average Spearman Correlation for brands with more than 400 posts is an impressive 0.55, there is no guarantee that the model will work for all brands as long as at least 400 posts are used. Below are some examples of brands with more than 400 posts, with

varying levels of model performance.

Brand

Spearman

_Correlation

Number of

_records

Abercrombie & Fitch 0.833 510

National Geographic 0.626 993

Texas Instruments 0.398 467

Ritz-Carlton 0.155 472

When looking at the correlation between the likes and the predicted likes for these 4 brands in Figure 23, it becomes very clear how the model performance varies across brands.

(43)

Figure 23 Correlation between Y and predicted Y for four brands

One potential explanation why likes might be harder to predict for certain brands / industries is if all images receive similar number of likes. After all, there should be enough variation in the values of the dependent variable in order for the model to learn from it.

In order to test this theory, for each brand the range of ImageLikeCount was calculated. This range is an indication of the variation in the dependent variable: this bigger the range, the higher the variation. However, Figure 24 clearly shows that the range does not have a strong relation with the Spearman correlation.

(44)

Figure 24 Correlation between Range of ImageLikeCount and Model Performance

Furthermore, Figure 25 shows that this theory is not valid either when looking at averages across industry.

(45)

Another potential explanation for variation in model performance could be that for certain brands the images themselves are very similar, which makes the variation smaller. Or maybe the type of objects is very similar. For example, brands in the Professional Services industry (which performs poorly on average) might post similar pictures of business focussed subjects. However, to analyse this as part of this thesis would require too much manual effort (images would have to be visually examined one by one) and is therefore considered out of scope. Conclusion is that the current model performs very well on average, but will not necessarily work for all brands / customers and should therefore be tested on its usability. As a

recommendation for further model improvement it would be interested to see what can be achieved if much more data (more posts) for one brand is available.

6.3 Across brands

The Across Brand models performed slightly less well as can be seen by inspecting Figure 26. These types of models are representative of what could be achieved if for example an industry specific model would be trained on data of a collection of brands within one industry. This could then be used for brands in that industry for which no training data is available.

Figure 26 Across Brand Model Performance by Batch

When only models are selected that were trained on 1,000 records or more, the following effect can be noticed:

- The average rank correlation increases

- The standard deviation decreased dramatically - P-value becomes (almost) zero.

(46)

Figure 27 Across Brand Model Performance by Batch trained on > 1,000 records

(47)

7 Conclusion and Recommendations

This chapter provides the final conclusions as well as recommendation for further research, potential model improvements and use cases.

7.1 Conclusion

The results that were achieved in the relatively short time frame for this thesis are very

promising. The models, which only use automatically extracted features from the image content and image caption (the context) show an impressive average rank correlation of 0.55 for brands with at least 400 posts in one year.

As mentioned before, results published by (Gelli, Uricchio, Bertini, Del Bimbo, & Chang, 2015) and (Khosla, Das Sarma, & Hamid, 2014) show that next to image content, also the social context is important to predict image popularity. For example, time of posting, number of tags and number of followers also have a huge impact on the number of likes. Therefore, achieving an average rank correlation of 0.55 solely based on the image and image caption is very impressive.

However, there is no guarantee that the model will work for all brands since the rank

correlation ranges from 0.155 to 0.833. This varying level of model performance for different brands was analysed to some extent, but will require further analysis to fully understand as mentioned below.

One of the criteria was that the model should be generic and not specific to a particular scenario. Therefore, it is worthwhile mentioning this this model can be used for different

scenarios. For example, the model can be used for both predicting the Click Through Rate on an email campaign as well as likes on social media. However, a model trained on likes should not be used to predict CTR. It would have to be trained on data representing the scenario that it will be used for.

Below are some recommendations to further develop and improve the model that was created and used in this project. Because of the fact that time was a limiting factor to complete this thesis, there are still some areas that can be further analysed which are listed below as well.

(48)

7.2 Recommendations for improvements

The expectation is that the model performance can be further improved by:

- testing the use of other pre-trained Keras models. For this thesis, the MobileNet model was used because it’s a small and relatively simple model. However, it would be very interesting to see if the model performance increases when more complex models like Xception and InceptionResNetV2 are used.

- gathering and using more data. Currently a readily available test set is used where for the largest brands about 900 posts are available. If a dataset with more data can be collected this will probably improve the performance.

- testing for skewness per brand before applying a Log transformation of the dependent variable. Currently a broad-brush approach across all brands is taken to apply a Log transformation on the dependent variable (ImageLikeCount) to approach normality. However, the skewness per brand varies so it might be useful to test for skewness before applying the transformation.

- add additional explanatory variables. Currently the image content and image description are used as explanatory variables. It would be interesting to test model performance if additional variables are added such as cosine similarity between image description and the mission statement of a brand.

7.3 Recommendations for use cases

Clearly, the most obvious use case is to train a model on a company’s specific data set. For example, a company has two years of social media posts and is not happy with its customer engagement on social media. The likes or comments on pictures can be used as a proxy to measure customer engagement. The model can be trained to predict the number of likes or comments base on the features that are extracted from the posted images. The trained model can then be used to predict the number of likes or comments of a set of pictures that could potentially be used for a social media post by this company. The picture with the highest predicted likes or comments can then be used for the social media post.

The second use case is when a company wants to train the model on a data set of a competitor that is much more popular on social media. Naturally, this can only work if the data of this competitor is publicly available, which is normally the case for social media data. By training the model on competitor’s data, it will learn from this data. Just like in the first use case, the trained model can then be used to predict the number of likes or comments of a picture. The difference

(49)

Another use case is for example a scenario where a company is not active yet on social media and therefore does not have a data set to train the model on. This is what is called a “cold start problem”. In this case, publicly available social media data can be collected from other

companies. Which companies that should used can be based on criteria such as: - select companies based on similar target customers

- select companies based on similar size and product category - select companies with a high average number of likes or comments - select companies with better financial performance

Lastly, if meta data such as “day of week” and “time of day” is included when training the model, the model can also be used to recommend companies when the best day and time is to post social media messages.

7.4 Recommendations for further analysis

As mentioned, there is no guarantee that the model will work for all brands since the rank correlation ranges from 0.155 to 0.833. Therefore, additional analysis is required to fully understand what causes this variation in model performance.

Some ideas for further analysis are:

- Look at all individual images for a poor performing brand to see if the image content will reveal any useful information.

- Investigate if there are any characteristics that the poor performing brands have in common.

(50)

8 References

Can, E., Oktay, H., & Manmatha, R. (2013). Predicting retweet count using visual cues. Proceedings of the 22nd ACM international conference on Information & Knowledge Management (pp. 1481-1484). San Francisco, California, USA: ACM.

Cappallo, S., Mensink, T., & Snoek, C. (2015). Latent Factors of Visual Popularity Prediction. ICMR’15 (p. n/a). Shanghai, China: ACM.

Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., & Wirth, R. (2000). CRISP-DM 1.0 Step-by-step data mining guide. Misc.: CRISP-DM consortium.

Gelli, F., Uricchio, T., Bertini, M., Del Bimbo, A., & Chang, S.-F. (2015). Image Popularity Prediction in Social Media Using Sentiment and Context Features. Proceedings of the 23rd ACM international conference on Multimedia (pp. 907-910). Brisbane, Australia: ACM.

Goodfellow, I., Bengio, Y., Courville, A., & Bach, F. (2016). Deep Learning (Adaptive

Computation and Machine Learning). Cambridge, Massachusetts, United States: MIT Press.

Hong, L., Dan, O., & Davison, B. (2011). Predicting popular messages in Twitter. WWW '11 Proceedings of the 20th international conference companion on World wide web (pp. 57-58). Hyderabad, India: ACM.

Howard, A., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., . . . Adam, H. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. Google Inc, 1.

Khosla, A., Das Sarma, A., & Hamid, R. (2014). What Makes an Image Popular? Proceedings of the 23rd international conference on World wide web (pp. 867–876). Seoul, Korea: ACM. Mazloom, M., Rietveld, R., Rudinac, S., Worring, M., & van Dolen, W. (2016). Multimodal

Popularity Prediction of Brand-related Social Media Posts. Proceedings of the 2016 ACM on Multimedia Conference (pp. 197-201). Amsterdam, The Netherlands: ACM.

McParlane, P., Moshfeghi, Y., & Jose, J. (2014). "Nobody comes here anymore, it’s too crowded”; Predicting Image Popularity on Flickr. Proceedings of International Conference on Multimedia Retrieval (p. 385). Glasgow, United Kingdom: ACM.

(51)

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. Neural Information Processing

Systems 2013 (p. 1). Stateline, Nevada, United States: NIPS.

Overgoor, G., Mazloom, M., Worring, M., Rietveld, R., & van Dolen, W. (2017). A Spatio-Temporal Category Representation for Brand Popularity Prediction. Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval (pp. 233-241 ).

Bucharest, Romania: ACM.

Petrovic, S., Osborne, M., & Lavrenko, V. (2011). RT to Win! Predicting Message Propagation in Twitter. Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (pp. 586 - 589). Barcelona, Spain: AAAI.

Spearman, C. (1904). The proof and measurement of association between two things. The American Journal of Psychology, 72–101.

Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable are features in deep neural networks? Advances in Neural Information Processing Systems 27, 3320-3328. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the Elastic Net. Journal of

Using machine mearning to improve image selection in marketing

Using Machine Learning to improve

image selection in Marketing

Thesis for MBA Big Data at the Amsterdam Business School

University of Amsterdam

Table of Contents

Summary

Acknowledgements

1 Introduction

1.1 Business question

1.2 Image popularity

1.3 Image analysis

1.4 Text analysis

1.5 Delaware

1.6 Thesis approach

2 Business Requirements

2.1 Goal

2.2 Requirements

3 Data used

3.1 Data Set

3.2 Data preparation and feature engineering

3.3 Automated feature extraction from images

3.4 Automated feature extraction from text

3.5 Feature scaling

3.6 Transforming the post meta data

3.7 Transforming the dependent variable

4 Initial model exploration

4.1 Variables used

Variable set

Description

Purpose

Variable set

Description

Purpose

4.2 Initial models

Model Type

Variable sets

Description

4.3 Model variations

4.4 Evaluation metric

4.5 Initial model validation

5 Modelling

5.1 Final model used

5.2 Use of pre-trained models

5.3 Models trained

Batch

Training

time

Model type

# of models

6 Results

6.1 Performance evaluation metric

6.2 Per brand

Brand

Spearman

Correlation

Number of

records

6.3 Across brands

7 Conclusion and Recommendations

7.1 Conclusion

7.2 Recommendations for improvements

7.3 Recommendations for use cases

7.4 Recommendations for further analysis

8 References

_time

_Correlation

_records