To Click or Not to Click: Machine Learning Techniques for Predicting and Uncovering Influencers of Click Behaviour.

(1)

To Click or Not to Click: Machine Learning Techniques for Predicting and

Uncovering Influencers of Click Behaviour.

by

(2)

2

To Click or Not to Click: Machine Learning Techniques for Predicting and

Uncovering Influencers of Click Behaviour.

by Erwin Oosterhuis June 25, 2017 Master Thesis MSc Marketing, Intelligence University of Groningen Faculty of Economics and Business

Department of Marketing PO Box 800, 9700 AV Groningen

Supervisors:

First Supervisor: prof. dr. J.E. Wieringa Second Supervisor: dr. J.E.M. van Nierop

Author:

Erwin Oosterhuis (s2211173) Wrangelstrasse 3, 10997 Berlin

+31611501465

(3)

3

Management Summary

Nowadays, the highly competitive nature of the online marketing industry causes customers to avoid banner advertisements. Click-through rates (CTR) have decreased to as low as 0.1%. In order to increase CTR, a precise prediction of whether or not website visitors will click on an advertisement is necessary. The aim of this study is two-fold. First, it focuses on how state-of-the-art machine learning techniques can be used to accurately predict clicks, thereby providing further insights into open research issues with regard to handling class imbalance, feature selection, and calibration. Machine learning techniques included in this study are logistic regression, decision trees, boosted decision trees, bagged decision trees, random forests, and support vector machines. Second, this study uncovers relevant influencers of click behaviour, aimed at improving click prediction and providing relevant insights. The focus of this research is on browsing behaviour data and demographical data, obtained from an online web shop which makes use of banner advertisements.

Random forests result in the highest performance, making it the best algorithm to be used for click-prediction. Calibration using Platt’s scaling significantly impacts model performance of all algorithms, making it a good procedure to improve the quality of the predicted probabilities. After calibration, boosted decision trees result in highest performance. In order to increase performance even further, it is advised to apply feature selection. Wrapper feature selection is a good method, although caution is advised when using it in combination with boosted decision trees since it may cause a decrease in performance. However, filter feature selection is a good option when computation power is limited. No general rule can be uncovered which class imbalance procedure works best, so it is best practice to take an empirical approach to assess the influence of different sampling schemes to handle class imbalance on model performance. It is advisable to take SMOTE into consideration, since it has performed surprisingly well.

Demographical data and browsing behaviour is easily obtainable. Combining these types of data may in some cases increase model performance. However, most of the time, using only browsing behaviour data results in best performance. It is therefore better to focus on data quality of browsing behaviour variables instead of data quantity by acquiring multiple datasets.

(4)

4

advertising on mobile devices. Visitors acquired via organic search and direct traffic are more likely to click on advertisements than visitors acquired via paid search and email. It may therefore be advisable to focus on search engine optimization and brand knowledge. In addition, browsing through more result pages has a positive influence on click behaviour. Existing customers are more likely to show click behaviour, as well as customers who have recently visited the website.

(5)

5

Preface

Due to a voluntary internship, writing my thesis was delayed by half a year. Therefore, I had the opportunity to study a new course, Data Science and Marketing Analytics. During this course I was introduced to multiple machine learning techniques, which immediately caught my interest. I decided to use this knowledge in my thesis in order to gain more in-depth knowledge of these fascinating techniques. By doing so, I was able to gain a deeper understanding of these important algorithms, which nowadays are the backbone of many organizational and marketing processes. This experience also gave me a direction in which I want to head my future: analysing Big Data.

I want to thank my family and friends who frequently visited me when I stayed in Berlin, where I lived most of the time to focus on writing my thesis. I also want to thank my girlfriend whose insights provided clarity when I dealt with problems. In addition, special thanks go to my supervisor, prof. dr. Jaap Wieringa for his support and feedback.

Erwin Oosterhuis Berlin

(6)

6

TABLE OF CONTENTS

Management Summary ... 3

Preface ... 5

Chapter 1: Introduction ... 8

Chapter 2: Theoretical Framework ... 10

2.1 Drivers of click-through ... 10

2.1.1 The importance of user demographics and browsing behaviour ... 11

2.1.2 The influence of user demographics on click-through ... 12

2.1.3 The influence of browsing behaviour on click-through ... 13

2.2 Machine learning techniques and open research issues ... 14

2.2.1 Choice of classification techniques ... 14

2.2.2 Handling class imbalance ... 15

2.2.3 Calibration ... 16

2.2.4 Feature selection ... 17

Chapter 3: Methodology ... 18

3.1 Machine learning techniques ... 18

3.1.1 Logistic regression... 18

3.1.2 Decision trees ... 19

3.1.3 Bagged decision trees ... 21

3.1.4 Boosted decision trees ... 21

3.1.5 Random forests ... 21

3.1.6 Support vector machines (SVM) ... 22

3.2 Performance metrics ... 23

3.3 Feature selection ... 25

3.4 Data analysis procedure ... 25

Chapter 4: Data ... 28

4.1 Data collection ... 28

4.2 Data description ... 28

4.3 Data inspection and cleaning ... 28

4.4 Missing values ... 30

4.5 Correlation and multicollinearity ... 31

4.6 Descriptive statistics ... 32

Chapter 5: Results ... 33

5.1 Logistic regression ... 33

5.2 Decision trees ... 35

(7)

7

5.4 Boosted decision trees ... 38

5.5 Random forests ... 40

5.6 Support vector machine (SVM)... 41

5.7 Variable importance ... 42

5.8 Model differences ... 43

5.8 Demographic variables vs. browsing behaviour variables ... 44

Chapter 6: Discussion ... 45

Chapter 7: Conclusion and Recommendations ... 47

Chapter 8: Limitations and Future Research ... 49

References ... 51

Appendices ... 59

Appendix A ... 59

Appendix B - Logistic regression ... 61

Appendix C - Decision trees ... 63

Appendix D – Bagged decision trees ... 68

Appendix E - Boosted decision trees... 69

Appendix F - Random forests ... 74

(8)

8

Chapter 1: Introduction

Online marketing expenditures in Europe are increasing drastically, totalling €36.4 billion in 2015; an increase of 13% compared to €32.1 billion in 2014 (Interactive Advertising Bureau Europe, 2015). These numbers indicate an increase of interest in online marketing to draw traffic to firms’ websites, and firms are intensifying their online advertising efforts (Interactive Advertising Bureau, 2013). The expenditures are spread across multiple, and often discriminating, forms of online advertising such as display advertising, video advertising, affiliate marketing, retargeting, and search engine advertising. Different platforms such as LinkedIn, Facebook, Instagram, and Twitter offer innovative ways to advertise within their networks. Firms like Google and Yahoo offer, amongst others, search engine advertising and many organisations such as AdRoll, Retargeter, Fetchback, and Chango offer retargeting. The result is an industry with an abundance of different marketing options, causing consumers to avoid banner advertisements (Cho and Ceon 2004; Drèze and Hussherr 2003), and consequently the click-through rate (CTR)1_{has decreased to as low as 0.1% (MediaMind 2012).}

According to Chandon et al. (2003), the success of display advertisements should be measured by its CTR. In addition, according to Archak et al. (2010), CTR is often used as a standard measure of advertisement quality. To maximize revenue of display advertisements, it is essential to make a proper selection of advertisements, requiring a precise prediction of whether or not users will click on these advertisements (Shaparenko et al. 2009). Users are most likely to click on advertisements related to their interests. It is therefore of importance to be able to predict if an ad is likely to be clicked, and maximize the number of these clicks (Ciaramita et al. 2008). Being able to predict clicks gives an opportunity to target advertisements. According to Briggs and Hollis (1997), Sherman and Deighton (2001), Chandon et al. (2003), and Chatterjee et al. (2003), successful targeting of advertisements results in improvements of CTR. Consequently, the prediction of click-through results in increased advertising effectivity.

Nowadays, multiple sources of consumer information are available, such as personal identification information, demographics, shopping preferences, history of products bought, and geographical location information (Phelps et al. 2000; Unni and Harmon 2007; White et al. 2008). Free tools such as Google Analytics already offer a broad range of demographical,

1_{CTR: the ratio between the number of times an advertisement was clicked by a user and the number}

(9)

9

browsing behaviour and customer interests data. This paper investigates the importance of these types of data in influencing click behaviour. Different classification machine learning techniques are used to set up a click-through prediction model, and this helps to show the roles of different variables.

Machine learning techniques are seen as state-of-the-art data mining classification algorithms (Verbeke et al. 2011). However, many open research issues exist with regard to machine learning (Verbeke et al. 2012). Burez and Van den Poel (2009) advised comparing different machine learning techniques in terms of their performance. Verbeke et al. (2012) asked for future research into the change of performance of prediction models when combined with different sampling schemes to handle class imbalance. In addition, the effects of calibration and feature selection on performance differ per machine learning technique and per research (e.g. Caruana and Niculescu-Mizil 2006; Nnamoko et al. 2014; Kumari and Swarnkar 2011; Hall and Holmes 2003). Overcoming these issues is of paramount importance to setting up a proper predictive model.

Verbeke et al. (2011) advised focusing not only on accuracy during model building, but also on comprehensibility. According to Lima et al. (2009), the comprehensibility of prediction models has received much less attention in literature as opposed to the performance of these models. This trend is also apparent within the click prediction literature. For example, Chen et al. (2009) developed a highly complex behavioural targeting model, but do not take comprehensibility into consideration. Zhang et al. (2014) modelled the dependency on users’ sequential behaviours into click prediction by making use of recurrent neural networks. However, neural networks can be considered as so called ‘black boxes’, making interpretability of the influence of variables complex. Ample complicated click prediction models show that predicting click-through is the main objective, but have failed to gain relevant insights into the driving forces behind click behaviour.

(10)

10

This thesis will contribute to the existing literature by developing an accurate click-through prediction model. This prediction model can be used to target advertisements and maximize the revenue generated by display advertisement campaigns. In addition, open research issues regarding machine learning techniques will be resolved and relevant variables affecting click behaviour are uncovered; giving advertisers insight into what influences customers to click on advertisements.

The remainder of this thesis is organized as follows: Chapter 2 details an introduction to the current literature on variables influencing click-through. Different open research issues with regard to machine learning techniques are also introduced. Chapter 3 depicts the methodology of this research, with an explanation of the different used machine learning techniques. Chapter 4 details which data is used, as well as which pre-processing steps are taken. Chapter 5 contains the results of the different machine learning techniques. These results are discussed in Chapter 6. This paper is concluded in Chapter 7 and Chapter 8 touches upon the limitations and future research directions.

Chapter 2: Theoretical Framework 2.1 Drivers of click-through

Much research focuses on the design of advertisements as a driver of CTR. According to Baltas (2003), bigger advertisements are more effective in attracting attention, increasing the likelihood that web visitors will click on ads. In addition, Baltas (2003) revealed that unbranded advertisements may evoke curiosity, resulting in click-through. However, Dahlen (2001) concluded that well-known brands are favoured due to their familiarity, resulting in higher click-through rates when brand names are shown. According to Drèze and Hussherr (2003), the creativity of banners only had marginal effects on CTR.

(11)

11

Therefore, it is of importance to create a proper match between advertisement and web visitor. If there is a proper match, the probability of a web page visitor clicking an ad or making a transaction will increase (Jaworska and Sydow 2008). The problem underlying creating such a match is called targeting: users are divided into groups based on web usage and age, and each user group is shown a different advertisement (Jaworska and Sydow 2008). Targeting comes in different forms, such as contextual, behavioural, and demographical targeting.

When predictions of click-through of visitors are known, targeted ads can be shown to visitors, e.g. by offering a coupon code to visitors with the lowest probabilities to click or offer more information to visitors with medium probabilities. In order to decrease advertising costs and increase the average CTR, advertisements could be shown only to visitors with a high probability of clicking. This prediction can be based on the general click-through probability, calculated of all advertisements, or based on the click-through probability of specific advertisements. In the latter case, multiple calculations could be made to search for the optimal advertisement. Predicting clicks influences user experience and profitability of advertising and revenue (Graepel et al. 2010; Zhang et al. 2014). Predicting the probability of clicking also helps to uncover the best set of advertisements (Shaparenko et al. 2009). As previously mentioned, relevant advertisements result in multiple favourable outcomes.

Why using click-through as the measure of interest? The CTR can be considered as a simple way of measuring and attributing clicks, which made it the de-facto standard of measuring advertisement quality. It is common to assume that the goal of advertisers is to maximize clicks on an advertisement within a limited budget. In addition, CTR is often used as a way to measure the return on investment of specific keywords (Archak et al. 2010). Also, information of click-through rates is easily obtainable. Besides, many advertising companies only have access to CTR, since information such as profitability and sales are usually owned by organisations using the advertisement systems, a problem encountered by Yan et al. (2009). The higher the CTR of advertisements, the higher the revenue of the system and, at the same time, the more efficient the ad campaign (Jaworska and Sydow 2008).

2.1.1 The importance of user demographics and browsing behaviour

(12)

12

Jaworska and Sydow (2008) developed a model using machine learning techniques in order to decide which online advertisement to show to a web page visitor based on previous website behaviour. They were able to increase the CTR by 20%, in some cases by as much as 40%. Joshi et al. (2011) found that data on consumer demographics can improve targeting. According to Yan et al. (2009), CTR can be improved by 670% by properly segmenting consumers with behavioural targeting. He et al. (2014) introduced a machine learning model based on both historical features as well as contextual features. In this case, historical features capture information about user behaviour and ad interaction, and contextual features capture information regarding the context in which an ad is shown. He at al. (2014) find historical features to have considerably more explanatory power than contextual features.

Chen et al. (2009) advised using user behaviour data when predicting click-through. Joshi et al. (2011) and Beel et al. (2013) emphasised the importance of demographical data. Coupled with the high availability of such data, both kinds are used in developing the model. This model focuses on general click-through rates. This results in the following hypotheses:

H1: The combination of browsing behaviour data and user demographics data will result in better model performance in predicting click-through of banner advertisements, compared to the separate influence of browsing behaviour data and user demographics data on model performance.

2.1.2 The influence of user demographics on click-through

To date, limited research has focused on the influence of different specific user demographics on click-through. To the author’s knowledge, only one research has focused on the impact of demographics, specifically age and gender, on the CTR of paper recommendations made by paper recommender systems. According to this research, Beel et al. (2013) found gender to have only a limited influence on click behaviour. However, age did have a considerable influence on CTR.

(13)

13

behaviour. According to Wang and Sun (2010) and Cho (1999), attitudes positively and significantly influenced click-through behaviour.

The aforementioned research findings confirm the influence of attitudes on click behaviour, but how do demographics in turn influence these attitudes? According to Shavitt et al. (1998), younger consumers have more positive attitudes towards advertisements than older consumers. Contrary to this, McKay-Nesbitt et al. (2011) have reported that elderly people have more favourable responses to advertisements. Shavitt et al. (1998) found males to have more positive attitudes. Furthermore, Bush et al. (1999) stated that women have more positive attitude scores towards advertisements than men. Although authors did not reach a consensus about how exactly demographic variables influence attitudes, they all report significant influences. In addition, Hirschman and Thompson (1997) emphasised the importance of taking gender in consideration when analysing advertising effectiveness. Reportedly, gender has a direct relationship with advertisement effectiveness, mediated by emotions (Moore 2007).

It may be concluded that gender and age influence responses and attitudes towards advertising, in a direct way or in an indirect way. Since the dataset does not contain any information about intentions or attitudes, it is only possible to uncover the direct influence of these variables on click behaviour. Based on the findings of McKay-Nesbitt et al. (2011) and Bush et al. (1999), the following hypotheses are formulated:

H2a: Age positively influences click-through rates of banner advertisements

H2b: Women show significantly higher click-through rates of banner advertisements

compared to men

2.1.3 The influence of browsing behaviour on click-through

Customers involved in a certain product category are highly receptive to information related to this category and tend to be more engaged with advertisements for that product. Therefore, they are more likely to seek out more information by clicking on advertisements (Cho 1999). This is in line with the Elaboration Likelihood Model proposed by Petty and Cacioppo (1986).

(14)

14

expected that when customers show search behaviour, they are more involved and therefore more responsive towards banner advertisements. This results in the following hypothesis:

H3a: Search behaviour is positively associated with click-through rates of banner

advertisements

Kim et al. (2007) found a significant relation between involvement and patronage intention towards an online store, confirmed by research in an offline setting by Wakefield and Baker (1998). Since involvement is positively related with patronage intention, there may also be a positive relation between online store patronage and click behaviour.

H3b: Store patronage is positively associated with click-through rates of banner

advertisements

According to Assael (1984), the effectiveness of advertising is dependent on the type of medium through which the message is delivered. Bart et al. (2014) argued that it should not be expected that advertising on mobile devices has a large impact on consumer’s attitudes and intentions. According to Patel et al. (2013), consumers are often exposed to mobile advertisements when they are ‘on the go’, limiting their attention towards the respective advertisement.

H3c: Browsing via tablets or mobile phones will negatively influence click-through rates

of banner advertisements

2.2 Machine learning techniques and open research issues

Machine learning techniques can be broadly classified into several categories, including classification, clustering, dependency analysis, data visualization, and text mining (Shaw et al. 2001). Classification analysis is the process of training an algorithm (or classification technique) to categorize a set of training examples into classes. Such a classification model is then used to classify future instances (Wei and Chiu 2002). In this research, classification techniques are used to classify web page visitors according their potential clicking behaviour and uncover relevant influencers of clicking behaviour.

2.2.1 Choice of classification techniques

(15)

15

Caruana and Niculescu-Mizil (2006) conducted a large-scale empirical comparison of ten supervised learning algorithms. This included calibration using Platt’s scaling and isotonic regression (refer to section 2.2.3). According to the authors, calibration has a significant impact on the performance of multiple algorithms. Calibrated boosted trees have an overall best performance, followed by random forests, uncalibrated bagged trees, and calibrated SVM. Without calibration, bagged trees, random forests, and neural networks result in the best overall performance. Logistic regression was one of the worst performing algorithms. According to a benchmark study of Verbeke et al. (2012), the use of alternating decision trees resulted in the best overall performance. However, a large number of other techniques were not significantly outperformed. Wei and Chiu (2002) added that decision tree approaches appear to be more appropriate for targeted learning and prediction because these techniques are capable of efficiently generating interpretable knowledge in an understandable form.

The goal of this work is not only to classify future customers according their clicking behaviour, but also to give insights into which variables influence the clicking behaviour. Therefore, a balance of models must be found to satisfy both high classification performance as well as interpretability in the form of output which can be used to judge variable importance. Consequently, relying on the research of Caruana & Niculescu-Mizil (2006) and the suggestions of Wei & Chiu (2002) and Verbeke et al. (2012), boosted trees, bagged trees, random forests, and SVM are used to set up a click-through prediction model. In addition, decision trees and logistic regression are used because of their interpretable output (see section 3.1 for a further explanation of the machine learning techniques). This results in the following hypotheses:

H4a: Before calibration, bagged decision trees have the best performance in predicting click-through of banner advertisements

H4b: After calibration, boosted trees have the best performance in predicting click-through of banner advertisements

H4c: Before calibration, logistic regression has the worst performance in predicting click-through of banner advertisements

2.2.2 Handling class imbalance

(16)

16

a ‘null’ prediction system which classifies all cases as having the majority classification (Wei and Chiu 2002).

Random undersampling resolves class imbalance by randomly removing cases from the majority class in order to decrease the size of the majority class. In contrast, oversampling randomly duplicates instances in order to increase the minority class. SMOTE (Synthetic Minority Oversampling TEchnique), proposed by Chawla et al. (2002) is a form of oversampling. In contrast to random oversampling, SMOTE does not duplicate instances, but creates new examples which are interpolated between existing neighbours from the minority class.

In an experiment conducted by Van Hulse et al. (2007), undersampling was found to outperform SMOTE with low-dimensional data, confirmed by research of Wallace et al. (2011). However, according to Chawla et al. (2002), SMOTE was found to outperform undersampling. A research of Blagus and Lusa (2012) compared undersampling and SMOTE on six datasets. Only the k-NN classifier had a significant increase of performance due to SMOTE, whereas other classifiers benefited from undersampling. Although researchers still disagree on which class imbalance technique results in best performance, most evidence points in the direction of undersampling.

H5: Undersampling outperforms SMOTE in terms of model performance in predicting

click-through of banner advertisements

2.2.3 Calibration

(17)

17

Furthermore, according to Caruana and Niculescu-Mizil (2004), Platt’s scaling performs best with small datasets.

Platt (1999) proposed transforming SVM predictions to probabilities by using the predictions 𝑓of the algorithm and run it through a sigmoid function:

𝑃(𝑦 = 1|𝑓) = 1

1 + 𝑒𝑥𝑝⁡(𝐴𝑓 + 𝐵)

Parameters 𝐴⁡and 𝐵⁡are learned by the algorithm using maximum likelihood estimation, trained on a separate calibration set in order to prevent overfitting. Although originally used for SVM, the calibration technique also works well for other classification techniques.

H6: Calibration positively influences model performance in predicting click-through of

banner advertisements

2.2.4 Feature selection

Variable selection is a key step for many machine learning techniques. In order to increase the comprehensibility of classification techniques, reduced numbers of highly predictive variables are preferred (Verbeke et al. 2012). In addition, noisy features or bad quality can decrease the classifier’s performance (Čehovin and Bosnić 2010). According to Guyon and Elisseeff (2003), variable selection has three objectives: improving predictive performance, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data. Blum and Langley (1997) proposed to see a set of relevant variables as the smallest number of features to achieve optimal performance.

Multiple methods of variable selection exist, which can be broadly classified into three categories. Filter methods use statistical measures to rank features according to their usefulness in explaining the dependent variable, relying solely on properties of the data. This technique is independent of the chosen classification technique. Wrapper methods use the machine learning technique as a tool, or a ‘black box’, to select the most relevant variables. Different combinations of variables serve as input for the machine learning technique in consideration, and the performance of the machine learning technique is used to score the different combinations in consideration. In contrast, embedded methods score variables according to their relevance during the process of model training. Their design is coupled with specific algorithms, which limits their use for other algorithms. Therefore, the focus of this study is on filter methods and wrapper methods.

(18)

18

(2014), the performance of filter methods was satisfactory, taking their low complexity and low information use into consideration. Though wrapper methods made accurate selections, their impact on performance and selected variables differed per classification technique, even when the same training sets were used. They address the main advantages of filter methods, namely their speed of calculation, scalability to larger datasets, and the fact that they select relevant variables for any machine learning technique. Kumari and Swarnkar (2011) compared the wrapper and filter method by using a high dimensional dataset. They recommended filter approaches for fast data analysis but advise using wrapper approaches in order to better validate results. According to John et al. (1994), wrapper methods are superior to the filter selection methods because they use a separate evaluation function which has a bias that can differ from the classifier used to evaluate the final set.

Both a wrapper method as a filter method is applied to search for the optimal feature selection method. Since previously cited authors agree upon the influence of the different feature selection techniques, it is likely that both the wrapper and the filter method will result in significantly better performance. However, the wrapper method is frequently considered to be the best performing method.

H7a: The wrapper method positively influences model performance in predicting click-through of banner advertisements

H7b: The filter method positively influences model performance in predicting click-through of banner advertisements

H7c: The wrapper method outperforms the filter method in terms of model performance in predicting click-through of banner advertisements

Chapter 3: Methodology 3.1 Machine learning techniques

3.1.1 Logistic regression

(19)

19

must be applied to 𝑝⁡(𝑋). In logistic regression, this is called the logistic function and results in the following function:

𝑝(𝑋) = ⁡ 𝑒

𝛽0+𝛽1𝑋1+⋯+𝛽𝑝𝑋𝑝⁡

1 + 𝑒𝛽0+𝛽1𝑋1+⋯+𝛽𝑝𝑋𝑝⁡.

This function will always result in an S-shaped curve, where no probabilities 𝑝 will be below 0 or above 1. The log-odds function can be derived after transposing:

log ( 𝑝(𝑋)

1 − 𝑝(𝑋)) = 𝛽0+ 𝛽1𝑋1+ ⋯ + ⁡ 𝛽𝑝𝑋𝑝⁡.

At the left side between brackets, the log-odds are shown. At the right hand side, it is visible that an increase in 𝑋⁡results in an increase of the log-odds with⁡𝛽₁. The coefficients 𝛽₀, 𝛽₁… 𝛽_𝑝 are unknown and need to be estimated. This is done by using the maximum likelihood method; estimating the coefficients in such a way that the predicted probabilities are as near as possible to their observed status.

Logistic regression is of interest because (1) it is popular within marketing contexts, (2) it is conceptually simple (Bucklin and Gupta 1992), (3) it is easy to interpret (Burez and Van den Poel 2009), and (4) it provides robust results compared to other techniques (Neslin et al. 2006).

3.1.2 Decision trees

Decision trees are easy to interpret and can intuitively be understood (Ramasubramanian and Singh 2017). A decision tree consists of a series of splitting rules, starting at the top of the tree, called the root node, and ending at terminal nodes or leaf nodes, mostly visualized upside down so that leaves are at the bottom. The predictor space is split at different points within the tree called internal nodes. In the upside-down case, variables which are higher in the tree are more important in determining to which category an observation belongs, or in this case, in deciding whether or not a website visitor will click on an advertisement. After the model is trained, the algorithm serves the tree structure as an output, giving relevant insights for future business practices (Lantz, 2015).

(20)

20

𝐻(𝑆) = ∑ −𝑝𝑖log2(𝑝𝑖) 𝑐

𝑖=1

Sets with high entropy are heterogeneous, resulting in inadequate information about other observations that belong in the set as well. Having 𝑛 classes, entropy can take values from 0 to log₂(𝑝_𝑖) where 0 represents complete homogeneity and the maximum values represent complete heterogeneity: data which is as distinct as possible. In the above formula, c refers to the number of class levels and 𝑝𝑖 to the proportion of values falling into class level 𝑖. Suppose a dataset 𝑆 with 2 class levels, with a proportion of examples falling in one class is 𝑝. Using the above formula results in the Figure 1.

Figure 1: Entropy for all values of p

The peak at p=0.50 depicts the maximum entropy, which means that a 50-50 split results in the maximum amount of heterogeneity. The change in homogeneity is calculated when deciding which feature to use for splitting, commonly referred to as information gain:

𝐼𝐺(𝐹) = 𝐻(𝑆₁) − ⁡𝐻(𝑆2)⁡

In this case the difference between the entropy before the split (𝑆1)and after the split (𝑆2) results in information gain. To calculate the entropy, the total entropy along all of the partitions needs to be considered. Therefore, each partition’s entropy is weighed by the proportion of records falling in one partition. The higher the information gain, the better the feature is in decreasing heterogeneity. If the information gain is 0, using the feature in consideration does not lead to a decrease of entropy.

(21)

21

3.1.3 Bagged decision trees

Bootstrap aggregation, or bagging, is the process of constructing multiple datasets by bootstrapping the training set with replacement. It is a useful procedure for reducing the variance of a decision tree (James et al. 2013). In the case of bagged decision trees, a decision tree is applied to all created datasets. The predictions of these trees are averaged, thereby reducing variance and increasing prediction accuracy (Breiman 1996).

3.1.4 Boosted decision trees

Boosting refers to a generalizable and proven effective method of producing a very accurate prediction rule by combining rough and moderately inaccurate rules of thumb. It is a general method for improving the accuracy of any given learning algorithm (Freund et al. 1999), however, in this context it is only applied to the decision tree. When using boosting, trees are grown sequentially; each tree contains information of previously grown trees. Each consecutive iteration causes weights of misclassifications to increase and weights of correct classifications to simultaneously decrease, forcing the algorithm to focus on difficult classifications (Lemmens and Croux 2006). Stated differently: a decision tree is grown using the residuals of the previously grown tree instead of using outcome 𝑌. In many cases, the accuracy of a combination of these series performs much better compared to the base function alone (Burez and Van den Poel 2009).

3.1.5 Random forests

Dudoit et al. (2002) report suboptimal performance when performing decision trees. Breiman (2001) proposed a solution for these drawbacks, resulting in a new technique called random forests. A selection of randomly chosen variables (lower than the total number of variables) is used to grow a tree, based on a bootstrap sample of the training data. This process is repeated in order to create a large set of trees in which each tree votes for the most popular class, called forests. Allowing only a selection of variables prevents the continuous use of one strong predictor in the top split which could result in trees looking rather similar and being highly correlated.

(22)

22

(Coussement and Van den Poel 2006) and, most importantly, (5) performance is considered to be among the best of the different techniques (Luo et al. 2004).

3.1.6 Support vector machines (SVM)

SVM is a method which classifies cases using hyperplanes. A hyperplane is a subspace of dimensions 𝑝⁡– ⁡1. In having two dimensions, the algorithm will produce a one-dimensional plane (a line), and in having three dimensions, this plane will be two-dimensional (a plate). However, the concept of hyperplanes is also applicable to many dimensions. In the two-dimensional example, the plane can be defined as:

𝛽0+ 𝛽1𝑋1+ ⁡ 𝛽2𝑋2 = 0⁡.

If a point 𝑋⁡ = ⁡ (𝑋₁, 𝑋₂)𝑇 satisfies this formula, point 𝑋 lies on the hyperplane. However, if 𝛽₀+ 𝛽₁𝑋₁+ ⁡ 𝛽₂𝑋₂ > 0⁡,⁡

𝑋 lies on one side of the hyperplane. Therefore, this hyperplane can be seen as a way to divide the parameter space in half. An observation is assigned a class depending on which side of the hyperplane it is located. The magnitude of the distance between this observation and the hyperplane is also of importance, because high magnitudes indicate that the observation is being located far from the hyperplane, resulting in high certainty that the observation belongs to the predicted class. However, when this magnitude is small, it will lie near the hyperplane, resulting in uncertainty regarding its assignment to the class.

In many cases, multiple hyperplanes that divide training observations exist (James et al. 2013). In this case, it is useful to calculate how far every observation is located from the hyperplane, called the margin. As previously stated, the further away 𝑥 is from this hyperplane, the higher the certainty of its assignment to a particular class. Consequently, the hyperplane with the largest total margin, called the maximum margin classifier, is selected. In some cases, observations may lie directly at the margin. When these observations are slightly moved, the margin will also change. However, if another observation outside the margin would be moved slightly, the margin itself would not be affected. Therefore, observations on the margin are called support vectors.

(23)

23

the margin. Instead of constructing the maximum margin classifier, the algorithm now focuses on minimizing total costs 𝐶.

In the explanation above, observations were linearly separable. However, in practice, this often is not the case. The support vector machine is an extension of previous principles which can handle nonlinearity by enlarging the feature space using kernels. By doing so, the problem is mapped into a higher dimension, making the separation linear again.

3.2 Performance metrics

It is difficult or even impossible to set up a model which classifies all examples from the test set correctly. Therefore, a model is chosen which has a minimal amount of loss, measured by a specific performance measure. These suboptimal models differ from their ‘true’ model in multiple ways. These deviations are reflected in the different performance metrics. The decision of which metric to use depends on the problem, the learning algorithm, and how the predictions will ultimately be used (Caruana and Niculescu-Mizil 2004). Optimizing a model according to a certain performance metric can influence the choice of the model, making the selection of performance metrics an essential step.

A commonly used performance metric is Area Under the receiver operating Characteristic (AUC), which has proven to be good a performance metric (Ling and Li 1998; Burez and Van den Poel 2009). AUC uses the ROC-plot; the ratio between true positives and true negatives, or sensitivity vs. (1-specificity), where sensitivity is

𝑃(𝑃𝑟𝑒𝑑 = 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒|𝑇𝑟𝑢𝑒 = 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒), and specificity is

𝑃(𝑃𝑟𝑒𝑑 = 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒|𝑇𝑟𝑢𝑒 = 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒).

(24)

24

Caruana and Niculescu-Mizil (2004) conducted a large scale empirical analysis in order to compare nine different performance metrics for classification algorithms. They introduce a new metric, SAR, which is a combination of squared error, accuracy, and ROC-area, because these metrics were most consistent. Squared error can be defined as:

𝑅𝑀𝑆𝐸 = ⁡ √1

𝑁∑(𝑃𝑟𝑒𝑑(𝐶) − 𝑇𝑟𝑢𝑒(𝐶))2

In addition, accuracy can be defined as the number of correct classifications divided by the total size of the dataset (Caruana and Niculescu-Mizil 2004). The authors propose SAR as a good general purpose metric to use when more specific criteria are unknown. Since its general applicability, SAR is used as second performance measure to compare models when AUC does not show any significant differences.

AUC and SAR are insensitive to calibration, making these performance metrics unsuitable to measure the effect of calibration. Therefore, LogLoss (LL) is used to evaluate the performance before and after calibration and can be defined as:

𝐿𝐿 = ⁡ −1 𝑁∑ ∑ 𝑦𝑖,𝑗 𝑀 𝑗=1 𝑁 𝑖=1 log(𝑝_𝑖,𝑗),

where 𝑁 is the number of observations, 𝑀 the number of class labels, 𝑙𝑜𝑔⁡is the natural logarithm of 𝑝_{𝑖,𝑗⁡,}⁡the predicted probability that observation 𝑖 is in class 𝑗, and 𝑦_𝑖,𝑗 is 1 if observation 𝑖 is in class 𝑗 and 0 otherwise. LogLoss has a minimum value of 0 (in case of perfect predictions) and no maximum value. As the certainty of a correct classification increases, LogLoss gradually lowers to 0. However, if the certainty declines, the metric increases rapidly without any upper bound. The same holds for wrong classifications; certain misclassifications are penalized more than misclassifications with low probabilities. Refer to Figure 2 for a visual representation of this process.

(25)

25

3.3 Feature selection

Information gain is used as the selection criterion for filter feature selection. In this case, the entropy-based information gain is calculated between each feature and the target variable individually:

𝐼𝐺(𝐴) = 𝐻(𝑆) − ∑𝑆𝑖 𝑆 𝑖

𝐻(𝑆_𝑖)⁡,

where the entropy of the total dataset is defined as 𝐻(𝑆) and 𝐻(𝑆_𝑖) as the entropy of dataset Si created by classifying dataset S based on variable A. Information gain measures the amount of information a feature carries, where high information gain denotes strong power in classifying data. In a comparative study on feature selection, Yang and Pedersen (1997) found information gain to be the best performing.

Sequential backward search is used as the wrapper method. In this case, the algorithm is applied to all but one variable present in the dataset, resulting in n different models which are trained and compared to each other. Thereafter, the variable which was not present in the model with the best performance is eliminated. This process is repeated until there are no variables left. Sequential search strategies are considered to be fast wrapper methods. In addition, the backward algorithms often result in better performance compared to a forward search algorithm, where the feature selection process consists of sequentially adding the best performing variable to the model (Kudo and Sklansky 2000).

3.4 Data analysis procedure

All data analysis takes place in the statistical program R. Refer to Table 1 for the specific packages and functions used.

The dataset is randomly divided into a training set, a cross-validation set, and a test set, in a ratio of 60%:20%:20% respectively (Ramasubramanian & Singh 2017). Training data is used for training the algorithm. The test dataset contains data points which the algorithm has not yet processed, in order to measure its predictive power on new data. The cross-validation set is used to estimate the performance of calibration.

(26)

26

According to Gao et al. (2015), best results are yielded when filter feature selection is performed on the complete dataset before sampling. However, sampling should also be performed using the full dataset. Therefore, sampling is performed first, followed by filter feature selection using the full, initial dataset. However, as previously mentioned, using imbalanced datasets will result in a ‘null’ prediction system. Therefore, to prevent incorrect results, sampling should be performed before the wrapper feature selection process is started, following the procedure of Al-Shahib et al. (2005). The wrapper feature selection process is conducted using ten-fold cross-validation.

Many algorithms have parameters that can be tuned, which is called hyperparameter optimization. For a C5.0 decision tree, these parameters are model and winnow (in addition to trials– the number of iterations- for boosting). SVM requires a specification of the kernel-function which will be used in addition to the cost associated with violating the margin, where large values for cost will result in narrow margins. Depending on the kernel-function, additional parameters may be specified. Random forests require a specification of .mtry: the number of variables which are considered in every iteration (Lantz 2013). Bagging, performed via the package ipred, does not have any parameters to be optimized (Kuhn 2008). Hyperparameter optimization is executed by conducting a grid search using ten-fold cross-validation, and is performed for every combination of sampling and feature selection by making use of the Caret-package.

When calibration is performed on the training dataset, unwanted bias is introduced. Therefore, an independent calibration set is needed: the cross-validation set. The trained algorithm is applied to the cross-validation set, and the results are put through a sigmoid in the form of a logistic regression model. The LogLoss after calibration is compared to the LogLoss before calibration. The same process is repeated on the test data; first the trained model without calibration is used, followed by the procedure of running the output of the algorithm through a sigmoid.

In order to draw relevant conclusions in terms of effects of the different algorithms, sampling techniques, and feature selection techniques model performance, the trained algorithm is applied to a randomly selected bootstrapped sample of the dataset (sampling with replacement), which is repeated 15 times. The set.seed-function is used to ensure that all algorithms are applied to the same sample of datasets.

(27)

27

instead of independent. The different metrics yielded per technique and algorithm are therefore compared to each other with paired samples t-tests. When the sample size is equal or larger than 30, the t distribution can be used to estimate a confidence interval (Mann 2010). Comparisons of performance with regard to the sampling scheme and feature selection meet this criteria, since these comparisons consist of sample sizes of 45 and 30 respectively. A sample size lower than 30, which is the case when the performance of the best performing algorithm-settings are compared, requires a normal distribution of the paired differences (Mann 2010). Therefore, a Shapiro-Wilk test is performed to check the normality assumption. If this assumption is violated, the Wilcoxon test is used to compare performance. Using paired samples t-tests to compare model performance is common practice (e.g. Hung et al. 2006; Caruana and Niculescu-Mizil 2006; Chen et al. 2012).

Variable importance is uncovered by using the varImp-function of the Caret package. In the case of decision trees, the variable importance is calculated as the decrease in entropy attributed to each variable at every split. The same procedure is used for bagged decision trees and boosted decision trees. The variable importance of Random Forests is calculated by making use of the internal measure for variable importance. In the case of an SVM, which does not have an internal measure for variable importance, a filter approach is used to calculate the variable importance. In this case, an ROC-curve analysis is performed on each predictor. Subsequently, for every class the highest AUC is used as the variable importance measure.

(28)

28

Table 1: Used R-packages per process

Chapter 4: Data 4.1 Data collection

This research makes use of website data from a web shop which offers a broad range of products in the entertainment sector, and targets customers in the Netherlands and Belgium. However, in order to set up a targeted prediction model, data should be more focused. Therefore, because of the frequent use of display advertisements and extensive browsing patterns, only data regarding visitors to the classical music category is used.

4.2 Data description

The dataset contains information about browsing behaviour as reported by Google Analytics, enriched with information about internal data sources. The time span ranges from 1st January 2017 to 18th March 2017 (12 weeks), resulting in 19.358 raw observations. The dataset to be used consists of different variables concerning browsing behaviour and demographics (see Table 2). Refer to Table 3 for a detailed overview of all the categories.

4.3 Data inspection and cleaning

In order to conduct a relevant analysis, the dataset needs to be cleaned by removing missing values, outliers, and other anomalies. The data contains session regions ranging from the United States to China, often accompanied with a bounce. Since the web shop only targets and ships to customers in the Netherlands and Belgium, traffic outside these regions are considered as invalid traffic. In total, 945 observations are discarded.

Process Package

Decision trees and boosted decision trees C5.0

Bagged decision trees ipred

Random forests randomForest

Support Vector Machine e1071

Feature selection mlr

SMOTE DMwR

Undersampling ROSE

ROC-curves/AUC ROCR

RMSE (to calculate SAR) Metrics

Variable importance Caret

(29)

29

Table 2: Operationalization of variables

Variable Coding Explanation of coding

Session Count Integer Total number of sessions by individual user Subsequent sessions do not change previous values

Days Since Last Session Integer Days elapsed since user previously visited the website

Device Category Categorical Type of device used

Region Categorical Region of user according to IP-address

Landing Page Category Categorical The page via which the user accesses the website

Channel Categorical The channel via which the user accesses the website

Hour Index Categorical Index for the hour in which the session started

New User Dummy 0/1 1 when new user

Bounce Dummy 0/1 1 when bounce

Session Duration Integer Duration of session in seconds

Page Views Integer Number of page views

Unique Searches Integer Number of unique searches

Search Depth Integer Number of search result pages viewed Products Removed Integer Number of products removed from cart

Day Categorical Day on which session started

Age Categorical Age category of user

Gender Dummy 0/1 1 when male

Advertisement Clicked Dummy 0/1 1 when clicked

Table 3: Different categories of categorical variables Device Category Desktop, Mobile, Tablet

Landing Page Category

Catalogue, Composer Catalogue, Customer Account, Genre Catalogue, Home Page, Internal Search Result, Other Catalogue, Product Page

Region Brussels, Drenthe, Flandres, Flevoland, Friesland, Gelderland, Groningen, Limburg, North Brabant, North Holland, Overijssel, South Holland, Utrecht, Walloon Region, Zeeland

Channel Direct, Email, Organic Search, Other, Paid Search, Referral, Social Hour Index 0-23

Day Monday - Sunday

Gender Male, Female

(30)

30

In order to be able to predict clicks, data from bounces is not relevant. Therefore, 7786 additional observations are omitted. Box plots are used to search for outliers. All variables show large outliers, but many of them are not necessarily abnormal.

Variable Days Since Last Session has an average of around 5.8 days, but a maximum registered value of 182 days. Since the cookies used for this measurement exist for 2 years, 182 days cannot be considered as invalid. Observations with large values for Page Views are always combined with high values for Session Length, indicating actively browsing web site visitors. Class Walloon Region in variable Region contains only 36 observations. The same holds for Hour Index 2-6 (totalling 107 observations). Since the data set is split multiple times, it may be that none of these observations fall in the training class. When the algorithm does encounter these categories during cross-validation or evaluation stages, problems arise. Therefore, Hour Index 2 - 6 is combined in the category night. Category Walloon Region is combined with category Brussels into the category French Speaking Belgium with 140 observations.

4.4 Missing values

Variables Region and Landing Page Category show missing values of 0.42% (46 observations) and 0.63% (67 observations) respectively. Since these are relatively small numbers, the influence of these missing values is considered to be limited. Therefore, the observations are deleted (Little and Rubin 2014). In addition, a large number of sessions (9.6%) do not contain information about session length. In order to cope with these missing values, it is important to know why missing values occur. This distinction results in three types of missing data: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). When the reason for missing observations is completely random, in other words, not related to any characteristics of the dataset, missing data is said to be MCAR. However, if the probability of an observation missing is dependant of the underlying value of this observation, missing data will be MNAR. Data is considered to be MAR when the underlying probability of missing is dependent on variables which are present in the dataset (Donders et al. 2006).

(31)

31

is caused by the value itself, since this has no influence on the measurement. The data is therefore assumed to be MAR. Consequently, multiple imputation is used to handle the missing values.

Multiple imputation (MI) can be seen as an extension of single imputation. Single imputation requires the imputation of a certain observation based on various other characteristics of the respective observation. However, this procedure often results in standard errors which are underestimated (Donders et al. 2006), since the observations are only based on one imputation. Therefore, MI creates multiple imputed datasets with different imputations based on different underlying distributions. MI is performed with the R-package mice, with a default setting of 5 datasets and predictive mean matching as the imputation method.

Table 4: Results of Little’s MCAR-test

Chi Square Degrees of Freedom P-value Missing Patterns

1381.805 16 0 2

4.5 Correlation and multicollinearity

Correlation implies a statistical relationship between two variables. When severe correlation within a set of independent variables exists, the case of multicollinearity arises. This may result in the case of logistic regression in inflated parameter estimates as well as a change in signs of regression coefficients (Ramasubramanian and Singh 2017). Multicollinearity can be detected by means of a correlation matrix, which can be found in Appendix A1. Two strong correlations (>0.5) are found; a correlation between Session Duration and Page Views (0.614) and a correlation between Page Views and Search Depth (0.518).

Another way to find multicollinearity is by means of Variance Inflated Factors (VIF). VIF is defined as follows:

𝑉𝐼𝐹_𝑗 = 1

1 − 𝑅_𝑗2

in which 𝑅_𝑗2 is the determination parameter of a regression of variable j on all the other variables. Multicollinearity exists when the tolerance is less than 0.20 or when the VIF is higher than 5 (Ramasubramanian and Singh 2017). However, no extreme VIF-scores are observed. See Appendix A2 for the complete table.

(32)

32

Time Per Page, where the duration of the session is divided by the number of page views, resulting in the time spent on a page in seconds. After this transformation, no strong correlation where found anymore.

Time Per Page shows extreme outliers, with a maximum value of 17110 seconds. However, extensive information booklets are included on some product pages and this may take considerable time to read. Also, these high values combined with multiple page views may still indicate browsing behaviour instead of invalid traffic or abandoned sessions. Therefore, observations with only 1 page view exceeding the arbitrarily chosen 30 minutes (1800 sec.) are regarded as outliers, following the procedure of White and Drucker (2007). This results in 25 deletions.

4.6 Descriptive statistics

The final model contains 10489 observations. Of these observations, 64.3% are male. During the reported time frame, the click-through rate of banner advertisements was 6.1%. 56% of these clicks were by men. The total click-through rate for women is 5.8%, and for men 6.2%. Most sessions were on Friday, however, the average click-through rate was highest on Monday (see table 4). Age category 65+ is the largest category (37.65%), followed by 55-64 (30.36%), 45-54 (14.91%), 35-44 (8.99%), 25-34 (6.37%) and lastly 18-24 (1.72%).

Table 5: CTR per day

(33)

33

Chapter 5: Results 5.1 Logistic regression

Feature selection was based on the filter method results which can be found in Appendix B1. According to these results, one variable reports high information gain (Search Depth), followed by a range of variables with lower, comparable information gain. Following Ramasubramanian and Singh (2017), the top 6 variables were used for model building (variable Search Depth, Time Per Page, Channel, Landing Page Category, Device Category and New User). Subsequently, filter feature selection resulted in omitting all demographic variables.

Performing the wrapper feature selection procedure suggested omitting variables Hour Index, Unique Searches, Region, Day, Gender, and Age, resulting in an increase in test-AUC from 0.697 to 0.709. Refer to the Appendix B2 for a full overview of the steps. Removing demographic variables from the model increased model performance. Logistic regression does not require any parameters to be optimized. Therefore, no grid search was performed. The impact of feature selection, class imbalance techniques, and calibration was analysed using 15 resampled datasets. The average results are summarized in Table 6.

Table 6: Overall average performance after resampling of a logistic regression

Feature selection

Class Imbalance technique

AUC SAR LogLoss

before LogLoss after no SMOTE 0.667 0.658 6.176 0.227 no undersampling 0.689 0.611 12.853 0.223 filter SMOTE 0.677 0.655 4.142 0.225 filter undersampling 0.689 0.611 12.853 0.228 wrapper SMOTE 0.704 0.630 7.501 0.219 wrapper undersampling 0.688 0.646 3.487 0.223 best performing combinations are marked in blue

(34)

34

p=0.002. The AUC of wrapper feature selection (0.689) was significantly higher than the AUC of filter feature selection; t(29)=2.129, p=0.041. In addition, calibration did significantly influence performance, decreasing LogLoss from 7.835 to 0.223 (t(89)=17.509, p=0.000). According to Table 6, the wrapper/SMOTE-combination results in the best performance. To check whether or not this model outperformed other models, a Shapiro-Wilk test is required to check the normality assumption of a paired samples t-test, since the number of samples was only 15. This test was found insignificant; W(15)=0.947, p=0.483. Therefore, a paired samples t-test could be used to check for significant differences. Multiple models did not outperform this combination. The first significant difference appeared between the wrapper/undersampling-combination (t(14)=2.195, p=0.05).

In conclusion, SMOTE significantly outperforms undersampling when combined with logistic regression. Wrapper feature selection results in the best performance in terms of AUC, followed by filter feature selection and the full model. Calibration influences the performance of the model in terms of LogLoss to a significant extent. The model using wrapper feature selection and SMOTE results in the highest AUC (0.704) and minimizes LogLoss after calibration the most (0.219), although two other combinations do not significantly underperform. Figure 3 and Figure 4 show respectively the ROC plot and the reliability plot of this model.

(35)

35

Logistic regression results in interpretable output: coefficients are reported in addition to their associated p-values. Therefore, these results are highly relevant in uncovering the underlying variables influencing click behaviour. Since the model using SMOTE and wrapper feature selection resulted in the highest AUC, this model was used to gain insights. The associated parameter estimates of variables with significance levels lower than p=0.1 are reported in Table 7. The full table can be found in Appendix B3.

According to the results, browsing through more search result pages leads to a higher chance of clicking on an advertisement. In addition, a longer time per page has the same result. However, new users are less likely to respond to advertisements, as well as visitors who visit the website less frequently. Finally, browsing via mobile phones or tablets reduces the likelihood of clicking, and landing on a specific product page increases this likelihood, contrary to landing on an internal search result.

Table 7: Parameters of logistic regression with significance p<0.1

β Z β Z

Intercept -2.905*** -5.667 New User: Yes -1.199*** -8.810 Search Depth 0.016*** 5.153 Time Per Page 0.004*** 8.545 Device Category: Mobile -0.697** -2.700 Days Since Last Session -0.018*** -4.001 Device Category: Tablet -0.465* -2.493 Channel: Organic Search 2.979*** 6.101 Channel: Direct 2.481*** 4.873 Channel: Paid Search 1.464** 2.937 Channel: Email 1.689*** 3.514 Channel: Referral 1.740*** 3.297 Landing Page Category:

Internal Search Result

-0.831* -2.470 Landing Page Category:

Product Page

0.618*** 3.557

Reference categories: Device: Desktop | Landing Page Category: Catalogue | New User: No Channel: Other | *p<0.05 **p<0.01 ***p<.001

5.2 Decision trees

The results of the previously performed filter selection method (as detailed in Appendix B1) were used as input for the filter method. The wrapper procedure resulted in the suggestion to omit the variables Hour Index, Region, and Device Category, increasing test-AUC from 0.661 to 0.715. Refer to the Appendix C1 for a full overview of the steps.

(36)

36

performance of a decision tree after the different settings, when applied to the bootstrapped datasets.

Table 8: Overall average performance after resampling of a decision tree

AUC SAR LogLoss before LogLoss after no SMOTE 0.703 0.352 0.660 0.220 no undersampling 0.652 0.290 0.814 0.225 filter SMOTE 0.737 0.378 0.597 0.215 filter undersampling 0.701 0.275 0.673 0.219 wrapper SMOTE 0.704 0.354 0.567 0.218 wrapper undersampling 0.718 0.285 0.626 0.213 best performing combination is marked in blue

To check for significant differences between the different settings, multiple paired samples t-tests were performed. According these t-tests, the average AUC of SMOTE (0.714) was significantly higher than the average AUC of undersampling (0.690), t(44)=4.366, p=0.000. The AUC after filter feature selection (0.719) differed significantly from the full model (0.677), t(29)=5.511, p=0.000. No difference was found between filter feature selection and wrapper feature selection (0.711), t(29)=1.088, p=0.285, confirmed by a comparison of the average SAR of filter feature selection (0.327) and wrapper feature selection (0.320), t(29)=1.691, p=0.102. Calibration influenced performance significantly (t(89)=51.975, p=0.000), decreasing LogLoss from 0.656 to 0.218. According to Table 8, the filter/SMOTE combination results in the best performance. To check whether or not this model outperformed other models, a Shapiro-Wilk test is required to check the normality assumption of a paired samples t-test, since the number of samples was only 15. This test was found to be not significant; W(15)=0.932, p=0.294. Therefore, the paired samples t-test could be used to check for significant differences. The performance of the optimal combination differed significantly from the second-best performing combination (t(14)=2.855, p=0.013).

(37)

37

Filter feature selection and SMOTE resulted in the best performance. Therefore, this model was used to gain insights into variable importance. According to this procedure, the number of search result pages browsed and the time spent on a page are the most important predictors, both with a variable importance of 100. Thereafter, Variable New User: Yes, Landing Page Category: Internal Search Result, and Channel: Referral followed with an importance of 69.44, 57.48 and 55.50 respectively. For more information, refer to Appendix C4.

Viable information can be uncovered by the different steps of the decision tree, which can be found in the Appendix C5. According to this tree, high time spent per page coupled with viewing more than 3 result pages will likely result in clicking on a banner advertisement. Apparently, search behaviour and a high time spent per page is an important predictor for click behaviour. The variable New User is often used in later splits as a final decision step. In this case, new customers are likely to refrain from clicking on banner advertisements.

5.3 Bagged decision trees

Wrapper feature selection indicated best performance when deleting variables Gender, Hour Index and Day, increasing AUC from 0.831 to 0.839 (refer to Appendix D1 for a full overview). Since the bagging-function from the package ipred does not have any hyperparameters, no grid search was performed. Table 9 sums up the average performance of bagged decision trees per setting after applying it to the bootstrapped datasets.

Table 9: Overall average performance after resampling of bagged decision trees

AUC SAR LogLoss

before LogLoss after no SMOTE 0.691 0.360 0.706 0.223 no undersampling 0.715 0.312 0.795 0.217 filter SMOTE 0.712 0.357 1.140 0.218 filter undersampling 0.701 0.307 1.648 0.220 wrapper SMOTE 0.722 0.338 0.731 0.215 wrapper undersampling 0.715 0.298 0.812 0.219 best performing combinations are marked in blue