To Click or Not to Click: Machine Learning

(1)

7/24/2017 | 1

To Click or Not to Click: Machine Learning

Techniques for Predicting and Uncovering

Influencers of Click Behaviour

Erwin Oosterhuis

s2211173

University of Groningen

First Supervisor: prof. dr. J.E. Wieringa Second Supervisor: dr. J.E.M. van Nierop

(2)

7/24/2017 | 2

Click-through rates have decreased to as low as 0.1% (MediaMind 2012).

Predicting clicks gives an opportunity to target advertisements, which results in

improvements of CTR. (Briggs and Hollis 1997, Sherman and Deighton 2001, Chandon et al. 2003, Chatterjee et al. 2003)

2) What is the importance and influence of

user demographics and browsing behavior variables on click-through?

Two-fold research purpose

1) How can machine learning techniques be used to develop an accurate display advertising click-through prediction model, providing further insights into open research

issues in model building?

Research Question 1 Research Question 2

(3)

Hypotheses

7/24/2017 | 3

Choice of algorithm

Before calibration, bagged decision trees have best performance

Before calibration, logistic regression has worst performance

After calibration, boosted decision trees have best performance

Calibration

Calibration positively influences model performance

Handling class imbalance

Undersampling outperforms SMOTE in terms of model performance

Feature selection

Wrapper feature selection positively influences model performance

Filter feature selection positively influences model performance

Wrapper feature selection outperforms filter feature selection in terms of performance

Importance of the datatypes

Combination of both demographic variables and browsing behaviour variables result in better performance than their separate influence

Influence of demographic variables

Age positively influences click behaviour

Women show higher click behaviour than men

Influence of browsing behaviour variables

Search behaviour is positively associated with click behaviour

Store patronage is positively associated with click behaviour

Browsing via tablets or mobile phones negatively influences click behaviour

(4)

› 12 weeks of Analytics data, enriched with internal data =

19.358 raw observations

› 9.6% missing values for session length

• Little’s MCAR test -> p=0.000, data is not MCAR

• Assumption: data is MAR

• MI to impute values

› Final dataset consists of 10489 observations

› Analysis takes places in R

(5)

7/24/2017 | 5

Methodology

Algorithms

SMOTE or undersampling Full, filter or wrapper

No calibration or calibration

+ + +

Data set _{Boosted dataset: resampled 15 times}

LR, DT, boosting, bagging, RF, SVM

Grid search for optimal parameter settings 10-fold cross-validation

AUC SAR LogLoss

Performance

1/3 + (AUC + ACC + (1-RMSE))

1) Shapiro-Wilk test for normality assumption 1) p<0.05: Wilcoxon test

2) P>0.05: Paired samples t-test

(6)

(7)

Results Research Question 1

7/24/2017 | 7

› Random forests result in highest performance before calibration › SVM results in worst performance before calibration

› Calibration significantly improves performance of all algorithms › Boosted decision trees result in highest performance after

calibration

› Feature selection improves model performance (except for SVM and boosting) – differs which selection procedure works best

(8)

Results Research Question 2

› Only bagged decision trees benefitted from combination of data › Age and gender do not influence click behaviour

› Search behaviour is a strong predictor of click behaviour • Search depth has positive influence

• Number of unique searches is important predictor › Store patronage is strong predictor of click behaviour

• New users are less likely to click on advertisements • Days since last session negatively associated with click

behaviour

• Session count is important predictor

› Browsing via mobile phones and tablets decreases click behaviour

(9)

Additional results

› Time per page is most important predictor of click behaviour

› The acquisition channel influences click behaviour

• Organic search and direct traffic have largest positive

influence

› Landing page influences click behaviour

• Landing on product page has positive influence

• Internal search result has negative influence

(10)

7/24/2017 | 10

Research Question 1 Research Question 2

Conclusions

1. ‘There is no such thing as a free

lunch in statistics’. Take an

empirical approach.

2. Calibration and feature selection (in almost all cases) increase

model performance. Wrapper

feature selection is preferred.

3. Take SMOTE in consideration when handling class imbalance.

4. Use random forests for highest performance. Or, when using

calibration, combine it with boosted decision trees.

1. Focus on data quality instead of quantity. Browsing behaviour

variables are better predictors than demographic variables.

2. Incorporate search behaviour data, store patronage data and data on time spent per page. 3. Critically assess advertising on

To Click or Not to Click: Machine Learning