7/24/2017 | 1
To Click or Not to Click: Machine Learning
Techniques for Predicting and Uncovering
Influencers of Click Behaviour
Erwin Oosterhuis
s2211173
University of Groningen
First Supervisor: prof. dr. J.E. Wieringa Second Supervisor: dr. J.E.M. van Nierop
7/24/2017 | 2
Click-through rates have decreased to as low as 0.1% (MediaMind 2012).
Predicting clicks gives an opportunity to target advertisements, which results in
improvements of CTR. (Briggs and Hollis 1997, Sherman and Deighton 2001, Chandon et al. 2003, Chatterjee et al. 2003)
2) What is the importance and influence of
user demographics and browsing behavior variables on click-through?
Two-fold research purpose
1) How can machine learning techniques be used to develop an accurate display advertising click-through prediction model, providing further insights into open research
issues in model building?
Research Question 1 Research Question 2
Hypotheses
7/24/2017 | 3
Choice of algorithm
Before calibration, bagged decision trees have best performance
Before calibration, logistic regression has worst performance
After calibration, boosted decision trees have best performance
Calibration
Calibration positively influences model performance
Handling class imbalance
Undersampling outperforms SMOTE in terms of model performance
Feature selection
Wrapper feature selection positively influences model performance
Filter feature selection positively influences model performance
Wrapper feature selection outperforms filter feature selection in terms of performance
Importance of the datatypes
Combination of both demographic variables and browsing behaviour variables result in better performance than their separate influence
Influence of demographic variables
Age positively influences click behaviour
Women show higher click behaviour than men
Influence of browsing behaviour variables
Search behaviour is positively associated with click behaviour
Store patronage is positively associated with click behaviour
Browsing via tablets or mobile phones negatively influences click behaviour
› 12 weeks of Analytics data, enriched with internal data =
19.358 raw observations
› 9.6% missing values for session length
• Little’s MCAR test -> p=0.000, data is not MCAR
• Assumption: data is MAR
• MI to impute values
› Final dataset consists of 10489 observations
› Analysis takes places in R
7/24/2017 | 5
Methodology
Algorithms
SMOTE or undersampling Full, filter or wrapper
No calibration or calibration
+ + +
Data set Boosted dataset: resampled 15 times
LR, DT, boosting, bagging, RF, SVM
Grid search for optimal parameter settings 10-fold cross-validation
AUC SAR LogLoss
Performance
1/3 + (AUC + ACC + (1-RMSE))
1) Shapiro-Wilk test for normality assumption 1) p<0.05: Wilcoxon test
2) P>0.05: Paired samples t-test
Results Research Question 1
7/24/2017 | 7
› Random forests result in highest performance before calibration › SVM results in worst performance before calibration
› Calibration significantly improves performance of all algorithms › Boosted decision trees result in highest performance after
calibration
› Feature selection improves model performance (except for SVM and boosting) – differs which selection procedure works best
Results Research Question 2
› Only bagged decision trees benefitted from combination of data › Age and gender do not influence click behaviour
› Search behaviour is a strong predictor of click behaviour • Search depth has positive influence
• Number of unique searches is important predictor › Store patronage is strong predictor of click behaviour
• New users are less likely to click on advertisements • Days since last session negatively associated with click
behaviour
• Session count is important predictor
› Browsing via mobile phones and tablets decreases click behaviour
Additional results
› Time per page is most important predictor of click behaviour
› The acquisition channel influences click behaviour
• Organic search and direct traffic have largest positive
influence
› Landing page influences click behaviour
• Landing on product page has positive influence
• Internal search result has negative influence
7/24/2017 | 10
Research Question 1 Research Question 2
Conclusions
1. ‘There is no such thing as a free
lunch in statistics’. Take an
empirical approach.
2. Calibration and feature selection (in almost all cases) increase
model performance. Wrapper
feature selection is preferred.
3. Take SMOTE in consideration when handling class imbalance.
4. Use random forests for highest performance. Or, when using
calibration, combine it with boosted decision trees.
1. Focus on data quality instead of quantity. Browsing behaviour
variables are better predictors than demographic variables.
2. Incorporate search behaviour data, store patronage data and data on time spent per page. 3. Critically assess advertising on
mobile devices
4. Focus on acquisition via organic search and direct traffic