Speed is key : decreasing time till convergence on CTR prediction models

(1)

Master’s Thesis

Speed is key:

Decreasing time till convergence on

CTR prediction models

Thijs van Velzen

Student number: 10379924

Date of finished version: July 28, 2017

Master’s programme: Econometrics

Specialisation: Big Data Analytics

Supervisor: Dr. K. Pak

Second reader: Dr. N. P. A. Giersbergen

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided up into a number of sections and contains references. An outline can be something like (this is an example for an empirical thesis, for a theoretical thesis have a look at a relevant paper from the literature):

(a) Front page (requirements see below)

(b) Statement of originality (compulsary, separate page) (c) Introduction (d) Theoretical background (e) Model (f) Data (g) Empirical Analysis (h) Conclusions

(i) References (compulsary)

If preferred you can change the number and order of the sections (but the order you use should be logical) and the heading of the sections. You have a free choice how to list your references but be consistent. References in the text should contain the names of the authors and the year of publication. E.g. Heckman and McFadden (2013). In the case of three or more authors: list all names and year of publication in case of the rst reference and use the rst name and et al and year of publication for the other references. Provide page numbers.

2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All that actually matters is that your supervisor agrees with your thesis.

3. The front page should contain:

(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Faculty as in the heading of this document. This combination is provided on Blackboard (in MSc Econometrics Theses & Presentations).

(b) The title of the thesis

(c) Your name and student number (d) Date of submission nal version

(e) MSc in Econometrics

(f) Your track of the MSc in Econometrics 1

(2)

i This document is written by Thijs van Velzen who delcares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

4 Stochastic Gradient Descent 12 4.1 Optimization algorithms . . . 13 4.2 Results . . . 15 5 FTRL 18 5.1 FTRL algorithm . . . 18 5.2 AdPredictor . . . 20 5.3 Results . . . 20 6 FTRDL 22 6.1 Follow-The-Regularized-Dynamic-Leader . . . 22 6.2 Results . . . 25 6.3 Logistic vs. Probit . . . 26 6.4 Failed experiments . . . 28 7 Conclusion 30 A Diebold-Mariano significance test 32 A.1 Derivation of the Diebold-Mariano test . . . 32

A.2 Diebold-Mariano test results . . . 34 ii

(4)

CONTENTS iii

(5)

Chapter 1 Introduction

In the online world we live in today, online advertising is a booming business with an estimated revenue growth of 18.2 billion dollar in the US between 1998 and 2007 (Chapelle et al., 2015; Ha, 2008). One way to advertise online is by using banner ads. Often the banner spots are auctioned by real-time bidding, so it’s important for companies to have ads that have a positive impact (Yuan et al., 2013).

One way to measure this is by looking at the click-through rate (CTR) of an ad. By looking at the historical data you can predict the CTR of the ad, based on the characteristics of the viewer. The advertisements all rely heavily on the accuracy, speed and reliability of the predictions made by learned models (Richardson et al., 2007). These learned models have shown in the past that they can do just that.

However, since the size of databases are increasing faster than the power of proces-sors do, models and computers begin to struggle to handle all this data. Stochastic or online gradient descent models have been developed to tackle this problem (Bottou, 2010). Instead of feeding all the data at once to the model, it feeds it in subsets. After each subsets, the weights are updated untill convergence.

Data can be reused as well, called epochs, which can lead to better convergence of the weights. However, since it can take a lot of epochs to get the optimal result, this can lead to a loss in training speed. Since speed is an important factor in predicting click-through rates, it’s important to get a model that can handle the data size while also producing accurate predictions without the loss of training speed.

In this thesis, the main objective is to find a new and improved algorithm in predicting the CTR. Because memory usage has been researched before (Graepel et al., 2010; McMahan et al., 2013), this paper is focussed on the training speed of the models. In this, training speed is the time it takes for the algorithm to converge to it’s optimal solution. The objective is to explore the issues in training speed from known models

(6)

CHAPTER 1. INTRODUCTION 2 and improving their speed by adjusting them. Moreover, since increasing the training speed of the models can possibly lead to a decrease in prediction accuracy, the aim is to get a model that increases the training speed without the loss of accuracy on the predictions.

To answer these questions, data from Avazu is used. This data was supplied to competitors in Kaggle’s click-through rate prediction competition. With this data, predictions are done by using various published models, as well as predictions made by the models proposed in this paper. The predictions are evaluated by the use of the logistic loss. This is explained more in chapters 2 and 3.

The remainder of this thesis is organized as follows. Before the proposed algorithm of this research is explained in 6, the techniques and models of other researches will be discussed in chapter 2. In chapter 3 more information about the dataset from Avazu is provided, before the algorithms of recent works are discussed in 4 and 5. The results are shown in 4, 5 and 6 as well, showing overall progress in the models. Finally, conclusions and future directions are in chapter 7.

(7)

Chapter 2 Literature Review

The process of CTR forecasting starts with the dataset. Since click datasets are large, simple prediction models like ordinary least squares and maximum likelihood can’t be used. These predictions models want the data all at once. By feeding the entire click dataset to the model at once, this would flood the memory of regular computers, unable to make any predictions. Other models that are able to handle big datasets might require too much power for a processor to handle. And finally, there are models that can handle both big datasets and the processor problem, but lose accuracy or training speed in doing so.

In this chapter, a few methods and models are discussed and why they can be usefull in this particular research. First, a brief CTR system overview is presented in 2.1. In that section, problems concerning the system are also mentioned. Finally in 2.2 models developed in other researches will be discussed.

2.1 System overview

When users browse online, they often find advertisements on the webpages. Com-panies can buy the advertising spot on that page, often by real time bidding (Yuan et al., 2013). When a user loads the webpage, an auction opens where interested com-panies can bid on advertising for that particular user. The auction mechanism then determines which company gets the banner spot.

For companies it’s important to know what impact their advertisement can have on user i. Since every banner cost money, it’s detrimental that the advertisement eventually leads to the user buying a product from the company, a so called conversion. By estimating the CTR for that particular user i, the company can decide whether to invest in the auction or not. However, some ads have different purposes. Some

(8)

CHAPTER 2. LITERATURE REVIEW 4 are used just for name recognition, while other are directly aimed at selling a product (Anderl et al., 2016). Therefore, the effect of an ad can’t be meassured by the CTR alone in all instances. Since this research is aimed at predicting the CTR, it is assumed that every ad is meant to be clicked.

Since the model estimates the probability of a click, it’s outcome should be between 0 and 1. This is not really an issue, since there are plenty of binary functions that can deal with these kind of data (Gunduz and Fokoue, 2015). More important however is picking a binary function that suits click data. Since click-trough rates are usually around 0.1% (Stec, 2015), this means that the model should be able to tackle skewed data.

In estimating the CTR, companies often are able to see the features of the user. These features of a user can be plentyfull and can include the user ID, what device he is using and the type of connection he is currently using (Agarwal et al., 2009). By using these features or a combination of them, the company has more knowledge about the users behaviour which helps them in predicting the CTR.

However, these features are often plentyfull and can contain thousands if not milions of different values (Graepel et al., 2010; McMahan et al., 2013). Since a lot of features have binary values, for example you either have a certain device or not, this can lead to extremely sparse data. Therefore not every feature value is relevant for a certain user i. A CTR prediction model should be able to deal with the sparse data.

Most models tend to use all available data at once, setting the weights and using the weights to predict new impressions. Since datasets are usually very large on banner ads, this can be computationally costly (Bishop, 2006). Not only that, but since multiple new impressions can be available every single second (Sherman and Deighton, 2001), it can be preferable to have a model that can use this new information on the fly. A model that can predict the impression and afterwards can update the weights with the newly acquired result is optimal. This way, the model is able to deal with new features directly instead of waiting for the model to be trained again.

2.2 Recent works

Since the online advertising business is growing, interest in better prediction models have been rising as well. In the last couple of years multiple models have been devel-oped by specialists from various companies. In this section, a couple of these models are discussed.

(9)

CHAPTER 2. LITERATURE REVIEW 5 research a model was developed that was able to predict the CTR for sponsored search advertising for Microsoft’s Bing search engine (Graepel et al., 2010). They started with a generalised linear model with a probit link function, which is able to deal with the binary nature of clicks. They use Bayes to update their model in two ways. First, they use a training example and a prior to infer a new posterior. Next, they use the posterior and feature vector to infer the predictive distribution.

The used weights are the means and the variances of each feature vector. By using stochastic gradient descent (SGD), they update the weights after each training instance. What seperates their model from regular SGD, is the use of a Gaussian correction. More details on this can be found in their article (Graepel et al., 2010), but in summary this allows the model to update the weights according to how ”surprising” the observation is. This amount of change is also proportional to the variance and since every observation leads to a reduction of the variance, this leads to higher learning rates at the start. This way the model steps away more quickly from the prior weights. In case you don’t want to diverge from the prior weights, Greapel et al. (2010) also included a variant of their model with an exponential moving average, where the weights are updated by a weighted average of the prior and posterior weights. This allows the model to deal with changes in the environment (Graepel et al., 2010).

To evaluate their model, they use a variant of the logistic loss function called the relative information gain (RIG). This allows them to see how accurate the model is. They compare it with predictions made by a regular Na¨ıve Bayes model. The AdPredictor model provides more accurate and more spread out prediction, which allows the model to give more informative predictions. However, the AdPredictor model doesn’t seem to have regularization to deal with overfitting, which might cause problems with forecasting (He et al., 2014).

Another model that does try to deal with overfitting, is the follow-the-regularized-leader-Proximal (FTRL) model (McMahan et al., 2013). Instead of using a probit regression model, McMahan et al. (2013) use a logistic regression, which can deal with binary outcomes as well. A full explanation of the used algorithm is provided in section 5. In their model, McMahan et al. (2013) add L1 and L2 regularization

terms. L1 regularization deals with overfitting the coefficients. Instead of having a

lot of non-zero coefficients, L1 induces sparsity, therefore preventing overfitting. This

leads to better out-of-sample accuracy.

L2regularization on the other hand penalizes the model for having large coefficients.

Large coefficients can possibly cause the model to produce high CTR predictions while only one weight is positive, resulting in unaccurate predictions. In training their model,

(10)

CHAPTER 2. LITERATURE REVIEW 6 McMahan et al.(2013) also use a per-coordinate learning scheme, which allows each feature to have its own learning rate based on the amount of impressions of that feature. This way, new features can learn faster than frequent features.

In testing the model, they compared their model with various other models. The results showed that FTRL-Proximal gives better sparsity than regular SGD, while the accuracy loss caused by L1 regularization is very small. They also noticed that their

model gives better predictions when the CTR of the dataset is large, around 50%, as opposed to a small CTR of around 2%, which is a more realistic CTR.

An adaptation to the FTRL-Proximal model from McMahan et al.(2013), is the follow-the-reguralized-factored-leader (Ta, 2015). In his model, Ta (2015) adds second-order factorization machines (FM) to the FTRL-Proximal model, based on the work of Rendle (2010). In normal linear models, it follows that

y(x, w) =

N

X

j

wjxj (2.1)

with wj the weights of input features xj. This however imposes restrictions on the

amount of information you can get from the features. By using factorization machines, Ta (2015) wants to capture interaction between features in the hope of getting more information out of the features. He therefore changes the linear model from (2.1) to

y(x, w) = N X i N X j>i hvi, vjixixj (2.2)

He applies (2.2) to the FTRL-Proximal model of McMahan et al. (2013). Ta (2015) notes that for his model to work on a dataset of 45 million impressions, he used 28 CPUs with a shared memory of 139 GB. His FTRFL model outpeformed a normal SGD model with FM implemented. He alsno notes that it converges faster than the SGD model, but doesn’t tell more about the total computation time needed for the model.

Finally a model that has a proven reputation, is the Field-aware Factorization machine (FFM) (Juan et al., 2016). This model uses the same idea as the model of Ta (2015). The FFM model has been used succesfully in two competitions on Kaggle, winning both the Criteo display advertising challenge (Juan et al., 2011) and the Avazu Click-Through Rate Prediction competition (Juan et al., 2014). In the FM model, every feature has only one latent vector to learn the effect with any other features. For example, for an impression of an ad from Nike on the website of ESPN

(11)

CHAPTER 2. LITERATURE REVIEW 7 being read by a male, we get

y(x, w) = wESP N · wN ike+ wESP N · wM ale+ wN ike· wM ale

y(x, w) = wESP N · (wN ike+ wM ale) + wN ike· wM ale

So in here wESP N is used to learn the latent effect of both (ESPN, Nike) and (ESPN,

Male). However, since the fields of Nike and Male are different, the latent effects of (ESPN, Nike) and (ESPN, Male) might be different as well.

FFM however assumes that the effect of ESPN is indeed different for Nike and Male. Therefore each feature has several latent vectors to deal with each field. In this example, A stands for the field of Advertiser, P for Publisher and G for Gender. This results in the following

y(x, w) = wESP N,A· wN ike,P + wESP N,G· wM ale,P + wN ike,G· wM ale,A

This time wESP N,A is used to learn the latent effect of (ESPN, Nike), while wESP N,G

is used for (ESPN, Male), therefore returning more accurate effects. In summary, this results in (2.3) y(x, w) = N X i N X j>i (vi· vj)xixj (2.3)

In their paper, Juan et al. (2016) also explain the use of epochs. In getting better results, they feed the data multiple times to the model, also called epochs. It is shown that using multiple epochs results in the LogLoss score converging to a more optimal score than by using only 1 epoch. They also note the importance of early stopping, since the model starts to diverge from the optimal solution after using too many epochs. Finally, they show the result they got by comparing the FFM model with linear models, poly2 models and factorization models with different parameters on both datasets (Criteo and Avazu). The results show that FFM indeed outperformed all the other models with these datasets. However, they also notice that the computation time increases significantly by using models other than the linear model. For weaker processors this can be worrysome.

(12)

Chapter 3 Data

Before the models are explained in 4, 5 and 6, first we take a look at the data. This is done in 3.1. With this data, the models can be trained and eventually evaluated. How the evaluation is performed, is explained in 3.2.

3.1 Avazu

In this paper, a CTR dataset from Avazu is used. This dataset was made available for a competition on Kaggle 1. This dataset contains over 40 million click impressions with 24 variables for mobile banner ads. One-hot encoding and feature hashing from Vowpal Wabbit is used to create 223 _{features. Since 40 million impressions are too}

many for the computer to handle, a single Intel Core i5-4460 Processor with 16GB of memory, 10% of the data is randomly sampled for use. This sample is then split up in a training and a test set. Since the dataset contains click impressions for 10 days, the first 9 days are used for the training set and the 10th day as the test set. The training set is used to train the models, while the predictions are done on the test set. The average CTR of the three datasets are almost equivalent at 0.17. This is quite high for a CTR dataset, since CTRs usually are around 0.001 (Stec, 2015). A brief summary is given in 3.1.

Dataset Impressions Variables Features CTR Full set 40.428.967 22 223 0.1698 Training set 3.621.307 22 223 _0.1698

Test set 421.744 22 223 _0.1700

Table 3.1: Summary of used dataset

1_{Avazu Click-Through Rate Prediction: https://www.kaggle.com/c/avazu-ctr-prediction}

(13)

CHAPTER 3. DATA 9 As previously mentioned, the data contains click impressions for mobile banner ads. These impressions have information on 24 fields. A lot of these fields have encodes values. For example, site id has a value of 1fbe01fe. It’s therefore not easy to say what the true values of these fields are. This doesn’t affect the prediction, only the data exploration. The fields of the dataset are:

• id: Identifies which impression we are dealing with. • click: Shows whether an ad is clicked (1) or not (0).

• hour: Gives the time of the impression in YYMMDDHH format. • banner position.

• site related variables, consisting of the id, the domain and the category. The id gives the site id where the website is is hosted on, the domain gives the site domain, for example example.com. The category shows the type of site, like news site.

• app related variables, consisting of the id, the domain and the category. The sub-categories are the same as with the site field.

• device related variables, consisting of the id, the ip, the model, the type and the connection type. The id gives the unique device ID, the device ip stores the IP-adress of the device. model gives the model number of the device, type the kind of device that is used (like computer or smartphone). Finally, connection type stores the connection that is used. This could be Wi-Fi or 4G for example.

• 9 anonymized categorical variables.

From this dataset, the id variable is removed. From hour the first six numbers, the year, month and day, are removed, so only the hours are used. The hours should contain the most information and since the data set only contains ten days total, the year, month and day aren’t as valuable. Then 22 variables and 1 class vector click remain. This is the dataset that is used in the training and testing of the models. From this point forward, the variables are denoted by x, while the clicks are denoted by y.

(14)

CHAPTER 3. DATA 10

3.2 Evaluation

The goal of this paper is to find a model that predicts CTRs while training the model fast. In doing so, the probability of a click is determined by

CT R = P (y = 1 | x, w) (3.1) With w the weights of the model. The function P (y | x, w) should be able to convert a linear model to a mapping [0, 1]. Some functions that can be used, are discussed in 4. Since probabilities are computed, a common way of evaluating this is by looking at the resulting LogLoss (logistic loss). The objective of every prediction model is to minimize the errors. Since the goal is to predict P (y = 1|x, w) = p and P (y = 0|x, w) = 1 − P (y = 1|x, w) = 1 − p, this can be generalised to

P (y|x, w) = py(1 − p)1−y

By taking the negative log-likelihood, this results in the logistic loss (LogLoss) LogLoss = 1

N

X

i=1

(−yilog pi− (1 − yi) log(1 − pi)) (3.2)

The LogLoss meassures the uncertainty of the probabilities in the model by com-paring them with the true result. If a prediction is close to the true value, the error is small and the LogLoss function gives it a small penalty close to zero. If however the prediction is completly wrong, the penalty can become large, easily surpassing 1. It is therefore rewarding to make more thought of predictions instead of predicting 0 or 1. The more certain the model is in it’s predicting, the more willing it is to take risk in predicting values that are close to 0 and 1. Therefor a low LogLoss score means that the model is more certain about it’s predictions.

This makes the LogLoss more preferable over the SSE. Where SSE only gives an indication on how wrong the predictions are, the LogLoss score gives us information on how sure the model is about its predictions. Therefore the LogLoss score can be used for decision making as well, while the SSE only gives an indication in what confidence interval the prediction might be, due to y being valued only 0 or 1.

The models are trained by using multiple epochs, as considered by Juan et al. (2016). This could lead to lower LogLoss scores. Since using multiple numbers of epochs increases the total training time, the number of epochs needed for convergence is also evaluated. The average time required for 1 epoch to run is also looked upon, to see if some models require more computation than others. Finally, since it can be

(15)

CHAPTER 3. DATA 11 valuable to know if predictions are significantly different from eachother, a Diebold-Mariano test is performed. This test checks if there the predictions of the models are significantly different from eachother. An extended explanation of the Diebold-Mariano significane test is in A.1.

All models are compared with estimates of guessing the Benchmark, which is the average clicks of the training set

pBenchmark = 1 N N X i=1 yi,

and with Naive Bayes classification. This is formulated by pBayes = P (y = 1)Q iP (Xi = xi|y = 1) P zP (y = z) Q i(Xi = xi|Y = z)

In this, Xistands for the feature and xi for the feature value. This classifier assumes

that Xi|Y are indepent. In other words, given the outcome of value Y , the value of

feature X1 is independent of feature X2. In this paper, the features used for Naive

Bayes classification are the banner position and three anonymized variables, c1, c16 and c18. These four features resulted in the lowest LogLoss score for the Naive Bayes classifier.

(16)

Chapter 4 Stochastic Gradient Descent

As previously established in 3.2, the models compute probabilities P (y | x, w). One of the more popular functions that can transform the linear model y = wTx to a [0, 1] map, is the logistic regression model (Juan et al., 2016; McMahan et al., 2013; Ta, 2015). The logistic regression model is defined by P (y | x, w) = _1+exp(−w1 T_x). Another

option is the probit regression model, which is used in the AdPredictor model (Graepel et al., 2010). This model is defined by P (y | x, w) = Φy·w_βTx. However, for now only logistic regression is considered, unless noted otherwise.

A problem however is the estimation of the weights w. Since datasets are large, a simple OLS or ML won’t work, since it requires too much memory. Stochastic Gradient Descent solves this problem by using only parts of the data and updating the weights accordingly after each iteration.

Algorithm 1 Stochastic Gradient Descent gt← ∆f (wt−1)

wt ← wt−1− ηgt

In this algorithm, ∆f (wt−1) denotes the gradient of the LogLoss function, which

happens to be equal to (pt− yt)xt. This can be proven very easily (note that pt is a

function of w)

(17)

CHAPTER 4. STOCHASTIC GRADIENT DESCENT 13 ∆LogLoss = − N X i=1 δ δw(yilog pi+ (1 − yi) log(1 − pi)) = − N X i=1 y_i pi p0_i −1 − yi 1 − pi p0_i = − N X i=1 yi− pi (1 − pi)pi p0_i

For the logistic regression, the value of p0_i can be evaluated by p0_i,logistic = δ δw 1 1 + exp(−wT_x) = δ δw exp(wT_x) 1 + exp(wT_x) = x exp(w

T_{x)(1 + exp(w}T_{x)) − x exp(w}T_{x) exp(w}T_x)

(1 + exp(wT_x))2

= x exp(w

T_x)

(1 + exp(wT_x))2 = xpi(1 − pi)

Substituting this in the last line of LogLoss gradient solve to PN

i=1(pi− yi)xi.

The algorithm also uses a learning-rate η for it’s update. This learning rate can be constant, but it also can be updated by using the amount of training instances (He et al., 2014). Besides the constant learning rate, the global and per weight are also used in this paper. These are defined by

wt,global← wt−1,global− η pN(t)gt wt,weight ← wt−1,weight− η pNi(t)gt

With N (t) the amount of impressions untill that point and N i(t) the number of im-pressions at time t for feature i.

However, although the global and per weight are adaptive learning rate schemes, there are a lot of more advanced adoptations to the basis SGD algorithm. These will be discussed in 4.1. In 4.2 the algorithms that are discussed in this chapter are trained and briefly evaluated. When no learning rate scheme is mentioned, the learning rate remains constant.

4.1 Optimization algorithms

In this section, various SGD optimization algorithms are discussed. These are also discussed in the articles of Dozat (2016) and Ruder (2016). For a full explanation on the various algorithms however, it is advised to read the articles cited within the text.

(18)

CHAPTER 4. STOCHASTIC GRADIENT DESCENT 14 A problem SGD can have, is that it has trouble with local optima. The learning process can be slown down by these local optima, since it tends to oscillate between the slopes of the local optima (Ruder, 2016). Momentum is used to push the SGD in the relevant direction, accelerating the process. A process that does this, is the Classical momentum process (Sutton, 1986). By adding a fraction of the update of the past to the current update, it is able to push it out of the local minima much quicker.

An extension to the classical momentum is Nesterov’s accelerated gradient (Nes-terov, 1983). Since classical momentum has the tendency to accelerate too much, therefore pushing the process away from the optimum, Nesterov uses the expectation of the next momentum update in the current update. By doing this, the process slows down when a new optimum has been reached, not overshooting the optimum (Bengio et al., 2013).

Algorithm 2 Momentum based gt← (pt− yt)xt

mt← γmt−1+ gt

wt ← wt−1− ηmt For Classical momentum

wt ← wt−1−η(γmt+gt) For Nesterov’s accelerated gradient

Another class of optimization algorithms are the L2 norm-based algorithms. L2

norm-based algorithms are based on preventing the weights from increasing indefinetly, possibly resulting in overfitting. Possibly one of the simplest algorithms which applies-the L2 regularization, is the Adagrad algorithm (Duchi et al., 2011). Adagrad adapts

it’s learning rate to the weights, where infrequent weights receive larger updates than common weights. It updates it’s weights based on the past gradients that have con-tributed to that weight.

Algorithm 3 Adagrad gt← (pt− yt)xt

nt← nt−1+ gt2

wt ← wt−1−√_nη_t₊gt small to prevent dividing by 0

However, Adagrad has an ever decreasing learning rate which comes to a halt when nt eventually becomes really large, preventing the model for reaching the local

minimum. RMSProp is an alternative to Adagrad which solves that very problem by taking the decaying mean as the learning rate update (Tieleman and Hinton, 2012). Therefore the learning rate won’t be infinetly small. This makes sure that the model keeps on learning from new impressions.

(19)

CHAPTER 4. STOCHASTIC GRADIENT DESCENT 15 Algorithm 4 RMSProp

gt← (pt− yt)xt

nt← γnt−1+ (1 − γ)gt2

wt ← wt−1−√_nη

t+gt small to prevent dividing by 0

Finally, the last algorithm that is discussed in this section is the ADAM algorithm. This algorithm combines momentum-based with the L2 norm-based algorithms. It

also uses a bias correction on their first and second moment estimates, making sure the initializing values won’t interfere in the process (Kingma and Ba, 2014).

Algorithm 5 ADAM gt← (pt− yt)xt mt← γ1mt−1+ (1 − γ1)gt nt← γ2nt−1+ (1 − γ2)gt2 ˜ mt← _1−γmtt 1 ˜ nt← _1−γntt 2 wt ← wt−1−√_n_˜η

t+m˜t small to prevent dividing by 0

4.2 Results

0 5 10 15 20 25 30 35 40 45 50 0.402 0.404 0.406 0.408 0.410 Constant Global Per Weight Momentum Nesterov Adagrad RMSProp Adam

Figure 4.1: LogLoss scores for the learning algorithms introduced in chapter 4. Epochs are on the x-axis, the LogLoss score is on the y-axis.

The results of the algorithms are plotted in figure 4.2, while a summary of the results are shown in table 4.1. Notice that the results of the benchmark and the naive

(20)

CHAPTER 4. STOCHASTIC GRADIENT DESCENT 16

Algorithm Parameters LogLoss Epochs

Time per epoch (min) Time till convergence (min) Improvement Benchmark - 0.455369 - - - -Naive Bayes - 0.441787 - - - -2.98% Constant η = 0.001 0.402385 50 4.07 203.5 -11.64% Global η = 0.01 0.404270 8 4.35 34.8 -11.22% Per weight η = 0.05 0.403027 7 4.53 31.71 -11.49% Momentum η = 0.0001, γ = 0.9 0.402423 50 4.53 226.5 -11.63% Nesterov η = 0.0001, γ = 0.9 0.402403 50 4.49 224.5 -11.63% Adagrad η = 0.05 0.402462 7 4.54 31.78 -11.62% RMSProp η = 0.005, γ = 0.90 0.404575 6 4.87 29.22 -11.15% Adam η = 0.001, γ1= 0.9 γ2= 0.999 0.405170 5 6.27 31.35 -11.02%

Table 4.1: Summary of results and used parameters for the algorithms of chapter 4.

Bayes estimator aren’t shown in figure 4.2, since the LogLoss scores of these are much higher than those of the algorithms of 4.

In figure 4.2 the LogLoss scores of the various algorithms are shown with the amount of epochs needed to reach that score. The algorithms of RMSProp and Adam are underperforming in comparrisson with the other algorithms, having trouble with the sparse nature of the data. The models with a global and a per weight learning rate show better scores for lower epochs with the per weight learning rate significantly performing better than the global learning rate, but both stop improving after 8 epochs. This increase is most likely caused by overfitting. Adagrad has a similar performance, but it’s minimal LogLoss score is much closer to the overall lowest LogLoss score. Finally, the algorithms of momentum and Nesterov show a similar performance as the constant learning rate and seem to produce the lowest LogLoss score.

Table 4.1 gives more details about the results. The results show that there isn’t a significant difference between the minimal LogLoss score of the constant model, mo-mentum, Nesterov and Adagrad. They all have an overal LogLoss score improvement of around 11.63% over the Benchmark. The other algorithms do improve as well with improvements between 11.02% and 11.49%. The main difference that can be noted from the algorithms, is that the constant model, momentum and Nesterov need the full 50 epochs to reach their best LogLoss, while others reach theirs in under 10 epochs. This shows the importance of early stoppings, where the training should be stopped when the model stops improving.

From all the algorithms, Adagrad has the most interesting results, since it reaches the same results as the best performing model, the constant, but in far less epochs. This results in Adagrad reaching it’s score in just 31 minutes, while a full run with a

(21)

CHAPTER 4. STOCHASTIC GRADIENT DESCENT 17 constant learning rate takes 3 hours and 24 minutes. Since the improvement difference is just 0.02%, it could be beneficial to pick Adagrad over constant for it’s computation time is much shorter. Even though momentum bases models should speed up the convergence process, this wasn’t seen in the results. This could be due to the learning rate used in the models, since they used a constant learning rate. With a decreasing learning rate scheme, this might be different.

Finally, using the results from the Diebold-Mariano significance test, reported in A.2, it is shown that there’s significant proof that the predictions of the Nesterov algorithm are similar to the Momentum and Constant algorithms. This supports the results of figure , where the three shown similar trends. However, Momentum doesn’t show any significant similarities in predictions with the constant model, with a p-score of only 0.002, despite showing similar LogLoss scores over the entire training.

(22)

Chapter 5 FTRL

It was shown in the last chapter that SGD models can be effective in training mod-els for CTR prediction, while the computing resources are kept at a minimum. In practise however, it’s also important to keep the memory usage low. Models can be stored sparsely, where the number of non-zero coefficients determine the usage of mem-ory. However, SGD is not effective at producing sparse models, since it considers all available features.

In this chapter, the FTRL-proximal is introduced. This model is able to produce more sparse models, effectively leading to lower memory usage and even more accurate predictions (McMahan et al., 2013). In 5.1 the model is explained, while results are reported in 5.3.

5.1 FTRL algorithm

An option to induce sparsity is L1 regularization. L1 regularization penalizes the sum

of the absolute values of weights. This is mainly used to prevent overfitting, but it has a side effect of ’removing’ features that aren’t as valuable by setting the weights to zero. Therefore L1 actually induces sparsity as well.

FTRL-Proximal introduces both L1 and L2 regularization to the model. Where

the normal SGD performs the update

wt+1= wt− ηtgt, (5.1)

the update for FTRL-Proximal is more complicated. The FTRL-Proximal algorithm instead uses wt+1= arg min w g1:t· w + 1 2 t X s=1 σs||w − ws||22+ λ1||w||1 ! (5.2) 18

(23)

CHAPTER 5. FTRL 19 In here, σs is defined such that σ1:t = _α1_t and λ1 is the term that adds the L1

regular-ization (McMahan et al., 2013). In σ1:t, α is the learning rate scheme as a function of

the initial learning rate η.

This looks a lot different from the regular SGD algorithm. However, by rewriting (5.2), this is the same as

(g1:t− t X s=1 σsws) · w + 1 αt ||w||2 2+ λ1||w||1. (5.3)

Next, let zt−1= g1:t−1−Pt−1_s=1σsws be a value that stores previous values, then at the

beginning of round t we update by using zt= zt−1+ gt+

1 αt − 1 αt−1 wt. If then wt+1is

solved in closed form, this results in the update of FTRL-Proximal (McMahan et al., 2013) wt+1,i=    0 if |zt,i| ≤ λ1

αt(zt,i− sgn(zt,i)λ1) otherwise

(5.4) The complete algorithm is reported in algorithm 6. Notice that the used learning rate scheme is actually the same as Adagrad. Also notice that only two values have to be stored, namely zi and ni, where the models of 4.1 stored wi and ni, or only wi in

case of a constant learning rate. Therefore FTRL-Proximal won’t use more memory, but in fact, since it induces sparsity, it uses even less memory than the models of 4.1 (McMahan et al., 2013).

Algorithm 6 Adagrad FTRL-Proximal for Logistic Regression

Require: parameters η, β, λ1, λ2

(∀i ∈ {1, . . . , d}), initialize zi= ni= 0

for t = 1 to T do do

Receive feature vector xtand let I = {i|xi6= 0}

For i ∈ I compute wt+1,i=    0 if |zi| ≤ λ1 −β+ √ ni η + λ2 −1 (zi− sgn(zi)λ1) otherwise

Predict pt= σ(xt· w) using the computed wt,i

Observe click yt∈ {0, 1} for all i ∈ I do gi= (pt− yt)xi σi= 1_η q ni+ g2i− √ ni Equals 1 αt,i− 1 αt−1,i zi← zi+ gi− σiwt,i ni← ni+ gi2 end for end for

(24)

CHAPTER 5. FTRL 20

5.2 AdPredictor

In this chapter AdPredictor is also added (Graepel et al., 2010). Since this algorithm uses a probit regression model instead of the common logistic regression, it can be interesting to see the differences in results of this algorithm and algorithms using a logistic regression. AdPredictor is also one of the first published algorithms to be focussed entirely on CTR prediction, so it can be nice to see how the models are improved after a few years.

AdPredictor estimates the CTR by using p(y|x, w) = Φ y · µ

T_x

Σ

(5.5) With µ the mean and Σ2 _{the total variance.} _{Updating the parameters starts by}

computing the total variance as followed

Σ2 = β2 + xTσ2 (5.6) With β a set minimum value of the total variance, preventing the variance to converge to zero and preventing deviding by zero. Next, by using (5.6), the mean and variance are updated µi = µi+ yxi σ_i2 Σv y · xT_µ Σ (5.7) σ_i2 = σ2_i 1 − xi σ_i2 Σw y · xT_µ Σ (5.8) With v(t) = N (t,0,1)_Φ(t,0,1), w(t) = v(t) · (1 + v(t)). These are called the Gaussian corrections and control the amount of change of the weights after each step.

5.3 Results

Since the FTRL-Proximal algorithm has been established, the algorithm can be used now to train the model. This algorithm, alongside the AdPredictor (Graepel et al., 2010) are added to the two best performing models of 4.2, which are the constant and the Adagrad models. The results are shown in figure 5.1 and table 5.1.

From figure 5.1 it can be seen that FTRL-Proximal is performing better than the constant model at almost all epochs. The difference in LogLoss score between the FTRL-Proximal and the constant model seems to decrease at high epochs, but FTRL-Proximal still shows an significant improvement over the constant model at early epochs. AdPredictor outperforms the constant and FTRL-Proximal at the first

(25)

CHAPTER 5. FTRL 21 0 5 10 15 20 25 30 35 40 45 50 0.402 0.404 0.406 0.408 Constant Adagrad AdPredictor FTRL-Proximal

Figure 5.1: LogLoss scores for the best performing algorithms of chapter 4, AdPredictor and FTRL-Proximal.

Time per epoch (min) Time till convergence (min) Improvement Benchmark - 0.455369 - - - -Constant η = 0.001 0.402385 50 4.07 203.5 -11.64% Adagrad η = 0.05 0.402462 7 4.54 31.78 -11.62% AdPredictor 1α = 0.1, β = 0.2, σ = 0.05, µ = 0, = 0 0.403150 17 4.87 82.79 -11.47% FTRL η = 0.04, β = 1, λ1= λ2= 50 0.402014 20 4.97 99.4 -11.72%

Table 5.1: Summary of results and parameters for the algorithms of chapter 5.

few epochs, but it is cleary outperformed at higher epochs. Adagrad does what Ad-Predictor does, but better. FTRL-Proximal does get the lowest LogLoss score and does this in about 20 epochs, which isn’t nearly as much as the constant model needs at 50.

Table 5.1 supports the conclusions from figure 5.1. FTRL reaches the lowest LogLoss score at 0.402014, with an overall improvement of 11.72% over the Bench-mark. Although it uses only 20 epochs as opposed to the 50 epochs needed for the constant model, a single epoch takes 4.97 minutes for FTRL-Proximal, while the con-stant model needs only 4.07 minutes. With early stopping, this leads to a training time reduction of 50% for the FTRL-Proximal algorithm over the constant model. Since the FTRL-Proximal algorithm gives both slighty better LogLoss scores and does this in just 50% of the time the constant model needed, it is clear that FTRL-Proximal outperforms the constant model. Finally the results from the Diebold-Mariano test from appendix A.2 show that the predictions don’t show any significant similarities.

(26)

Chapter 6 FTRDL

Untill this point a lot of common algorithms have been used in predicting CTR. Two of the better perfoming algorithms were the constant and the FTRL-Proximal model. A downside to these two models turned out to be their time till convergence, were it could take a lot of epochs for the models to reach their optimal LogLoss score

In this section a solution is provided. A new model called the Follow-The-Regularized-Dynamic-Leader is established which speeds up the process without the loss of accu-racy. In 6.1 the model is introduced and explained. In 6.2 the perfomance of the new model is compared with the best model thus far. Finally, since only the logistic regression has been used in the past models, a comparison between logistic and probit is done in 6.3

6.1 Follow-The-Regularized-Dynamic-Leader

One of the input parameters that can influence the result the most is η. By testing the models for various values of η, it showed that a large η could result in faster LogLoss score improvements early on. On the other hand, low values of η resulted in much slower, but longer improvements of the LogLoss score. Even with learning schemes like Adagrad controling the learning rate, a change in the initial value η can lead to drasticly different end results.

The model proposed in this chapter wants to abuse this behaviour of η. By having the value of η decreasing during the training, the model should be able to have strong improvements early on while still improving at later epochs. This is applied to the current best algorithm, the FTRL-Proximal algorithm, with an added momentum term.

Something to take into considerations, is the maximum and minimum value of η. 22

(27)

CHAPTER 6. FTRDL 23 If the start value of η is too high, it might take too long for η to drop to a good value for later epochs. On the other hand, if the dynamic of η is too strong and the value drops fast, it might drop too fast for the model to actually benefit from it, since the learning rate is too low.

Let’s start by saying that η is a function of e, with e the current epoch. The proposed dynamic for η is based on the bias-correction of the Adam model (Kingma and Ba, 2014). The proposed dynamic is the following formula

η(e) = η0

1 − γe, (6.1)

with η0, γ ∈ [0, 1]. This function can control both the maximum and minimum value

of η(e). First, since e is a strictly positive integer, the maximum value of η occurs for e = 1 and is equal to η0

1−γ. For its minimum value the limit of η(e) with respect to

e should be computed. Since lime→∞_1−γη0e → η0, the function converges towards the

initial set value of η0. Therefore it’s possible to control the maximum learning rate η

by γ, while the minimum can be controlled by η0.

This is then applied to the FTRL-Proximal algorithm (McMahan et al., 2013) with added momentum. Recall that the FTRL-Proximal algorithm used the update zt = zt−1 + gt + 1 αt − 1 αt−1

wt. Momentum can be added by replacing gt with ˜mt.

In this, ˜mt stands for the Nesterov’s accelerated gradient term. Exponential weigthed

moving average (Dozat, 2016) is used to calculate the momentum. The exponential weighted moving average is written as

mt = γ1mt+ (1 − γ1)gi

By applying the Nesterov update to the exponential weighted moving average, this results in the used momentum term

˜

mt = γ1m˜t+ (1 − γ1)gi

And thus resulting in the updated update rule zt= zt−1+ ˜mt+

1 αt − 1 αt−1 wt.

Finally, the last task for the FTRDL-Proximal algorithm is changing the FTRL-Proximal algorithm to incorporate the dynamic learning rate η(e). Since zt is used

as update term where all the past updates are stored in, if η(e) would be applied to the update formula −β+

√ ni

η + λ2

−1

(zi − sgn(zi)λ1), the changing η(e) would be

applied over all the past update terms zt. However, by looking at zt = zt−1+ ˜mt+

1 αt − 1 αt−1

wt, three elements can be seperated. zt−1 and

1 αt − 1 αt−1 wt contain all

the past elements, while ˜mt is the new element. So the update only applies for ˜mt.

(28)

CHAPTER 6. FTRDL 24 on the correct terms instead of on all the past terms as well. This results in the final algorithm 7

Algorithm 7 Adagrad FTRDL-Proximal for Logistic Regression Require: parameters η0, γ1, γ2, β, λ1, λ2

(∀i ∈ {1, . . . , d}), initialize zi = ni = 0

for e = 1 to E do With E total epochs for t = 1 to T do

Receive feature vector xt and let I = {i|xi 6= 0}

For i ∈ I compute wt+1,i=    0 if |zi| ≤ λ1 −β+ √ ni η0 + λ2 −1 (zi− sgn(zi)λ1) otherwise

Predict pt= σ(xt· w) using the computed wt,i

Observe click yt ∈ {0, 1}

for all i ∈ I do gi = (pt− yt)xi

mi ← γ1mi+ (1 − γ1)gi

˜

mi = γ1mi+ (1 − γ1)gi Nesterov’s accelerated gradient

σi = _η1 0 pni+ gi2− √ ni Equals 1 αt,i − 1 αt−1,i zi ← zi+_1−γm˜ie

2 − σiwt,i Apply dynamic learning rate

ni ← ni+ gi2

end for end for end for

(29)

CHAPTER 6. FTRDL 25

6.2 Results

0 5 10 15 20 25 30 35 40 45 50 0.402 0.403 0.404 0.405 0.406 Constant Adagrad FTRL-Proximal FTRDL-Proximal

Figure 6.1: LogLoss scores for the best perfoming algorithms of chapter 5 and the FTRDL-algorithm of 6.1.

Time per epoch (min) Time till convergence (min) Improvement Benchmark - 0.455369 - - - -Constant η = 0.001 0.402385 50 4.07 203.5 -11.64% Adagrad η = 0.05 0.402462 7 4.54 31.78 -11.62% FTRL η = 0.1, β = 1, λ1= λ2= 50 0.402014 20 4.97 99.4 -11.72% FTRDL η0= 0.04, β = 1, γ1= γ2= 0.9, λ1= λ2= 50 0.402014 8 5.12 40.96 -11.72%

Table 6.1: Summary of results and parameters for the algorithms.

The first thing that can be noticed from the results in figure 6.1 is that the curve of the FTRDL-Proximal isn’t a smooth curve anymore. At the early stages of the training it got more sharp edges, pointing out that the learning rate η changes value over the training time. This results in strong learning early on, while controling the learning at later epochs. The training process is therefor faster for the FTRDL-Proximal algorithm, where its minimum LogLoss score is reached in about 7 to 8 epochs. Since the FTRL-Proximal needs 20 epochs to reach it’s minimal LogLoss score and since it seems the LogLoss scores of both models are comparable, this shows that FTRDL-Proximal indeed speeds up the process.

Table 6.1 tells the same story. Both FTRL-Proximal and FTRDL-Proximal result in exactly the same LogLoss score of 0.402014, which is an overall improvement of

(30)

CHAPTER 6. FTRDL 26 11.72% over the Benchmark. However, FTRDL-Proximal requires 8 epochs to converge versus the 20 that it takes for the FTRL-Proximal algorithm to converge. Since the running time per epoch for FTRDL-Proximal and FTRL-Proximal is comparable, 5.12 and 4.97 respectively, this results in a total time till convergence reduction of 58.8%. The Diebold-Mariano test results from A.2 also shows that there is significant proof that the predictions of FTRL-Proximal and FTRDL-Proximal aren’t different from eachoter Therefore the FTRDL-Proximal algorithm succefully reduces the total training time significantly, while keeping the predictions similar.

6.3 Logistic vs. Probit

Untill this point only the logistic regression has been used. Since this is the most used regression model for CTR prediction models for it’s simplicity, most researches use this in their models as well (He et al., 2014; Juan et al., 2016; McMahan et al., 2013; Ta, 2015), while only a few use a probit regression in their algorithms (Graepel et al., 2010). This is most likely caused by the computational convenience in the sense that the model yields faster convergence than other models (Gunduz and Fokoue, 2015).

However, Gunduz and Fokoue (2016) also note that most researches dealing with binary classification tend to automatically choose the logistic regression over probit regression or any other model without a solid reason. Therefore in this section FTRL-and FTRDL-Proximal are trained with both logistic FTRL-and probit regression models to see if this results in significant differences.

0 5 10 15 20 25 30 35 40 45 50 0.402 0.403 0.404 0.405 0.406 FTRL Logistic FTRL Probit FTRDL Logistic FTRDL Probit

Figure 6.2: LogLoss scores for FTRL- and FTRDL-Proximal models for logistic and probit regression.

(31)

CHAPTER 6. FTRDL 27

Algorithm Parameters LogLoss Epochs Time (min) Time till convergence Improvement Benchmark - 0.455369 - - - -FTRL λ1= λ2= 50 (Logistic) ηLog= 0.1, β = 1, 0.402014 20 4.97 99.4 -11.72% (Probit) ηP rob= 0.04 0.401901 23 5.08 116.84 -11.74% FTRDL λ1= λ2= 50 (Logistic) η0,Log = 0.04, γ = 0.9, 0.402014 8 5.12 40.96 -11.72% (Probit) η0,P rob= 0.01, β = 1 0.401996 5 5.25 26.25 -11.72%

Table 6.2: Summary of the results for the logistic versus probit test for FTRL- and FTRDL-Proximal algorithms. The subfix {Log, P rob} shows which parameters belongs to which probability model.

In figure 6.2 the results of the training of FTRL- and FTRDL-Proximal algorithms with logisitc and probit regressions are shown. The probit regressions shows a small improvement, both in convergence speed and LogLoss score. Although the differences are small, it does show that you shouldn’t automaticly pick logistic over probit re-gression. For FTRL-Proximal the probit regression model reaches the same LogLoss score as the logistic regression model in approximately 5 epochs less. This speed up is more noticable for the FTRDL-Proximal models, where the probit regression models converges after 3 epochs versus the 8 of the logistic regression model.

Besides changing the initial values of η, as noted in table 6.2, nothing has been al-tered to the parameters for the different probability models. For table 6.2 it can be seen that the probit models slightly outperforms the logistic models in their LogLoss scores, although the difference is most likely insignificantly small. The biggest difference can be seen in the FTRL-Proximal model, where the probit regression has a LogLoss score which is 0.000113 smaller than the LogLoss score of the logistic regression for the same algorithm.

The Diebold-Mariano significance test as reported in A.2 show some interesting results. The tests tells that there’s significant proof that the predictions of the FTRL-Proximal with logistic regression and the FTRDL-FTRL-Proximal for both logistic and probit regression are similar. However, the test results for the FTRL-Proximal with probit regression show that it has no significant similarities with any of the other model. It’s therefore safe to say that the predictions of the FTRL-Proximal with probit regression are outperforming the predictions of the other three considered algorithms.

The running time per epoch is slightly higher for the probit models. For FTRDL-Proximal the total time till convergence for probit is lower than the time needed for

(32)

CHAPTER 6. FTRDL 28 the logistic regression. The probit regression model for FTRL-Proximal needs more epochs till convergence than the logistic regression, with 23 and 20 epochs respectively. However, if a similar LogLoss score is used in both probit and logistic regression for this algorithm, the probit model needs only 15 epochs to reach a LogLoss score of 0.402005, while the logistic regression model needs 20 epochs for 0.402014. This shows that the probit regression model gives faster and better performing models than the logistic regression. Therefore one should not automaticly pick logistic over probit regression, since probit can lead to better performances of the model.

6.4 Failed experiments

Although the final algorithm succeded in decreasing the time till convergence without losing accuracy, some other experiments have been done to (unsuccesfully) improve the models. In this section these unsuccesfull experiments are reported to give a direction for possible future research.

First off, similar to 6.3, some other binary regression models have been tested. Regression models based around the complementary log-log and compit link functions have been used. However, the complementary log-log regression model didn’t show any significant difference in comparisson with the logistic regression model. The compit regression model on the other hand wasn’t able to produce results with the same or better LogLoss scores as the logistic or probit regression models.

Since both FTRL- and FTRDL-Proximal use an Adagrad learning rate scheme, these models were adapted to support other learning rate schemes like RMSProp and Adam as well. However this didn’t result in good predictions with LogLoss scores much higher than the current best results. This was also noted in 4.2 when these learning rate schemes were used on a standard SGD model. This could be due to the fact that click impressions data sets are very sparse and these models are simply outperformed by Adagrad on sparse datasets. It could therefore be interesting to see their performance on less sparse datasets. However, due to the lack of additional data, these experiments couldn’t be performed.

Due to the works of Ta (2015) and Juan et al. (2016) on feature interactions and factorization machines, some simple experiments concerning feature interactions have been performed as well. Therefore simple poly2 feature interactions have been implemented. However, this increased the training time significantly, taking 1 hour and 15 minutes to perform a single epoch. Since the LogLoss scores weren’t improving, there was no reason to include this in the final algorithm.

(33)

CHAPTER 6. FTRDL 29 Finally experiments based on the work of Neelakantan et al. (2015), who added noise to the gradients to make them more robust to poor initialization, have been per-formed. Adding the noise did not improve the model, only increasing the running time per epoch while giving insignificantly small differences, both positive and negative, to the LogLoss scores.

(34)

Chapter 7 Conclusion

In this paper research has been done on the subject of convergence speed for CTR prediction models and finding a new model that can improve the convergence speed for such models. This is an interesting subject since training can be both time-consuming and expensive, therefore a model that speeds up the training process can be financially interesting.

In recent works on predicting CTR the most used basic model was the SGD model. This model is used to train large dataset while requiring little memory usage and can be performed on weaker processors as well. Following the idea of Juan et al. (2016), epochs can be used to pass training data through the model multiple times, which can improve the predictions significantly. This however results in longer training times. Therefore finding a model that is able to use the accuracy benefits from epoch training, while keeping the total running time at a minimum, was the main focus of the model in this paper. The accuracy of the models were compared by the use of the LogLoss score.

The model of McMahan et al. (2013) was used as a starting point. Their FTRL-Proximal algorithm uses a logistic regression model. The weights were computed by the SGD algorithm with an Adagrad learning rate scheme, while also adding L1 and

L2 regularization terms to prevent overfitting. This model was able to give improved

predictions at a reduced training time over the constant SGD model. However, there was still room for improvement in training time speed.

Therefore the Follow-The-Regularized-Dynamic-Leader-Proximal algorithm was pro-posed in this paper. This model uses the same properties as the FTRL-Proximal al-gorithm by McMahan et al. (2010), but add Nesterov’s accelarted gradient and uses the characteristics of the learning rate parameter η to create a more dynamic algo-rithm. Using a high learning rate would result in strong learning in the beginning of

(35)

CHAPTER 7. CONCLUSION 31 the training, while the training would come to a halt very quickly. However, setting the learning rate too low would result in a very slow, but more accurate model. It was therefore proposed to change the learning rate parameter from a constant value to a more dynamic paramenter, where it’s a large value at early epochs, slowly decreasing towards the initial value η0 at higher epochs. By using the benefits from both a large

and small parameter, the algorithm should be able to train faster and for longer. The results showed that the FTRDL-Proximal algorithm was indeed able to speed up the training. The FTRDL-Proximal algorithm was able to reduce training time till convergence by 58% in comparison with the FTRL-Proximal algorithm. The increase in training speed didn’t affect the accuracy, since both models resulted in the same LogLoss score.

The use of logistic regression over probit regression was also questioned. Since most researchers tend to use logistic regression by default while the probit regression has some interesting properties of his own, the FTRL- and FTRDL-Proximal algorithms were compared with eachother using both regression models. The results showed that the probit regression models were able to slightly outperform the logistic regression models in both speed and accuracy, showing that the default regression model isn’t always the best. It’s therefore adviceable to always test the algorithms by using different regression models to get the model that is the best fit.

For feature research on this subject a look upon the use of feature interactions can be done. In this research simple poly2 feature interactions have been implemented unsuccesfully. Since factorization machines have shown in the past that they are able to make good predictions (Juan et al., 2016; Ta, 2015), using more advanced feature interactions in combination with the dynamic learning rate can possibly lead to better perfomances as well.

Further exploration of the learning rate dynamics can be interesting as well. In this paper a simple dynamic for the learning rate has been proposed. However, a dynamic that uses the current LogLoss score as an adjustment for the learning rate can be very interesting, possibly leading to more correct adjustments.

Despite these possible improvements of the algorithm, FTRDL-Proximal succeded in it’s main goal. The algorithm is able to reduce the overall time till conergence without the loss of accuracy. Therefore FTRDL-Proximal can be used to reduce the costs of predicting and training models.

(36)

Appendix A

Diebold-Mariano significance test

A.1 Derivation of the Diebold-Mariano test

In performing predictions using multiple algorithims, it can be of value to see if pre-dictions are significantly different from eachother. The Diebold-Mariano test (Diebold and Mariano, 2002) has been developped to do this. Lets start with

Actual values {yt: t = 1, . . . , T }

Predictions {p1.t : t = 1, . . . , T }{p2.t : t = 1, . . . , T }

Next they define a loss function g(.) for an error ei,t = pi,t − yt. They use

g(ei,t) = e2i,t. However, since in this paper the LogLoss score is used, it’s preferable to

use the LogLoss score instead. Some tests have shown that using the LogLoss score for the loss function instead of the loss function as defined by Diebold and Mariano (Diebold and Mariano, 2002), doesn’t result in different results. Therefore the loss function is defined as

g(yt, pi,t) = −ytlog(pi,t) − (1 − yt) log(1 − pi,t)

Next, define the loss differential between two predictions by dt= g(yt, p1,t) − g(yt, p2,t)

So the step is defining the hypothesis. The point of interest is to see if predictions are equal to eachother. In other words, if E[dt] = 0. Therefor the null hypothesis is

defined as

H0 : E[dt] = 0 ∀ T

versus the alternative hypothesis

H1 : E[dt] 6= 0 ∀ T

(37)

APPENDIX A. DIEBOLD-MARIANO SIGNIFICANCE TEST 33 Now consider

√

T ( ¯d − µ) → N (0, Σ2) With ¯d = PT

i=1di, µ = E[dt] and Σ

2 _{the variance of d}

t. Under the null, this can be

rewritten as

DM =√T ¯ d

Σ → N (0, 1)

So under the null hypothesis, the test statistic is asymptotically N (0, 1) distributed. Therefore the null hypothesis of no difference is rejected if the computed DM statistic falls outside of the range of the z-values from the standard normal table corresponding to the desired significance level α. In otherwords, reject if

(38)

APPENDIX A. DIEBOLD-MARIANO SIGNIFICANCE TEST 34

A.2 Diebold-Mariano test results

Mo del Dieb old-Mariano test p -score 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12 13. 14. 1. Be n chmark 2. Naiv e Ba y e s 0.000 3. Constan t 0.000 0.000 4. Global 0.000 0.000 0.000 5. P er W eigh t 0.000 0.000 0.000 0.000 6. Momen tum 0.000 0.000 0.002 0.000 0.000 7. Nes tero v 0.000 0.000 0.049 0.000 0.000 0.145 8. Adagrad 0.000 0.000 0.000 0.000 0.000 0.000 0.000 9. RMSProp 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 10. Adam 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0. 0 00 0. 00 0 11. AdPredictor 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 12. FTRL-Pro ximal 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 13. FTRDL-Pro ximal 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.130 14. FTRL (Probit) 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.001 0.000 15. FTRDL (Probit) 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.389 0.320 0.002

Table A.1: p-scores of the Diebold-Mariano test. Bold results show significance at a 5% level, while cursive shows significance at a 1% level. Null hypothesis of no difference is rejected if p-score exceeds the significance level.

(39)

References

Agarwal, D., Chen, B.-C., and Elango, P. (2009). Spatio-temporal models for esti-mating click-through rate. In Proceedings of the 18th international conference on World wide web, pages 21–30. ACM.

Anderl, E., Becker, I., Wangenheim, F. V., and Schumann, J. H. (2016). Mapping the customer journey: A graph-based framework for online attribution modeling. International Journal of Research in Marketing, 33(3):457–474.

Bengio, Y., Boulanger-Lewandowski, N., and Pascanu, R. (2013). Advances in optimiz-ing recurrent networks. In Acoustics, Speech and Signal Processoptimiz-ing (ICASSP), 2013 IEEE International Conference on, pages 8624–8628. IEEE.

Bishop, C. M. (2006). Pattern recognition and machine learning. springer.

Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer.

Chapelle, O., Manavoglu, E., and Rosales, R. (2015). Simple and scalable response prediction for display advertising. ACM Transactions on Intelligent Systems and Technology (TIST), 5(4):61.

Diebold, F. X. and Mariano, R. S. (2002). Comparing predictive accuracy. Journal of Business & economic statistics, 20(1):134–144.

Dozat, T. (2016). Incorporating nesterov momentum into adam. URL http:// cs229.stanford.edu/proj2015/054 report.pdf. Accessed May 2017. Stan-ford University, Tech. Rep

Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Accessed, June 2017:2121–2159.

Graepel, T., Candela, J. Q., Borchert, T., and Herbrich, R. (2010). Web-scale bayesian click-through rate prediction for sponsored search advertising in microsoft’s bing search engine. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 13–20.

Gunduz, N. and Fokoue, E. (2015). On the predictive properties of binary link func-tions. arXiv:1502.04742.

Ha, L. (2008). Online advertising research in advertising journals: A review. Journal of Current Issues & Research in Advertising, 30(1):31–48.

He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., Shi, Y., Atallah, A., Herbrich, R., Bowers, S., et al. (2014). Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, pages 1–9. ACM.

Juan, Y., Zhuang, Y., and Chin, W.-S. (2011). 3 idiots’ approach for display advertis-ing challenge. URL http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014 -criteo.pdf. Accessed Jun 2017. NTU CSIE MLGroup.

Juan, Y., Zhuang, Y., Chin, W.-S., and Jahrer, M. (2014). 4 idiots’ approach for click-35

(40)

REFERENCES 36 through rate prediction. _{URL http://www.csie.ntu.edu.tw/~r01922136/} slides/kaggle-avazu.pdf. Accessed Jun 2017. NTU CSIE MLGroup.

Juan, Y., Zhuang, Y., Chin, W.-S., and Lin, C.-J. (2016). Field-aware factorization machines for ctr prediction. In Proceedings of the 10th ACM Conference on Recommender Systems, pages 43–50. ACM.

Kingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

McMahan, H. B., Holt, G., Sculley, D., Young, M., Ebner, D., Grady, J., Nie, L., Phillips, T., Davydov, E., Golovin, D., et al. (2013). Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1222–1230. ACM. Nesterov, Y. (1983). A method for unconstrained convex minimization problem with

the rate of convergence o (1/k2). In Doklady an SSSR, volume 269, pages 543– 547.

Rendle, S.(2010). Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th International Conference, pages 995–1000. IEEE.

Richardson, M., Dominowska, E., and Ragno, R. (2007). Predicting clicks: estimating the click-through rate for new ads. In Proceedings of the 16th international conference on World Wide Web, pages 521–530. ACM.

Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747.

Sherman, L. and Deighton, J. (2001). Banner advertising: Measuring effectiveness and optimizing placement. Journal of Interactive Marketing, 15(2):60–64.

Stec, C. (2015). 20 display advertising stats that demonstrate digital advertising’s evo-lution.’ URL https://blog.hubspot.com/marketing/horrifying-display -advertising-stats#sm.00013oki5d57kfic11mx58e7ag2lb. Accessed Jun 2017.

Sutton, R. S. (1986). Two problems with backpropagation and other steepest-descent learning procedures for networks. In Proc. 8th annual conf. cognitive science society, pages 823–831. Erlbaum.

Ta, A.-P. (2015). Factorization machines with follow-the-regularized-leader for ctr pre-diction in display advertising. In Big Data (Big Data), 2015 IEEE International Conference on, pages 2889–2891. IEEE.

Tieleman, T. and Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31.

Yuan, S., Wang, J., and Zhao, X. (2013). Real-time bidding for online advertising: measurement and analysis. In Proceedings of the Seventh International Workshop on Data Mining for Online Advertising, page 3. ACM.

Speed is key : decreasing time till convergence on CTR prediction models

Master’s Thesis

Speed is key:

Decreasing time till convergence on

CTR prediction models

Thijs van Velzen

Student number: 10379924

Date of finished version: July 28, 2017

Master’s programme: Econometrics

Specialisation: Big Data Analytics

Supervisor: Dr. K. Pak

Second reader: Dr. N. P. A. Giersbergen

Faculty of Economics and Business

Amsterdam School of Economics

Contents

Chapter 1

Introduction

Chapter 2

Literature Review

2.1

System overview

2.2

Recent works

Chapter 3

Data

3.1

Avazu

3.2

Evaluation

Chapter 4

Stochastic Gradient Descent

4.1

Optimization algorithms

4.2

Results

Chapter 5

FTRL

5.1

FTRL algorithm

5.2

AdPredictor

5.3

Results

Chapter 6

FTRDL

6.1

Follow-The-Regularized-Dynamic-Leader

6.2

Results

6.3

Logistic vs. Probit

6.4

Failed experiments

Chapter 7

Conclusion

Appendix A

Diebold-Mariano significance test

A.1

Derivation of the Diebold-Mariano test

A.2

Diebold-Mariano test results

References