CTR prediction using first- and second-order methods

(1)

Faculty of Economics and Business, Amsterdam School of Economics MSc Econometrics

Specialisation: Big Data Business Analytics

CTR prediction using first- and second-order

methods

Ramon Wonink (10421777) 12 January 2018

Supervisor: Dr. N. P. A. van Giersbergen Second reader: Dr. K. Pak

(2)

Statement of Originality

This document is written by Ramon Wonink who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document are original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

Abstract

In this research different models based on first- and second- order methods were used to predict if someone will click on an ad. The evaluation was based on a logloss function and the accuracy of the prediction.

Stochastic gradient descent (SGD) and the variants of SGD were the first-order methods based on only the gradient. The most successful algorithms were SGD itself and the variants Adam and Adagrad. These three algorithms had the lowest logloss value and the highest accuracy in both the fixed window (fixed sample size) and the rolling window (using observations in the test set to train the weights after a prediction has been made). While not all variants improved over time, the accuracy using a rolling window was always higher than using a fixed window.

Quasi-Newton with the DFP and BFGS update formulas and Online Newton Step were the second-order methods based on not only the gradient but also the Hessian matrix. Due to memory limitations only a part of the dataset could be used to test these methods. Theoretically second-order algorithms should outperform first-order algorithms, but this was not the case. Using a rolling window in practice is not possible due to increased run time, but also the results did not increase much or at all in comparison with SGD.

(4)

1 Introduction

Traditional media like television is impersonal. Everyone who watches a certain televi-sion channel will see the same advertisements. On the Internet this does not have to be the case, because users can be tracked. Therefore a lot of characteristics are known which can be used to personalize ads. There are different kinds of advertisements on the Internet but the focus in this thesis will be on banner ads. Examples of character-istics are for instance if a certain user has clicked before on an ad, what kind of device he or she uses and the website a user is visiting. These characteristics can be used to predict if a certain user will click on an ad or not. If banner ads are placed by real-time auctions, ads preferably should only be showed to a user who has a high probability of clicking. A lot of money is spent on online advertising. According to IAB digital advertising revenues increased with 23% to 19.6 billion dollars in Q1 2017 in the U.S. in comparison with Q1 2016.

Datasets with characteristics of users that clicked or did not click on an add can become very large. To analyse these datasets different models can be used. Stochastic gradient descent is a first-order algorithm that can deal with big data sets. A disad-vantage is not all the characteristics of the data is being used. Newton’s methods make use of the second-order properties, but calculation times increase significantly (Ye et al., 2017). In this research different models based on first- and second- order methods will be used to predict if someone will click on an ad. Not only is the accuracy impor-tant, but also the required time of the models to obtain results. To make a decision for new observations the testing speed needs to be almost instantly for the real-time auction. Preferably the new observations need to be added to the training set, without training the entire model again. The difference in accuracy by adding observations to the training set in comparison with a fixed sample size is part of the research.

To test different models, a large Click-Through-Ratio (CTR) dataset from Criteo Labs is used. This dataset was originally supplied for a three month competition in 2014 on Kaggle. The competition was to predict if someone will click on an advertisement

(6)

or not. After the competition the dataset was released for academic use. The size of the dataset also creates some problems. One of these problems is the entire database cannot be loaded into memory of a normal laptop, for instance with 8 GB of internal memory. Therefore a standard estimation technique like OLS cannot be used. The programming language used in this research is Python and the graphs are made with the program R.

The organization of this thesis is as follows. In Secton 2.1 CTR and real-time auctions are explained. In Section 2.2 relevant literature is discussed and in Section 2.3 the topic is programming optimizations. The content of the CTR dataset is discussed in Section 3.1 and in Section 3.2 feature extraction is the main topic. Section 4 starts with the evaluation methods in Section 4.1 and ends with the different models that will be used to predict the CTR in Section 4.2 - 4.5. The results of these models are provided in Section 5. The conclusion is given in Section 6.

(7)

2 Literature

A CTR dataset from Criteo Labs is used to test the models introduced in Section 4. Therefore in Section 2.1, CTR will be explained in more detail. Key is to find models that are accurate, but also fast to train. If there are new observations, these should be tested and then added to the training set without retraining the entire model. In Section 2.2, recent papers are discussed. In this research the programming language is Python. Because time is an important aspect, the topic in Section 2.3 is programming optimizations.

2.1 CTR

When browsing on the Internet it is possible that on some web pages banner ads are shown to the user. Companies compete for those advertisement spots for instance through real time bidding (Yuan et al., 2013). When a user loads a web page with advertisement spots an auction starts where companies can bid. The advertisement of the company that wins the auction will be shown to the user. It is important for companies that the ads have a positive impact. Not every user is interested in a certain product and the focus of this thesis is not how high the bid must be, but to determine the probability that a user will click on an ad. This is important for companies because these personalized advertisements cost money. The Click-Through-Rate (CTR) is the ratio of users that click on, in this case, a banner to the total number of users:

CTR = number of clicks total number of users.

The average CTR for desktop display advertising is approximately 0.1% (Shan et al., 2016). To determine the probability that a user will click features are needed. Compa-nies are able to see some characteristics of the user like the user IP, but also date and time of the request to the web page. Cookies can be used to track down the sequence of page requests of a user. It is also possible to track usage data of users at the client side by using Javascript, Java applets or modified browsers. They can provide detailed

(8)

information about the behaviour of the user, but these approaches raises privacy issues (Facca and Lanzi, 2005).

Selecting good features is one of the core challenges. The extracted features that are used for the models need to handle discrete features of different sizes. Some features only have a few options while there are other features which can have millions of different options such as user ID (Graepel et al., 2010). A lot of these possible features are categorical variables which can lead to a sparse dataset. It is crucial that the CTR prediction model can deal with sparse data.

Computational cost is also an essential aspect. Millions of ads each hour should be able to be served with fast response times of 100 ms or less. Therefore the algorithm used to predict the result of new observations should have a low computational cost. Due to large sparse datasets it is also important that the learning algorithm has a bounded memory footprint in RAM to be able to run continuously (Graepel et al., 2010).

2.2 Recent papers

CTR prediction in this research is based on first-order and second-order methods. In this section relevant recent papers are discussed.

The research of van Velzen (2017) shows that stochastic gradient descent (SGD), an iterative method used to minimize an objective function, can be used to predict the CTR. The weights wt of the model are updated using the old weights wt−1, a learning

parameter η and the gradient gt:

wt = wt−1− ηtgt.

The learning parameter does not need to be dependent on time t. If this is the case the learning parameter becomes equal to η. There are multiple variants based on SGD. The goal of these variants is to improve the results. Van Velzen used different variants of SGD, namely Momentum based, Adagrad, RMSProp and Adam. More information about these variants are given in Section 4.3. Also the normal SGD algorithm in

(9)

combination with a different learning parameter was used. In those cases the learning parameter was dependent on the number of observations (global) and also dependent on the number of times a feature occurred (per weight).

The results were obtained using these different algorithms in combination with a maximum of 50 epochs. An epoch is feeding the data to the model so in the case of three epochs, the data has been fed to the model three times. The evaluation method was a logistic loss function

` = 1 n

n

X

i=1

(−yilog pi− (1 − yi)log(1 − pi)),

in which y stands for click (1) or no click (0) and pi is a function that uses the data

and converts it to a probability [0, 1]. The lower the logloss value the better the algorithm performs. The normal stochastic gradient descent algorithm had the lowest logloss value. The disadvantage was that it took long to converge in comparison with other models. After 50 epochs it still was not converged while the variant Adagrad was converged after 7 epochs. The training time per epoch was larger for the variant Adagrad, but the total training time was lower. The difference in logistic loss was very small between Adagrad and the normal stochastic gradient descent algorithm (0.02%) while it only took 15.6% of the training time.

The algorithms SGD, Momentum and Nesterov which is also Momentum based needed the full 50 epochs while all other algorithms only required 8 or less epochs. RMSProp and Adam were the models that resulted in the highest logloss. There are two questions that still remain. The first question is it beneficial to use the new observations to train the model after a prediction has been made? The second question is the accuracy in predicting a click or no click related to a lower logloss value?

Newton’s method is also an iterative method to optimize an objective function. It uses the gradient gt, the learning parameter ηt and the Hessian matrix Ht. The weights

(10)

wt are updated according to the following rule

Ht = Ht−1+ gtgtT

wt = wt−1− ηt(Ht)−1gt.

Stochastic gradient descent is a variant of this updating rule in which the Hessian matrix Ht is a constant equal to the identity matrix. Calculating the Hessian matrix

for small datasets and small models is not an issue, but in case of a big dataset with a lot of features it becomes a problem. The advantage of using the Hessian is that more information is used. The computational complexity of each operation is of order O(nd2_{+ d}3_{) in which n is the number of observations and d the dimension of H}

t. This

method is expensive when n or d is large (Ye et al., 2017). The advantage is that in case of convergence rate it outperforms first-order methods like SGD (Civek and Kozat, 2017). Quasi-Newton’s method is another method which is in the middle of SGD and Newton’s method. The Hessian matrix is updated in such a way that the secant condition is satisfied. This results in a slower convergence, but faster to calculate in comparison to Newton’s method (Gower and Richtárik, 2016). More information about Quasi-Newton is given in Section 4.4.

The main problem of second-order methods is the computational complexity of order O(d2) while the first-order methods like SGD have a computational complexity of order O(d). Civek and Kozat (2017) reduced the computational complexity from O(d2₎

to O(d) for the problem of sequential linear data prediction while not relying on any statistical assumptions. Their algorithm offered the performance of the second-order method with the computational cost of the first-order method. They also showed that their algorithm was numerically stable. Time series data sequences were used. This means that consecutive data vectors consist of mostly the same information which is the basis of their algorithm. Then they rewrote the updating rule of the Hessian matrix Ht using the matrix inversion lemma

(Ht)−1 = (Ht−1)−1−

(Ht−1)−1gtgTt(Ht−1)−1

1 + gT_t(Ht−1)−1gt

(11)

The expensive operation in this rule is (Ht−1)−1gt, but instead of calculating Ht−1−1gt

they developed a rule which calculates H_t−1−1gt from Ht−2−1gt without any knowledge of

H_t−1−1. This resulted in a computational complexity of O(d) with the same error rates compared to regular implementation which has order O(d2) (Civek and Kozat, 2017).

2.3 Programming optimizations

While the goal of this research is to use algorithms which are fast and accurate, it is also important to look into other optimizations techniques which make the run time faster. The programming language Python is used, which can be optimized in different ways. A few techniques will be discussed and in Appendix A.1 timed examples are given.

Explicit loops in Python should be avoided if possible (Langtangen, 2009). Instead a package called NumPy, which contains mathematical support for arrays and matrices, should be used to vectorize expressions. If this is not possible loops can often still be avoided by replacing the loop with a combination of the following Python functions: map, reduce, and filter. The function map applies another function to all items in a list. With the function reduce another function of two arguments is applied cumulatively to the items of an iterable from left to right. This reduces the iterable to a single value. The function f ilter constructs a list with elements of an iterable for which another function returns True.

Another optimization is removing the module prefix in frequently called functions (Langtangen, 2009). For instance in Python the package NumPy can be loaded into memory by using import numpy as np. If the logarithm of x is required then the command is np.log(x). If this operation occurs multiple times, it saves time to store the function np.log into a local object and to use the local object instead of np.log(x).

Loop unrolling is another optimization technique. The run time of a loop can be reduced by decreasing the number of repetitions of a loop by doing multiple iterations within the body of a loop. The step-size of the loop changes to the number of iterations done within the body of the loop. To illustrate if the normal loop has n iterations and

(12)

if the programmer manually programmed for every iteration the process x times then the number of iterations after loop unrolling is n_x. The performance can be increased, because the loop overhead is reduced, the instruction parallelism is increased and the register, data cache usage is improved (Bacon et al., 1994). The experiments of Asher and Rotem (2009) show that loop unrolling increases the performance.

Normally when a Python script runs it only utilizes one core. This means that every bit of code is processed in serial. There are ways in Python to use multiple cores to distribute the workload in parallel. This is called multiprocessing and in Python this can be done in a few different ways. One of the approaches is the Pool/Map approach which creates a pool of worker processes and returns the results in a list. Every element of an iterable is assigned to one of the worker processes. A disadvantage is that this approach only allows for one argument as input parameter. The best performance is achieved when the number of processes is equal to the number of processor cores on the machine (Singh et al., 2013). Another approach is the Process/Queue approach which allows for multiple arguments as input parameter. First two queues are created, one for input data and another one for output data. The process class is used to start the parallel worker processes. Each worker process picks up the next item from the input data and the result is stored on the output data queue. A third approach is the Parallel Python approach. Parallel Python is an open source cross-platform module in which the processes run under a job server which is started locally with a number of processes. In the research of Singh et al. (2013) the Process/Queue approach performed better than both the Pool/Map and Parallel Python, but all three approaches reduced the wait time.

(13)

3 Data

In Section 3.1 the CTR dataset from Criteo Labs is described. In Section 3.2 the main topic is how to create a feature set when the entire dataset cannot be loaded into memory.

3.1 Criteo Labs

In this research a CTR prediction dataset from Criteo Labs is used. In June 2014 this dataset was made available for a three-month competition on Kaggle 1_{. After the}

competition the data was made available for academic use2_{. There are 3 files provided:}

a readme file which contains some general information about the data, a training file which contains information about the observations which can be used to train a model and a test file which should be used to test the model.

It is given that the training data covers a portion of Criteo’s traffic over a period of 7 days and the test set covers a portion of the traffic the day after the training set. The rows in both datasets are ordered chronologically. The clicked and non-clicked examples have both been subsampled at different rates in order to reduce the dataset size. The training set contains 45,840,617 rows and 40 columns. The first column indicates if someone clicked (1) or not (0). The next 13 columns are integer variables (mostly count features) and the last 26 columns are categorical variables.

Data exploration has been made difficult for this dataset. The main reason is the header is missing. The consequence is that it is not possible to know what each column means. Also every categorical value is hashed into 32 bits for anonymization purposes. Therefore it is not possible to know what these values are. This is only a problem when trying to understand what every column means, but not a problem when training a model. Another problem is the test set cannot be used for this research, because the first column (click or no click) has been removed. Therefore it is not possible to test

1_{https://www.kaggle.com/c/criteo-display-ad-challenge}

(14)

the models with this test set. To test a model a new test set needs to be created. From the training set the last 6,500,000 observations have been removed from the training set and added to the test set. This means that roughly 6 days of data are used for training a model and 1 day of data for testing a model. A summary of the datasets is given in Table 1. The CTR of the training sample is slightly lower (0.43%) than the CTR of the test sample.

Dataset Observations Count var Categorical var Total var CTR

Training set 45,840,617 13 26 39 25.62%

Training sample 39,340,617 13 26 39 25.56%

Test sample 6,500,000 13 26 39 25.99%

Table 1: Summary datasets

Variable Min q25 Median q75 q95 q99 Max Mode Mean Unique values Missing values

Count 01 0 0 1 3 15 39 5,775 0 3.50 649 20,793,556 45.36% Count 02 -3 1 5 45 600 2361 257,675 0 118.19 9,364 0 0.00% Count 03 0 2 6 18 76 222 65,535 1 26.91 14,746 9,839,447 21.46% Count 04 0 2 4 10 25 40 969 1 7.32 490 9,937,369 21.68% Count 05 0 328 2,813 10,131 63,804 348,664 23,159,456 1 18,538.99 476,707 1,183,117 2.58% Count 06 0 8 32 102 474 1286 431,037 0 116,06 11,618 10,252,328 22.37% Count 07 0 1 3 11 65 217 56,311 0 16.33 4,142 1,982,866 4.33% Count 08 0 2 7 19 42 48 6,047 0 12.52 1,373 22,773 0.05% Count 09 0 10 38 110 431 988 29,019 1 106.11 7,275 1,982,866 4.33% Count 10 0 0 1 1 2 3 11 0 0.62 13 20,793,556 45.36% Count 11 0 1 1 3 10 26 231 1 2.73 169 19,982,886 43.59% Count 12 0 0 0 1 4 19 4,008 0 0.99 407 35,071,652 76.51% Count 13 0 2 4 10 29 48 7,393 1 8.22 1,376 9,937,369 21.68%

Table 2: Count variables original training set

A high CTR is noticable in Table 1, since the expectation is a CTR of approxi-mately 0.1% (Shan et al., 2016). The reason is that different rates have been used for subsampling the clicked and non-clicked examples. This is also preferred when training a model. If the CTR is for instance only 1 percent, the model only predicts not clicking to achieve a high accuracy.

(15)

An overview of the characteristics of the count variables are in Table 2. All count variables show the same patterns. Low values occur more often and the higher the number the lower the frequency. For instance only 1% of the values of count 01 are higher than 39 while there are 649 unique values. Half of the non-missing values for count 01 are equal to zero or one. Unfortunately most count variables have a high percentage of missing values of which an extreme case is count 12 with 76.51%. Only count 02 is complete.

Variable Missing values Unique values Variable Missing values Unique values

Cat 01 0 0.00% 1,460 Cat 14 0 0.00% 27 Cat 02 0 0.00% 583 Cat 15 0 0.00% 14,992 Cat 03 1,559,473 3.40% 10,131,227 Cat 16 1,559,473 3.40% 5,461,306 Cat 04 1,559,473 3.40% 2,202,608 Cat 17 0 0.00% 10 Cat 05 0 0.00% 305 Cat 18 0 0.00% 5,652 Cat 06 5,540,625 12.09% 24 Cat 19 20,172,858 44.01% 2,173 Cat 07 0 0.00% 12,517 Cat 20 20,172,858 44.01% 4 Cat 08 0 0.00% 633 Cat 21 1,559,473 3.40% 7,046,547 Cat 09 0 0.00% 3 Cat 22 34,955,073 76.25% 18 Cat 10 0 0.00% 93,145 Cat 23 0 0.00% 15 Cat 11 0 0.00% 5,683 Cat 24 1,559,473 3.40% 286,181 Cat 12 1,559,473 3.40% 8,351,593 Cat 25 20,172,858 44.01% 105 Cat 13 0 0.00% 3,194 Cat 26 20,172,858 44.01% 142,572

Table 3: Categorical variables original training set

An overview of the characteristics of the categorical variables are in Table 3. The only characteristics that can be extracted are the missing values and the number of unique values. Some categorical variables have millions of unique values (cat. 01 and cat. 22), while other categorical variables only have a few different values (cat. 09). There are 15 categorical variables without missing values.

As can be seen in Table 2 and in Table 3 a lot of variables have missing values. In total there are only 756,554 rows complete. Even though a lot of variables have missing values there is still information that can be used to predict whether someone will click or not. Missing observations can create bias, but the goal is not to predict the correct weight for each categorical variable but to predict the CTR.

(16)

3.2 Features

The next step is to generate features from these count and categorical variables. The laptop which was used for this research had an Intel Core I7 - 3610QM, 2.3GHz with 8 GB of memory and 4 cores with hyper-threading. Some models require that all data is loaded into the memory. With a dataset of this magnitude this is not possible. A program will crash (R) or generate errors (Python) when this dataset is loaded into memory. The size of the training sample is 8.94 GB and the test sample is 1.47 GB. Because the size of the data is large, it is not feasible to generate features and store them on a hard disk. This will make the size of the file even larger. To create features the count and categorical variables have been treated differently. The first step is only used on the count variables and the second step is only used on the categorical variables.

Step 1:

For each count variable:

- Determine all possible values. - Sort the values from small to large.

- Split the sorted values in 40 groups of equal size. - Save the result of the largest value in every group.

- If some groups have the same largest value save this value only once.

Step 2:

For each categorical variable:

- Determine all possible options and count the occurrence. - Sort the possible options by occurrence and delete the values

which occur less than 100 times.

- Store the result of the possible values.

Step 3:

- Count all possible options for each count and categorical variables. - Give each possibility a unique number starting from 1.

(17)

Step 4:

For each row in the dataset:

For each column in the dataset: if a column is a count variable:

- Allocate the value to the first group if the value is smaller or equal than the number of the group itself using step 1. - Apply recoding scheme created in step 3.

- If the number is missing, no number is saved. else:

- Apply recoding scheme created in step 3. - If a value does not occur, no number is saved.

The problem of the count variables is that the higher values are widespread (Table 2). Therefore using step 1 every value of a count variable is placed into a group. There are in total a maximum of 40 groups for each count variable. The reference group for all count variables except count variable 2 is the missing values. For count variable 2 the reference group are the negative values, because there are no missing values. The idea of a group is that it probably does not matter if a count variable is 155 or 156, but the effect is probably different from 0 or 1. This also solves the problem of missing values by placing the missing values in a reference group.

The same problem occurs for categorical variables. Some categorical variables have a lot of unique values (Table 3). If these values are not removed the number of features will be millions (almost more than the number of observations). In comparison with a count variable there is no order in the data. Therefore categorical variables cannot be placed into groups. The weight of a feature cannot be precisely estimated if the value only occurs a few times. Therefore deleting the least occurring values in step 2 is applied on categorical variables. This also reduces the calculation time later when training a model.

The reason why unique numbers need to be created in step 3 is to locate the column of the feature vector which is used in step 4. The constant is also included with number 0. As stated before not every value of a categorical variable is used. In step 4 these

(18)

values are removed from the dataset.

This processed dataset can now be used to train and test a model. Note that steps 1, 2 and 3 are only for the training set, because the model needs to be based on the features of the training set which are known in advance. The features of the test set are only known after training. Step 4 is applied for both the training and test set.

For the CTR dataset of Criteo Labs 168,414 features are created using values which occur at least 100 times for the categorical variables. Table 4 contains the number of features for each variable. The main goal of the next section is to define different models and how to compare them.

Variable Features Variable Features Variable Features Variable Features Variable Features

Count 01 13 Count 09 39 Cat 04 21,105 Cat 12 18,867 Cat 20 3

Count 02 26 Count 10 4 Cat 05 181 Cat 13 3,082 Cat 21 19,316

Count 03 25 Count 11 11 Cat 06 14 Cat 14 26 Cat 22 12

Count 04 21 Count 12 6 Cat 07 9,829 Cat 15 6,641 Cat 23 14

Count 05 40 Count 13 21 Cat 08 341 Cat 16 19,821 Cat 24 13,684

Count 06 38 Cat 01 683 Cat 09 2 Cat 17 9 Cat 25 57

Count 07 22 Cat 02 537 Cat 10 15,171 Cat 18 3,153 Cat 26 11,076

Count 08 28 Cat 03 18,508 Cat 11 4,411 Cat 19 1,576 Total 168,414

(19)

4 Models

Using the feature vector of Section 3.2 a model can be trained and tested. In this section all models are explained. In Section 4.1 the evaluation process is described using the logloss function. In Section 4.2 a benchmark is created so results can be compared to other models. In Section 4.3 the first-order method stochastic gradient descent and some variants of this method are described. The Quasi-Newton algorithm and Online Newton Step are the topics of Section 4.4.

4.1 Evaluation process

Every model needs to be evaluated. To compare results of the different models two criteria are considered, namely the Logloss and the accuracy. The Logarithmic Loss or the Logistic Loss or Logloss is created in the following way. The probability of a click is given by

pi = P (yi = 1|xi, w),

in which y stands for click (y = 1) or no click (y = 0), x stands for all features and w the weights of the model. The likelihood is given by

L = n Y i=1 pyi i (1 − pi)1−yi.

For the variable pi different functions can be used. Important is that they can deal with

probabilities. The value of p has to be between zero and one. Known models which can deal with these restrictions are probit and logit models. Only the logit or the logistic regression model is used which is defined as follows

P (y = 1|x, w) = 1 1 + e−wT_x.

The variable p is a function of x and w and taking the negative log likelihood and dividing by n gives the logloss function

` = 1 n

n

X

i=1

(20)

The logloss function compares the true value with the outcome of the model. The larger the distance between the estimated p and the true value of y the larger the value that is added to the logloss. A lower logloss score indicates a better model.

It is clear that a lower logloss score is preferred, but ultimately a decision has to be made. Does someone click or not. A second prediction criterion is added. If an estimated p value is higher than 0.5 the prediction is a user will click. If an estimated p value is lower than 0.5 the prediction is a user will not click. The accuracy is then calculated as the proportion of the predictions that were predicted correctly.

4.2 CTR as benchmark

One of the goals of this research is to find a model that gives accurate results which are produced in a reasonable amount of time. To determine the performance of a model, a simple model is used as a benchmark. The result of each model is a probability of someone that will click or not. Therefore as benchmark the CTR will be taken as the probability that someone will click. The probability is in this case the same for everyone, namely ˆ pb = 1 n n X i=1 yi.

4.3 Stochastic Gradient Descent

Stochastic gradient descent, also known as sequential gradient descent or on-line gradi-ent descgradi-ent, is an iterative method to find a local minimum of the function f which is in this case the logloss function. In every iteration the weight vector is updated based on one data point in the following way

wt= wt−1− ηgt,

in which gt stands for the gradient of f and η for a learning parameter. This can be

repeated for the entire dataset. In the initialization step a η and w must be created. To determine the gradient of the logloss function, the derivative of the logit model is

(21)

required which is equal to pi(1 − pi) (Heij et al., 2004). Using this result the gradient of

the logloss function becomes equal toPn

i=1(pi− yi)xi (Heij et al., 2004). The updating

step of the weight vector in every iteration changes into wt = wt−1+ η(yt− pt)xt.

The advantage is that the entire dataset does not have to be in memory, only one data point at a time. Choosing a learning rate can be difficult. The learning rate needs to be small to be stable, but convergence will be slow. If the learning rate is too large, convergence will be fast but unstable (Toulis and Airoldi, 2015). A property of stochastic gradient descent is that it is possible to escape a local minimum (Bishop, 2006).

There are different variants based on stochastic gradient descent which try to im-prove the results. Most of these variants are also discussed in the papers of Ruder (2016) and van Velzen (2017). The first variant is the addition of a momentum term to increase the rate of convergence. The updating step becomes

vt= γvt−1+ η(yt− pt)xt

wt= wt−1+ vt,

in which γ is the momentum parameter and vt the change in direction within an

iter-ation. The momentum term is a fraction of the previous update vector. This method helps accelerating SGD in the relevant direction while it reduces oscillations (Qian, 1999).

Another variant is the average stochastic gradient descent algorithm (Polyak and Juditsky, 1992). The update step is almost the same as in regular SGD, but the difference is that the next weight vector is the average of past weight vectors.

wt= wt−1+ η(yt− pt)xt ¯ w = 1 t t X i=1 wi.

(22)

Regular SGD, momentum and the average SGD algorithms have learning rates which are the same for every feature and remain constant. The following variants change the learning rate over time. The Adagrad algorithm (Adaptive Gradient Algorithm) applies different learning rates to the different features. The learning rate of features that occur frequently is lower than the learning rate of infrequent features. The intuition behind this algorithm is if a feature is seen that is normally never seen, the feature should have a larger impact on the model (Duchi et al., 2011). The updating step keeps track of the past gradients in the following way

vt+1 = vt+ diag(gtgtT) wt+1 = wt+ η √ vt+1+ (yt− pt)xt,

where the parameter is a small number to prevent dividing by zero. In every iteration each element of the weight vector is squared. The result of this step is the denominator keeps on growing, because the added term is always positive and the learning rates will therefore decline. Eventually the learning rate of frequent occurring elements will be very small. The model stops learning for those features.

Another optimization variant is the Adadelta algorithm which is derived from the Adagrad algorithm to improve on two main drawbacks (Zeiler, 2012). The first draw-back is the continual decay of learning rates and the second drawdraw-back is a learning parameter needs to be selected. Instead of using all past squared gradients in the same way this algorithm implements an exponentially decaying average of the squared gradients. The learning rate parameter is rewritten using the idea of the Hessian ap-proximation. The initialization step is to choose a decay rate ρ, an and a weight vector wt. The following two variables are set to 0:

γ0 = 0

∆w₀2 = 0.

(23)

The weight vector update, starting with t = 1, becomes γt= ργt−1+ (1 − ρ)gt2 RM Sg =√γt+ RM S∆w = q ∆w2 t−1+ ∆wt= − RM S∆w RM Sg gt ∆w2_t = ρ∆w_t−12 + (1 − ρ)∆w2_t wt+1= wt+ ∆wt.

RMSProp is an unpublished variant on SGD (Tieleman and Hinton, 2012). The algorithm is mostly the same as the Adadelta algorithm:

γt = ργt−1+ (1 − ρ)g2t wt+1 = wt− η √ γt+ gt.

The difference with Adadelta is η needs to be initialized and the suggestion of ρ is a constant equal to 0.9. Dividing the gradient by √γt+ improves the learning part of

the model.

The last variant of SGD that is discussed is the ADAM (Adaptive Moment Estima-tion) algorithm (Kingma and Ba, 2015). This algorithm uses exponential decay rates for the moment estimates, not only for the first but also for the second moment. Both first and second moment estimates are bias-corrected. The update step is as follows

vt= ρ1vt−1+ (1 − ρ1)gt ˆ vt= vt 1 − ρt 1 ut= ρ2ut−1+ (1 − ρ2)gt2 ˆ ut= ut 1 − ρt 2 wt= wt−1− η √ ˆ ut+ ˆ vt.

The default settings mentioned in Kingma and Ba (2015) are η = 0.001, ρ1 = 0.9,

(24)

4.4 Quasi-Newton and Online Newton Step

Online Newton Step (ONS) is a second-order iterative algorithm to optimize an objec-tive function. In comparison with SGD besides the gradient, the Hessian matrix Ht is

also required (Ye et al. (2017) and Civek and Kozat (2017)) Ht= Ht−1+ gtgtT

wt= wt−1− η(Ht)−1gt.

The Hessian is recursive and can be rewritten using the matrix inversion lemma (Civek and Kozat, 2017) (Ht)−1 = (Ht−1)−1− (Ht−1)−1gtgTt(Ht−1)−1 1 + gT t(Ht−1)−1gt .

This is a direct update from (Ht−1)−1 to (Ht)−1. For large amounts of features, storing

and updating the Hessian matrix is an expensive calculation and also memory intensive. There are different methods to update the Hessian matrix. Quasi-Newton methods try to make an approximation of the Hessian. The secant condition needs to hold for the approximation of the Hessian on time t − 1 to time t (Yamashita, 2008)

(Ht)−1yt−1 = ∆wt−1,

where yt−1 = gt− gt−1 and ∆wt−1 = wt− wt−1. There are different update formulas for

the inverse Hessian, but in this research only the DFP update and the BFGS formulas are used. The weights are updated according to

wt= wt−1− (Ht−1)−1gt−1.

The description of the DFP update is (Yamashita, 2008) (Ht)−1 = (Ht−1)−1− (Ht−1)−1yt−1yt−1T (Ht−1)−1 yT t−1(Ht−1)−1yt−1 + ∆wt−1∆w T t−1 ∆wT t−1yt−1 . The description of the BFGS update is (Yamashita, 2008)

(Ht)−1 = (Ht−1)−1− (Ht−1)−1yt−1∆wTt−1+ ∆wt−1((Ht−1)−1yt−1)T ∆wT t−1yt−1 + 1 + y T t−1(Ht−1)−1yt−1 ∆wT t−1yt−1 ∆wt−1∆wTt−1 ∆wT t−1yt−1 .

(25)

5 Results

In this section the results are presented using the data of Section 3 and the models of Section 4. In Section 5.1 the results of the benchmark is discussed. In Section 5.2 the results of the first-order algorithms, stochastic gradient descent and the variants, are presented. In Section 5.3 the results of Quasi-Newton and Online Newton Step are discussed.

For almost every algorithm a fixed window and a rolling window is used. If a fixed window is used then the number of observations in the training set remains the same. If a rolling window is used then the number of observations in the training set increases after a prediction has been made on an observation in the test set. This is illustrated in Figure 1.

Figure 1: Difference fixed and rolling window where black is the training set and red the observation that is tested

5.1 CTR as Benchmark

The CTR of the training sample is 25.56%. Because this number is lower than 50%, the benchmark algorithm predicts no click for every user in the test set. The CTR of the test sample is 25.99%. Therefore the accuracy of the benchmark is 74.01%. An overview of the results are in Table 5. The logloss decreases slightly if the test set is used for training, because the average CTR increases to the CTR of the test set.

(26)

Algorithm Parameters Epochs Training time (min)

Fixed window Rolling window

logloss accuracy test time logloss accuracy test time

CTR - 1 3.2 0.57295 74.01% 0.9 0.57294 74.01% 0.9

Table 5: Result Benchmark

The winners of the competition in 2014 had a logloss of approximately 0.445 3_{. It}

is not possible to directly compare this result with the results of this research, because there are some differences with the used datasets. Nonetheless it gives some insights in how effectively the training models are.

5.2 Stochastic Gradient Descent

In Section 4.3 the formulas for SGD are given. These can directly be implemented, but two optimizations have been applied to reduce calculation time. One of the two expensive operations is calculating pi in

P (y = 1|x, w) = 1 1 + e−wT_x,

in which w and x are vectors of length 168,415. In every iteration the in-product of those two vectors need to be calculated. The features x only consist of dummy variables. The consequence is that every weight is multiplied by zero or one. There are only 39 variables and a constant so there are a maximum of 40 features which are one. Therefore the operation wT_{x can be optimized by using the sum of the weights of w for which}

x = 1. The second optimization is applied to the update of the weight vector wt+1 = wt+ η(yt− pt)xt.

The same argument applies as before. The features x consist only of zero and one. Therefore only the weights that need to be updated is when x = 1.

SGD applied on the entire dataset, with a learning parameter of η = 0.001, has a logloss of 0.47207 and an accuracy of 77.90%. In comparison with the benchmark the

(27)

Figure 2: Results logloss of SGD on each variable

accuracy is 3.89% higher and the logloss is 0.1009 lower. One of the problems with this dataset, explained in Section 3.1, is it is unknown what kind of variables are included. The question then is, are all those variables and features needed to get an accuracy of 77.90%. To determine which variables cause the largest improvements SGD has been applied separately on every variable. The results have been ordered on logloss and are shown in Figure 2. It seems that count variable 06 is the best performing variable if it is the only explanatory variable used in SGD. The second best performing variable is categorical variable 07. Some of the variables do not explain the CTR, like categorical variable 21 and categorical variable 26. Those logloss values are even lower than the logloss of the benchmark.

To determine the effect of adding more variables to the model SGD has been applied 38 times more. In the first round the two best variables are included (count 06 and cat. 07). In the second round the three best variables are included (count 06, cat. 07 and cat. 15) etcetera. In the last round every variable is included. The results of the logloss is in Figure 3. As expected adding more variables lead into a lower logloss. When the number of variables is 0, the result of the benchmark is used. The first difference of the logloss is shown in Figure 4. In the beginning adding a variable has a large effect on the decline of the logloss. In the end this effect goes to zero, but sometimes the difference is larger than zero which results in a higher logloss. Visible are the cases of cat. 08 and cat. 05, but also the difference of the logloss by adding cat. 02, cat. 09, cat. 19, cat.

(28)

20, cat. 25, cat. 01, cat. 16 and cat. 21 is positive.

Figure 3: Results SGD logloss

Figure 4: Results SGD first differences logloss

The same results apply for the accuracy shown in Figure 5. Adding more variables lead into a higher accuracy. When the number of variables is 0, the result of the benchmark is used. The first difference of the accuracy is shown in Figure 6. Not all variables lead to a larger accuracy, namely the variables cat. 06, cat. 08, cat. 05, cat. 01, cat. 03 and cat. 26 impact the accuracy negatively.

The lowest logloss score is achieved after using the best 28 variables (0.47165). The highest accuracy is achieved after using the best 27 variables (77.91%). The logloss score is slightly higher for 27 variables, namely 0.47176. As a result the last 12 variables will not be used in the remainder of this section. This reduces the number of features from 168,415 to 77,971.

(29)

Figure 5: Results SGD accuracy

Figure 6: Results SGD first differences accuracy

The initialization step of the weight vector in SGD and the variants of SGD is a zero vector. Therefore the first batch of examples are used to point the weights into a certain direction. To improve the accuracy the dataset have been fed to the model with a maximum of ten times (epochs). To speed up calculation times optimizations have been applied on the different algorithms. For Momentum the idea is to push the weights into a certain direction for multiple iterations. This does not work, because the dataset is very sparse and therefore the second time a feature occurs this effect is already diminished. Then the effect is just the same as in normal SGD. To determine if this idea works momentum is applied when a feature occurs instead of in every iteration. For averaging it is not necessary to calculate the average weights in every iteration. Only the average weights for the features that occur are needed. This is also the case for the last four algorithms. The results of SGD and the variants of SGD are presented

(30)

in Table 6 and in Figures 7, 8 and 9.

Algorithm Parameters Epochs

Training time (min)

CTR - 1 3.2 0.57295 74.01% 0.9 0.57294 74.01% 0.9 SGD η = 0.001 10 214.7 0.46574 78.20% 2.8 0.46433 78.29% 3.7 Momentum η = 0.0008, γ = 0.8 10 248.0 0.46941 77.94% 2.8 0.46559 78.21% 4.3 Average η = 0.001 1 56.6 0.47929 77.54% 2.9 0.47843 77.57% 10.0 Adagrad η = 0.012, = 10−8 ₁₀ _268.5 _0.46516 _78.23% _2.8 _0.46505 _78.24% _4.7 Adadelta ρ = 0.9, = 10−8 _{5 (6)} _{223.5 (268.1)} _0.47480 _78.02% _2.8 _0.47337 _78.13% _7.6 RMSProp η = 0.001, = 10 −8 ρ = 0.9 1 30.3 0.47864 77.70% 2.9 0.47733 77.95% 2.9 Adam η = 0.001, = 10 −8 ρ1= 0.9, ρ2= 0.999 10 348.7 0.46704 78.11% 2.8 0.46437 78.27% 6.0

Table 6: Result SGD and the variants of SGD

Table 6 shows that SGD and the variants perform better than using the CTR only, but some variants are better than others. Starting with Average which performed badly, because after one epoch the past weights were too large to store into memory. Therefore only one epoch succeeded. The algorithms Adadelta and RMSProp did not need the full 10 epochs to reach the lowest logloss value. In the case of Adadelta the Fixed window needed five epochs and the rolling window six epochs. The best performing algorithms were SGD, Adagrad and Adam. Using a rolling window improves the accuracy for every algorithm albeit very small for Adagrad, which is as expected.

Looking at a fixed window the lowest logloss and the highest accuracy occurred for the Adagrad algorithm. The rolling window had the lowest logloss and the highest accuracy for SGD. However a lower logloss does not always mean a higher accuracy. The logloss of Adadelta in a fixed window is larger than the logloss of Momentum, but the accuracy is also larger.

Figure 7 shows the change in logloss over time for the different algorithms. On the left a fixed window is used and on the right a rolling window is used. It is clear that RMSProp, Average and Adadelta do not perform well in comparison with the other algorithms. After only one epoch Adam performs best, but after a few epochs SGD

(31)

and Adagrad perform better than Adam using a fixed window. Using a rolling window only SGD performs better than Adam after nine epochs. Momentum also seems to have a good performance using a rolling window and one epoch, but in the end Momentum is surpassed by SGD, Adam and Adagrad.

Figure 7: Results SGD and the variants of SGD with on the left a fixed window and on the right a rolling window, logloss on the y-axis and the number of epochs on the x-axis

Figure 8: Results SGD and the variants of SGD with on the left a fixed window and on the right a rolling window, accuracy on the y-axis and the number of epochs on the x-axis

Figure 8 shows the change in accuracy over time for the different algorithms. On the left a fixed window is used and on the right a rolling window is used. It is clear that most of the conclusions are the same as in the case of Figure 7. Strange is the behaviour of RMSProp which is downwards sloping using a fixed window, but remains almost constant over time using a rolling window.

Figure 9 shows the results of the best three performing algorithms over time. On the left the logloss is presented and on the right the accuracy is shown. Using a rolling

(32)

window does increase the performance, but also interesting is the distance between using a rolling window and a fixed window over time. In the case of SGD and Adam it seems to remain constant, but in the case of Adagrad the distance becomes smaller over time. Apparently in the test set is some new information available which increases the result.

Figure 9: Results SGD, ADAM and Adagrad, logloss on the y-axis on the left, accuracy on the y-axis on the right and the number of epochs on the x-axis

In both this research and the research of van Velzen (2017) the normal SGD algo-rithm and Adagrad have a good performance in comparison with other variants. The Adam algorithm has a good performance in this research, but this was not the case in the research of van Velzen (2017). In comparison with all other variants and SGD it was the worst performing algorithm. Another difference is the moment of convergence. Both Adam and Adagrad needed respectively 5 and 7 epochs to convergence. Using more epochs the logloss would increase. In this research this was not the case. After every epoch the logloss of both Adam and Adagrad was lower than before.

5.3 Quasi-Newton and Online Newton Step

Unfortunately not all features can be used to assess the second-order methods due to memory limitations which is a big disadvantage. To make a comparison between SGD, Quasi-Newton and ONS, the focus in this paragraph lies on features that have to do with count variables.

(33)

The online version of Quasi-Newton methods did not perform according to expec-tations in all cases. The problem lies with the inverted Hessian matrix. The values of this matrix quickly went to 0 of which the consequence is no change for the weights for the newer observations. Therefore the logloss values are high and the accuracy low and in some extreme cases an accuracy around 31%. At the end of this section an offline version of Quasi-Newton is discussed.

Using the method described in Section 3.2 the count variables have 294 features. The results of predicting the CTR using the different methods are in Table 7.

Algorithm Parameters Observations

Training time (min)

CTR - 1m / 0.5m 0.0 0.55328 75.85% 0.0 0.55312 75.85% 0.0 CTR - 1 epoch 3.2 0.57295 74.01% 0.9 0.57294 74.01% 0.9 SGD η = 0.001 1m / 0.5m 0.5 0.49930 76.95% 0.2 0.49814 76.98% 0.3 SGD η = 0.001 1 epoch 21.5 0.51678 75.47% 3.0 0.51592 75.55% 4.2 ONS η = 0.001 1m / 0.5m 17.5 0.68906 76.65% 0.2 0.68896 76.67% 8.8 ONS η = 0.001 1 epoch 693.4 0.68814 75.17% 2.9 0.68812 75.17% 115.0

Table 7: Results count variables

Two different sizes of the dataset have been used. These are given in the column observations. The entire dataset is used when 1 epoch is noted. When 1m / 0.5m is given it means the first million observations of the training set is used with the first half a million of the test set. Table 7 shows that SGD has the lowest logloss values and the highest accuracy. ONS has a high logloss value in comparison with the CTR and SGD, but the accuracy is higher than using the benchmark. However ONS has a lower accuracy than SGD. Looking at the differences between using a fixed window and a rolling windows it seems that ONS stops learning much faster compared to SGD.

Using this many features is not preferable when second-order methods are used. In Section 3.2 each count variable has been divided into groups. The consequence is a larger amount of features. To further test the second-order methods another transformation is used. For each count variable the following transformation has been applied.

(34)

For each count variable:

- determine the minimum value

- determine the difference between 1 and the minimum value - add this difference to each number

- take the logarithm of every number

In Section 3.1 it was shown that most of the count variables have low values, but the higher numbers are widespread. By taking the logarithm all these numbers are much closer to each other. There are a lot of missing values. If a missing value occurs it is set to 0. This reduces the number of count features from 294 to 13. The features are no longer 0 or 1. This means that also the difference between using feature scaling or not has been looked into. If feature scaling has been used to make sure that all values of a feature are between [0, 1] then this formula is used

xn_ij = x

o ij

max(xi)

.

Algorithm Parameters Feature

Scaling

Training time (min)

CTR - - 3.2 0.57295 74.01% 0.9 0.57294 74.01% 0.9

SGD η = 0.001 no 18.6 0.53955 74.38% 2.9 0.52892 74.97% 3.3

SGD η = 0.001 yes 20.1 0.52477 75.00% 3.2 0.52408 75.09% 3.5

ONS η = 0.001 no 32.0 0.68729 74.80% 2.9 0.68717 74.80% 5.5

ONS η = 0.001 yes 32.7 0.68797 74.43% 3.1 0.68795 74.44% 5.7

Table 8: Result log count variables

The results of using only 13 features with the log transformation are in Table 8. Again SGD achieves the highest accuracy using only 1 epoch. ONS performs better without feature scaling and even has a higher accuracy in comparison with SGD when a fixed window is used.

Unfortunately a lot of values are set to zero due to missing values. This probably has a negative impact on the results. To test this a new dataset has been created using only observations with no missing values. This reduces the training set to 4,647,298

(35)

observations and the test set to 838,234 observations. The same transformations have been applied as in Table 8. The results are in Table 9.

Algorithm Parameters Feature

Scaling

Training time (min)

CTR - - 0.2 0.61967 68.93% 0.1 0.61967 68.93% 0.1

SGD η = 0.001 no 2.3 0.60123 68.77% 0.4 0.57169 71.26% 0.4

SGD η = 0.001 yes 2.4 0.56667 71.40% 0.4 0.56618 71.47% 0.5

ONS η = 0.001 no 3.9 0.68993 70.61% 0.4 0.68992 70.62% 0.7

ONS η = 0.001 yes 4.3 0.69028 71.12% 0.4 0.69026 71.12% 0.9

Table 9: Result log count variables without missing values

The accuracy is a lot lower compared to Table 8, but the reason is a decrease of the CTR from 74.01% to 68.93%. Apparently the rows containing missing values have a lower CTR. Again to get the best results out of SGD it is important to scale the features which increases the accuracy. In this case feature scaling also has a positive impact on ONS. Still SGD performs better in both logloss and accuracy in comparison with the second-order algorithm ONS.

While important categorical variables have been omitted from this analysis which reduces the accuracy it still seems that SGD performs better in comparison with ONS. This has mainly to do with the Hessian matrix. The elements of the Hessian matrix quickly become very small which means that the weights cannot escape a local optimum. All results presented until now are generated using online algorithms. In the be-ginning of this section it was mentioned that the online Quasi-Newton methods did not perform according to expectation. Therefore the offline variant of BFGS has been tested as well using the Python package SciP y. In this package the function scipy.optimize.minimize is used with the method BF GS. This was tested on a desk-top with an Intel(R) Core(TM) i5-7600 CPU @ 3.50GHz with 16 GB of memory. The results are in Table 10.

(36)

Algorithm Case Observations Training time (min) Iterations logloss accuracy BFGS 294 count features 1m / 0.5m 14.7 320 0.49642 77.06% BFGS 13 count features 1 epoch 128.8 43 0.52748 75.03% BFGS 13 count features no missing values 15.9 42 0.56622 71.43%

Table 10: Results BFGS using SciPy

The results of BFGS are similar as the results of SGD using a fixed window with feature scaling. While the accuracy of BFGS is 0.03% or 0.11% higher, the training time is a lot longer. A disadvantage is that new observations cannot be taken into account easily.

The 16 GB of memory is not enough to train a BFGS algorithm with 294 count features for the entire training set. Therefore the last test is derived from a lecture slide called Thinking out the Box: Spraygun (Hastie, 2015) in which the training set was divided and trained separately. In this case the training set is divided into 40 parts and every part trains the BFGS algorithm separately. The outcome has 40 sets with different weights. During the testing phase 40 different probabilities are calculated for every observation. The final probability that is used to determine the logloss and the accuracy is the average of those probabilities. The result is in Table 11.

Algorithm Case Observations Training time (min) Groups logloss accuracy

BFGS 294 count features Entire dataset 525.1 40 0.51570 75.55%

Table 11: Results BFGS using SciPy

Again the results of BFGS are similar as the results of SGD. The accuracy of BFGS is 0.08% higher in comparison with SGD using a fixed window and equal with SGD if a rolling window is used. The logloss of BFGS is 0.00108 lower in comparison with SGD using a fixed window and 0.00022 lower in comparison with SGD using a rolling window. The training time is a lot longer, namely 525 minutes in comparison with 21.5 minutes.

(37)

6 Conclusion

In this research different models based on first- and second- order methods have been used to predict if someone will click on an ad. Because the size of the CTR dataset was large, it did not fit into 8 GB of memory. Therefore primarily iterative methods have been used. The advantage is only one observation at a time has to be loaded into memory. The evaluation was based on a logloss function and the accuracy of the prediction. The difference in accuracy by adding observations to the training set in comparison with a fixed sample size was also part of the research.

The CTR of the test set with 6,500,000 observations is 25.99%, which was used as a benchmark value. Other algorithms should increase the accuracy. The logloss value was 0.57295 and other algorithms should lower the logloss value.

Stochastic gradient descent and the variants of SGD were the first-order methods based on only the gradient. The variants of SGD are optimization methods to find hopefully a better result. Most of these optimization algorithms change the learning rate over time of the model. SGD itself has a constant learning rate. The most successful algorithms were SGD itself and the variants Adam and Adagrad. These three algorithms had the lowest logloss value and the highest accuracy in both the fixed window and the rolling window. A maximum of 10 epochs were used. Most of the algorithms improved over time. Only RMSProp and Adadelta stopped improving after respectively 1 and 5-6 epochs. The other algorithms kept on improving and even after 10 epochs it still was not converged. The accuracy using a rolling window was always higher than using a fixed window. The improvement of Adagrad using a rolling window was small which was expected, because that is how the algorithm was set up. SGD and Adam both improved using a rolling window even after multiple epochs. These two algorithms both can tune the weights better after using these newly acquired information.

Without using a rolling window Adagrad was the best performing algorithm with 78.23% accuracy, but using a rolling window it only increased with 0.01% to 78.24% while SGD went from 78.20% to 78.29%. Adam went from an accuracy of 78.11% to

(38)

78.27%. If a rolling window is used the best performing algorithm is the normal SGD algorithm. Every newly acquired observation takes a few milliseconds to calculate. In this research the number of features were constant so the algorithm can not go out of memory. Over time new features will emerge such as new types of devices. Therefore the learning algorithm should be rerun from time to time with the new features.

Quasi-Newton with the DFP and BFGS update formulas and Online Newton Step were the second-order methods based on not only the gradient but also the Hessian matrix. Just like SGD these algorithms were iterative methods. The big disadvantage is that not the entire feature vector based on all variables can be used due to memory limitations. Therefore only the count variables were used to predict the CTR in this case. Three different tests were done, but in all cases the online second-order meth-ods performed badly in comparison with SGD. Theoretically second-order algorithms should outperform first-order algorithms, but this was not the case. These second-order methods did require much more time than the first-order methods which is as expected. Therefore using a rolling window in practice is not possible, not only timewise but also the results did not increase the accuracy using a rolling window or decrease the logloss in comparison with SGD. The offline BFGS algorithm had small improvements to the accuracy in comparison with SGD if only a fixed window is used.

While it was clear in Section 5.3 that applying a log transformation on the count variables were not ideal it still provided some information to predict if someone will click on an ad. A better method was dividing the count variables into groups which increased the performance, but a disadvantage is that groups of the same count variable were no longer related to each other.

For future research on this subject the process of creating features can be looked into. In this research a particular value of a categorical variable was only used when it occurred at least 100 times. Over time values of categorical variables can change, for instance new variables and or values can emerge. It can be useful to directly implement these changes into the feature set if a new feature has a positive effect on the model.

(39)

References

[1] Asher Y. B. and Rotem N. The effect of unrolling and inlining for python bytecode optimizations. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, SYSTOR ´09, pages 14:1–14:14, New York, NY, USA, 2009. ACM.

[2] Bacon D. F., Graham S. L., and Sharp O. J. Compiler transformations for high-performance computing. ACM Comput. Surv., 26(4):345–420, December 1994.

[3] Bishop C. M. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006, pages 240 - 241.

[4] Civek B. C. and Kozat S. S. Efficient implementation of newton-raphson methods for sequential data prediction. IEEE Transactions on Knowledge & Data Engineering, 29(12):2786–2791, 2017.

[5] Duchi J., Hazan E., and Singer Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.

[6] Facca F. M. and Lanzi P. L. Mining interesting knowledge from weblogs: a survey. Data & Knowledge Engineering, 53(3):225 – 241, 2005.

[7] Gower R. M. and Richtárik P. Randomized quasi-newton updates are linearly convergent matrix inversion algorithms. SIAM Journal on Matrix Analysis and Applications, 38(4):1380–1409, 2017.

[8] Graepel T, Candela Jq, Borchert T, and Herbrich R. Web-scale bayesian click-through rate prediction for sponsored search advertising in microsoftś bing search engine. In Proceedings of the 27th International Conference on Machine Learning (ICML-10). Omnipress, 2010.

[9] Hastie T. Statistical learning with big data. https: // web. stanford. edu/ ~hastie/ TALKS/ SLBD_ new. pdf , October 2015, Accessed Jan 2018.

[10] Heij C., de Boer P., Franses P. H., Kloek T., and van Dijk H. Econometric Methods with Appli-cations in Business and Economics. Oxford University Press, 2004, pages 447 - 450.

[11] IAB. Digital advertising revenues hit $19.6 billion in Q1 2017, climbing 23% year-over-year, according to IAB. https://www.iab.com/news/ad-revenues-hit-19-6b/, June 2017.

[12] Kingma D. P. and Ba J. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.

(40)

[13] Langtangen H. P. Python Scripting for Computational Science. Springer Publishing Company, Incorporated, 3rd edition, 2009, pages 442 - 449.

[14] Polyak B. T. and Juditsky A. B. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855, July 1992.

[15] Qian N. On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1):145 – 151, 1999.

[16] Ruder S. An overview of gradient descent optimization algorithms. CoRR, abs/1609.04747, 2016.

[17] Shan L., Lin L., Sun C., and Wang X. Predicting ad click-through rates via feature-based fully coupled interaction tensor factorization. Electronic Commerce Research and Applications, 16(Sup-plement C):30 – 42, 2016.

[18] Singh N., Browne L., and Butler R. Parallel astronomical data processing with Python: Recipes for multicore machines. Astronomy and Computing, 2(Supplement C):1 – 10, 2013.

[19] Tieleman T. and Hinton G. Lecture 6.e rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for Machine learning, 2012.

[20] Toulis P. and Airoldi E. M. Implicit stochastic approximation. October 2015.

[21] van Velzen T. Speed is key: Decreasing time till convergence on CTR prediction models. July 2017.

[22] Yamashita N. Sparse quasi-newton updates with positive definite matrix completion. Mathemat-ical Programming, 115(1):1–30, September 2008.

[23] Ye H., Luo L., and Zhang Z. A unifying framework for convergence analysis of approximate newton methods. February 2017.

[24] Yuan S., Wang J., and Zhao X. Real-time bidding for online advertising: measurement and analy-sis. In Proceedings of the Seventh International Workshop on Data Mining for Online Advertising. Association for Computing Machinery (ACM), August 2013.

(41)

A

Appendix

A.1 Examples Python optimizations

The Python examples below illustrate the optimization methods described in Section 2.3. The code is written in Python 3 and the goal of these examples is to show that using a certain optimization method can improve the run time. This does not mean that every example is optimized in the best way possible. For instance the example loop unrolling can be made faster if the function np.sum is used. The packages time, numpy and P ool from multiprocessing need to be loaded into memory before these examples can be run. The run time of each example is noted in parentheses after each Method 1 and Method 2. For a full overview the different run times are also given in Table 12. # R e q u i r e d p a c k a g e s import t i m e import numpy a s np from m u l t i p r o c e s s i n g import P o o l # Avoid l o o p s M1 ( 1 1 9 s , 100%) b e g i n _ t i m e = t i m e . t i m e ( ) f o r i in range ( 4 0 0 0 0 0 0 0 ) : l i s t _ s t r i n g = [ ’ 1 ’ , ’ 2 ’ , ’ 3 ’ , ’ 4 ’ , ’ 5 ’ ] l e n _ l i s t _ s t r i n g = len ( l i s t _ s t r i n g ) f o r j in range ( l e n _ l i s t _ s t r i n g ) : l i s t _ s t r i n g [ j ] = i n t ( l i s t _ s t r i n g [ j ] ) print ( t i m e . t i m e ( ) − b e g i n _ t i m e ) # Avoid l o o p s M2 ( 7 3 s , 61%) b e g i n _ t i m e = t i m e . t i m e ( ) f o r i in range ( 4 0 0 0 0 0 0 0 ) : l i s t _ s t r i n g = [ ’ 1 ’ , ’ 2 ’ , ’ 3 ’ , ’ 4 ’ , ’ 5 ’ ] l i s t _ s t r i n g = l i s t (map( int , l i s t _ s t r i n g ) ) print ( t i m e . t i m e ( ) − b e g i n _ t i m e ) # Remove module p r e f i x M1 ( 1 3 . 4 7 s , 100%) b e g i n _ t i m e = t i m e . t i m e ( ) f o r i in range ( 1 0 0 0 0 0 ) : r e s u l t = [ ] f o r j in range ( 1 , 1 0 0 ) : r e s u l t . append ( np . l o g ( j ) ) print ( t i m e . t i m e ( ) − b e g i n _ t i m e ) # Remove module p r e f i x M2 ( 1 2 . 1 4 , 91%) b e g i n _ t i m e = t i m e . t i m e ( ) np_log = np . l o g f o r i in range ( 1 0 0 0 0 0 ) : r e s u l t = [ ] f_append = r e s u l t . append f o r j in range ( 1 , 1 0 0 ) : f_append ( np_log ( j ) ) print ( t i m e . t i m e ( ) − b e g i n _ t i m e )

(42)

# Loop u n r o l l i n g M1 ( 3 3 s , 100%) b e g i n _ t i m e = t i m e . t i m e ( ) l i s t _ n = l i s t ( range ( 1 0 0 0 ) ) n = len ( l i s t _ n ) f o r i in range ( 1 0 0 0 0 0 ) : total_sum = 0 j = 0 while j < n : total_sum += l i s t _ n [ j ] j += 1 print ( t i m e . t i m e ( ) − b e g i n _ t i m e ) # Loop u n r o l l i n g M2 ( 2 2 s , 66%) b e g i n _ t i m e = t i m e . t i m e ( ) l i s t _ n = l i s t ( range ( 1 0 0 0 ) ) n = len ( l i s t _ n ) f o r i in range ( 1 0 0 0 0 0 ) : total_sum = 0 j = 0 while j < n : total_sum += l i s t _ n [ j ] + l i s t _ n [ j + 1 ] + l i s t _ n [ j + 2 ] + l i s t _ n [ j + 3 ] j += 4 print ( t i m e . t i m e ( ) − b e g i n _ t i m e ) # M u l t i p r o c e s s i n g M1 ( 2 0 s , 16%) def Counter ( a ) : c o u n t = 0 f o r i in range ( 1 0 0 0 0 0 0 0 0 ) : c o u n t += 1 return c o u n t i f __name__ == ’__main__ ’ : b e g i n _ t i m e = t i m e . t i m e ( ) p o o l = P o o l ( p r o c e s s e s = 8 ) r e s u l t s = p o o l .map( Counter , [ 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 ] ) print (sum( r e s u l t s ) ) print ( t i m e . t i m e ( ) − b e g i n _ t i m e ) # M u l t i p r o c e s s i n g M2 ( 1 3 0 s , 100%) b e g i n _ t i m e = t i m e . t i m e ( ) c o u n t = 0 f o r i in range ( 8 0 0 0 0 0 0 0 0 ) : c o u n t += 1 print ( c o u n t ) print ( t i m e . t i m e ( ) − b e g i n _ t i m e ) Optimization Technique Method 1 Method 2 Time in s in % Time in s in % Avoid loops 119 100 73 61

Remove module prefix 13.47 100 12.14 91

Loop unrolling 33 100 22 66

Multiprocessing 20 16 130 100

CTR prediction using first- and second-order methods