Engineering Features in Mobile Ad-Click Optimization

(1)

University of Amsterdam

Mobile Professionals

Bachelor Thesis

Engineering Features in Mobile Ad-Click

Optimization

Author:

Lonneke Brakenhoff

10328882 Supervisor UvA: Dr. M. W. van Someren FNWI Supervisor MobPro: Dr. K. Holsheimer Data Scientist at MobPro

(2)

Abstract

Currently, there are 17 million mobile users in the Netherlands. These users generate a lot of data. A current topic of interest is the amount of such data available and the challenge of extracting valuable knowledge from it. What information is gathered form mobile users and how it is used will be discussed in this thesis. Mobile Professionals (MobPro) is a company that specializes in mobile advertising and marketing. They use the new information to target mobile users in a more efficient way. In order to convert gathered data into useful knowledge, a model is needed. The goal of this thesis is to build an automated click predictor for mobile advertisers. As a result that it will increase the click-through rates. A logistic regression model will be trained by using a modern technique known as Follow The Regularized Leader. The existing models’s performance is improved by merging co-occurring features in the dataset, thereby engineering new features. This must be done in a careful manner in which only new features that are bona fide features and which contribute to the model’s performance. Naively, one may be tempted to interpret the logistic regression weights as measures for the importance of their corresponding features. It is shown that this naive intuition is not valid for the newly engineered composite features by studying other metrics such as Cohen’s κ-coefficient and Scott’s π-coefficient.

(3)

Introduction

These days, almost everyone has a smartphone or tablet and uses it with great regularity. Retailers and service providers seize their chance by means of approaching these users with advertisement via their mobile device. Mobile Professionals (MobPro) is a company that focuses on the mobile marketing and advertising. They guide and advise the retailers and providers in the mobile marketing world, and therewith, act as a bridge between those clients and the other parties operating in the mobile landscape.

MobPro handles a lot of data about mobile users and the goal of this project is to gain new insights by using this data. In particular, one of MobPro’s main objectives is better audience targeting that leads to an increased number of advertisement1 clicks. The relevant metric for the number of ad clicks is called the click-through rate or CTR, which is the fraction of ad views that that were clicked on. A more detailed explanation of the CTR is given in Section2.2.1. A view, or impression, is a displayed ad or banner in a mobile application or on a mobile website. A click is generated when a user views the displayed ad and clicks on it. When a user clicks on an ad the user will be directed to an application of website of the advertiser.

MobPro operates on the demand side of the programmatic mobile advertisement market, and they have developed their own Demand-Side Platform (DSP). Programmatic advertising essentially comes down to the buying and selling of ad space that appear in real time, through sending out bid requests and returning bid responses. A bid request is sent out by an Supply-Side Platform (SSP), which signals that an advertising space becomes available due to a user opening an application or browsing on a website. It is a call for buyers (DSPs) to start bidding on the available ad space. A detailed explanation of a bid request is given in Section 4.1. After a bid request is sent out, a bid response is expected within 100 ms. The bid price chosen by the DSP typically depends on the budget and goal of the advertisement campaign of the advertiser. When an auction is won by a DSP, their advertisement will be displayed. MobPro’s DSP handles approximately 10,000 bid requests every second.

(6)

increases the click-through rates. Such a click predictor is capable of seeking out which incoming bid requests are likely to result in an ad click. It is, for example, possible that some features turn out to be very important in the click prediction. This means they apparently create a situation in which an ad is more ‘often’ clicked.

The model MobPro uses to make such predictions is logistic regression. Logistic regression outputs a number on the unit interval, which may be interpreted the expected probability p ∈ (0, 1) of a click. The way we shall use this output, however, is slightly different. Namely, one of the things that a company that uses DSP2, like MobPro has to make sure of is to generate a sufficient number of views, which is typically agreed on beforehand. Thus, the way we use the probability p is that it will be a gauge for whether we will respond to the corresponding bid request. In other words, the decision of how to respond to the bid request will be given by:

decision = (

no-bid if p < T, not likely to be clicked through

bid otherwise, very likely to be clicked through (1.1) where we introduced the decision threshold T ∈ (0, 1). The standard value for the threshold is T = 1/2, but at MobPro it is determined dynamically in such a way that we always generate the right amount traffic. The right amount of traffic will be dependent by the intention of the customer’s3 budget and the timespan of the campaign. A typical situation might be that we are able to decline the bulk of the incoming bid requests, such that we can be very picky, only accepting bid requests with a very high value for p. For instance, in such a situation we might take T = 0.9.

This thesis aims to improve the existing click prediction model by testing the coincidence of features that accompany the bid request. The current model is made using machine learning models. We discuss this in Chapter 3. An important hurdle that needs to be cleared has to do with the fact that the datasets are very large, categorical, and incredibly sparse. It turns out that predictions, such as likelihood of evoking a click, can be made using machine learning models.

Does the CTR prediction model’s performance improve by engineering new composite features?

These composite features are generated by combining original features that appear simulta-neously. One has to be careful, though, for not all such co-occurring features are valuable additions to the set of features. What it means for a feature to be ‘a valuable addition’ will be explained in detail in Section4.2. The improvement of the model is in terms of the increased number of correct predictions and the decreased number of false predictions.

The broader subject of this thesis is data science as applied to Big Data, which is an in-creasingly ubiquitous subject. Machine learning plays a crucial role in data science, and grew rapidly over the past 15 years. This upcoming technique, used for predicting, arose out of

2

DSP: Demand-Side Platform, like MobPro operates on the demand-side of the real-time bidding auction.

(7)

the need to recognize patterns, such as speech and image recognition [4]. Wladawsky-Berger wondered why data science was necessary to solve the data problem since statistics has been the tried-and-testing paradigm for a long time. He explained the difference between statistics and data science as follows [23]:

“It’s all about the difference between explaining and predicting. Data analysis has been generally used as a way of explaining some phenomenon by extracting interesting pat-terns from individual data sets with with well-formulated queries. Data science, on the other hand, aims to discover and extract actionable knowledge from the data, that is, knowledge that can be used to make decisions and predictions, not just to explain what’s going on.”

Throughout this work, we adopt the latter point of view.

The topic privacy is very relevant to mobile marketing since gathering (personal) data is an important aspect of targeted advertising. Gathered data is used for the optimization of the CTR predictor. In Section2 the aspects of privacy and the handling of personal information will be described with regard to the way it is used and collected at MobPro. This section also includes a brief introduction of the establishment of the model that have formed the basis for the current training model used at MobPro. In the following section, Section 3, this model is explained in detail with some basics of machine learning such as a general discussion of linear models, and gradient descent for finding optimal weights. This section also shows that predictions, such as the likelihood of evoking a click, can be made using relatively straightforward machine learning models. In particular, we discuss how linear models like logistic regression are quite powerful when used in conjunction with recently developed online learning techniques, cf. [13]. Next, in Section3.3, the hash trick for dealing with categorical data, and some basic online learning methods will be discussed.

In Section4 the gathered information described above is applied as a case study to the data and the model of MobPro. The case study focuses on the interaction between coexisting features. The section includes a description of the data and a bid request, the approach and method, followed by the results and the analysis.

Finally, in Section 5, the overall results will be discussed together with potential further research.

(8)

Chapter 2

Literature Review

This Section denounces the privacy subject with regard to the use of personal data by MobPro. Legislation and the privacy concerns in practice will be discussed. User privacy is followed by a short explanation about the choice of the current method used at MobPro.

2.1 User Privacy

The Internet causes transportation of personal data from the uers’s mobile device or desktop to an endpoint [14]. A lot of people are not aware of the kind of data that is made accessible to companies by using mobile applications [21]. The questions is what a company like MobPro is allowed to use of the personal data of the mobile device users. This seems like a quite complex subject since there are no specific rules with regard to the personal information of mobile device users.

According to Montgomery [14], COO of GroupM1, are the business plans not about finding individuals, but about finding large groups of people that are interessed in buying a centrain product. He is defining the main theme as follows [14]:

“Marketers need consumers to share their non-personal data. And, in return for the assurance that we will treat their data with responsibility and respect, consumers will have access to a corncopia of information and content on a largely free Internet – that’s the valuable exchange.”

There are laws that apply to the use of personal data in the Netherlands. Firstly, the Personal Data Protection Commision, in Dutch it called the College bescherming persoonsgegevens (CBP)2 that oversees the Personal Data Protection Act, in Dutch the Wet bescherming

per-1_{GroupM: ”GroupM is the largest media investment company in the world.”} 2

(9)

soonsgegevens (Wbp)3. Secondly, the Telecommunicatiewet4 which aims to protect the rights of citizens regarding any form of digital communication.

Since the introduction of the Cookie Act, a connection is made between the Wbp and the Telecommunication Act. The Cookie Act determines the use of personal data in the infor-mation technology. The Wbp determines if a party is allowed to process data. In terms of commercial purpose, this is not allowed unless a user is asked for his or her permission. The point of discussion, however, is the question asked to the user since there are a lot of stakeholders in the landscape of mobile advertisement that can access the data. Is it possible to explain the intention of each party, and if so, is a user able to understand this information. Another topic is the question of who is responsible for the personal data and who needs to ask permission to the user/customer.

The CBP. The CBP oversees compliance with the Wbp. It also conducts research into the use of personal data within companies and organisations. When necessary it stipulates fines for violations and provides information on privacy.

The Wbp protects ones privacy. It states what can and what cannot happen with personal data and ones rights if its data is used. For example, someone has the right to access his or her personal data, and can always oppose to it being used. For an organisation it is mandatory to notify the user when his or her information will be used [22].

The Telecommunication Act. The Telecommunication Act includes the Cookie Act (art. 11.7a Telecommunicatiewet) [18]. This is monitored by the ACM, before the OPTA.

The Cookie Act. The Cookie Act states that storing or accessing personal data is allowed only if the user a) is provided with a detailed explanation of the goal of the information storing, and b) if the user agrees and thus allows a company to store its data [18]. Which means for mobile advertising that somewhere in de process of displaying an ad, the user that clicks on the ad, needs to give permission to a party for storing or accessing its data.

The technology is going must faster than the legislation. It is almost impossible for the government to keep track of the innovations. A simple exmaple are the sensors implemented in mobile devices. For example, device rotation detection or the tracking of the number of steps a user takes each day. These innovations make it easier for companies to create a user profile, which is only allowed if a company has permission of the user according to the Wbp [22].

However, Montgomery states that it is nearly impossible to find or target individuals. Using an IP adress in combination with other variables could make a match that identify an individual, but ”that is not going to happen” [14].

3_{Wpb: wet bescherming persoonsgegevens} 4

(10)

2.2 Existing Methods

The relative number of clicks evoked by an online ad is known as the click-through rate or CTR, which is often used as a metric for the successfulness of the ad in question. The need for optimizing CTR for mobile ads is fairly new, which results a limited number of articles. The vast majority of literature on CTR optimization focuses ordinary website banners and search engine ads. Direct application of these results is often not possible due to some technical differences between ordinary web ads and mobile ads, the most important being the absence of cookies in mobile apps.

Recently, two machine learning challenges appeared on Kaggle5, a website that provides an platform for companies that look into developing solutions through online data science competitions. Both machine learning challenges asked the Kaggle community to find a model to predict the CTR of a mobile ad as accurately as possible.

The logistic regression model and its adjustments that MobPro is using is based on a top submission of one of the Kaggle competitions mentioned above.

On one side MobPro want as much data as possible, but they are careful not to sacrifice user privacy in the process. A major issue with legislation is the fact that it cannot keep up with the enormous pace at which modern (mobile) advertisement changes due to continuous technical innovations made in the mobile devices landscape. In other words, it is near impossible for governments to keep track of these innovations by adopting new laws to prevent user privacy violation. The challenge for a company like MobPro is to carefully interpret the aforementioned acts, like the Wbp and the Telecommunication Act, as well as possible.

2.2.1 Click-Through Rate

The click-through rate, or CTR, measures the fraction of impressions6 that lead to a click on an ad [7]. The CTR is simply:

CTR = #clicks

#views (2.1)

The CTR is used to indicate the effectiveness of an ad. An advertiser is often interested in getting a high value for the CTR. As mentioned before, machine learning models can be used to predict the probability of the user clicking on an ad [17]. If the expected behavior fits the type of ad, a company can decide to consider bidding on the ad space. However, the system is facing the following problem when it is trying to predict the click-through rate for a new ad. Since there is still no generated information.

Richardson et al. studied how to use features of the new ad to predict the CTR [17]. Usually the following features are predefined: information about the ad, such as the length, size, and words; the page the ad points to; and statistics of other related ads.

5

Kaggle:

(11)

In earlier research the CTRs of other similar ads were used to predict the CTR for a new ad. The similarity was determined by the same bid terms or topic clusters.

(12)

Chapter 3

Machine Learning: Background

In this chapter we discuss some machine learning background needed in the remainder of this thesis. We start by given a very brief overview of the gradient method for finding an optimal set of model parameters by minimizing a scalar function known as the loss function. We then move on to discuss linear models. Special attention is paid to logistic regression, as it will be used extensively throughout the remainder of this work.

3.1 Gradient Descent

In order to make accurate predictions, we need to find a good set of weights w. Finding such weights can be automated by defining an appropriate loss function Q and instructing a computer to minimize Q as a function of the weights:

w = arg min

w

Q(w) (3.1)

There are several algorithms available to extremize scalar functions such as Q(w). The one we use is gradient descent. We discuss the inner workings of this algorithm below.

Overview of gradient descent. A set of weights w is represented by a point in weight space W , which is typically just W = Rn. The loss function Q : Rn → R defines a hypersurface over weight space W known as the error surface. The first step in the algorithm consists of starting at a random point w ∈ W . At this point, we compute the gradient of the error surface ∇Q(w), which tells us the direction and magnitude of the slope of the error surface at w. The last step is then to update the weight vector [2]:

w ← w − η ∇Q(w) (3.2)

The parameter η > 0 is called the learning rate, which dictates the step size between iterations. Furthermore, the minus sign is there to counteract the fact that the gradient points ‘uphill’ rather than ‘downhill’. The algorithm terminates when k∇Q(w)k drops below some specified cutoff value.

(13)

3.2 Linear Models

By linear models we mean models which have the hypothesis in the form of:

h(w, x) = f (w · x) (3.3)

for some generic function f . In other words, the hypothesis depends on the features x only through the linear combination w · x, where the parameters w are the weights of the model. The prime example of a linear model is linear regression, for which f is simply the identity function: h(w, x) = w · x. Another obvious example is logistic regression, in which case f is the logistic sigmoid function1.

3.2.1 Logistic Regression

The model that is used in the work is logistic regression, which is very similar to linear regression besides the fact the model’s output is not a number on the real line R but rather a number in the unit interval [0, 1]. This has the advantage that the output of a logistic regression model can be interpreted as a probability in binary classification.

The hypothesis of the logistic regression is, as explained above, given by: hlogistic(w, x) = σ(w · x) =

1

1 + e−w·x (3.4)

which may be interpreted by, in case of binary classification, as the probability of the label (y = 1), given the feature vector x:

P (y = 1|x) = σ(w · x) (3.5)

The loss function in logistic regression is given by the log-loss:

Q(w) = −y log σ(w · x) − (1 − y) log 1 − σ(w · x) (3.6) Here, Q(w) is the loss suffered due to one data point (or row). The complete log-loss function is the average loss over all rows:

Q(w) = 1 n

n

X

i=1

−y(i)log σ(w · x(i)) − (1 − y(i)) log(1 − σ(w · x(i))) (3.7)

For notational convenience, we focus on only one single row. Using this loss function allows us to find the weight vector w, which is the one that minimizes Q(w):

w = arg min

w

Q(w) (3.8)

1

(14)

The next step is to compute the gradient of the log-loss function. This gradient will be used in the gradient descent optimization method. We will work in component notation, writing the gradient ∇Q as ∂Q/∂wi. First, we use the chain rule to split up the gradient:

∂Q ∂wi = ∂Q ∂σ ∂σ ∂z ∂z ∂wi (3.9) where σ(z) is the logistic sigmoid and z = w · x is the logit.

Let us start with the first factor of the function described above: ∂Q ∂σ = − y σ + 1 − y 1 − σ (3.10)

The second factor can be obtained directly by using the defining differential equation for the logistic sigmoid:

∂σ

∂z = σ(1 − σ) (3.11)

Lastly, for the third factor, the logit is written as z = w · x =Pp

j=1wjxj, such that: ∂z ∂wi = p X j=1 ∂wj ∂wi xj = p X j=1 δij xj = xi (3.12)

where we used the Kronecker-delta, defined as: δij =

1 if i = j

0 otherwise (3.13)

When putting these three factors together: ∂Q ∂wi = −y σ + 1 − y 1 − σ σ(1 − σ) xi =−y(1 − σ) + (1 − y)σxi = (σ − y) xi (3.14)

In vector notation, this is:

∇Q = (σ − y) x (3.15)

Instead of taking one row, the average over all rows is taken: ∇Q = n X i=1 σ(w · x(i)) − y(i) x(i) (3.16)

Regularization. In order to avoid overfitting2 regularization is used: Q(w) = −y log p − (1 − y) log(1 − p) + λ2

p+1 X i=1 |w_i|2+ λ1 p+1 X i=1 |w_i| (3.17)

(15)

where λ2 and λ1 are the L2 and L1-norm regulators, respectively. What these regulators do

is provide a fixed budget for the weights w. The standard L2 regulator spreads the weight

budget relatively evenly, but curbs the weights’ overall magnitude. The L1 regulator puts

more stringent budget restriction on the weights, where only the strongly predictive weights survive and all other weights are killed off entirely. The latter type of regularization may be used for automatic feature selection.

3.3 Online Learning and Follow The Regularized Leader

The standard machine learning algorithms make use of batch learning, where each step in the learning procedure uses the entire training set. Now imagine a situation in which we cannot load the entire data set into main memory. In those situations we can proceed by using out-of-memory algorithms. The options for out-of-memory learning are then to either load consecutive chunks of the data (mini batch) or even to read only one data point at a time (online learning).

Another useful application for out-of-memory techniques might be a situations in which the data set may fit into memory, but the incredibly sparse nature of the data typically calls for a predictive model that by itself is rather sizable. In such a situation it is still infeasible to have both the data as well as the predictor in main memory.

The model use at MobPro is essentially just logistic regression, but the way the model is trained is somewhat more sophisticated. The algorithm for training their logistic regression model is called Follow The Regularized Leader (FTRL), cf. [13]. Although a proper in-depth treatment of the FTRL algorithm is beyond the scope of this work, we now discuss a few of its interesting properties. FTRL is an example of an online learning algorithm, as it trains a predictive model by going over the training data row-by-row.

What makes FTRL unique compared to other online learning algorithms is essentially that it makes use of adaptive learning rates. That is, each feature receives not only its own logistic regression weights, but also its own counter that keeps track of how well that specific weights has been trained. This has the effect that the FTRL algorithm basically trains billions of single-feature logistic regression models in parallel.

3.4 How To Handle Sparse Categorical Data

The type of datasets we consider contain a large number categorical features. In particular, the RTB bid request data set contains about 20 useful categorical feature, where each feature may take on a huge number of distinct values. For instance, the name of an app or site is highly heterogeneous. A more precise description of what the bid request data looks like is given in below in Section4.1.

(16)

can be used as input for a predictive model. To this end, we use a method called the hash trick, which is explained below.

The hash trick is a superbly elegant and simple method that solves many of our problems mapping all categorical features to well-defined vectors in Zn= {1, 2, . . . , n}. The way this is

accomplished is by using a hash function. Assuming that our categorical features are strings, e.g. suppose we are given the following feature vector ξ:

ξ =        Samsung Galaxy S4 Android NLD .. .        (3.18)

The hash function h then maps these strings to a number in Zn. Let us take a small values

for the dimensionality of the hashed feature space, n = 10 say, and suppose that the hash function h maps the “Samsung” string to:

h(Samsung) = 8 h(Galaxy S4) = 3 (3.19)

h(Android) = 8 h(NLD) = 2 (3.20)

Notice that the hash function h mapped the string os Android to the same number as brand Samsung. This is called a collision. Collisions are unavoidable, but the probability of a collision can be reduced by increasing our hashed feature space dimensionality n. A good hash function distributes all incoming strings approximately uniformly over its output range. The typical choice is set by the system’s memory capacity. For instance, on an 8GB machine, we typically take n = 228∼ 108_{. Choosing such a large value enables us to handle very sparse}

data sets with many different features.

The next step in the hashing trick is to interpret the output integers as indices of a feature vector x. First, let X be the set of integers that the components of ξ are mapped to, i.e. X = h(ξ).3 We can then interpret this set X as the set of available indices of an underlying feature vector x whose components xi are either 0 or 1, i.e.

X = {i | xi = 1} (3.21)

The logit z can then be computed very easily as: z = w · x = X i∈Zn wixi = X i∈X wi (3.22)

The final step will then be to pass the logit to the logistic sigmoid σ(z) = 1

1 + e−z (3.23)

which is finally gives us the probability of finding y = 1, i.e. P (y = 1 | ξ) = σ(z).

(17)

Chapter 4

Case Study: Real-Time Bidding

The features used for training the dataset are obtained with Python from the incoming bid request. The features used are listed and explained in Appendix B. Using the sparseness representation saves a lot of memory. In stead of going through the whole dataset, a smaller set is made containing only the features values that are not empty. Hashing only this data needs about 4 GigaByte of memory.

MobPro stores the the daily generated data in BigQuery, a component of Google Cloud Platform [19]. BigQuery allows one to asked questions about the data by running SQL queries.

Click-Through Rate Optimization. A way to optimize the predictor is to clean the dataset. As described in the previous (see Sparseness 4.1.1) the data consists of a lot of empty values. By examining the properties of the features it may allows us to remove some sparse data. This will improve the quality of the data. Another possibility is to cluster coherent features, which will result in a heavier weights for specific clusters.

4.1 Data Description

In order the design a prediction model that will improve the CTR in mobile advertising data is needed to train and test the adjustments. Data is made available by MobPro and consists of different data sets. Relevant data sets for this research are all the incoming bid requests and the actual generated clicks.

The Guidelines of a Bid Request. The OpenRTB project from Interactive Advertising Bureau (IAB)1[11] aims for a better communication between DSPs and sellers of publisher inventory, by creating protocol standards2 for the RTB Interface. The protocols can be used as guidelines, which means that, besides some required object, some objects are only

1

(18)

recommended or optional. However, the more information the request contains the better. The chance that a certain bidder is not bidding on a bid request that has limited information is reasonably high. So, for the providers of the request it is important to provide as much information as possible to increase the number of bids.

As an example, a bid request may contain the field ’cat’. This field consists of an array of strings, each string is an IAB category specified by IAB. The specified categories indicate types of sites or applications. For example: ’Arts & Entertainment’ (IAB1) or ’Automotive’ (IAB2). Each category has multiple categories. ’Books & Literature’ (IAB1-2) is a sub-category of Arts & Entertainment, and ’Autoparts’ (IAB2-1) of Automotive. Besides the categorties that fit the banner space of the bid request, it is also possible to define ’blocked categories’ (bcat). These categories are blocked for ads.

4.1.1 General Features of the Data.

The training set that is used to train the model consist of rows and is received in JSON (JavaScript Object Notation). The number of rows is dependent of the time the naive model was active. Each row represents one accepted bid request and is composed of different kind of objects, which is depending on the type of the request. For example, a distinction is made between an app or site object.

An Incoming Bid Request. For a typical incoming bid request see Appendix A. A bid request consists of different kind of objects, including: App and Site Object, Device Object, Geo Object, Banner Object, Video Object. Additionally, a timestamp and if the ad is clicked (y = 1) or not (y = 0), is added to a bid request by MobPro itself. cf. TablesB.1–B.6. For the hierarchy of the bid request see AppendixB.1.

Sparseness. The MobPro dataset has a lot of missing values. The dataset has about 1 billion (228_{) features of which only 30 features are not empty. This means the data is sparse.}

Sparse simply means that the number of features where the value is empty is greater than the number of features where the value is not empty. Because of the high sparseness of the data a ’sparse representation’ is used. The hash trick, described in Section 3.4, is a method for handling the non-numeric values in a sparse dataset.

To explain the sparse representation a comparison is made with the dense representation. The following table (Table 5.1) shown below represents a dataset D1.

1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16

Table 4.1: A simple dataset D1.

(19)

D1 = [[1 ,5 ,9 ,13] , [2 ,6 ,10 ,14] , [3 ,7 ,11 ,15] , [4 ,8 ,12 ,16]]

And, the sparse representation:

D1 = {(0 ,0):1 , (0 ,1):5 , (0 ,2):9 , (0 ,3):13 (1 ,0):2 , (1 ,1):6 , (1 ,2):10 , (1 ,3):14 (2 ,0):3 , (2 ,1):7 , (2 ,2):11 , (2 ,3):15 (2 ,0):4 , (2 ,1):8 , (2 ,2):12 , (3 ,3):16}

As you can see, the dense representation in this case is considerably shorter than the sparse representation. However, in the following case Table 5.2 represents dataset D2 with some missing values. The value = 0 when the value is empty.

1 0 9 0 0 0 0 0 0 7 0 0 0 0 0 0

Table 4.2: A more simple dataset D2.

In comperison to Table 5.1, only three values are known in Table 5.2. Again, the dense representation of the dataset is notated as follows:

D2 = [[1 ,0 ,9 ,0] , [0 ,0 ,0 ,0] , [0 ,7 ,0 ,0] , [0 ,0 ,0 ,0]]

However, the sparse representation is a dictionary with the key as the location of the value and is in this case must shorter:

(20)

(0 ,2):9 , (2 ,1):7}

4.2 Approach

An incoming bid request comes in the form of a JSON object, see AppendixAfor an example bidrequest. The features are extracted from this JSON object such that is a string of the form

namespace|featurename=value

Thus, each feature consists of a namespace, a feature name and its corresponding value. For example, a feature that contains information about the device make may look as follows: device|make=apple, where device is the namespace, make is the feature name, and apple is the value of the feature. Such a feature string can be hashed to an integer. The data to be trained consists of two lists, one of the hashed features X and one of the corresponding values y that indicates whether the bid request of the features was clicked (y = 1) or not (y = 0). This data is used to train a logistic regression model with the FTRL algorithm mentioned above.

In this project makes use of one dataset and one model for training this dataset. However, the process between loading the dataset and using it as input of the classification model is different. In the first approach the data is not pre-processed other than the general process of hashing the gather features of the data as input for the classifier. In the second approach, all features are firstly combined with a co-occurring feature as a pair. Then the resulting pairs are hashed. As in the second approach, the features in the third approach are also combined with co-occurring features. Additionally, since the number of clicks (y = 1) is significantly lower than the non-clicks (y = 0), oversampling with replacement is used. Sampling with replacement, also called bootstrap sampling, means that a random sample is chosen and then reinserted. This works as follows. Instead of one list with all feature pairs, two lists are made. The first list Xy0 where y = 0, and the second list Xy1 where y = 1. Since significantly more

bid requests are not clicked, the list Xy0 is way longer than Xy1. In order to eliminate the

difference in the number of clicks and non-clicks, the minority class (y = 1) is oversampled. In terms of bootstrap sampling, for every pair in Xy0, the pair of Xy0is added to Xnew together

with a random selected pair from Xy1. The selected pair of Xy1 is then reinserted, which

allows a feature pair appear more than once in the oversampled dataset. In the next two approaches the datasets from the last two approaches are reused. However, these datasets are reconstructed by removing ’bad features’ from the data. To determine which of the feature are ’bad features’ and which are the ‘good features’ the earlier explained Scott’s π-coefficient is used. For each pair ’good feature pair’ the following rule applies (explained in Section4.3.1):

(21)

Together with the separate features, the ‘good features’ are then used as input to train the model.

feature1 feature2 feature3

feature1 X X

feature2 X

feature3

Table 4.3: Explanation of the assembling of co-occurring features. Only the bold feature pairs are used to prevent duplicate pairs.

Naively, One might be tempted to interpret the magnitude of the logistic regression weights as the importance of the corresponding (composite) features. This point of view is refined by looking at a more sophisticated measure for feature correlations: Scott’s π-coefficient, which will be defined shortly.

4.2.1 Measuring co-occurrence of feature pairs

The main goal is to reduce the feature-space dimensionality by figuring out which features are redundant. A standard PCA-type approach does not work, because the data consists only of categorical features. Several features are not relevant for the prediction of a click since they only appear together. For example, the make and the operating system of a device. When the device make is ’Apple’, the device operating system is by default ’iOS’. The distance between combined features can be calculated with a the degree of similarity.

To measure the degree of correlation between the two combined features Cohen’s κ-coefficient and the Scott’s π are used [6,9]. Both coefficient are calculated with the number of occurrences (N ) of both features together and separate (see Table4.4).

Feature 1

present absent Totals Feature 2 present N11 N01 F1

absent N10 N00 N − F1

Totals F2 N − F2 N

Table 4.4: Variables needed to calculate the Cohen’s Kappa score for a pair of two features (Feature 1, Feature 2).

N11constitutes the number of co-occurrences of both Feature 1 and Feature 2 in an incoming

bid request, while N00 is the number of bid requests in which both Feature1 and Feature2

are absent. N10 is the number of bid requests where Feature 1 is present and Feature 2 is

absent, while N10 is the opposite of N01 where Feature 2 is present and Feature 1 is absent.

The single-feature counts Fi are related to the Nij as F1 = N00+ N10 and F2 = N00+ N01.

The total sample size N is just the sum of all components Nij. Finally, it is convenient to

(22)

ence/absence). This is found straightforwardly by the observed agreement pobs:

pobs = n00+ n11 (4.2)

Here, pobs ∈ [0, 1], where pobs = 0 indicates no agreement while pobs = 1 means perfect

agreement. One may wonder, however, how much of this observed agreement should be attributed to random chance. It is this issue that Cohen’s κ-coefficient addresses.

Cohen’s κ-coefficient

Cohen’s κ-coefficient is an adjusted version of pobs that compensates for agreement due to

chance. In order to make this adjustment, one needs to figure out what the expected agree-ment by change is, which is denoted pexp. This expected agreement is computed by taking the

expectation value of the occurrence of one feature with respect to the probability distribution of the other. More specifically, the (binary) probability densities are simply (fi, 1 − fi), so

pexp is just the inner product of these probabilities:

pexp = f1 1 − f1 · f2 1 − f2 = f1f2+ (1 − f1)(1 − f2) (4.3)

Note that also pexp∈ [0, 1]. Cohen’s κ-coefficient is then given by:

κ = pobs− pexp 1 − pexp

(4.4) Thus, Cohen’s κ computes the observed relative to the expected agreement, in units of the expected disagreement 1 − pexp. Note that κ is finite due to the fact that pobs ≥ pexp.

Scott’s π-coefficient

Yet another modification to the measure of agreement is obtained when we compensate for ‘observer bias’, which in our case is the considerable imbalance between a given feature being present or absent. This imbalance can be compensated for by replacing n10 and n01 by their

average (n10+ n01)/2, thereby removing the difference in imbalance of the two single-feature

occurrences, cf. [6]. This has the effect that f1 and f2are replaced by their average (f1+f2)/2,

such that the expected agreement becomes: ˜ pexp = f1+ f2 2 2 + 1 −f1+ f2 2 2 (4.5) Notice that the averaging of n10 and n01 has no effect on the observed agreement pobs (4.2).

Cohen’s κ-coefficient is thus modified into what is known as Scott’s π-coefficient: π = pobs− ˜pexp

1 − ˜pexp

(4.6) This is the measure for the amount of co-occurrence of feature pairs that we shall use hence-forth.

(23)

Interpretation.

π = −1 means that a pair does not occur. The feature pairs in Figure 4.1 near π = −1 are not relevant for the model since it is not likely that the pair exists. The broadest level of looking to feature interaction is on the namespace. A pair that has a Kappa score of π = −1 are feature pairs that do not share the same namespace, and where both namespaces never occur together. For example a feature with the namespace app will never co-occur with a feature which belongs to the namespace site. A banner is shown in either a site or an application. Another example of a pair that will never exists, at the level of the feature name, is the combination of the operating systems ios and andriod. The feature pair is redundant. If κ = 0 the feature may occur together, but they do not create a new valuable feature. κ = 1 indicates a maximum positive correlation between two co-occurring features and means that the both features are only occur together and never when the other is not present. Those pairs are also not relevant for the model because both separate features are given the same weight. The pair is redundant. Feature pairs that share the same namespace contain information about the same subject. For example, see Table4.5within the namespace device. The feature os with the value ios will always co-occur with the feature make with the value apple.

4.3 Analysis & Results

To test the utility of the adjustments, a training set and a test set is made. The model is trained on the training set and the result is applied to the test set. The classification output is given in terms of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). The P (positive) means that the model predicted the bid request likely to be clicked, which results in the acceptance of the bid request. A N (negative) prediction is a non-click. When the prediction is true (T) the prediction is correct, the bid request is correctly accepted or rejected. However, when the prediction is false (F), bid request is incorrect rejected or accepted (see Table4.6).

It is important that potentials clicks are as little as possible classified as non-clicks (FN). Otherwise, this potential click is thrown away. However, by decreasing the threshold, to avoid throwing away to many true clicks, the number of accepted false clicks (FP) will increase. To weighed up these factors the performance of the output of the model will be measured with the precision and recall score. Precision P and recall R are defined as:

P = T P

T P + F P R =

T P

T P + F N (4.7) The precision tells us what fraction of the incoming bid requests that were labeled ‘likely to click’ actually did result in a click. The precision P in this case may be interpreted as the predicted CTR. The recall, on the other hand, tells us what fraction of clicks were labeled as such by the predictor. The precision and recall are conjugate in the sense that maximizing one minimizes the other, like a seesaw type situation.

(24)

namespace feature data type time timeofday string

dayofweek string device devicetype integer

connectiontype integer make string model string language string os string osv string

geo type integer

city string

region string

zip string

user gender string

yob string app id string name string domain string publisherdomain string storerating float bundle string paid integer site id string name string domain string publisherdomain string Table 4.5: The OpenRTB feature specification.

truth is click truth is non-click predicted outcome is click (P) TP FP (error) predicted outcome is non-click (N) FN (error) TN

Table 4.6: Explanation of the classification output.

imizing the precision and recall in a way that is appropriate for the situation at hand. The standard way of reaching such a compromise is to maximize the Fβ-score [16]. This score is

defined as the weighted geometric mean of the precision and recall [16]: Fβ =

P R

(β2_{P + R) / (β}2_{+ 1)} (4.8)

where β parametrizes the importance of the precision relative to the recall. The closer the outcome is to one, the better (if the Fβ-score is equal to 1, the classifier made no errors). The

(25)

way β should be chosen in the case of CTR prediction is to set: β2 = # no-clicks

# clicks (4.9)

where the number of (no-)clicks is observed from the training data.

4.3.1 Results of the co-occurring measurements

π score is used to define a mutual relation between two co-occurring features, and it varies from −1 to 1. The as mentioned before, the data contains of a lot more non-clicks than clicks. As a result that most feature pairs are able to predict whenever an ad is not clicked rather than when an ad is clicked. This results in weights between [−1, 0]. This is clearly shown in Figure 4.1. Most datapoints (feature pairs) are located below 0. To equals the number of click compared to the non-clicks the feature pair where y = 1 were oversampled. Figure 4.2 shows the result of the oversampled dataset.

It is also shown in both Figure 4.1and Figure 4.2 that the Scott’s π does not correlate with the logistic regression weights.

Figure 4.1: Scott’s π against the logistic regression weights of the feature pairs. As explained, the feature pairs that satisfy the rule (π > −1) & (π == 0) & (π < 1) are not a valuable addition to the model. We implement this rule to subtract the ‘good features’ from the original dataset, and the Scott’s π is recalculated. The results are shown in Figure4.3 and Figure4.4.

(26)

Figure 4.2: Scott’s π against the logistic regression weights of the oversampled dataset.

Figure 4.3: Scott’s π against the logistic regression weights the of dataset with selected feature pairs.

4.3.2 Results of Feature Selection

The selected ‘good features’ are used as input for the classifier to reduce the noise of the dataset. The performance of the model is displayed in the four plot in Figure4.5. The plots show the performance, indicated by the Fβ− score, of the model compared to the threshold.

To compare the examined approaches, the Receiver Operating Characteristic (ROC) score is used and displayed in Figure 4.7. The ROC score is indicates the overall performance of binary classifiers, and is here used to compare the approaches. The column ’single’ shows the

(27)

Figure 4.4: Scott’s π against the logistic regression weights of the oversampled dataset with selected feature pairs.

result of the classifier trained on single features. The next column displays the result of the feature pair with and without feature selection. The ROC score increased a little after the application of feature selection. The last, and most right column, indicates the score of the oversampled feature pairs, and show a little decreased ROC score at the feature selection.

Figure 4.5: The Fβ − score againt the threshold to indacte the perfomance of the model by

(28)

single pairs oversampled pairs without feature selection 0.6849 0.6666 0.6340 with feature selection - 0.6704 0.6330

(29)

Chapter 5

Conclusion

To analyse the coincidence of features the dataset was trained on pairs of co-occurring features in stead of only individual features. By training the model with the hashed feature pairs, weights were assigned to those pairs. The naive interpretation of the weights as feature importance is made more robust by looking at the π-coefficient instead. An important weight is close to -1 or 1. The pairs with a high weight that occur relatively infrequently were considered to be more important. To determine the agreement of coexisting features, the Scott‘s Pi (π) was calculated. It turned out that the weight of each pair did not relate to the π-coefficient of a pair, thereby corroborating the expected sophistication of interpreting logistic regression weights as features importance.

By selecting features within a specified range of π the model did not perform better than the naive and current approach where single features were trained. According to the ROC-score the ability to predict the CTR as good as possible decreased. This applies on both the approach of combining co-occurring features and the approach of oversampling the dataset. Applying feature selection did improve the two earlier approaches, but did not perform better compared to the naive approach.

5.1 Discussion

The goal of this project was to improve the CTR prediction by training the model on co-occurring feature pairs. Unfortunately, the approaches used for engineering new features did not improve the CTR prediction. This can be caused by multiple factors. It is possible that by training the model on pairs the noise of the dataset increased instead of decreased, because a lot of features do not or barely occur. This results in new infrequent feature pairs.

Apparently, the new selected feature pairs within the specified range based on the π-coefficient, do not contain valuable features as input for the classification model. The specified range might be too large and maybe needs to be redetermined.

(30)

In terms of user privacy the discussion is about the need for asking permission by advertisers in the mobile advertising landscape. Since one doubt if an user is able to honestly determine the purpose of the permission question. The landscape of mobile advertising is complex, and it is therefore hard to determine which of the parties are responsible for the handling of personal data. But, if the legislation covers this problem with a concrete act. The users no longer need to give permission to every separate party, because the legislation will impose the rules to the relevant parties. Additionally, a user making a phone call is not aware of the amount of data stored during or after the phone call, nor did he or she gave permission for storing this data. The discussion will be endless, but it is important to keep it going.

5.2 Further Research & Development

In order to determine valuable new features in responds to the calculated Scott’s π by ad-justing the range of allowed values of π. For example feature pairs with a score between -0.5 and 0.5 with the exception of 0 can be added to the model before it is trained. This range is chosen to leave the non valuable features out. Training the model on more valuable features results hopefully in reducing the noise of the sparse dataset, which leads to a better trained model. Additionally, another coefficient for measuring feature agreement can be used. Besides testing improvement of the model on pairs consisting of two features, it could also be interesting to train on higher order interaction, like feature triples that consists of three co-occurring features.

Possibly, privacy is going to be a concern for companies playing in the market of (mobile) advertising, when asking permission for the use of personal data to an user is mandatory. Depending on the value of the personal data for a company like MobPro, it might be useful to include the probability a user will decline sharing its personal information in the CTR predictor together with the probability a user will click on the ad.

(31)

Appendix A

Example Incoming Bid Request

{’imp ’: {’ banner ’: {’api ’: [4 , 3, 5] ,

’ battr ’: [3 , 8, 9] , ’ btype ’: [1 , 4] , ’ext ’: {}, ’h ’: 50 , ’ hmin ’: 50 , ’id ’: ’1 ’, ’pos ’: 1, ’ topframe ’: 1, ’w ’: 320 , ’ wmin ’: 300}, ’ bidfloor ’: 0.05 , ’ bidfloorcur ’: ’USD ’, ’ext ’: {}, ’id ’: ’1 ’, ’ instl ’: 0},

’rb ’: {’app ’: {’ bundle ’: ’com . surpax . ledflashlight . panel ’,

’cat ’: [’ IAB1 ’, ’ IAB3 ’],

’ext ’: {}, ’id ’: ’ 528009 ’, ’ keywords ’: [] , ’ name ’: ’ PanelFlashlightAndroidTier1 ’, ’ publisher ’: {’ext ’: {}, ’id ’: ’ 194294 ’, ’ name ’: ’ iHandyInc ’},

’ storeurl ’: ’ https :// play . google . com / store / apps /... ’},

’at ’: 2,

’ badv ’: [] ,

’ bcat ’: [’ IAB25 ’, ’ IAB26 ’],

(32)

’ devicetype ’: 4,

’dnt ’: 0,

’ dpidmd5 ’: ’ d41d8cd98f00b204e9800998ecf8427e ’,

’ dpidsha1 ’: ’ da39a3ee5e6b4b0d3255bfef95601890afd80709 ’,

’ext ’: {},

’geo ’: {’ city ’: ’ Gouda ’,

’ country ’: ’NLD ’, ’ext ’: {}, ’lat ’: 52.0226 , ’lon ’: 4.707 , ’ type ’: 2, ’zip ’: ’’},

’ifa ’: ’7 b09889b−a730−490c−a62c−aab707e0be16 ’,

’ip ’: ’ 195.241.174.129 ’, ’js ’: 1, ’ language ’: ’nl ’, ’ make ’: ’ Samsung ’, ’ model ’: ’’, ’os ’: ’ Android ’, ’osv ’: ’4.2 ’,

’ua ’: ’ Mozilla /5.0 ( Linux ; U; Android 4.2.2; nl−nl ;... ’},

’ext ’: {’ exkey ’: ’ exchange VLGtIgJFVorrudLAKSrhLteOSdnPD ’},

’id ’: ’ 5755265917024450472 ’,

’imp ’: [{’ banner ’: {’api ’: [4 , 3, 5] ,

’ battr ’: [3 , 8, 9] , ’ btype ’: [1 , 4] , ’ext ’: {}, ’h ’: 50 , ’ hmin ’: 50 , ’id ’: ’1 ’, ’pos ’: 1, ’ topframe ’: 1, ’w ’: 320 , ’ wmin ’: 300}, ’ bidfloor ’: 0.05 , ’ bidfloorcur ’: ’USD ’, ’ext ’: {}, ’id ’: ’1 ’, ’ instl ’: 0}] , ’ tmax ’: 250 ,

’ user ’: {’ gender ’: ’’, ’id ’: ’−2889970989402297355 ’}},

’rid ’: ’54 a8e4f003f460d23c956048 ’,

’ timestamp ’: ’2015−01−04 07:00:00 UTC ’,

(33)

Appendix B

Explanation Bid Request

B.1 Raw Bid Request

App and Site Object.

field description scope type

id which is a unique ID for each bid request required string

name app or site name optional string

page URL of the page shown recommended string domain domain of the application or site optional string publisher contains: id name, cat, domain, ext optional Object

cat IAB categories optional array

bundle only in App Object: package name recommended string

Table B.1: Feature explanation of the App and Site Object a raw bid request. Note: only one of them is applicable.

Device Object.

type type of the device (tablet, phone, pc, etc) optional integer

make device make ” string

model device model ” string

lang browser language ” string

os operation system ” string

osv version of os ” string

js does the device support JavaScript ” integer conn connection type of the device ” integer

(34)

Geo Object.

lat latitude of the coordinates optional float lon longitude of the coordinates ” float region region using ISO 3166-2 ” string type indicate the source of the geo data ” integer

city city ” string

zip zip/postal code ” string

Table B.3: Feature explanation of the Geo Object of a raw bid request.

B.2 Impression

Banner Object.

width width of the banner space recommended integer height height of the banner space recommended integer pos ad position (unknown, header, footer, etc.) optional integer topframe if the banner is delivered in a topframe or iframe optional integer api supported API frameworks optional array of int Table B.4: Feature explanation of the Banner Object (included directly in the Impression Object) of a raw bid request.

Video Object.

protocol video bid response protocol optional Object api supported API frameworks optional array of int

Table B.5: Feature explanation of the Video Object (included directly in the Impression Object) of a raw bid request.

(35)

B.3 Timestamp

Timestamp.

field description

TIMESTAMP date and time object

hour morning, lunchtime, afternoon, evening, night day of week weekend, weekday

week of month day of the timestamp / 7

month month

(36)

B.4 Hierarchy of a bid request

(37)

Bibliography

[1] Askwith, B., Merabti, M., Shi, Q., & Whiteley, K. (1997, December). Achieving user privacy in mobile networks. In Computer Security Applications Conference, 1997. Pro-ceedings., 13th Annual (pp. 108-116). IEEE.

[2] Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010 (pp. 177-186). Physica-Verlag HD.

[3] Bouneffouf, D., Bouzeghoub, A., & Gan¸carski, A. L. (2012, January). A contextual-bandit algorithm for mobile context-aware recommender system. In Neural Information Processing (pp. 324-331). Springer Berlin Heidelberg.

[4] Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a re-joinder by the author). Statistical Science, 16(3), 199-231.

[5] Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., & Hullender, G. (2005, August). Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning (pp. 89-96). ACM.

[6] Byrt, T., Bishop, J., & Carlin, J. B. (1993). Bias, prevalence and kappa. Journal of clinical epidemiology, 46(5), 423-429.

[7] Farris, P. W., Bendle, N. T., Pfeifer, P. E., & Reibstein, D. J. (2010). Marketing metrics: The definitive guide to measuring marketing performance. Pearson Education.

[8] Feinstein, A. R., & Cicchetti, D. V. (1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of clinical epidemiology, 43(6), 543-549.

[9] Hoehler, F. K. (2000). Bias and prevalence effects on kappa viewed in terms of sensitivity and specificity. Journal of clinical epidemiology, 53(5), 499-503.

[10] iab Nederland (2014). Cookie Compliance: update Octobre 2015. Amsterdam: iab Ned-erland.

[11] iab (2014). RTB Project: OpenRTB API Specification Version 2.2. San Franciso: iab. [12] Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM

(38)

[13] McMahan, H. B., Holt, G., Sculley, D., Young, M., Ebner, D., Grady, J., ... & Kubica, J. (2013, August). Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1222-1230). ACM.

[14] Montgomery, J.(2015). A Private Conversation: The data privacy view from Groumm’s John Montgomery. GroupM, What’s Next Quarterly.

[15] Sim, J., & Wright, C. C. (2005). The kappa statistic in reliability studies: use, interpre-tation, and sample size requirements. Physical therapy, 85(3), 257-268.

[16] Perner, P. (2008). Advances in Data Mining: Medical Applications, E-Commerce, Mar-keting, and Theoretical Aspects. 8th Industrial Confernece, ICDM 2008 Leipig, Germany, July 2008, Proceedings. Springer.

[17] Richardson, M., Dominowska, E., & Ragno, R. (2007, May). Predicting clicks: estimating the click-through rate for new ads. In Proceedings of the 16th international conference on World Wide Web (pp. 521-530). ACM.

[18] Telecommunicatiewet (1998, 19 oktober). Accessed at 24/04/2015, from[link] [19] Tigani, J., & Naidu, S. (2014). Google BigQuery Analytics. John Wiley & Sons.

[20] Rijksoverheid. (z.d.). Wat regelt de Wet bescherming persoonsgegevens (Wbp)? Ob-tained on 24/04/2015[link]

[21] Wang, Y., Huang, Y., & Louis, C. (2013). Respecting user privacy in mobile crowdsourc-ing. SCIENCE, 2(2), pp-50.

[22] Wet bescherming persoonsgegevens (2000, 6 juli). Accessed at 24/04/2015, from[link] [23] Wladawsky-Berger, I (2014). Why Do We Need Data Science when We’ve Had Statistics

Engineering Features in Mobile Ad-Click Optimization

University of Amsterdam

Mobile Professionals

Engineering Features in Mobile Ad-Click

Optimization

Author:

Lonneke Brakenhoff

Contents

Chapter 1

Introduction

Chapter 2

Literature Review

2.1

User Privacy

2.2

Existing Methods

Chapter 3

Machine Learning: Background

3.1

Gradient Descent

3.2

Linear Models

3.3

Online Learning and Follow The Regularized Leader

3.4

How To Handle Sparse Categorical Data

Chapter 4

Case Study: Real-Time Bidding

4.1

Data Description

4.2

Approach

4.3

Analysis & Results

Chapter 5

Conclusion

5.1

Discussion

5.2

Further Research & Development

Appendix A

Example Incoming Bid Request

Appendix B

Explanation Bid Request

B.1

Raw Bid Request

B.2

Impression

B.3

Timestamp

B.4

Hierarchy of a bid request

Bibliography