Evaluating the use of Recurrent Neural Networks in Display Advertising

(1)

Evaluating The Use of Recurrent Neural Networks in Display Advertising

submitted in partial fulfilment for the degree of master of science Tom Dop

11439769

master information studies data science

faculty of science university of amsterdam

2018-07-06

Internal Supervisor External Supervisor Title, Name Rolf Jagerman Daniel De Sybel Affiliation UvA Infectious Media

(2)

Evaluating The Use of Recurrent Neural Networks in Display

Advertising

Tom Dop

University of Amsterdam thomasdop@btinternet.com

ABSTRACT

This paper examines the use of recurrent neural networks in the context of display advertising. By transforming an individual user’s advertising history into a sequence of events, it is possible to show that long short-term memory networks can be utilized in order to create models predicting the user’s propensity to convert. These models are then compared to baseline models, whereby LSTMs are shown to offer an improvement in performance. The output of these models is then visualized over time, allowing for the user’s predicted propensity to convert to be observed after each event in the sequence. Finally the advantages and disadvantages of using recurrent neural networks in this context are discussed.

1 INTRODUCTION

In the world of online advertising, being able to predict consumer behaviour has many advantages, from optimising bidding algo-rithms [3] to conversion rate prediction [17]. If an advertiser can calculate the optimum type of advert,when to serve it to a user and how much should be paid for it, then advertising budget can be designated accordingly. However, in practice, perfectly predicting the behaviour of individual users is a virtually impossible task. For-tunately in display advertising, users are not targeted as individuals, but as groups (known as audiences). Therefore, if it is possible to identify a group of users who have a higher probability to convert, targeting can be adjusted accordingly, potentially resulting in mul-tiple improvements for efficiently spending advertising budgets. Data on almost every individual advertising event is available. This means that whenever a user is served, views or clicks on an advert, or when they visit the advertiser’s website, this is recorded. These events contain both high level user data, such as the country the user is located in, as well as more detailed event level data, such as the price paid for the advert or the type of advert that was shown. Typical models used to predict user behaviour, such as logistic re-gression, are often built using user-level features [7]. These can be comprised of both user-level data as well as aggregated event-level data, where information on individual events has been transformed into user-level features. While these models can often be effective, they suffer from both the need to manually aggregate event-level data to create these top-level features, as well as the fact that de-tailed information about each individual event may be neglected as it can not necessarily be transformed into a user-level feature. Modelling individual events sequentially is a way to mitigate both of these problems.

Recurrent neural networks (RNNs) have proven their use as an effective machine learning method for sequence based data. In this thesis I attempt to answer the research question Can Recurrent Neural Networks be used to identify converters in the context of dis-play advertising and do they offer an improvement in performance

over baseline methodologies? I show that a user’s advertising his-tory can be modelled as a sequence and that RNNs can be used as part of a machine learning framework to identify users with a higher propensity to convert. This is compared against a baseline non-sequential, user-level logistic regression and neural network and the advantages of using RNNs over these baseline models are discussed.

2 A BACKGROUND TO DISPLAY

ADVERTISING

Display advertising is the process whereby users browsing a web page are shown advertisements on the site, often in the format of banners and sidebars as observed in Figure 1. These adverts are designed to raise the brand awareness of the user browsing the website and to entice them to visit the website of the advertiser and potentially make a purchase (known as a conversion). While historically, the effectiveness of display advertising has been mea-sured through the number of clicks on an advertisement, more recent methodologies use a user’s cookies to track when they have been served an advert and when they have visited the advertiser’s website. This means that any visits to the advertiser’s website can be subsequently traced back and attributed to the initial advertising activity.

Figure 1: An example of how a banner ad may appear on a website

A user’s behaviour on the web can be tracked through the use of cookies [1]. Cookies are used to tie individual events back to a single user ID, therefore it is possible to record event level data on all advertising events a user has been exposed to, as well as all visits the user has on an advertiser’s website and user level attributes such as country, browser and operating system. Having a history of user activity means that subsequent visits to the advertiser’s

(3)

website or purchases on that site can be tied back to the advertising activity. It stands to reason that a history of user events can be modelled sequentially in time and that machine learning on this sequential data is possible.

The process by which these advertisements are purchased is known as Real-time Bidding (RTB). When a user arrives on a website, there may be several advertisers who wish to show the user an advert. Each advertiser effectively bids how much they are willing to pay to show the given user an advert from their brand and the advertiser who bids the most will have their ad displayed to the user. This entire process happens in under 100 milliseconds and is almost imperceivable to the user. Ultimately advertiser’s who believe the user is more likely to make a purchase on their website are likely to bid more to show them an advertisement. Therefore, the price paid for each advertiser is effectively determined by the perceived value of the user to the marketplace of advertisers.

While perfect classification of users who will go on to convert on the advertiser’s website is very difficult, it may still be possible to identify a subset of users that are more likely to convert than others, using a combination of both user features and their history of advertising activity to date. This may have benefits to advertisers as they are able to refine the users they are targeting with ads by, for instance, removing users who are unlikely to convert from their targeting pool. A more sophisticated model could also potentially be used to determine which adverts should be shown to which users and what price should be paid for each one.

3 RELATED WORK

While the majority of work related to online advertising has been conducted in the field of search engine advertising, there is still a considerable amount of work that has been done in the field of display advertising. Wang, Zhang & Yuan (2017) [21] have explored predicting user response from categorical data and find basic lo-gistic regression approaches to be effective. Qu et al. (2016) [18] have shown that simple neural networks are effective in predicting user-response in the form of click-through rates. In the instance of Zhang, Du & Wang (2016) [23], several deep learning approaches are explored and in which they note the limitations of deep learn-ing on large vectors of categorical variables. For example, uslearn-ing domain as a feature in a predictive model when there are millions of potential domains an advert can be served on will result in very large feature vectors which may lead to increased training times and computational limitations when it comes to training. However, methods do exist for learning on large input vectors such as the cross-product method in Cheng et al. (2016) [4]. It is also worth noting that due to the complexity of deep learning frameworks, these models are rarely transparent and interpreting the weights of such a model becomes very difficult [8]. Alternative approaches to modelling advertising activity also exist. The use of Convolutional Neural Networks in the case of Chen et al. (2016) [2], have also been shown to be effective in predicting user-response, while factoriza-tion machine based approaches in the case of Guo et al. (2017)[13] have also been shown to be beneficial in predicting click-through rates.

Recurrent Neural Networks have become widely used for a variety of purposes where data is available as sequences of events. There

have been several extensions to the basic RNN framework, the most well-known and widely used being the Long Short-Term Memory Network. First introduced in 1999 by Hochreiter and Schmidhuber [14], the LSTM builds upon the basic RNN model, incorporating gating elements to provide a form of memory in the system. One of the most widely used applications of LSTM networks is in text pre-diction, as demonstrated in Graves (2013) [10]. A system that can is able to remember which words have already appeared in a sentence is much more effective at predicting the next word. LSTMs have also been shown to be effective within many other fields where data can be modelled sequentially, such as Speech Recognition [12] and predicting stock prices using financial data [19]. With specific regard to display advertising, work has been conducted, in the case of Zhang et al. (2014) [24], into predicting user response in search advertising using Recurrent Neural Networks on event-level data, however these studies have generally been applied to the prediction of indirect user response on ads in the form of clicks, not indirect user response such as conversions. It is worth noting that while extensions to RNNs other than the LSTM exist, such as the Gated Recurrent Unit introduced by Cho et al. (2014) [5], this paper fo-cuses on the use of LSTMs. While Chung et al. (2014) [6] have shown that performance of GRUs is comparable to that of LSTMs, in reality the use of LSTMs is much more commonplace and the main advantage of the GRU comes when creating models based on smaller datasets. There have been broader studies, such as those of Toth et al. (2017) [20] that use RNNs to predict consumer behaviour on an e-commerce website and show that they generally make bet-ter predictions than Markov-chain methods. Similar studies, such as that of Lang and Rettenmeier (2017) [16], focus specifically on conversion rate prediction using RNNs in e-commerce. With access to on-site data for consumers, the results of their study indicate that RNNs may provide greater predictive power than using simplistic logistic regression alone. They also demonstrate the advantages of transparency within a sequential model, by being able to show a user’s propensity to convert at different points within their se-quence, as well as determining which events within a user’s history of events contribute most to conversion uplift.

4 MODELS

There are a variety of machine learning models which can be used to classify user behaviour, however in this study we compare per-formance of a sequential RNN in the form of an LSTM network, to a non-sequential baseline in the form of a logistic regression and by extension, a one layer neural network.

4.1 Vector-based models

The most basic machine learning methods, such a logistic regres-sion, take a single vector input, in this case one per user, and are trained against a classifier to minimise loss. When constructing input features [X0, X1, X2...Xn] , user-level features, such as the

user’s country of origin, can be easily transformed into one-hot encoded categorical variables. However, event level features must be manually transformed into user-level features. These individual events can be aggregated into counts and then transformed into lo-gistic, categorical variables for logistic regression as demonstrated in Figure 2.

(4)

Figure 2: Feature engineering multiple logistic input features from aggregated event level variables

The thresholds by which these variables are defined are not fixed. Although different combinations may result in a better performing model, it is impossible determine the performance of the model without first creating the input features and running the model. While it is possible to engineer input values for every possible value of the event level variable, the input vector to the model very quickly becomes large, so there may be a trade off between training time and model performance when determining the size of the input vectors to use. Although transforming event counts into logistic variables is possible with only some loss of information, this is not possible with other features. In the case of price paid per advertising impression, it is possible to create user-level features to represent the total sum of all prices paid, or the average price paid per impression. However, any information relating to the distribution of prices over impressions is lost when aggregating individual events up to a user level. After user-level features have been created the resulting input vectors can fed into a one-layer neural network with sigmoid activation functions to perform the task of a logistic regression. This is comprised of an input layer consisting ofn nodes, where n is the length of the user-level input vector. The prediction layer consists of a single binary nodey of which the value is 1/0 depending on whether the user is a future converter. By using stochastic gradient decent the weights [W0,W1...Wn] are learned to minimise the loss between predicted and actual values of y as illustrated in figure 3

Figure 3: Logistic Regression in one-layer Neural Network

It is also possible to add an additional hidden layer consisting ofi nodes to the neural network as demonstrated in 4. When combined

with non-linear activation functions such as ReLu, this allows the network to learn complex features that a logistic regression would be unable to learn.

Figure 4: Neural network with a single hidden layerh consisting ofi nodes

While these user-level models can be effective, the process of manually engineering user-level features and deciding on the fea-tures to use in a final model can be time consuming. Furthermore, by aggregating or summarizing at a user-level, much of the informa-tion related to the individual events is neglected. This may include elements such as the price paid and the type of each individual advertisement, as well as timing-related features such as the order in which the events occurred and the length of time between events. This is where sequence based models may prove advantageous.

4.2 Recurrent Neural Networks

Recurrent Neural Networks allow for the modelling of sequential data. Rather than taking a single vector as an input to the model, by vectorizing the individual events of a sequence and passing this sequence of vectors into a Recurrent Neural Network, sequence based learning can occur. The events within an RNN are modelled as connected sequences of nodes, maintaining a hidden stateh as each event is fed into the network. This means that at each step, the new hidden state is calculated using the state from the previous time stepht −1and the incoming event vectorEt.

Et= [e1, e2, ...en] ht= σ(WEXE+ Whht −1+ b)

The parameters learned are the matricesWE andWh, which correspond to the weights on the incoming event and previous hidden layer. A sigmoid activation function is applied to ensure the system is non-linear. The final hidden state is then used as the output of the model, which can be used to learn the outputy, or can be combined with other features as part of a deeper neural network. It is worth noting that the hidden state has the number of dimensionsd, which is a hyperparameter to be defined upon model creation. Figure 5 shows hidden stateh changing as it receives each 3

(5)

event inputE and outputting the final hidden state hnas inputs into the machine learning architectureXn

Figure 5: Basic RNN architecture

A commonly used extension of basic RNNs is the Long Short Term Memory Network (LSTM). This uses the basic architecture of an RNN, however it has been modified to better learn long-term dependencies between events. This is done through the use of a series of gates, with weights learned from the network, which are used to control the flow of states through the network. An additional long-term memory statec is also kept. While a basic RNN only calculates the updated hidden state at a given time step by using the previous hidden state and the new input vector, an LSTM uses the long term memory state in conjunction with input i, output o and forget f gates to better preserve long-term event dependencies, as defined by the equations below [9]:

it = σд(WiEt+ Uiht −1+ bi) ot = σд(WoEt+ Uoht −1+ bo) ft = σд(WfEt+ Ufht −1+ bf) ct= ft ⊙ct −1+ it⊙σc(WcEt+ Ucht −1+ bc)

ht= ot ⊙σ_h(ct)

WhereW & U represent matrices, b represents biases learned by the model and ⊙ is used to represent entry-wise products. Figure 6 shows how each of these gates is implemented in a single LSTM cell.

Both RNNs and LSTMs generally have much longer training times than simple vector-based methods due to the high degree of computation power required to back-propagate through multiple stages of an RNN layer. However, the increased model complex-ity can be shown to have advantages, and may result in a better predictive model than using single vector based approaches.

Figure 6: LSTM cell diagram, adapted from Graves (2014) [11]

5 METHODOLOGY

5.1 Data

The data used for the models is based on all display advertising events for a single advertiser (name undisclosed for privacy reasons) over the course of a single month. The dataset is based on over 200 million advertising impressions served over the month of February 2018, to approximately 50 million users. This data is then enriched with further event-level data from other advertising events (ie. clicks on an advert), as well as visits to the advertiser’s website, resulting in a full dataset of over 500 million events. The complete list of event types can be found in Table 1.

The goal of the model is to be able to better identify converting users, so to do this converting users must be defined by some criteria. The data is split into 4 weeks, where the first three weeks are the advertising events to be analyzed and the final week is the prediction window, where a binary 0/1 label is used to denote whether the user converts within this window. This means that conversions within the initial three week period are used to create the model inputs and subsequent conversions in the final week are used to create the conversion label. Using a fixed point in time as a conversion window is preferable as it emulates how such a model would be put into practice in a business scenario, ie. that the predictions would be generated at a fixed point in time using all available historic data.

Figure 7: Demonstration of how the binary conversion flag is calculated

(6)

Table 1: Description of event types

Event Type Description

Impression An advert is served to a user (not necessarily viewed as it may occur off-screen) View User views an impression (50% of the impression must be on screen for one second) Click User clicks on an impression

Site Visit - Category User visits a "category" page on the advertiser’s website Site Visit - Newsletter User visits the "newsletter" page on the advertiser’s website Site Visit - Product User visits a "product" page on the advertiser’s website Site Visit - Basket User visits the "basket" page on the advertiser’s website Site Visit - Search User visits a "search" page on the advertiser’s website

Conversion User visits the "purchase confirmed" page on the advertiser’s website after making an order

The raw event data is then preprocessed for input into the user-level and event-user-level models. For the user user-level models, data is aggregated per user and features are created based on the feature generation methods discussed in section 4.1. For the event-level models, event data is transformed into sequences of events per user. Individual events within these sequences are transformed into separate vectors which are then passed sequentially to the model. These separate event vectors consist of a one-hot encoded event type, as well as a timestamp differential, which measures the length of time between events. This is so the model can learn not only how the order of events affects conversion propensity, but also how the specific timing of events affects it. Additional event-level features such as price can also be added to these event level vectors if required. User sequence length follows a long-tail distribution with 44% of users having just a single advertising event and 90% of users having less than 10 events. The maximum event sequence length for a single user is 7,500 events, however as sequence lengths must be of the same length for input into the RNN, it is clearly impractical to model all sequences as a 7,500 event sequence. Therefore sequential models were built using both sequences of 100 events and sequences of 200 events. For sequences shorter than this length, the sequences were padded with empty vectors, while the oldest events were dropped for sequences longer than these lengths. Although just 0.15% of users had sequences longer than 100 events and only 0.03% of user had sequences longer than 200 events, there is still invariably some information loss as a result of shortening sequences. However the trade-off between modelling the long-tail of data and model complexity is a common problem in machine learning. [22]

Figure 8: One-hot encoded event vectors

One of the main limitations of the data is the sparsity of conver-sions. Of the users with an event in the first 21 days of the analysis window, only 0.4% go on to visit the advertiser’s website and just 0.02% make a purchase. The data can be split into two separate machine learning tasks in order to solve both of these problems; predicting users that will visit the advertiser’s website and predict-ing users who will convert on the advertiser’s website. By creatpredict-ing two binary labels for each user; whether the user visits the adver-tiser’s website in the final week and whether the user converts on the advertiser’s website in the final week, the performance of both the user-based and sequence-based models can be compared across the two conversion criteria.

5.2 Models

The models consist of two main types. The first type, the user-level model, consists of a single vector input and is constructed with a simple one-layer architecture with input nodes feeding directly into the output node. This is first done using just the hand crafted event level data (clicks, impressions etc.) and then again with the addition of user level attributes (country, browser etc.). The model is then extended to include a single fully connected layer between the input and output layers.

For the event-level sequential models, an LSTM layer is used to handle the event-level inputs of the model. Each event-level input to the model consists of a sequence of 100 input vectors, each consisting of the event-type as a one-hot encoded vector as well as the timestamp differential (Tt−Tt −1) between each event and the previous event in the sequence. The motivation behind this is to allow the model to learn the contribution of event timings, as well as the event type and order, to conversion propensity. The output hidden state of the LSTM is then concatenated with user level features and the features relating to the timing of the user’s final event. These features are then passed through a final fully-connected layer before they are fully-connected to the output node. This is repeated on both sequences of length 100 and sequences of length 200, to determine how sequence length affects model performance. One final “enhanced” sequential model is then constructed. This model consists of additional event-level features that are not com-patible with the user-level model. This includes additional informa-tion such as the price paid for each advertising event and the size of each advert on the page. A comprehensive list of all dimensions in-cluded in the enhanced model can be found in Table 2. Performance 5

(7)

Figure 9: Sequential model

of each these models are then evaluated over the two variations of the dataset, using both the visits flag and the conversion flag as the model output.

All models attempt to minimize binary crossentropy loss over the predicted output value versus the actual. After running a small hyperparameter search, the following initializations were deter-mined for each of the two model types:

User-level Model - 200 fully connected layer nodes with ReLu activation functions and random initialization. 1 predictor node with sigmoid activation function. Adam optimizer with a learning rate of 0.001. Minibatch size of 100,000 users.

Event-level Model - 200 fully connected layer nodes with ReLu activation functions and random initialization. 1 predictor node with sigmoid activation function. 10 node LSTM layers with Glorot-normal initialization. Adam optimizer with a learning rate of 0.001. Minibatch size of 2,000 user sequences. All models were trained until validation loss converged to its minimum value. When val-idation loss began to rise as a result of overfitting, training was halted.

The models were built using Keras [15], a neural network API, running on top of Tensorflow and were run on a MacbookPro with a 2.3 GHz Intel Core i7 processor and 8GB of DDR3 memory. Training times were significantly shorter for the user-level models, with each training epoch taking between five and ten minutes depending on model complexity. Event-level models took significantly longer to train, at around 4-6 hours per epoch depending on sequence length.

6 EVALUATION

The datasets are divided into training and validation sets, whereby 80% of the data is used to train the model against 10% of the data. After tuning hyperparameters against the 10% of the data, per-formance of the model on the remaining 10% of unseen data is evaluated.

In the instance of our models, we are trying to predict very rare events. This means that there is an extreme imbalance in classes. Experimenting with rebalancing the classes (both through upsam-pling the smaller class and downsamupsam-pling the larger class) didn’t result in an improvement in model performance. The imbalanced nature of the classes means that using accuracy as an evaluation metric is unsuitable. Instead the models are evaluated on three

main criteria. The first being the total binary cross-entropy loss of the model when run over the unseen dataset. The second and third evaluation metrics are average precision and area under the precision/recall curve respectively. Instead of classifying the users as conversion/non-conversion, the users are instead ranked from most likely to convert to least likely based on the model output. From this, the precision/recall curve can be plotted and the area under this curve, as well as the average precision, can be calculated.

Figure 10: Precision/Recall curve for Enhanced LSTM Conversion Model

Comparing the performance of the models predicting site land-ings in Table 3, the two model types seem to offer similar perfor-mance based on the evaluation metrics. At a user-level, adding user features to the model improves performance substantially, suggest-ing that not only the user’s events but also the type of user, contains information relating to their propensity to convert. Adding a dense layer can also be seen to help the model to fit the data, resulting in an smaller total loss on the validation data set. In the case of the ba-sic sequential, event-level models, performance was worse than the user-level models, albeit marginally so. Increasing the sequential model to sequence length of 200 only resulted in a minute improve-ment in performance, at the expense of much larger training times. By extending the basic sequential model to the “enhanced” version and including event level events that are unable to fit into the user-level models, performance was actually increased above that of the user-level models, resulting in the smallest total loss, and the largest precision and AUC by a substantial margin.

Comparing the performance of the models predicting user con-versions in Table 4, the sequential event-level models outperform the user-level models by a substantial margin across all three met-rics. In a similar fashion to the landing-based models, extending the conversion based model to the full “enhanced” version further improves model performance.

One argument could be made is that the LSTM model exhibits an improvement in performance, simply because the model itself is more complex and has more parameters. It is possible show that this is not the case by extending the user-level model so that it contains the same number of learned parameters as the LSTM model. This can be done by adding additional nodes to the dense 6

(8)

Table 2: All dimensions included in “enhanced” sequential model

Dimension Description In Basic Sequential Model?

Event Type Type of event (one-hot encoded) Yes Browser User browser (one-hot encoded) Yes Operating System User operating system (one-hot encoded) Yes Country User country (one-hot encoded) Yes

Device User device (one-hot encoded) Yes

Time Differential Differential between the timestamp of the current event and the timestamp of the previous event in the sequence

Yes

Campaign The campaign the served advert was from (one-hot encoded) No Slot The on-screen size of the served advert (one-hot encoded) No

Price Price paid for the advert No

PMP Whether the advert was from a PMP campaign (binary) No Firewall Whether the advert was blocked by a firewall (binary) No

Table 3: Models Predicting Landing Probability

Level Model Validation Loss Average Precision AUC User Handcrafted Event Features Only 0.02347 0.07002 0.01337 User Event and User Features 0.02268 0.08163 0.01416 User Event and User Features, 1 Fully-connected Layer 0.02181 0.08508 0.01510 Event LSTM, 1 FC Layer, Sequence Length 100 0.02210 0.07864 0.01383 Event LSTM, 1 FC Layer, Sequence Length 200 0.02207 0.07917 0.01398 Event Enhanced LSTM, 1 FC Layer, Sequence Length 100 0.02032 0.08689 0.01593

Table 4: Models Predicting Conversion Probability

Level Model Validation Loss Average Precision AUC User Handcrafted Event Features Only 0.001585 0.002705 0.000429 User Event and User Features 0.001396 0.003692 0.000533 User Event and User Features, 1 Fully-connected Layer 0.001306 0.005742 0.000611 Event LSTM, 1 FC Layer, Sequence Length 100 0.001208 0.007009 0.000642 Event LSTM, 1 FC Layer, Sequence Length 200 0.001202 0.007034 0.000653 Event Enhanced LSTM, 1 FC Layer, Sequence Length 100 0.001188 0.007191 0.000712

layer so that the total number of parameters is approximately equal. From the results in Table 5 it can be shown that while adding additional parameters to the model improves performance slightly, performance of the comparative LSTM model still excels, indicating that any improvements the LSTM has over the baseline model do not come as a result of model complexity alone.

One advantage of the use of sequential models is that it is possible to see the model’s output at each stage of the sequence. By observing the model output after each event, it is possible to see how each additional event affects the user’s propensity to convert. Figure 11 shows how the model output changes at different points in the user’s journey. The user is located in the UK, uses a Mac OSX and Chrome browser on desktop and is exposed to different events at irregular time intervals as defined on the graph.

7 CONCLUSION

It has been demonstrated that utilizing recurrent neural networks may offer improvements in performance in predicting converters in display advertising. By contrasting a sequential RNN model with a comparative logistic regression model using the same data, the RNN saw a notable increase in performance over the baseline model in the case of the model predicting user conversions. However, the model predicting user visits exhibited similar, albeit slightly worse performance than the baseline models.

It can then be shown that by expanding upon the sequential model to include event-level data that would otherwise have been excluded from the model, results in a notable improvement in per-formance. Therefore a demonstrable advantage of the sequential models is the ability to include and model detailed event-level data that cannot be modelled at a user-level.

In terms of transparency, the sequential models allow a user’s current propensity to convert at any stage of their sequence of 7

(9)

Table 5: Conversion Probability Models with Equalized Parameters

Model Parameters Validation Loss Average Precision AUC Event and User Features, 200 Node FC layer 31,191 0.001306 0.005742 0.000611 Event and User Features, 237 Node FC layer 36,889 0.001297 0.005759 0.000614 LSTM, 200 Node FC Layer, Sequence Length 100 36,881 0.001208 0.007009 0.000642

Figure 11: Model Output (Visit Propensity) at different points in the user’s journey using the enhanced LSTM model

events. The current hidden state of the model after a given event be used as an input to the rest of the model and the user’s predicted propensity to convert can be calculated at that point in time. This means that it is possible to view a user’s sequence as a time se-ries, where their predicted conversion rate is calculated after each event, as demonstrated in Figure 11. While the user-level models are somewhat transparent, as their weights can be examined to determine which factors contribute more highly to conversions, they are based on event-level data aggregated to a user-level. This means that it is not possible to assess the individual contribution of a single event to the user’s conversion propensity, nor is it possible to gauge the effect of the timing of these events.

In terms of efficiency, it can be shown that the process to con-struct input data for the models is faster for the sequential, event-based models. These models simply require the encoding of events as one-hot vectors which are then combined to form sequences. This can be done with relative ease. However, the user-level mod-els require the modelling of events as aggregated logistic features, demonstrated in Figure 2. There is no predefined way to decide how many of these features should be constructed and how it will affect the performance of the final model. Therefore, the event-level models are advantageous as they allow for all data to be utilized in the model without losing any information and there is no ambiguity as to how the input features should be constructed.

One of the main trade-offs with the recurrent neural network ap-proach is training time. The user-level models took approximately ten minutes per epoch to train on the defined hardware, whereas the sequential models take approximately four hours per epoch. While recurrent neural networks can be shown to outperform basic logistic regression approaches in terms of loss and average preci-sion, this comes with a significant trade-off against training time.

However, in practice it is unlikely that model training times will be an issue. By scaling the process up to run on more powerful hard-ware (ie. GPUs), model training times should not be prohibitively large and a better performing model is likely preferable to one that takes longer to train. So long as the time taken for the model to generate predictions given new data is small, it is unlikely that time will be a restricting factor.

While the usefulness of recurrent neural networks have been shown, their practical use in the context of display advertising still requires validating. The models that have been used in this thesis have been trained on a single month’s worth of labelled display advertising data. Whether or not these models will generalise to data from an unobserved month is yet to be tested. Furthermore, whether the outputs of the model can be used to improve perfor-mance of display advertising campaigns is yet to be tested. While the model outputs could easily be used to segment users into differ-ent targeting pools based on their predicted propensity to convert, whether this would result in improved performance would require extensive A/B testing in order to verify.

REFERENCES

[1] Claude Castelluccia. 2012. Behavioural Tracking on the Internet: A Technical Perspective. , 21-33 pages.

[2] Junxuan Chen, Baigui Sun, Hao Li, Hongtao Lu, and Xian-Sheng Hua. 2016. Deep CTR Prediction in Display Advertising. CoRR abs/1609.06018 (2016). arXiv:1609.06018 http://arxiv.org/abs/1609.06018

[3] Ye Chen, Pavel Berkhin, Bo Anderson, and Nikhil R. Devanur. 2011. Real-time Bidding Algorithms for Performance-based Display Ad Allocation. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’11). ACM, New York, NY, USA, 1307–1315. https://doi.org/ 10.1145/2020408.2020604

[4] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah.

(10)

2016. Wide & Deep Learning for Recommender Systems. CoRR abs/1606.07792 (2016). arXiv:1606.07792 http://arxiv.org/abs/1606.07792

[5] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. CoRR abs/1406.1078 (2014). arXiv:1406.1078 http://arxiv.org/abs/1406.1078

[6] Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. CoRR abs/1412.3555 (2014). arXiv:1412.3555 http://arxiv.org/abs/1412.3555 [7] Krzysztof Dembczyski, Wojciech Kotlowski, and Marcin Sydow. 2008. Effective

Prediction of Web User Behaviour with User-Level Models. Fundam. Inf. 89, 2-3 (April 2008), 189–206. http://dl.acm.org/citation.cfm?id=2366356.2366358 [8] G. David Garson. 1991. Interpreting Neural-network Connection Weights. AI

Expert 6, 4 (April 1991), 46–51. http://dl.acm.org/citation.cfm?id=129449.129452 [9] Felix A. Gers, Jürgen A. Schmidhuber, and Fred A. Cummins. 2000. Learning to Forget: Continual Prediction with LSTM. Neural Comput. 12, 10 (Oct. 2000), 2451–2471. https://doi.org/10.1162/089976600300015015

[10] Alex Graves. 2013. Generating Sequences With Recurrent Neural Networks. CoRR abs/1308.0850 (2013). arXiv:1308.0850 http://arxiv.org/abs/1308.0850 [11] Alex Graves and Navdeep Jaitly. 2014. Towards End-To-End Speech

Recogni-tion with Recurrent Neural Networks. In Proceedings of the 31st InternaRecogni-tional Conference on Machine Learning (Proceedings of Machine Learning Research), Eric P. Xing and Tony Jebara (Eds.), Vol. 32. PMLR, Bejing, China, 1764–1772. http://proceedings.mlr.press/v32/graves14.html

[12] Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. 2013. Speech Recognition with Deep Recurrent Neural Networks. CoRR abs/1303.5778 (2013). arXiv:1303.5778 http://arxiv.org/abs/1303.5778

[13] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction. CoRR abs/1703.04247 (2017). arXiv:1703.04247 http://arxiv.org/abs/1703.04247 [14] Sepp Hochreiter and JÃĳrgen Schmidhuber. 1997. Long Short-term Memory. 9

(12 1997), 1735–80.

[15] Keras. 2017. Keras Documentation. https://keras.io/

[16] Tobias Lang and Matthias Rettenmeier. 2017. Understanding Consumer Behavior with Recurrent Neural Networks.

[17] Quan Lu, Shengjun Pan, Liang Wang, Junwei Pan, Fengdan Wan, and Hongxia Yang. 2017. A Practical Framework of Conversion Rate Prediction for Online Display Advertising. In Proceedings of the ADKDD’17 (ADKDD’17). ACM, New York, NY, USA, Article 9, 9 pages. https://doi.org/10.1145/3124749.3124750 [18] Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang.

2016. Product-based Neural Networks for User Response Prediction. CoRR abs/1611.00144 (2016). arXiv:1611.00144 http://arxiv.org/abs/1611.00144 [19] Sreelekshmy Selvin, R Vinayakumar, E.A. Gopalakrishnan, Vijay Menon, and

Soman Kp. 2017. Stock price prediction using LSTM, RNN and CNN-sliding window model. (09 2017), 1643–1647.

[20] Arthur Toth, Louis Tan, Giuseppe Di Fabbrizio, and Ankur Datta. 2017. Predicting Shopping Behavior with Mixture of RNNs.

[21] Jun Wang, Weinan Zhang, and Shuai Yuan. 2016. Display Advertising with Real-Time Bidding (RTB) and Behavioural Targeting. CoRR abs/1610.03013 (2016). arXiv:1610.03013 http://arxiv.org/abs/1610.03013

[22] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. 2017. Learning to Model the Tail. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett (Eds.). Curran Associates, Inc., 7029–7039. http://papers.nips.cc/paper/ 7278-learning-to-model-the-tail.pdf

[23] Weinan Zhang, Tianming Du, and Jun Wang. 2016. Deep Learning over Multi-field Categorical Data: A Case Study on User Response Prediction. CoRR abs/1601.02376 (2016). arXiv:1601.02376 http://arxiv.org/abs/1601.02376 [24] Yuyu Zhang, Hanjun Dai, Chang Xu, Jun Feng, Taifeng Wang, Jiang Bian, Bin

Wang, and Tie-Yan Liu. 2014. Sequential Click Prediction for Sponsored Search with Recurrent Neural Networks. CoRR abs/1404.5772 (2014). arXiv:1404.5772 http://arxiv.org/abs/1404.5772