Click Sequence Based Product Recommendations in E-commerce

(1)

submitted in partial fulfillment for the degree of

master of science

Enzo Blindow

11840374

master information studies

data science

faculty of science

university of amsterdam

2018-07-06

Internal Supervisor External Supervisor

Title, Name Ilya Markov Rob Winters

Affiliation UvA TravelBird

(2)

in E-commerce

Enzo Blindow

University of Amsterdam enzo@blindow.eu

ABSTRACT

Understanding how to derive user preferences from click and in-teraction histories plays a critical role in recommending relevant products to users on commercial web pages. In web search, click models rank documents by relevancy to a user’s query on the basis of previous recorded interactions and information need. The goal is to translate into a predictive model that ranks products on the click probability for issuing recommendations suitable for e-commerce. Instead of relying on item-to-item recommendations based around item similarity, we propose a click sequence-based model (CSBM) for learning click and interaction patterns in users’ click histories. We investigate the model’s accuracy on (i) predicting the product clicked next, (ii) predicting the product the user is most likely to interact with post click, and (iii) predicting the product first clicked in a successive browsing session, indicating which product inspires the user to return to the website.

Experimental results show the CSBM is most accurate at predicting the product clicked next and even suggesting similar products when predictions are not correct, making it useful for recommendations, albeit with a less accurate measure on the classification task.

KEYWORDS

Click Prediction, Sequence Classification, Recurrent Neural Net-works, Long Short-Term Memory, Recommender Systems

1 INTRODUCTION

Understanding users’ preferences is key for making accurate prod-uct recommendations in e-commerce. Identifying how a user is interacting with web content can have a substantial effect on how these interactions lead to a given goal, such as conversion in a commercial space or clicking on advertisements. To get a better comprehension of the customer needs and to be able to make more fitting recommendations, this research investigates how a user’s sequential click history can be modeled to predict the most likely next interaction with a product.

The data used to research the accuracy of such a model comes from TravelBird1, a travel-tech company that aims to inspire their customers with customized and care-free travel offers around the world. Travel as a product is seen as very unique and a purchase decision typically comes with a long conversion time frame over repeated visits, which makes it paramount to understand how to best match relevant products to customers.

In web search, various models exist to better recognize users’ in-teractions designed to rank relevant documents to a users search query on search engine result pages (SERP), for an overview see [1]. Many of these models are based on a probabilistic graphical model (PGM) [2] utilizing the sequential nature of users’ clicks, but

1_{www.travelbird.nl}

depend upon manually setting the relations between observations. To alleviate setting manual dependencies, Borisov et al. [3, 4] in-troduced sequential neural models in recent publications. These models showed that click interactions can be used to determine a user’s next click probability to present documents ranked by rele-vance to the user.

As an alternative to the web search environment, we propose to adapt the neural click model to an e-commerce setting by ranking relevant products to users for issuing accurate recommendations in-stead. Existing recommender systems are commonly based around item-to-item recommendations [5] relying on the similarity of prod-ucts, which are determined from a user’s last click. Though Hidasi et al. [6] have shown that user interactions during a browsing session can be modeled sequentially to produce session-based recommen-dations, we argue that using the entire click history yields more accurate predictions.

We propose a click sequence-based model (CSBM) that predicts which products are most likely to be clicked on next. The hypothe-sis is that short and long-term behavioral patterns can be learned and used to make more accurate forecasts. As the order of infor-mation is crucial this research will focus on utilizing Recurrent Neural Networks (RNNs), which are explicitly designed to process sequential information.

Experimental results show that modeling users’ click histories se-quentially yields more accurate predictions on which products will be clicked next. Further investigation of the predictions reveal that the CSBM model managed to generalize interaction patterns and suggested similar products even when the next click prediction was not correct. Additionally adjusting learning exclusively on product clicks where users interacted with the product page post click or learning which product gets clicked first that causes a user to return to the web page, showed little improvements over a range of linear and neural baseline models. And while the classification task could be improved by predicting a cluster of products instead, we found limited evidence that clustering products actually produces more effective product recommendations.

Our work’s contribution is modeling users’ clicks and interactions with the objective of predicting products based on click probabil-ities in order to obtain accurate product recommendations. This differs from existing recommender systems by modeling entire click histories sequentially and employs methods from neural click models for ranking documents in search result pages to ranking products on an e-commerce site instead.

The primary goal of this research is how to accurately predict which product will be visited next to issue relevant recommenda-tions to a user and to investigate a range of methods that improve accuracy. This research is structured around these questions:

(3)

RQ1 How accurately can the next clicked product be predicted based on a user’s click history?

RQ2 How does performance change if training on post-click in-teractions or first clicks instead?

RQ3 Does grouping products improve the accuracy?

We start by laying out how related works model sequences ef-fectively for a prediction task in Section 2, as well discussing corre-sponding publications on the effectiveness of using sequential data as the basis of issuing recommendations. In Section 3 we introduce the methods that are used for modeling the sequence predictions. Section 4 outlines the implementation and data of the models. We discuss the results in Section 5 by running a set of experiments designed to answer each research question.

2 RELATED WORK

Our work connects two types of related work for the objective of obtaining product recommendations: click predictions by modeling online user behavior and the use of sequence classification models for predicting products clicked next.

2.1 Click prediction

Modeling users’ click sequences to represent their online behavior or intents has a wide foundation in ranking results on Search Engine Result Pages (SERP). These use various models ranging from user behavior models, dynamic Bayesian networks, task-centric click models [7], and probabilistic graphical models. However, Borisov et al. [3] have recently introduced a neural approach by using an RNN for modeling user click sequences on search engine results. RNNs have also demonstrated promising results over feature engineered linear models when analyzing user click sequences within a brows-ing session in a commercial web settbrows-ing. Predictions were made for purchase propensity [8], session-based recommendations [9], or classifying such sessions into various intent types [10]. Toth et al. [10] also showed that using RNN yielded better results for classifi-cation purposes than a more traditional approach using high-order Markov chain models.

2.2 Sequence-based recommendations

Utilizing sequential user click information primarily with RNN, and LSTMs for recommendation purposes was previously shown as effective by MacLean et al.[11], comparing their RNN based rec-ommender system to current state-of-the-art approaches achieving comparable to superior results. Proven methodologies like collab-orative filtering suffer from fading effectiveness on too few in-teractions and it was proposed to overcome the shortcomings of item-to-item recommendations based on similarity by respecting all interactions in a users sequence [6] or utilizing review and content information when modeling user’s interests [12]. In order to over-come a widespread complication of the cold start problem, it was proposed to primarily use page and article information to train a predictive model, eliminating the need for users to have purchased before predictions can be made [13]. Moreover, Yu et al. [14] include all historic sessions of a user in conjunction to the current, arguing that this way the dynamic user’s preference can be learned from recurring global user interaction patterns.

3 METHOD

In this section we propose modeling a user’s click log, as well as learning representations of the content and pages on a website as a sequence of feature vectors to classify the next product a user is going to engage with. This approach has the advantage to learn product preferences from a user’s click behavior, allowing it to pick up more intricate patterns indicating a visitor’s taste. Furthermore, we are expanding on using embeddings in the preprocessing step to create meaningful representations of categorical variables in the data, as well as discussing clustering methods for grouping products into bins that can be utilized as alternative training targets.

3.1 Click sequence prediction model

Given the hypothesis that most users have several repeated touch points on a website while researching their next product purchase, we also want to express a user’s preferences over time. Touch points are all occurrences of users visiting or interacting with content rel-evant to the e-commerce’s products. We want to capture a broad variety of product clicks during the user’s research process, but also a narrow selection close to a potential purchase decision. This model is designed to express the available information at each of these touch points by concatenating information about the user, the click event, and data about the page that was clicked and interacted with.

Figure 1 displays an example of a possible sequence of a user’s inter-actions with a website spanning the user’s task of finding a suitable next product to purchase. The figure illustrates a variety of touch points of a user with home pages, product pages, and category pages listing a variety of products sharing common attributes. Each record possesses information on the content displayed to the user2, as well as behavioral information on how the user interacted with it. Meta-data such as timestamps is used to determine the time a user spent on a given page and how much time has passed between sessions. The illustrated sequential structure of touch points can be leveraged when modeling the click history of a user, as each of these touch points contains valuable information on the user’s taste. When modeling this data we distinguish between the relatively static, page specific information like the page type, the content that is displayed, and the available elements on the page that can be interacted with. This stands in contrast to the dynamic information on the user, how he engaged with the web page. Observations on engagement includes the time between clicks, the sequencing of touch points, and which elements on the page were clicked. Next we discuss how this information is effectively leveraged in the model.

3.2 Feature representations

We outline the different features and transformations of the data points as shown in the example in Figure 1, which we concatenate to a feature vector and use as an input at every time step of the model described in the following Section 3.3.

3.2.1 Page types. We want to distinguish between the various types of pages a user can visit while browsing the website, like home page, product pages, and category pages listing products

2_{We limit the content representations to product related content only}

(4)

Figure 1: Transformation of user click logs to input data for LSTM cells. Each of the features is transformed into a vector and concatenated at each time step. Demonstration how the training label is derived for either the next product within a session, the next offer with more interactive elements or the first product visited of the upcoming session.

with similar attributes, as each of these page types convey different information indicating a user’s taste3. The page types are repre-sented by one-hot encoded binary vectors with the length of all possible different page types.

3.2.2 Content representation. Instead of using one-hot represen-tations for the products, we utilize pre-trained embedding vectors. This allows for a fixed-size representation even with increasing amounts of categories instead of a vector that varies based on the number of categorical variables.

Embeddings come from the space of natural language process-ing and are learned vectors comprised of multiple values rangprocess-ing between zero and one [15]. These are learned by maximizing the probability of a word occurring given a window of its surrounding words and are able to capture information like gender, plurality, or other semantics making it a useful representation of a categorical variable. This method is also referred to as word2vec [15] and can also be used in a non-textual context, if sequential input data is provided. These multi-dimensional vectors possess specific proper-ties retaining semantic and syntactic information, allowing basic arithmetic operations [15]. The information gained by these learned representations over using one-hot encoded variables becomes clear when applying the vector offset calculation [16] by adding and sub-tracting specific vectors with related attributes, revealing the most

3_{Other page types such as search, account, landing pages, etc. are excluded here as}

they are not seen as contributing to revealing a users preferences.

cosine similar item with the calculated attribute in vector space (1). y= xb− xa+ xc,

w∗= arдmaxw( xwy ∥wx∥ ∥y ∥).

(1) The following example on products illustrates one of the main advantages of embeddings.

Products. To produce product embeddings we can use the se-quential information in the user click logs, representing it as the input sentences in the word2vec algorithm and each product is represented as one word in our corpus. This allows us to capture which products are being clicked together by users while retaining product specific semantics on how those products were browsed together, which allow for improved calculations of a neural network when used as input [17, 18]. To examine if word2vec embeddings learned domain-specific properties we can apply the vector offset method in specific product embedding vectors Vproduct(1).

VCitytr ipMadr id− VSpain+ VI taly= VCitytr ipRome (2) In our case, the example (2) reveals that concepts such as similarity in countries and their capitals, as well as travel categories like city trips were learned. This shows how effective this representation of content is over using one-hot encodings.

Categories. If a recorded event is on a category page, all visible products on that category page are aggregated into an average product vector. This allows for an aggregated representation of those products biased to the most frequent similar offers present in the category. As only a part of all clicks represent visits on product

(5)

pages, we also create a zero vector of the same size, which acts as a mask when a click contains no product information.

3.2.3 User Interactions. Interactive elements on pages are rep-resented as a vector with a binary flag indicating which elements have been used. If all elements have been touched, the resulting vector would consist exclusively of ones [1, 1, ..., 1]. We use this representation over one-hot encodings to capture all combinations and variants of interactions to allow for identifying more intri-cate patterns. At this point, only interactions on product pages are modeled to represent a user’s interest more closely.

3.2.4 Timestamps and dwell time. The time spent on a page also referred to as dwell time, is represented by the number of seconds passed between the recorded click and the prior. This allows the model to learn more meaningful product clicks in addition to the interaction vector, differentiating between a product click where the user exited a page early and a click where the user remained on the product showing increased interest. To represent differences between values in the lower value range as meaningful as differences in higher ranges, the loд(∆ts) is applied and capped at 500 seconds. High values in∆tsallow acting as a delimiter between browsing sessions indicating the user was away longer during the clicks.

3.3 Implementing the sequence model

In an effort to leverage the consecutive nature of a user’s browsing history over longer time frames, in this section we discuss utilizing Recurrent Neural Network (RNN) [19], specifically Long Short Term Memory (LSTM) [20]. Furthermore we elaborate how those models are explicitly designed to capture the effects of sequential data [3, 8, 17] and learn to distinguish long and short-term effects, which are useful for modeling click behavior as described in Section 3.1. 3.3.1 Recurrent Neural Network. RNNs are an adaptation of a Feed Forward Neural Network (FFNN)4specifically designed to retain memory longer in the hidden states, allowing the capture of sequential information over longer time steps. This is achieved by assessing the hidden states for a time step ht by combining an input xtwith a weight matrix W and adding the hidden states of the prior time step ht −1, multiplied by a transition matrix U , that determines how much information is transferred from one hidden state to the next (3). The result is squashed with a sigmoid function σ to allow weaker signals to show stronger and retain the effects of already strong signals (4) [21].

ht = σ(W xt+ U ht −1) (3)

σ (x)= 1

1+ e−_x, σ : IR → (0, 1) (4)

As hidden states are determined at every step in the input sequence, weights from the previous step will be kept along with the com-bined weights remain from all prior steps.

After determining the weights, errors E are back-propagated, al-locating partial responsibility to all individual input-, output- and hidden-weights w by determining the partial derivative _∂w∂ E, the relation of change with respect to the other. The direction and rate

4_{RNNs reuse cells repeatedly instead of solely feeding through}

Figure 2: LSTM node with internal gates and flow states

of adjustment of the weight according to the error are then deter-mined by the gradient descent.

The recursive nature of the optimization function allows for us-ing arbitrarily large sequences as input, however, as weights are continuously updated by the gradient of the error function, the values are at risk of becoming too small for efficient training. This vanishing gradient problem [22] with long sequences or deep layers in a FFNN was tackled with the introduction of the LSTM cell [20], discussed next.

3.3.2 LSTM configuration. LSTM cells have been developed to combat the vanishing grading problem by Schmidhuber and Hochre-iter [20, 23]. This is achieved by relaxing the non-linear dependency to allow for linear dependencies and preserve a more constant error even over a sizeable number of time steps. A LSTM cell contains multiple gates outside of the recursive function, steering which prior information is being kept (forget-gate ft), how much of the current information is being utilized (input-gate It), and how much of the information should be fed to the next iteration (output-gate Ot), illustrated in Figure 2 and designed after [23]. The gate values are outputs of separate siдmoid and tanh activation functions re-sulting in values between zero and one, which are adjusted as part of the networks back-propagation. Using a network with LSTM cells allows us to model which of a user’s historic and current in-teractions are relevant to learn the provided label and are being captured by a final output vector of the LSTM representing the hidden learned internal states.

The LSTM cells require inputs of sequences S of the same length containing vectors of identical dimensions. Sequences shorter than a fixed length L series of clicks are left padded with special zero values. These zero values are masked in all layers of the network to contribute to the training of the model. Each data point in the sequence is the concatenated vector of all input features Iconcat

described in Section 3.2. Categorical variables such as products and categories are replaced with the corresponding learned embeddings. All input vectors are combined to a matrix with the size of a batch B, so that the LSTM cells receive an input of (B, LS, Iconcat).

3.3.3 Network Design. The first layer of LSTMs is configured to treat its weights as stateful so it retains index level information from prior steps in a batch [23]. This allows the model to be run

(6)

Figure 3: LSTM model architecture

much faster in larger batches instead of processing each sequence individually. This makes the choice in batch size non-trivial for se-quence prediction tasks as the LSTM weights controlling the gates need to be reset if a sequence is processed, otherwise following sequences will be initialized with a bias. The weight reset in this model with larger batch sizes is achieved by a callback function called at the end of every sequence. As shown in the network dia-gram in Figure 3, the first layer of LSTM cells feeds its output to a second layer. This adds complexity to the model and allows the cap-ture of longer patterns. The vector produced by the recurrent layers is concatenated with non-sequential data that does not need to be processed in sequence. The combined features then feed into two fully connected layers, with the second layer doing the prediction task represented by one node for each class to be classified. The softmax activation at the end emphasizes nodes with high values and squashes weights which are significantly different from the maximum weight. As nodes in the prediction layer are mapped to products, we can serve recommendations with all products where the node weights activated the most.

3.3.4 Training process. The entire training process can be out-lined with the following steps:

(1) Events and clicks are collected per user (2) Input sequence for each feature is produced

(3) Left padding of sequences, as the label represents the next click on the right end of the sequence

(4) Categorical variables are replaced with their corresponding pre-trained embedding vectors

(5) Network inputs are concatenated

(6) LSTM node weights are randomly initialized from a trun-cated Gaussian distribution centered around zero [24] (7) Sequential data is recursively fed into the LSTM cells

(8) Output is predicted and loss calculated at the end of each batch

(9) All weights are updated

(10) When all data has been processed, the errors on the valida-tion set are calculated

(11) Repeat training procedure for all batches until the validation error stops decreasing

3.4 Learning to predict next click, post-click

interactions and first click

In supervised machine learning training happens as input features X are associated with a given label y. The quality of the model in terms of accuracy can then be assessed testing the model against a test set how often predictions ypr edcan be matched to the true label ytrue. To investigate which type of product click can be forecast the best, we implement each model with three different types of labels as illustrated in Figure 1.

3.4.1 Next click. The most common setup in sequence predic-tion is using the following item in a given sequence as a label yN C, effectively predicting how the sequence will continue. This setup is commonly used in language processing or other sequence pre-diction tasks underlying a time series, such as weather or financial-market predictions. In the context of clicks for product recommen-dations getting this prediction right allows for valuable insights on the preferences of a user and derive suitable recommendations. This research is limited for training labels to use product pages exclusively, as recommendations are made by ranking the likeli-hood of the predicted next product clicked per user. In order to predict the next item in a sequence, we have to consider how to choose the right training label. Simply using the last observation in the sequence will introduce a bias to interactions that happen towards the end of the sequence. To retain a more effective training on the model, the training data is generated by splitting each full sequence of a user’s click history into subsequences. The label of each subsequence is the subsequent product click.

3.4.2 Post-click interactions. While learning which product will be clicked next can be useful, data shows however that users’ brows-ing behavior indicates not all product clicks are equally relevant. It is recorded that users tend to click on a variety of products while browsing and only remain on a product page or interact with the content on a product page for a fraction of all clicks. Such observa-tion demonstrates the potential of improving the model by training on products where users interact post-click yPCI. This will remove randomness in exploratory click behavior in the product research phase and keep data which reflect the user’s interest in products more closely.

3.4.3 First click. Using product clicks when users interacted as a label or learning the next product for a user is useful especially when a user is already on the website and making predictions within their browsing session [25]. However, a lot of user touch points with product recommendations are not during a user’s browsing session, but rather while the user is not on the website. Promotional material, as well as marketing channels compete for a users attention so that he makes his way back to the website to interact with the content. Therefore, we investigate adjusting the training label not only to

(7)

clicks with interaction, but to the product that was first clicked in a session yFC, basically limiting the learning to classifying the product that brought a user back to the website.

3.5 Categorize products into collections

Training a model to learn which product is most likely clicked by a user opens the question of scalability as the classification task increases with the number of products. Taking after Rodrigues et. al. [26], who have demonstrated success in grouping users prior to making the product recommendations, we investigate if the clas-sification task can be improved when grouping products together. Predicting a group of offers would reduce the number of classes in the classification task and improve accuracy, however, be more imprecise on which actual products will be recommended to users. Product groups can be created by using one or more shared at-tributes. These attributes can be created manually or be based on properties that differentiate products into bins of suitable sizes. This research, however, aims to also generate clusters of products that do not rely on selecting a shared attribute.

Hierarchical clustering [27] is used in order to obtain these col-lections of products. The clustering algorithm determines distances between data points and produces groups based on predetermined criteria for linkage. This method allows the creation of these group-ings without having to determine a hyperparameter for a fixed number of clusters beforehand, but offers an option to determine the cutoff point based on a maximum distance criterion [28]. It is iteratively calculating the distance between each centroid at each step and linking clusters based on an objective function. Links are combined at the following step and create a new cluster one step above in the hierarchy until one root cluster remains [29]. For fur-ther details on the internal workings of the hierarchical clustering algorithm we refer the reader to Section A in the appendix or the clustering methods book [27].

4 EXPERIMENTAL SETUP

This section outlines the experimental setup used to answer the research questions. We describe the dataset in Section 4.1 and dis-cuss the product clustering in Section 4.2. Next Section 4.4 outlines baselines which are used to compare the effectiveness of the CSBM measured by the evaluation metrics given in Section 4.3. The design of the experiments and technical details are shown in Section 4.5.

4.1 Data set

The data used in the models and experiments are taken from a TravelBird internal event collector, which logs user clicks and back-end processes, to represent all interactions and events happening on their website. The events we utilize are primarily event record-ings of users clicking on either the homepage, search page, cate-gory pages, and most importantly product pages. For each of these recorded events, information is available on how long the user stayed on the page and which elements were interacted with. All records are grouped by user and ordered by time. For training we randomly divide the data into two parts, providing 80% for training and 20% as a holdout test set5.

5_{Randomization happens repeatedly for each separate training process}

The data set is limited to users of the Dutch site that have visited at least one product page and recorded a minimum of five events in the period between January and May 2018. This makes the data set a size of 6,539,737 records with 1,774,650 unique browsing sessions coming from 449,747 unique users. The data set comprises 4,701 different products with each user visiting around 6 unique product pages on average. Even with some products clearly being more popular in terms of visits, we can neglect product class imbalance as even the most frequently visited product pages make for less than < 1% of all product clicks in that time frame.

Purchasing travel products is usually researched over repeated vis-its and comes with a longer conversion time frame. This can be seen in the records, which reveal that 91,82% of all users interact with 46,16% of the products they click on and 77,22% of users re-turn to the website continuing to research their next travel. We utilize this information to produce alternative labels to investigate how training on post-click interactions or first click of consecutive sessions affects accuracy.

With the goal of predicting clicks on product pages, we limit the user records to types of pages that contain products or infor-mation about products, which make up 97, 4% of all records. Out of these records, 49, 41% of visits are exclusively on product pages, however, 30, 80% on category pages, which lists products sharing specific attributes such as travel destination, a broad theme, or type of transportation.

Included in the data set are several attributes of each of the prod-ucts. We primarily focus on attributes that would allow products to be grouped into bins of suitable sizes, as described in Section 3.5. We propose using a semi-manual label of the touristic regions each product falls in. These regions are based on Dutch travel consumer reports and are primarily determined based on the location of the travel product that is offered. The touristic regions were created for the Dutch market to provide an adequate representation of various areas of interest of Dutch travelers. The attribute of touristic regions therefore makes for a good candidate for semi-manual grouping of the products.

4.2 Product Clusters

The provided data shows that products can be differentiated by a extensive list of distinct features like price levels, travel duration, family-friendliness, distance, included amenities, quality of accom-modation, etc. Since we already obtained product embeddings from applying word2vec, providing latent features on which products are clicked together in a browsing session. As many of these features are dynamic and the product portfolio slowly changes, it does not seem like a sensible choice to rely on a clustering algorithm that requires the number of clusters as an hyper-parameter, such as K-means. Therefore, we opted to use hierarchical clustering with a maximum distance cutoff point, as described in Section 3.5. Following the implementation depicted in Section 3.5 the cluster generation is best visualized in a dendrogram, showing each split in a hierarchical manner. The values indicate pairwise distances between clusters at each level of the hierarchy. A suitable cutoff

(8)

Figure 4: Dendrogram depicting the hierarchy of clusters

point can be obtained by determining the value where each next dis-tance adds no extra meaning. Clusters are obtained by cutting the hierarchical ordering at the y-axis value of 38 resulting in around 80 clusters. This leaves us with a comparable number of classes to product groups based on touristic regions, making differences in the number of classes negligible.

To ensure that the cluster learning is not included in the model, we not only replace the label by clusters, but also adjust the product clicks in the input of the models to use clusters instead.

4.3 Metrics

In the light of recommendations we have to carefully choose how to evaluate the prediction and classification task of our models. Errors in the recommendations can lead to two types of characteristic errors [26]: products the user likes but are unrecommended (false negatives) and products recommended the user does not like (false positives). A range of evaluation metrics are used to properly assess the usefulness of the models.

One of the primary ways of evaluation is accuracy, the rate of re-liable predictions out of all predictions made. However, with the data showing some class imbalance it is critical to understand that this metric is not the most representative. Nevertheless, the distri-bution of class labels shows that no single class occurs more than > 1% of the time, which allows us to use accuracy reliably. Since recommendations typically are not single products, but rather a selection of multiple products in promotional material, accuracy is also measured for a reliable prediction among the top k results. The parameter in this research is set to default at k = 6, as most of the listing pages, promotional emails, and marketing efforts at TravelBird display six products to the user.

To assess the effects of true positive results the metrics precision and recall are used. Precision can quickly reveal when too many products a user does not like are recommended, by measuring the true positive rate in all returned results. On the other hand recall helps understanding the rate of products a user clicked but were not predicted by calculating the proportion of true positives out of all relevant items.

In addition to measures based around model quality, the recom-mendation task will be assessed by comparing the similarity of the offer label to the offers in the sequence as well as the similarity between the label and the predicted. A similarity score is calculated by measuring the cosine of the angle between products in the la-tent feature space of the product embeddings. The score is a value between zero and one and shows decent similarity around 0.7 with one being identical.

4.4 Baselines

In order to properly assess the usefulness of a sequential neural model, the results are compared against a range of frequently used models ranging from linear, tree-based to standard neural classifiers. Each of the baseline models was run with the same dataset and labels, albeit with another step in the data pre-processing trans-forming the sequential data into count aggregates instead. For a linear baseline model two variations of a linear support vector machine are used. One utilizing stochastic gradient descent (SGD) as an optimizer, while the other is implemented with a "One-vs-rest" decision function to support multi-class classification (SVC) [30]. Another conventional approach for multi-class classification problems utilizes random forest models (RF), which determine the following class by optimizing the objective with the help of a large array of diversely configured decision trees. As a third baseline, a neural network classifier (NN) is used to compare non-sequential neural models as well. This particular implementation is a simple two fully connected layer setup with 100 and 200 nodes each and a dropout layer in between for additional complexity. To perform effective comparisons to the LSTM Network, both neural networks are evaluated with the same measures and identical classification reports post training.

4.5 Experimental methodology

4.5.1 Design. The following outlines the experimental setup, followed by experiments divided into three parts. The first exper-iment is closely modeled to answer the first research question to classify the product of the next click. In order to assess the classifica-tion task, the results are compared to all baseline models described in the prior Section 4.4. For the second experiment all models are trained on all labels depicted in Section 3.4, discussing the differ-ences in performance. The third experiment’s goal is to investigate the effect of decreasing the number of classes by exchanging the class label on products with a less granular label representing a collection of products. First we do this on a manual label describing touristic regions, then secondly on cluster labels generated with a hierarchical clustering algorithm from a broad range of product features.

4.5.2 Technical details and hardware. The experiments were conducted with models implemented in Python, primarily utilizing the Keras framework [31], relying on a custom callback function that resets the LSTM internal states written in TensorFlow [32]. The product embeddings were trained with the continuous bag of words model (CBOW) based on the Gensim package [33] in Python. Clusters were generated using a hierarchical clustering implementation in sci-kit learn [34]. All models were trained on NVIDIA K80 GPU instances hosted on Amazon Web Services (AWS).

(9)

Table 1: Results for Click Sequence Based Model (CSBM) compared to baseline models on forecasting which product will be clicked next. Best results highlighted in bold.

Next Click (yN C)

Model V al .Accuracy Accuracy@6 Precision Recall

TOP <0.001 0.002 <0.001 <0.001 LAST 0.182 0.432 0.185 0.182 RF 0.041 0.112 0.012 0.041 SGD 0.117 0.238 0.190 0.117 SVC 0.100 0.273 0.130 0.100 NN 0.199 0.449 0.270 0.200 CSBM 0.317 0.583 0.320 0.320

Figure 5: accuracy@k measures with k between 1 and 10.

5 RESULTS

We demonstrate the results of the experiments implemented as out-lined in Section 4.5. Each of the following sections provides answers to the research questions presented in Section 1. The underlying models and dataset have been kept identical throughout the experi-mentation process for the sake of comparison, as all experiments are distinguished by the training labels outlined in Section 3.4.

5.1 Predicting the next product

The results of the Click Sequence Based Model (CSBM) for the task of predicting the next product a user clicks on are given in Table 1 as well as Figure 5. Table 1 compares model performance against baseline models introduced in Section 4.4 measured in terms of validation accuracy, accuracy@6, precision, and recall. Figure 5 plots changes in validation accuracy@k for all k between 1 and 10. Table 1 shows that the CSMB model outperforms all other mod-els in all metrics by a considerable margin. Using the top most frequently visited product as a prediction (TOP) yields < 0.1% accuracy, showing the class imbalance has no effect on the clas-sification task. The approach of placing the prior product clicked (LAST) as a prediction has an accuracy of 0.182 making it a naive baseline. RF, SGD, and SVC models failed to recognize on the basic pattern used by the naive LAST model with accuracies below that.

The notable exception is the SGD model’s precision, which is on par with the naive LAST model. The sole baseline outperforming the LAST model is the NN model. Interestingly, recall is consistently lower than the precision metric, except for the CSBM and LAST models.

Only the two neural models in our comparison managed to beat the naive baseline of using previous products visited as predic-tions with statistical significance. This shows that even utilizing a relatively simple two-layer neural network with a dense layer in between, manages to learn to predict the product clicked purely on the number of occurrences of previous clicks with a mean vali-dation accuracy of around 20%. This type of neural network does not rely on any sequential information in the data, nor leverage the timing or types of clicks recorded, but instead relies on amounts and types of previously visited offers to reach a classification. The click sequence-based model CSBM, however, shows a drastic increase in accuracy over even the basic NN to almost 32%. This shows that the LSTM network manages to utilize the provided user clicks as sequential input effectively to predict the following product visited more accurately.

Investigating the results of the CSMB model more in depth with regards to similarity reveals that the predicted and actual offers are alike with a similarity score of over 0.76. This is much more similar than the average similarity 0.69 measured between the product used as a label and all offers in the sequence. This shows that even when the prediction was not correct the model predicts products similar to the actual next offer clicked. The test set was split into two parts for further investigation: samples where the product is similar to products in the click sequence and samples with dissimilar products as training labels. The samples were split on the mean similarity between product label and sequence. Accuracy measured on the part with dissimilar product labels revealed that the model was nevertheless able to predict 26% correctly, showing the models ability to learn beyond purely relying on products in the input sequences, albeit with less accuracy.

However, a similarity score of 0.73 measured against offers in the sequence on predicted offer is higher than the score of 0.69 measured with the actual offer, indicating that the model primarily although not exclusively suggests products most similar to pre-viously seen offers, which are not the exact same. Depending on the goal of the recommendation, this might be problematic, as it is reusing already clicked products instead of suggesting something new and leaves less room for inspirational product recommenda-tions.

Additionally, we find that the CSBM generally yields higher recall compared to the baseline models. This demonstrates the clas-sifier’s ability to obtain all positive samples. This is crucial when adopting the classifications as product recommendations, to in-crease finding more products the user will like, and recommending fewer products the user will not visit.

As recommendations are typically made with multiple product suggestions, Figure 5 shows the differences in accuracy@k at dif-ferent levels of k. This demonstrates a model’s ability to correctly predict the next visited offer among the top k highest predicted

(10)

Table 2: Results for baseline and sequential models on alter-native training labels. Best results highlighted in bold.

yPCI Post-Click Interaction

SGD 0.077 0.182 0.159 0.077

SVC 0.075 0.196 0.100 0.075

NN 0.161 0.360 0.180 0.150

CSBM 0.154 0.347 0.170 0.150

yFC First Click

SGD 0.104 0.218 0.162 0.104

SVC 0.072 0.186 0.103 0.072

NN 0.146 0.323 0.200 0.160

CSBM 0.173 0.358 0.180 0.170

products ranked by likelihood. The figure reveals that correct pre-dictions are more often in the top k= 4 predictions, as shown by the steep increase at low k and a diminishing effect with k > 4 for all models. k= 4 represents the last step at which the curve gradi-ent is higher than the mean gradigradi-ent. The CSBM model shows the highest accuracy over all k, with only also the NN outperforming the naive baseline LAST. However, it is visible that the LAST model has a higher diminishing effect with higher k, as not all samples have a minimum of at least k product clicks in the prior sequence.

5.2 Predicting post-click interactions and first

clicks

Model performance for predicting post-click interactions yPCIand first clicks for consecutive sessions yFC are presented in Table 2, comparing baselines presented in Section 4.4 to the CSBM model6. 5.2.1 Post-click interactions. The results shown in Table 2 reveal similar accuracy between the NN and CSBM model at around 16%. The difference is not statistically significant. This suggests it is harder to model for clicks that are further into the future instead of the next click. This can be explained that the next click is closely related to the prior sequence. Visits on category pages possess only a range of valid next product clicks including all products listed on that category page, exiting the page or clicking on another globally available page. However, predicting the next product with post-click interactions presents an exceedingly more complex problem as the range of valid following clicks dependent on the last prior click is considerably larger.

5.2.2 First click in a browsing session. Running the same experi-ment just with the label set to the first product click in a subsequent user session reveals the models have learned to correctly predict the next product a user clicks on 17.3% of the time. This is slightly more accurate than the NN with 14.6%. Both neural models out-perform the remaining baselines. Training on yFCyields a higher

6_{The naive baseline prediction on the last product visited in the prior sequence LAST}

is not shown, as not all sequences contain prior visits with post-click interactions or first clicks, and therefore can not be compared to the other models.

validation accuracy than models trained on yPCI, however, is not more accurate than limiting the prediction to the next product in a click sequence.

This suggests that different models can be used for these two separate problems, each optimized to either recommend products during a session or to reactivate a user compelling him to return to the website. This particular variant of the experiment is excep-tionally useful when judged by the accuracy@6 also depicted in Table 2, as many marketing related efforts to reactivate the user is displaying multiple products to choose from. In TravelBird’s partic-ular case sending out promotional emails containing six products. Judging by the accuracy@6 of almost 36% suggests that this model is useful for bulk recommendations, as it would be correctly predict-ing the next product that compels a user to return to the website a little more than one-third of the time.

5.3 Improving classifications with product

collections

Grouping products was achieved by following a semi-manual pro-cess based on specific product attributes as outlined in Section 4.1 and secondly by generating clusters from product properties with hierarchical clustering described in Section 4.2. Results are com-pared in Table 3 including measurements across the different labels of yN C, yPCI, and yFC, which were presented in Section 3.4 and already investigated in the previous Section 5.2

5.3.1 Product groups by semi-manual labeling. Results in Table 3 expectedly show higher accuracy in both the baseline and sequen-tial model over predicting exact products, suggesting the presumed benefit gained from reducing the amount of classes. The CSBM model on yN Cshows the highest accuracy with 48.4% and 0.48 on both precision and recall. Both NN and SVC baselines reveal similar values in accuracy of around 37%. The two linear models SGD and SVC demonstrate a better classification ability on this task with fewer classes. Using yFCor yPCIis less accurate than yN C

in line with results shown in Section 5.2. Both alternative training labels reveal no significant benefit of using the CSBM models with all baselines being similarly accurate. This can be explained as dis-cussed in Section 5.2.1. The NN model seems to obtain an advantage in measuring the highest accuracy@6, however, the SGD model shows the highest precision at 0.391 and 0.363 respectively. Table 3 also shows drastically lower recall than precision values across all baselines, demonstrating an advantage of the CSBM showing comparably high recall measurements.

Results for predictions on product collections are more accu-rate, however this makes deriving recommendations less precise. Recommendations are derived from their ranking with a product group order through their mean daily visits. Applying this rank-ing reveals that 52% of all product labels are ranked in the top 6 of all product groups. This approach differentiates itself from the accuracy@6= 0.583 of the CSBM on yN Cfor predicting products, compared to a validation accuracy of 0.484 for predicting the touris-tic region correctly with the actual correct product in the top 6 of that group.

5.3.2 Product groups through hierarchical clustering. Table 3 shows some improvements over using the semi-manual product

(11)

Table 3: Results for CSBM and baseline models on forecasting which touristic region belongs to of the next product that will be clicked, interacted with post click and which product will be clicked first in a session. Best results highlighted in bold.

Manual Next Click (yN C) Post-Click Interaction (yPCI) First Click (yFC)

Model V al .Acc. Acc@6 Precision Recall V al .Acc. Acc@6 Precision Recall V al .Acc. Acc@6 Precision Recall

SGD 0.292 0.559 0.393 0.292 0.203 0.481 0.391 0.203 0.246 0.462 0.363 0.246

SVC 0.366 0.772 0.379 0.366 0.313 0.638 0.369 0.313 0.330 0.330 0.340 0.330

NN 0.376 0.763 0.410 0.380 0.320 0.653 0.360 0.320 0.312 0.651 0.350 0.310

CSBM 0.484 0.795 0.480 0.480 0.338 0.644 0.340 0.340 0.320 0.646 0.310 0.310

Cluster Next Click (yN C) Post-Click Interaction (yPCI) First Click (yFC)

Model V al .Acc. Acc@6 Precision Recall V al .Acc. Acc@6 Precision Recall V al .Acc. Acc@6 Precision Recall

SGD 0.277 0.632 0.423 0.277 0.288 0.549 0.325 0.288 0.258 0.547 0.378 0.258

SVC 0.374 0.807 0.382 0.374 0.330 0.700 0.347 0.330 0.355 0.736 0.361 0.355

NN 0.380 0.793 0.420 0.380 0.331 0.709 0.380 0.330 0.336 0.708 0.360 0.340

CSBM 0.480 0.836 0.470 0.480 0.354 0.729 0.350 0.350 0.355 0.712 0.340 0.350

groups based on touristic regions, especially when learning clicks further into the future such as yPCI or yFC. This suggests that learning clusters as part of the processing step has a benefit in predicting a group of products that are most likely to be clicked on next by a user. For yFCor yPCI there is no notable difference between the CSBM model and the baselines, with the NN and SVC yielding the highest accuracies. CSBM shows an improvement for all metrics only on the yN Clabel.

While clustering reveals more accurate predictions over the semi-manual labels especially on yFCand yPCI, the ranking of the correct products in clusters is much lower with 34% of product labels ranked among the top 6s. This shows that clusters might lead to a slight advantage in the prediction task, however do not yield the same quality of recommendations when compared to a more manual grouping of products.

6 CONCLUSION

In this research, we have introduced a click sequence-based model to predict which product is clicked on next based on users’ interac-tion history in a commercial web setting. We represented the state and products presented on the page at each step in a series of a user’s clicks as embeddings allowing us to utilize learned syntactic and semantic properties between the products. Feeding this data into an RNN with LSTM cells allowed it to generalize behavioral patterns found across all users’ click histories.

Our results show that the accuracy of the CSBM outperforms all baselines on the prediction task, even when presented with a sub-stantial number of classes. It demonstrates value for recommenda-tions as even with incorrect predicrecommenda-tions the suggested product is still similar to the actual product that was clicked next. The model presented itself to improve when increasing the result set to top k = 4 making it useful for issuing bulk recommendations. How-ever, upon closer inspection, we identified a limited variation of the product suggestions compared to the products found in the input sequence revealing a limited ability of the CSBM to inspire users. When training the CSBM on alternative labels such as predicting the next product the user interacts with post-click or predicting

the first product clicked in succeeding browsing sessions showed only little improvements in terms of accuracy over baselines. We identified the alternative labels as an exceedingly more difficult problem to optimize for, as they are not closely associated to the immediate prior click in the input sequence, making the predictions vary a great deal more. Additionally, we found that predicting with a limited number of classes allows the model to stay consistent and more accurate, but only gets the general direction of products correct, which then still has to be translated to actual product rec-ommendations based on a metric ranking all products inside that group.

For future work, the model could be improved by combining it with features out of the item-to-item similarity generated by alternat-ing least squares (ALS) to combat the cold start problem. Also, we should investigate applying the CSBM model to a dataset that is unbiased by clicks that were already driven by an existing recom-mender system. Furthermore as this research was mainly concerned with the predictive accuracy, the actual effect of recommendations would have to be evaluated by presenting the CSBM’s predictions to users in an A/B test.

Acknowledgements. Firstly I want to thank both my excellent supervisors Ilya Markov and Rob Winters for lending me their knowledge and giving valuable feedback. I am especially grateful for the large degree of freedom and autonomy they gave me, it made writing this thesis a mostly stress-free experience. A special thanks to Bastien for being the second reader and helping out on the defense examination. Also I tremendously appreciate the continuous support during my studies and research from the entire data science team at TravelBird: Rob, Bastien, Niels, Egle, Jeff and Teodora. Especially Niels, with his help managed to accelerate training models in the cloud many-fold. Lastly I want to also thank my brother Ben for being there for me during the entire research and writing process.

REFERENCES

[1] Aleksandr Chuklin, Ilya Markov, and Maarten de Rijke. Click Models for Web Search. Synthesis Lectures on Information Concepts, Retrieval, and Services, 7(3):1– 115, 2015.

[2] Daphne. Koller and Nir. Friedman. Probabilistic graphical models : principles and techniques. MIT Press, 2009.

(12)

[3] Alexey Borisov, Ilya Markov, Maarten de Rijke, and Pavel Serdyukov. A Neural Click Model for Web Search. In Proceedings of the 25th International Conference on World Wide Web - WWW ’16, pages 531–541, New York, New York, USA, 2016. ACM Press.

[4] Alexey Borisov, Martijn Wardenaar, Ilya Markov, and Maarten de Rijke. A Click Sequence Model for Web Search. 2018.

[5] Badrul Sarwar, George Karypis, Joseph Konstan, and John Reidl. Item-based collaborative filtering recommendation algorithms. In Proceedings of the tenth international conference on World Wide Web - WWW ’01, pages 285–295, New York, New York, USA, 2001. ACM Press.

[6] Balazs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Session-based Recommendations with Recurrent Neural Networks. Physical Review Letters, 116(15), 11 2015.

[7] Yuchen Zhang, Weizhu Chen, Dong Wang, and Qiang Yang. User-click modeling for understanding and predicting search-behavior. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining -KDD ’11, page 1388, New York, New York, USA, 2011. ACM Press.

[8] Sebastian Heinz, Christian Bracher, and Roland Vollgraf. An LSTM-Based Dy-namic Customer Model for Fashion Recommendation. CoRR, abs/1708.0, 8 2017. [9] Robin Devooght and Hugues Bersini. Long and Short-Term Recommendations with Recurrent Neural Networks. In Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization - UMAP ’17, pages 13–21, New York, New York, USA, 2017. ACM Press.

[10] Arthur Toth, Louis Tan, Giuseppe Di Fabbrizio, and Ankur Datta. Predicting Shopping Behavior with Mixture of RNNs. eCom, 2017.

[11] Cole Maclean, Barbara Garza, and Suren Oganesian. A Recurrent Neural Network Based Subreddit Recommendation System. Technical report, 2016.

[12] David Zhan Liu and Gurbir Singh. A Recurrent Neural Network Based Recom-mendation System. Technical report, 2015.

[13] Christian Bracher, Sebastian Heinz, and Roland Vollgraf. Fashion DNA: Merging Content and Sales Data for Recommendation and Article Mapping Christian. CoRR, abs/1609.0, 8 2016.

[14] Feng Yu, Qiang Liu, Shu Wu, Liang Wang, and Tieniu Tan. A Dynamic Recurrent Model for Next Basket Recommendation. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval -SIGIR ’16, pages 729–732, New York, New York, USA, 2016. ACM Press. [15] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation

of Word Representations in Vector Space. 1 2013.

[16] Tomas Mikolov, Wen-Tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. pages 746–751, 2013.

[17] Daniel Sanchez Santolaya. Using recurrent neural networks to predict customer behavior from interaction data. PhD thesis, 2017.

[18] Makbule Gulcin Ozsoy. From Word Embeddings to Item Recommendation. CoRR, abs/1601.0, 1 2016.

[19] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning repre-sentations by back-propagating errors. Nature, 323:533, 10 1986.

[20] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Juergen Schmidhuber. Gradi-ent Flow in RecurrGradi-ent Nets: the Difficulty of Learning Long-Term Dependencies. [21] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press,

2016.

[22] Y Bengio, P Simard, and P Frasconi. Learning Long Term Dependencies with Gradient Descent is Difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.

[23] Sepp Hochreiter and J Urgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, 1997.

[24] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks.

[25] Yong Kiam Tan, Xinxing Xu, and Yong Liu. Improved Recurrent Neural Networks for Session-based Recommendations.

[26] Fatima Rodrigues and Bruno Ferreira. Product Recommendation based on Shared Customer’s Behaviour. Procedia Computer Science, 100:136–146, 1 2016. [27] Lior Rokach and Oded Maimon. Clustering Methods. In Oded Maimon and Lior

Rokach, editors, Data Mining and Knowledge Discovery Handbook, pages 321–352. Springer US, Boston, MA, 2005.

[28] Daniel Müllner. Modern hierarchical, agglomerative clustering algorithms. [29] R. Sibson. SLINK: An optimally efficient algorithm for the single-link cluster

method. The Computer Journal, 16(1):30–34, 1 1973.

[30] Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Courna-peau, Matthieu Brucher, Matthieu Perrot, and Edouard Duchesnay. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2012.

[31] François Chollet. Keras, 2015.

[32] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Man-junath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan

Yu, Xiaoqiang Zheng, and Google Brain. TensorFlow: A System for Large-Scale Machine Learning TensorFlow: A system for large-scale machine learning. 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16), pages 265–284, 2016.

[33] Radim Rehurek, Radim Rehurek, and Petr Sojka. Software Framework for Topic Modelling with Large Corpora. IN PROCEEDINGS OF THE LREC 2010 WORKSHOP ON NEW CHALLENGES FOR NLP FRAMEWORKS, pages 45–50, 2010. [34] Eric Jones, Travis Oliphant, and Pearu Peterson. SciPy: Open source scientific

tools for Python, 2001.

[35] Joe H Ward. Hierarchical Grouping to Optimize an Objective Function. Source Journal of the American Statistical Association, 58(301):236–244, 1963.

(13)

Appendices

A

HIERARCHICAL CLUSTERING

Distances between two centroids d(a, b) is determined by calculat-ing the Euclidean distance, which is the direct straight distance between two points in space (5). The Euclidean distance is expressed as the square root of the sum of all squared differences between each point for all dimensions.

d(a, b)= v tn Õ i=1 (ai− bi)2 ₍₅₎

To determine which two clusters to link we choose to minimize the variance within a cluster following an objective function by Ward [35]. As all pairwise distances are known at each step in the recursive function, a newly merged cluster can be described as the combined squared distance from each of the newly combined clusters a, b to an unrelated cluster c subtracting the combined distance of a and b to c adjusted by their respective sizes ni, as depicted in (6). d(a∪b, c)= na+ nc Í na,b,cd(a, c)+ nb+ nc Í na,b,cd(b, c)− nc Í na,b,cd(a, b) (6) 12