Attribute-aware Diversification for Sequential Recommendations

(1)

MSc Artificial Intelligence

Master Thesis

Attribute-aware Diversification for Sequential

Recommendations

by

Anton Steenvoorden

11850493

June 25, 2020

48 European Credits October 2019 - June 2020

Supervisor:

Dr. Pengjie Ren

Emanuele Di Gloria

Assessor:

Prof. Dr. Maarten de Rijke

Institute of information and language processing systems

(2)

Abstract

Research has shown users prefer diverse recommendations over homogeneous ones. However, most recent work on Sequential Recommenders does not consider diversity and only strives for maximum accuracy, resulting in accurate but redundant recommendations. In this work, we present a novel model called the Attribute-aware Diversifying Sequential Recommender (ADSR). The ADSR takes both accuracy and diversity into consideration while generating the list of recommendations. Specif-ically, the ADSR utilizes available attribute information when modeling a user’s sequential behavior. The ADSR simultaneously learns the user’s most likely item to interact with, and their preference for attributes, which is used to diversify the recommendations. The ADSR consists of three modules: the Attribute-aware Encoder (AE), the Attribute Predictor (AP) and the Attribute-aware Diver-sifying Decoder (ADD). First of all, the AE is responsible for encoding the input sequences and learning the preference over items. Second, the AP is responsible for learning the preference for the attributes. Third, the ADD incrementally generates a diversified list of recommendations based on the predicted attribute preference distribution, while taking the item-attributes already present in the recommended list into account.

Experiments on the publicly available datasets MovieLens-1M and TMall demonstrate that the ADSR can significantly increase the diversity of recommendations while maintaining accuracy. More-over, an ablation study shows the positive effect each module has on performance by investigating the performance of stripped-down variants of the ADSR. Results from the comparison with other baselines show that the ADSR can provide highly diverse recommendations while outperforming a number of models. Furthermore, several studied cases show that the ADSR provides properly diverse recommendations, but that instances exist where the ADSR is limited in its diversifying capabilities. Finally, a follow-up experiment is performed in which the user identifier is used to learn an embed-ding used to make predictions. Results show that this helps the ADSR yield higher performance, indicating that further improvements over the model are possible in the future.

(3)

Acknowledgements

First of all, I would like to thank Emanuele Di Gloria (Ahold Delhaize) and Pengjie Ren (University of Amsterdam), for their guidance and support during this thesis. Without their time, effort, and expertise this thesis would not have been in the shape it is today. Second, I would like to thank Wanyu Chen (UvA), who has helped me get familiar with the topic at the start of this thesis and helped form the idea of this research. Furthermore, I want to thank Bart Voorn for providing the opportunity to conduct this research at Ahold Delhaize, and my colleagues Hinda Haned and Kim de Bie for providing a good atmosphere and for giving feedback on my work. I would also like to thank members of the Data Science team of Albert Heijn (Sven Schagen, Almer Tigelaar, Freek Boutkan, and Mats Willemsen) for giving me insights into a real-world environment, and for letting me present my work and gather feedback. Next, I would like to extend my gratitude to Maarten de Rijke for proofreading my short paper before submitting and for taking the time to read and assess this thesis. Finally, I would like to thank my family, and especially my girlfriend Anne, for being ever sup-portive during the entirety of my master’s degree in Artificial Intelligence.

(4)

Introduction

Recommender Systems (RSs) are widely used to reduce the information overload users face [18]. In other words, Recommender Systems help users find a subset of items they are interested in from a large collection of items. At Netflix, RSs are used to help users find the next movie to watch [23], at YouTube they are used to suggest which video to view next [13], and in an e-commerce setting, they are used to suggest products the user might be interested in purchasing [32]. Many companies in e-commerce have recognized the value of RS and have been researching them extensively [36]. A number of publicly available datasets meant for research on RS are from (online) retailers such as Taobao, Amazon, and Tesco [2].

At Ahold Delhaize (where the research for this thesis is conducted), a Dutch-Belgian retail com-pany, e-commerce plays an important role. It is therefore of great importance to have a good RS in place to help satisfy the customers, to give them a better experience, and to increase sales [26]. A good RS can convert users who are just browsing into buyers. Loyalty can be increased by providing the best matching personalized offers, making the user want to return to the webshop that knows their preference [32]. Especially during the current pandemic (COVID-19) the presence of e-commerce has increased significantly and plays a key role for many people in obtaining products and groceries. A large number of products previously bought in stores are now purchased through webshops, increasing the importance of a good recommender to help satisfy the customer and to make it pleasant for users to buy from you.

To provide recommendations, Recommender Systems use the historical behavior of many users to predict the relevance of documents, which is then used to select and present a list of items. Traditional Recommender Systems such as Collaborative Filtering and Matrix Factorization [26, 35, 23] provide recommendations using a decomposition of a user-item interaction matrix. Recommendations are made by either suggesting items other users have interacted with (user-to-item), or items that are related to items from the user’s history (item-to-item). The matching of items is performed in the D-dimensional space of the decomposed matrix.

Another approach called content-based filtering uses features of the items and users to produce recommendations instead, using e.g. techniques from the field of Natural Language Processing (NLP) to interpret textual descriptions.

However, all the above approaches do not take the order of interactions into account and fail to capture the users’ evolving preferences [19]. As the user’s preferences evolve, interactions performed a long time ago may not be very relevant at the time of making recommendations. These methods do not forget [24], meaning that the interactions performed a long time ago still affect recommendations. Therefore, Sequential Recommenders (SRs) have attracted a lot of attention. They are able to capture the temporal drift in preference and relevance, using a fixed number of parameters to do so. SRs are designed to exploit the order of interactions, by using e.g. Recurrent Neural Networks (RNNs) to model sequences of behavior [19, 25, 31]. These models read and encode sequences into hidden representations, which are then used to make predictions. In most recent research, these RNNs have been enhanced with so-called attention mechanisms. These are specifically designed to learn to which interactions of a sequence the RNN should pay ”attention” to when forming a representation of the full sequence. In industry, Sequential Recommenders allow e-commerce retailers to provide recommendations for (unknown) users, based on their current sequence of interactions and have proven to be very effective [19, 25, 31, 26].

After these successes, researchers looked into exploiting available side information in Sequential Recommendation, e.g., item attributes (category, genre, etc.). This approach has recently been shown to be useful for capturing user preference [5, 10]. Bai et al. [5] use available item attributes to make

(6)

their model attribute-aware by obtaining a unified representation from the items and their attributes. Their recommendations are based on both the items interacted with as well as attribute information about the items. For example, Chen et al. [10] use item attributes to infer users’ future intention, allowing them to do intent-aware recommendation, where the recommendations are selected according to the user’s intentions. However, for both methods, the focus lies on increasing performance in terms of accuracy only, resulting in homogeneous recommendations. These RSs seek to provide items relevant to the user, but do not consider that each recommended item needs to be of value, and can not be a near-duplicate to the other recommended items.

Zhang and Hurley [41], Carbonell and Goldstein [8] have shown that diversity is an important met-ric to consider when returning results in web search and that users prefer more diverse search results as opposed to highly accurate, but redundant ones. Therefore, to satisfy the wish of a diverse list, the RS should ensure the list of recommendations contains items that each provide distinct value. More-over, Li et al. [26] show that besides increasing user satisfaction, providing diverse recommendations can also increase sales, as users might explore and add more to their basket.

1.1 Outline thesis

In this work, we propose to use available attribute information to diversify Sequential Recommenders. For instance, assume we know that a user enjoys apples, bananas, and oranges, and this user has previously bought Pink Lady apples. The user is shopping at an online retailer, and just placed bananas in their basket, the recommender could suggest e.g. Pink Lady apples. With this recom-mendation in place, they are probably satisfied with the ”apples” category and would rather see some other fruits, or perhaps some baking material, opposed to the recommender suggesting 10 different brands of apple. Therefore, category information and preferences can be used to present diverse rec-ommendations, collectively covering multiple categories. To this end, we present an Attribute-aware Diversifying Sequential Recommender (ADSR). This model uses available attribute information to predict and diversify the recommended items, it considers both accuracy and diversity (another ex-ample illustrating the need for diversity is presented in Section 3.6).

The ADSR consists of three modules: an Attribute-aware Encoder (AE), an Attribute Predic-tor (AP), and an Attribute-aware Diversifying Decoder (ADD). The AE models both the sequence of items and their attributes. The AP learns the user’s preference on attribute level to predict a probability distribution over the item categories. Finally, the ADD is used to diversify recommenda-tions according to the probability distribution from the AP. To do this, the ADD trades-off relevance and diversity while selecting items, via a balancing hyperparameter λs (a scalar between 0 and 1).

As a result, ADSR can provide diverse recommendations based on the user’s preferences. The AP and ADD are optimized in a multi-task learning paradigm [9]. To compare performance in terms of accuracy and diversity we carry out experiments on two benchmark datasets.

With this work, we seek to answer the following research questions:

(RQ1) What are the effects on accuracy and diversity of the AE, AP and ADD modules?(See §5.1.) (RQ2) How does the performance of ADSR compare to baseline methods in terms of accuracy and

diversity? (See §5.2.)

(RQ3) How does the trade-off parameter λsaffect the accuracy and diversity of ADSR?(See §5.3.)

(RQ4) Can ADSR generate properly diversified recommendations? What are the limitations?(See §5.4.)

(RQ5) What is the effect of modeling the user on the performance of the ADSR?(See §5.5.)

The thesis is outlined as follows. First, the necessary background on Traditional Recommender Systems, Sequential Recommenders, and recent techniques, is presented in Chapter 2. Next, our proposed model is described in detail in Chapter 3. Then, the experimental setup is described with details on the datasets, metrics, and baselines in Chapter 4. The collected results of the experiments and answers to the research questions are presented and discussed in Chapter 5. Finally, the work is concluded in Chapter 6 and suggestions for future research are made.

(7)

Chapter 2

Background

In this chapter, some necessary background is provided and relevant techniques, that serve as the foundation for the architecture proposed in this thesis, are explained.

In Section 2.2, the traditional approaches to Recommender Systems are briefly discussed. Then, in Section 2.3, the sequential approach to recommendations is outlined. Here, a number of essential building blocks are introduced. Section 2.3.3, provides an explanation of the attention mechanism used in modern Recommender Systems. Next, we go over some approaches in the field of web search to diversify results in Section 2.5. Finally, in Section 2.6, the multitask learning paradigm which is used in our work is introduced.

2.1 Recommender Systems

Recommender Systems are widely used to reduce the information overload users face [18]. In other words, RSs help users find a subset of items they are interested in from a large collection of items. Recommender Systems provide recommendations in many forms. At Netflix, Recommender Systems are used to help users find the next movie to watch [23], at YouTube they are used to suggest which video to view next [13], and in an e-commerce setting, they are used to suggest products the user might be interested in purchasing [32]. In all cases, the Recommender System is used to provide better service to customers and to improve sales and engagement with the service [26, 32].

2.2 Traditional Recommender Systems

One of the first Recommender Systems, Tapestry, assumed that if users u1 and u2 rate n items

sim-ilarly, or have similar behaviors, they will rate other items similarly [18]. This is called Collaborative Filtering and has been widely adopted. The interactions are transformed into a user-item (or item-item) matrix. From this matrix, ratings are predicted for items the user has not yet interacted with. These ratings can be used directly to present recommendations by simply sorting according to their values. Another early approach is called content-based recommendation. These Recommender Sys-tems analyze the content of the iSys-tems and users, such as text descriptions, and try to find similarities in content. Often, content-based recommenders use cosine similarity between bag-of-word encodings or use euclidean distance to do matching [29].

A third popular approach is Matrix Factorization [23]. This is an example of a latent factor model, which infers a representation for each row in the user-item matrix in a D-dimensional space. Inferring this representation is often done with the Alternating Least Squares algorithm or by performing the Singular Value Decomposition. The dimensionality of this latent space is often a relatively low number compared to the number of unique items a user can interact with. These latent factors are assumed to explain the properties of the items. In the case of movie recommendations, these factors might represent e.g. “comedy” or “has a female protagonist”. For each user, the model measures how much they like the factors and can suggest movies that have similar factors [35].

However, the above methods are limited when data about a user to recommend items to is sparse and recommendations become unreliable. A similar issue, called the cold-start problem, occurs when a new user enters the system. This user has not performed any ratings yet, and no specific recommendations can be made [35].

Another limitation is the linearity of these models, which restricts them from learning more intri-cate patterns. Neural Network-based models are capable of this as they employ non-linear activations

(8)

[42] allowing them to learn these complex relations. Furthermore, these approaches do not take the order of interactions into account and fail to capture the users’ evolving preferences [19]. However, users are known to have a drifting preference and these methods do not forget historical interactions [24] which may not be very relevant anymore.

All the above reasons have inspired the development of models that can deal with these issues, called Sequential Recommenders, which will be explained next.

2.3 Sequential Recommender Systems

Traditionally, Sequential Recommenders used only the last click to predict the next by using Markov Chain based models. Since the successes of Recurrent Neural Networks (RNNs) in Natural Language Processing (NLP) [19], Sequential Recommenders have attracted a lot of attention, as they were able to process sequential information, and outperformed traditional methods. These Neural Network-based methods can in fact capture intricate user interaction patterns due to their large number of parameters and the non-linearities present in their architectures [42]. In this section, we first describe what RNNs are and some of their limitations. Next, a number of models applying RNNs to recommendations are outlined. Finally, the so-called attention mechanism employed by some of these models is explained.

2.3.1 Recurrent Neural Network (RNN)

Recurrent Neural Networks can capture the temporal drift in preference and relevance [19, 24]. RNNs sequentially read input data (the input is often converted into learned embeddings before reading) and update their hidden state/representation by performing a series of matrix multiplications and by applying non-linear functions (referred to as activations). The parameters are shared, meaning that the same matrices are used for each input element (in Figure 2.2, it shows that the same network is used for all the steps). This allows RNNs to model input of varying lengths while keeping the number of parameters constant. After reading the complete input sequence, a global preference is calculated from the hidden states and a prediction is made by feeding this global preference into a classification network often called the Multi-Layer Perceptron (MLP). The MLP is a fully connected layer that transforms the hidden representation into the number of items and is trained to predict the correct item. Oftentimes, the last hidden state is used as the global preference, as this hidden state captures the complete input sequence [25]. From the predictions, an error is calculated. The errors for each time step are aggregated and a gradient is calculated. Finally, the gradient is backpropagated through the network and the parameters of the network are updated.

Formally, the input to the RNN is a sequence of items: S = {x1, x2,· · · , xT}, where x are indices

of items. These are often represented as one-hot-encoded vectors which are converted to a sequence of learned embeddings via matrix multiplication S = {x1, x2,· · · , xT} with xt ∈ Rdx. The RNN

processes these sequentially and generates a hidden state for each input item, resulting in a sequence of hidden states: {h1, h2,· · · , hT} with ht∈ RdGRU (dGRU is the number of dimensions used in the

neural network). The processing of input in the “Vanilla” RNN works as follows:

For each time step t, the new hidden state is calculated from the previous hidden state ht−1 (a

nullvector is used for the first input) and the input embedding xt:

ht= σ1(Whht−1+ Wxxt+ bh)

and output

yt= σ2(Wyht+ by),

(2.1)

where σ1,2 are activation functions. σ1 is often the tanh function (see Eq. 2.3.1), and σ2 is often

the softmax activation (see Eq. 2.3.3). With Wh∈ RdGRU×dGRU and Wx∈ RdGRU×dx such that the

output ht∈ RdGRU.

Vanilla RNNs are known to suffer from a phenomenon called a Vanishing Gradient, which essen-tially means that the RNN is not able to propagate the error signal from inputs early in the sequence, failing to update the network through all the time steps. This is caused by the recurrent element in the gradient, shrinking the gradient fast if it is lower than 1 (sometimes exploding if it is larger than 1). To tackle this problem, the Long-Short Term Memory (LSTM)-cell was proposed to prevent the gradient from vanishing [15].

The LSTM (left in Figure 2.1) uses a separate memory cell, combined with a number of gates to regulate the following: 1) how much information is kept from the last time step, 2) which and

(9)

Figure 2.1: Visualizations of LSTM (left) and GRU (right) cells1_{. The LSTM has a larger number of}

gates and components interacting compared to the GRU. In both cases, the gates determine 1) which information is relevant from the previous hidden state and 2) which information from the new input should be captured in the hidden state. Because the hidden states are not directly replaced at each time steps, but are updated via addition, the problem of a vanishing gradient is tackled.

how much information from the current time step is used and 3) which information to pass on to the next time step. The most important difference between the vanilla RNN and the LSTM is that the LSTM has control over the existing memory, and over how this memory is updated. Similar to the LSTM, the Gated Recurrent-Unit (GRU) was proposed (right in Figure 2.1), using a cell with reset and update gates [12] to modify the hidden state. Both LSTM and GRU adjust the state by adding the new information, as opposed to replacing it, helping gradients to flow through and allowing it to learn long-term dependencies. The main difference between the GRU and the LSTM is the fact that there is no separate memory in the GRU, and the GRU exposes the state fully to the next time step, where the LSTM controls this via the output gate. A side-by-side comparison of the LSTM and GRU is visible in Figure 2.1, showing the different gates and how the hidden states are updated.

Because GRU requires a smaller number of parameters, it can be trained faster. Even though it has less gated control, they still perform on par with LSTMs in many scenarios [12]. In Sequential Recommenders the GRU is also common [19, 25, 31]. Therefore, in this work GRU-cells are used.

The GRU employs gates that decide when and how to update the hidden state. The hidden state htis replaced at each time step by a linear interpolation of the previous state ht−1and the candidate

state from the current time step ˆht. The candidate state ˆhtis formed from the previous hidden state

and the new input. The reset gate rtselects which information is relevant from the previous hidden

state. The sigmoid function σ is applied element-wise to each value of the hidden state vector. The resulting values are often close to 0 and 1, either selecting or forgetting parts of the hidden state. Formally, the GRU is defined as follows:

rt= σ(Wr[xt; ht−1]),

ˆ

ht= tanh(Wh[xt; rt ht−1]),

σ(x) = 1 1 + exp(−x), tanh(x) = exp(x) − exp(−x) exp(x) + exp(−x)

(2.2)

Then, the update gate zt selects which information from the candidate is used to form the new

hidden state:

zt= σ(Wz[xt; ht−1]),

ht= (1 − zt)ht−1+ zthˆt

(2.3) The result is a hidden state for this time step, which is used as input for the next time step, and as input to make predictions from. Intuitively, the GRU (but also the LSTM) learns which inputs are relevant to inputs already seen (captured in the hidden state). The updates are only made to parts of the hidden state for which the new input is relevant, according to the gates.

(10)

Figure 2.2: RNN visualized. The compact rendition with an arrow pointing to itself is shown on the left, on the right it is shown in more detail. At each step the cell outputs a hidden state which is also fed into the next step and gets combined with the new input. This process continues for the entire sequence, allowing for a variable length of input sequences.2

2.3.2 Recurrent Neural Networks in recommendation

Hidasi et al. [19] with their GRU4Rec were one of the first to apply RNNs to recommendation using, as the name suggests, an RNN with GRU cells to model the interaction sequence. This model is applied to session-based recommendations, where it outperforms and handles short sequences better than Matrix Factorization. The input to the model are sequences of user-item interactions, which are sequentially read to form a hidden representation for the entire sequence. To make the predictions, they use the final hidden state as the global preference. The prediction is done by extending the architecture with a MLP and feeding the global preference to it.

Many other SRs are based on this model, such as Neural Attentitive Recommendation Machine (NARM) [25]. This model assumes a user might show different behavior from their main purpose within a session, such as clicks out of curiosity. In NARM, the last hidden state of the encoder is used as the global representation. NARM also employs a local encoder that uses an item-level attention mechanism (attention is explained in Section 2.3.3) to get a second representation of the sequence, emphasizing parts of the inputs. Then, a unified representation, used as the global preference, is made from the global and local representations. To do the prediction, a bilinear matching scheme is used to match the item embeddings with the global preference.

2.3.3 Attention

Attention was originally proposed for Neural Machine Translation [4], but quickly spread to other fields such as Information Retrieval [25, 31, 40]. The idea of attention is to compare the hidden states of previously encoded elements (the keys) with the encoding of the new input (the query). Well-matching representations are assumed to be of importance. From the calculated Well-matching scores, attention weights α are calculated and a mixture is formed as global preference g by using the hidden state of each time step ht:

g =

T

X

j=1

αjht (2.4)

The global preference is then used as input for the prediction network. In this way, the prediction does not fully rely on the last hidden state but uses the information spread throughout the sequence. In the original paper, the input word was matched with the already translated words. Cheng et al. [11] proposed the idea to apply attention to a single sequence, called self-attention. This is how attention is applied in Recommender Systems. The keys are the hidden representations of the previously encoded items and the query is the last hidden representation.

There are various approaches to do the matching before calculating the attention weights, we go over two common methods. First, Additive Attention [4] projects the keys and query to a D-dimensional space. Then, the projections are added (hence additive) and the result is activated by applying the non-linear tanh function. To find the weights, the activated vector is projected to a hidden space where the dimensions match the number of keys, after which the softmax function is applied to make the weights sum to 1.

Formally, this is defined as:

αT j= softmax (Wptanh(Wqg + Wkht)) , softmax(xi) = exp(xi) P jexp(xj) (2.5)

(11)

where matrices Wq, Wk are used to transform the query and keys into the latent space where

matching is done. Wpprojects the activated vector before using the softmax to calculate the weights.

Second, the Scaled Dot-product attention was proposed together with the Transformer model [39]. The Transformer transforms the input sequence all at once, getting rid of the recurrent units. To model a sequence, a positional embedding is added to the input embeddings. The positional embedding is calculated by applying a periodic function to the time step. The Transformer relies heavily on the self-attention mechanism. Each input is key and pays attention to the other inputs as query and values. The alignment is done by computing the dot-product and keeps into account the number of dimensions d, by scaling the attention with _√1

d. In doing so, the gradients flow better,

and learning can be faster.

Although the Transformer is very interesting and has shown promising results [20, 36] our pre-liminary results using the proposed model (Chapter 3) favored the GRU, thus we employ the RNN with GRU-cells and the Additive Attention mechanism.

2.4 Context- & Attribute-aware Recommendation

Exploiting available side information in SRs has recently been shown to be useful for capturing user preference [5, 27, 30, 40]. One of the approaches seeks to use additional information about the interaction, such as weather or time, to create context-aware models. For instance, it makes sense to recommend barbecue supplies during summer after a user has added sausage to their cart, and less during winter. Liu et al. [27] proposed a recurrent approach to explicitly model the context information called Context-Aware RNN (CA-RNN). This model learns a matrix per possible context, which quickly grows in terms of the number of parameters. Therefore, Rakkappan and Rajan [30] propose to use RNNs stacked on top of each other in their paper, called Stacked Temporal-context-aware RNN (STAR). In STAR, the first RNN models the sequence of contexts and outputs the matrix used for multiplying with the hidden state in the second RNN. This allows them to have the same number of parameters in their model regardless of the number of context values.

Unfortunately, for many datasets, there is no such information available about the interaction. Another approach is to use additional information about the items instead, as the datasets often do provide some additional information about their items. Bai et al. [5] use available item attributes to make their model attribute-aware. They model both the interaction sequence and the attribute sequence to obtain a unified representation. They do this by first obtaining representations for the item and attribute sequences to then get the unified representation via element-wise multiplication. Wang et al. [40] propose the Mixture-Channel Purpose Routing Network (MCPRN) consisting of a purpose router and a Mixture-Channel RNN. The purpose router predicts a probability distribution over the m latent purposes. The Mixture-Channel RNN is a recurrent network with a custom GRU-based cell. The authors have added a purpose concentration gate, which serves as a threshold, ignoring a channel if the probability of this purpose is too low. Each of m channels has an MCRNN and produces a hidden state. During the reading of the input, the hidden state of a channel is modified according to how much the items belong to the channel. The hidden states produced by the channels will, therefore, be different from each other. Before making the prediction, a weighted combination is made from the hidden states according to the purpose distribution at the last time step, which is then used to match items with.

At the early stages of this research, we experimented with a context-aware and an attribute-aware approach to decide which approach to pursue in this work. Preliminary results showed that providing ground truth information about the context helped. The contexts tried were country information and time information: weekday and hour, as this information was the only context we had available for the datasets we considered. However, this experiment did not yield the increase in performance that we expected. An increase of about 100 % was made. In contrast, providing ground truth information about item attributes greatly improved performance, with an increase of roughly 300%, which is why in this work we use attribute information instead.

2.5 Diversity in Web Search and Recommendation

The importance of diverse recommendations has been recognized in web search [1, 8, 41], showing that users prefer more diverse search results as opposed to highly accurate, but redundant ones. In this section, we cover two approaches fundamental to the work in this paper.

(12)

2.5.1 Maximal Marginal Relevance

In web search, many information retrieved has high overlap, as often search engines return only relevant documents, which have redundant information. This does not satisfy the user, who instead seeks to get diverse results [8]. Therefore, Carbonell and Goldstein [8] propose a new method where each document in the ranked list is selected according to a combination of query relevance and diversity (the novelty of information). This means that a measure of dissimilarity between previously selected documents and possible candidates documents is used. They achieve this by using a linear combination they call marginal relevance. A document has high marginal relevance if it is relevant to the query and contains minimal similarity to previously selected documents. Because they maximise this marginal relevance their method is called Maximal Marginal Relevance (MMR), defined as:

MMR = arg max Di∈R§ λSim1(Di, Q) + (1 − λ) min Dj∈S ( Sim2(Di, Dj) ) (2.6) With this definition, when using MMR to incrementally add items to the list, the standard relevance ranked list is returned when λ = 1. A maximally diverse list is returned when λ = 0. In other words, if you wish to sample around the query (more diverse) you should set λ low. If you wish to focus on similar documents should set λ closer to 1 (less diverse).

2.5.2 Intent-aware Select

Another approach taken to diversify search results makes use of knowledge about the documents and query explicitly. Agrawal et al. [1] assume there exists a topic-level taxonomy of documents, and want to model user intentions at the topic-level. To do this, the authors use probability distributions of the query with respect to the topics P (c | q) and the quality of the document with respect to the query and topic V (d | q, c). This can be interpreted as the probability of document d satisfying the query q with the intended topic c. The objective to optimize is the probability that the returned list S contains at least one relevant item:

P (S | q) =X c P (c | q) 1 −Y d∈S (1 − V (d | q, c)) ! (2.7)

Here, (1 − V (d | q, c) can be interpreted as the probability that d fails to satisfy c and q. The product is, therefore, the probability that all documents in S fail to satisfy the user. Negating this yields the probability of having at least one satisfying document. Finally, the weighted combination yields the probability that the returned list S satisfies the user for the given query. However, a strong assumption in the work by Agrawal et al. [1] is that these probability distributions and values are available, which is in reality not the case and must be estimated.

To optimize Equation 2.7 the authors proposed the following algorithm: Algorithm 1: The IA-Select algorithm

input : k, q, C(q), R(q), C(d), P (c | q), V (d | q, c) output: set of documents S

1 Start with empty list ; 2 S = ∅;

3 Initial estimation of importance; 4 ∀c, U (c | q, S) = P (c | q);

5 while |S| < k do

6 Calculate marginal utility; 7 for d ∈ R(q) do 8 g(d | q, c, S) ←P_c∈C(d)P (c | q, S)V (d | q, c); 9 end 10 d∗← arg max g(d | q, c, S); 11 S ← S ∪ {d∗}; 12 ∀c ∈ C(d∗), P (c | q, S) = (1 − V (d∗| q, c))(P (c | q, S\{d∗}); 13 R(q) ← R(q)\{d∗}; 14 end 15 return S

(13)

In total, k items will be recommended, based on query q. The marginal utility g is the probability that a document d satisfies the user given that all documents already present in S fail to do so. P (c | q, S) is the probability that the query q belongs to category c assuming all documents present in S fail to satisfy the user, i.e. the importance of c given the query. This importance is updated each time an item is added to the list. In doing so, the algorithm promotes categories that are not yet satisfied, i.e. documents from categories not yet present in S. This still allows multiple items in the list of results S to be from the same category. In IA-select, it is possible for a document to still be preferred if the initial preference towards c was very high. If a high scoring document (from an already satisfied category) still has a higher marginal utility g than documents from unsatisfied categories, multiple items from the same category can enter the list of results S.

2.6 Multi-task learning

Multi-task learning improves generalization by leveraging signals from multiple related tasks [9]. This is done by training for multiple tasks at once while sharing part of the architecture or representations. Caruana [9] showed that optimizing for multiple tasks at once works better than to optimize multiple models separately. The idea is that training for multiple tasks can help all tasks by reusing knowledge. Different from optimizing the tasks with each their own individual loss function, this optimization is done by training the network end-to-end, combining the loss of the various tasks and update the entire network from this combined signal:

L = T X t=1 λt Nt X i=1 Lti, (2.8)

where Ntis the number of samples for task t, with T the number of tasks and Ltthe loss for input

i. Here, λtis used to make the linear combination of the losses to be used for backpropagation. The

auxiliary tasks can serve as a bias, causing the architecture to prefer the parameters that work well for more than one task, with as a result better generalization.

The idea of Multi-task learning has been applied successfully in various fields. Gibert et al. [16] used multi-task learning to automatically inspect railroad tracks more frequently to ensure safe transportation. There, the learner has to detect good, broken, or missing rail fasteners and also do segmentation of e.g. crumbling concrete ties (bars connecting the rails). In the field of RS and NLP, Bansal et al. [6] use a deep text model combined with multi-task learning to recommend scientific papers. Their auxiliary task is to predict metadata about papers, tags in this case, based on the shared representation.

These, amongst others, have shown that for their tasks learning multiple tasks at once is advan-tageous and improves performance. In the model we propose in this work, the network performs two tasks which each induce a separate loss. We combine these losses and optimize the model in a multi-task learning fashion.

(14)

Chapter 3

Method

In this chapter, the proposed model called the Attribute-aware Diversifying Sequential Recommender (ADSR) is explained by following the individual modules that form the architecture. First, an overview of the task is described together with the workflow of the ADSR. Next, in Section 3.2 the Attribute-aware Encoder (AE) is explained. Then, the modeling of the auxiliary task done by the Attribute Predictor (AP) is described in Section 3.3. Followed by Section 3.4, which describes how the outputs of these modules are combined in the Attribute-aware Diversifying Decoder (ADD). Finally, Section 3.5 describes the losses of the modules and how these together form the loss used to train the network.

After the model is explained formally, a clarifying example demonstrates how ADSR works intu-itively by looking at a concrete example in Section 3.6.

3.1 Overview

The goal of the model is to use a sequence of interactions with items to make predictions about the next item the user will interact with (see Section 2.3.2). This work aims at providing a diversified list of recommendations by leveraging additional information about the items. While doing so, the task is not only to predict the next item but also to predict which attribute the next item the user interacts with has. To be more precise, given a user u and their behavior sequence Sv = {v1, . . . , vt, . . . , vT},

with corresponding item attribute sequence Sc = {c1, . . . , ct, . . . , cT}, where vt ∈ V is the item u

interacts with at time step t, and ct∈ C is the attribute of vt, the goal is to create a diversified list

of recommendations RL. Formally, we can find RL by maximizing P (RL| Sv, Sc, u), defined as:

P (RL| Sv, Sc, u) = |C|

X

j=1

P (RL | Sv, Sc, cj, u) · P (cj| Sc, Sv, u), (3.1)

where P (cj | Sc, Sv, u) is the predicted importance of attribute ci based on the sequences and

P (RL| Sv, Sc, cj, u) is the probability of recommending RL conditioned on cj being user u their

preferred item attribute.

Optimizing P (RL| Sv, Sc, u) directly is difficult due to the large search space [1]. Therefore, we

propose to generate RL iteratively, by appending the item with the highest score S(vi) to RL at

each time step, using a similar approach as [1, 8]. The score used to select items is a combination of the relevance score and the diversity score. The relevance score is calculated based on the history of interactions and the diversity score is calculated with respect to the list of recommendations so far:

Sitem(vi) = λs· Srel(vi) + (1 − λs) · Sdiv(vi), (3.2)

where Srel(vi) is the relevance score for item vi(see §3.2), and Sdiv(vi) is the attribute-aware diversity

score (see §3.4). The hyperparameter λs is used to balance Srel and Sdiv, giving control over the

accuracy and diversity of the trained model (details on Sdiv are in Section 3.4).

To optimize this goal, we propose the Attribute-aware Diversifying Sequential Recommender (as shown in Figure 3.1) to model Srel(vi) and Sdiv(vi). The Attribute-aware Encoder (AE) (Section

3.2) models the input sequences Sc, Sv to get attentitive hidden representations fc and fv and the

relevance score for each item Srel. Next, the Attribute Predictor (AP) (Section 3.3) predicts the next

item attribute distribution P (c | Sv, Sc, u). Finally, the Attribute-aware Diversifying Decoder (ADD)

(15)

Attribute-aware Encoder (AE) Attribute Predictor (AP) Attribute-aware Diversifying Decoder (ADD) 𝑓𝑐 𝑃(𝑣 ∣𝑆𝑣, )𝑆𝑐 𝑉 (𝑣 ∣ 𝑐) Diversity Relevance Normalisation 𝑣∗ 𝑓𝑣 Addition 𝑐̂ 𝑇+1 Bilinear 𝑓𝑣 𝑃(𝑐 ∣𝑆𝑣, )𝑆𝑐 𝐿𝐴𝐷𝑆𝑅 𝑅𝐿 |𝑅𝐿|times 𝑈(𝑐 ∣𝑆𝑣, ,𝑆𝑐𝑅𝐿) (𝑣) 𝑆item 𝑃(𝑐 ∣𝑆𝑣, )𝑆𝑐 𝐿𝐴𝑃 ADD ADSR 𝐿𝐴𝐸 Input Layer Output Layer Batch Normalisation ReLU AP 𝑃(𝑐 ∣𝑆𝑣, )𝑆𝑐 𝑐̂ 𝑇+1 𝑓𝑐 𝑓𝑣 Activation Input Embedding 𝑆𝑐 𝑆𝑣 AE Input Embedding 𝑆𝑐 𝑆𝑣 Attribute Encoder Item Encoder Attention { , … , }𝐜1 𝐜𝑇 { , … , }𝐯1 𝐯𝑇 { , … ,𝐡𝑐1 𝐡𝑐𝑇} {𝐡𝑣1, … ,𝐡𝑣𝑇} 𝑓𝑐 𝑓𝑣

Figure 3.1: Overview of the Attribute-aware Diversifying Sequential Recommender (ADSR). The inputs to the Attribute-aware Encoder are an item interaction sequence and the corresponding item attribute sequence. From these sequences the Attribute-aware Encoder produces attentitive hidden representations fcand fv. Then, the Attribute Predictor uses the attentive representations and outputs

a probability distribution over the attributes. From this, the Attribute-aware Diversifying Decoder calculates Srel and Sdiv and trades-off relevance and diversity when generating RL. The output is a

diversified list of recommendations based on learned attribute preferences. The losses of the AE and AP are combined in LADSR and used to update the weights of the model.

Figure 3.2: This shows how the hidden states are created in a Bidirectional-RNN1. The sequences of hidden states get aligned i.e. h0 = {[s0; s00],· · · , hi = [si; s0i]} so that each step correctly a the

step in both directions (opposed to h0 = [s0; s0i]). The combined last hidden representation is the

concatenation of the last hidden representation of the forward pass and the last hidden representation of the backward pass (from right to left). In this way, the combined last representation captures all information in both directions.

3.2 Attribute-aware Encoder

First, the sequence of attributes is encoded using a RNN with Gated Recurrent-Unit (GRU)-cells, as is common in the literature [19, 25, 31, 7]. In this work a Bidirectional-RNN [33] (with GRU cells) is employed because early results showed an increase in performance.

The Bidirectional-RNN utilizes weights that model the forward time direction and separate weights that model the backward direction. The output hidden states from the forward direction and the backward direction are not connected, so they can be seen as two independent RNNs. At each time step, the hidden states from both directions are concatenated to form a single hidden representation, however, the order of the backward pass is flipped to match the forward pass, see Figure 3.2. In this way, the combined representations capture the context from both directions.

The sequence of items and item-attributes are first represented as one-hot-encoded vectors (for attributes with more than 1 active label this is a binary encoded vector). The Attribute-aware Encoder first converts these sequences of one-hot-encodings to a sequence of learned item embeddings concatenated with the corresponding learned item attribute embeddings: {p1, p2,· · · , pT} where

pt= [ct; vt; ] ∈ Rdc+dv. Next, the sequence is read by the GRU which outputs a hidden representation

for each timestep yielding {hc1, hc2,· · · , hcT} where hct∈ R

2dGRU_{, the dimension is twice as large due}

(16)

to the bi-directionality. Then, a second Bidirectional-RNN (again, with GRU cells) is used to get the representation for the sequence of the item interactions combined with the output of the first encoder. To be precise, the input to the second encoder is the concatenation of the item embeddings and the hidden states of the previous encoder: qt= [vt; hct], which then yields hidden representations

{hv1, hv2,· · · , hvT} ∈ R

2dGRU_{. Early results showed this sequential and interactive approach increased}

modeling capacity.

Next, we employ additive attention to get the global preference (attentitive hidden representations) for the item- and attribute encoder fv and fc, as follows:

fv= T X j=1 αT jhvt, αT j= softmax Wptanh(WqhvT + Wkhvj) , (3.3)

where matrices Wq, Wk ∈ R2dGRU×2dGRU are used to transform hiinto the latent space where

match-ing is done, and Wp∈ R2dGRU×1 is used as projection to get the final attention weights αT j. Then,

fc is calculated from the hidden representations {hc1, hc2,· · · , hcT} identically as Eq. 3.3. However, it

uses the attention weights αT jobtained for fv. This further enhances the dependency of the attribute

and item embeddings and reduces the number of parameters (intuitively it also makes sense to weigh the attributes of the items as much as the items themselves are important). When calculating the attention weights, the final hidden representations of both encoders hvT and hcT are used as the

query for the attention module, as these capture the complete sequences [25]. The other hidden representations are used as keys, and as the values, meaning that fv and fc are weighted averages of

the hidden representations.

Finally, the Attribute-aware Encoder calculates the relevance score2 _as:

where Wg is used to project the concatenated vector g to Rdv, the dimensions of the items. The

pre-dicted next item attribute ˆct+1is the weighted average of attribute embeddings, weighted according

to the distribution predicted by the Attribute Predictor P (cj | Sv, Sc), see Section 3.3. The relevance

score is calculated for each item v ∈ V and is subsequently used to make the first prediction and to help predict the remainder of the recommended list RL. For clarity, the loss induced by AE (LAE)

is described in Section 3.5.

3.3 Attribute Predictor

The Attribute Predictor (AP) is of great importance to our model, as the output P (c | Sv, Sc) is used

by the Attribute-aware Encoder to determine item relevance, and by the Attribute-aware Diversifying Decoder (ADD) to diversify the recommendations (see subsection 3.4). The AP predicts the distribu-tion over the attributes P (c | Sv, Sc) with a small neural network. The input is the concatenation of

the attentive final hidden representations of the item- and attribute-sequences, q = [fv; fc]. By using

both representations, the Attribute Predictor can learn to exploit features from both encoders when making the prediction, and optimizing the loss The prediction network consists of a fully connected layer, with Batch Normalization [3] and the ReLU activation, followed by the output layer which is then activated with a sigmoid function (see Figure 3.1). Batch Normalization ensures that the mean and variance of the input data are stable and stabilizes learning. The sigmoid activation function is defined as:

σ(x) = 1

1 + exp(−x), (3.5)

where each output class is activated separately (element-wise), squashing each value between [0, 1] allowing for a probabilistic interpretation.

2_{In Figure 3.1, calculating S}

rel is part of the ADD as “Relevance”. However, in practice this is computed once

(17)

This prediction task is an auxiliary task, with an additional loss LAP. This loss is combined with

the loss of the Attribute-aware Encoder. Consequently, both the AE and AP modules are updated from a single unified loss. Details are presented in Section 3.5.

3.4 Attribute-aware Diversifying Decoder

The Attribute-aware Diversifying Decoder (ADD) is responsible for generating the diversified list of recommendations RL. Inspired by IA-Select, a method used in web search [1], we incrementally build

RL while accounting for diversity. At each each step the item with the highest score according to

Equation 3.2, i.e. v_i∗= arg max_v

iS(vi) is appended to RL, where λscontrols the contribution of Sdiv.

Specifically, at each step the ADD calculates Sdiv using P (c | Sv, Sc), obtained from the AP, as initial

estimation of the importance per attribute class (category/genre) U (cj | Sv, Sc, RL):

Here, A(vi | cj) represents the value of vi in context of attribute cj. The categories of item vi are

denoted by C(vi). I.e. the binary encoded attribute vector is used directly.

After each step, we update U (cj | Sv, Sc, RL) to reflect the newly added item to RL:

U (cj| Sv, Sc, RL+1) = normalize [(1 − A(vi∗| cj)) U (cj| Sv, Sc, RL)] , (3.7)

where RL+1 represents the recommended list including vi∗. This update causes items from other

categories than those of v_i∗ to get a boost from Sdiv, while the items that share their category will

not. Note that even though not all items will be boosted by Sdiv, the next highest scoring item can

still be from the same category as items in RLif their relevance is still higher than items of the boosted

categories. The value estimates A are normalized by dividing the value for each category of the sum of categories. However, note that before normalization is done we add = 1e−5 to prevent division by zero, which can occur when the value for all categories has already been updated. Meaning, when all item categories have been selected and updated the method resorts to selecting items by using Srel. Note that, to ensure unique items in RL, at each step the selected item vi∗ is excluded from

the set of items V . In our implementation, to select the first item only the relevance score (Eq. 3.4) calculated by the Attribute-aware Encoder is used, after which the diversified selection begins.

This results in the following algorithm:

Algorithm 2: The selection algorithm used in the ADD to incrementally build the diver-sified list of recommendations RL

input : λs, k, Sv, Sc, V, C(vi), P (ci| Sv, Sc), A(vi| Sv, Sc)

output: List of recommendations RL

1 // Start with a single relevance based prediction; 2 RL= {v1};

3 // Initial estimation of importance; 4 ∀c, U (c | Sv, Sc, RL) = P (c | Sv, Sc); 5 while |RL| < k do

6 // Calculate marginal utility; 7 for v ∈ V do

8 Sdiv(v | q, c, S) ←Pc∈C(d)U (c | Sv, Sc, RL)A(vi| Sv, Sc, c); 9 v∗← arg_imax [λs· Srel(vi) + (1 − λs) · Sdiv(vi)];

10 RL← RL∪ {v∗};

11 for c ∈ C(v∗) do

12 U (c | Sv, Sc, RL) = softmax [(1 − A(v∗| Sv, Sc, c))(U (c | Sv, Sc, RL\{v∗})]; 13 V ← V \{v∗};

14 return R_L

(18)

3.5 Losses

The loss calculated for the Attribute-aware Encoder is the Categorical Cross-Entropy Loss:

LAE= −1 |V | |V | X i=1 yilog P (vi| Sv, Sc) yi= 1 if vi= vT +1

0 else , where vT +1 is the target.

(3.8)

The Cross-Entropy Loss increases when the predicted probability diverges from the true probability. Optimizing this loss forces the probabilities to become large where the target is 1.

When using the Cross-Entropy Loss, the assumption is often made that there is only one tar-get class. Therefore, to calculate the loss induced by the Attribute Predictor, LAP, the Binary

Cross-Entropy Loss is used instead. The Binary Cross-Entropy Loss treats each class as a binary classification problem, allowing the network to optimize for multi-labeled targets. Optimizing this loss forces the probabilities to become large where the target is 1 and to become small where the target is 0. LAP = −1 |C| |C| X i=1 yilog P (ci| Sv, Sc) + (1 − yi) log [1 − P (ci| Sv, Sc)] ,

yi= 1 if ci = cT +1, else 0, where cT +1is the target.

(3.9)

Then, to form the final loss, LADSRwhich is used to optimize ADSR, the two losses are

interpo-lated using the balancing hyperparameter λMT. That is,

LADSR= λMT∗ LAE+ (1−λMT) ∗ LAP (3.10)

This loss is calculated for each input separately, and is averaged over the batch of samples.

Finally, λMTcontrols the contribution of each module to the loss. Thus, λMTaffects the step size

when updating the modules and should compensate for the difference in magnitude of the losses [21].

3.6 Example

To make the concept more concrete, consider the following hypothetical examples3_{. First, to show}

why diversity is needed, consider this exaggerated example: A user is planning drinks and adds cheese to their basket, a non-diversifying recommender might produce the recommendations visible in Figure 3.3. Although the recommendation is correct in the sense that this user wants chorizo, the user now only gets different chorizo brands / products. This might, in some cases, be wished for, but as mentioned in Chapter 2, literature has shown users prefer diverse results.

Figure 3.3: Made up example of a non-diversifying recommender output

Now consider a customer that is doing groceries to cook pasta. The customer follows a recipe that consists of chorizo, penne, tomatoes, cream sauce, and some other ingredients. They go online

(19)

Figure 3.4: Example recommendations after 1 step in the customer journey

Figure 3.5: Example recommendations after 2 steps in the customer journey

to purchase these items and start by adding some chorizo to their basket. The hypothetical recom-mendations of ADSR are presented in Figures 3.4 and 3.5. At the top of the image, the input items and their categories are shown. In the center, the image displays the predictions made by the AP. Finally, on the bottom the recommended list is visible.

First, in Figure 3.4 several items are recommended based on the chorizo input. Because meat is often combined with bread and dry goods, the recommender has a strong preference for recommending these attributes and also has a high item relevance for these products. Remember that the Attribute-aware Diversifying Decoder reduces the score of products from a category already present in the recommended list, which is why many of the recommendations are from different categories. However, even though there is already an item in the recommended list from this genre, a second item from the category bread is added. This is because Attribute-aware Diversifying Decoder allows recommending items from categories already present in the list. This happens when the score Sv is still highest,

regardless of the boost Sdiv provides to the newly favored categories.

Let the customer accept the recommendation, and put the pasta in their basket. The Attribute-aware Diversifying Sequential Recommender updates the state, generates new attribute preference predictions and the list of recommendations change accordingly, see Figure 3.5.

From this addition, it has become more clear that the customer might be purchasing items from a recipe for an Italian dish4_{. The ADSR, therefore, suggests relevant but diverse items to go with the}

items already present in the basket. This process will continue and the recommendations will keep changing as the customer journey progresses and items are added to their basket.

(20)

Chapter 4

Experiment Setup

In this chapter, details about the experimental setup used to answer the research questions are provided. First, the datasets used to train and evaluate the models are described, as well as how they are processed before usage in this work. Next, the evaluation measures are introduced and the evaluation method is explained. After these topics, the baselines used to compare ADSR with are introduced and explained. Finally, some implementation specifics are mentioned.

4.1 Datasets

The Attribute-aware Diversifying Sequential Recommender is evaluated on the real-world datasets MovieLens-1M1_{and TMall}2_{. This section starts with a brief description of the datasets, after which}

the pre-processing steps are outlined.

MovieLens-1M

MovieLens-1M is a widely used benchmarking dataset containing 1 million ratings from 6.000 unique users which are spread over 4.000 unique movies. Additionally, this dataset contains genre information about each movie, with all movies belonging to at least one of 18 genres. GroupLens has collected the data from its movie recommendation website aptly named, MovieLens. Here, users can rate movies they have watched to get personal recommendations. A user can give an arbitrary amount of ratings in one session, and for many users interactions over multiple sessions are collected.

TMall

Tmall.com, previously called Taobao Mall, is one of China’s largest e-commerce retailers. This particular TMall dataset is a collection of interactions with TMall.com of customers that shopped between July 1, 2015 and November 30, 2015. The dataset was originally released for the IJCAI 2016 conference. The data consists of 44.5 million user interactions over 2.4 million items belonging to one of 72 categories. However, in this work, we only use “buy” interactions. The buy events in this dataset have timestamps of when the items were placed in the shopping cart, which allows the interactions with the purchased items to be sorted chronologically.

4.1.1 Pre-processing

First, the datasets are filtered from users and items with less than 20 interactions to increase density. Next, the datasets are split into training, validation, and test sets. To do this, we do not treat sessions for a specific user separately, but instead, view all their interactions as a large sequence. Per sequence, the first 80% is used for the training set. The remaining 20% is filtered from sequences containing items that are not present in the training split. Subsequently, the sequences are randomly divided into two sets of equal size, to create the validation and test sets. Finally, to create the inputs for the model, the sliding-window approach is used with size 10 [19, 37]. With this approach, the first 9 items are used as the input and the tenth item is used as the target item. The first, shorter, windows are kept and zero-padded to allow training in batches. Using the sliding window approach ensures the model is trained on nearly all items.

1_{https://grouplens.org/datasets/movielens/1m}

(21)

Table 4.1: Dataset statistics after pre-processing.

#Users #Items #Train sequences #Valid sequences #Test sequences #Attributes Sparsity ML1M 6,041 3,261 784,309 93,929 97,871 18 0,95 TMall 31,855 58,344 698,081 54,706 54,705 71 0,99

The genre and category are used as the item-attribute for ML1M and TMall respectively. A number of movies in the MovieLens dataset belong to more than one genre at once. These multi-labeled movies are handled by summing the genre embeddings, as this method yielded the best results during early experimentation. In this work, a single genre/category is used as the attribute, but the Attribute-aware Diversifying Sequential Recommender can be modified to use other/multiple attributes (or contextual information) by e.g. aggregation of embeddings before reading the input sequences with the GRUs.

The statistics on the datasets after pre-processing are reported in Table 4.1. Note that due to the large number of items in the TMall dataset, the sparsity is significantly higher compared to MovieLens, making it increasingly difficult to generate accurate recommendations [10].

4.2 Evaluation

Various metrics are used to measure the performance of the Attribute-aware Diversifying Sequential Recommender and to compare it with other existing methods. The metrics measure each a different aspect, to be able to compare the models on multiple properties. Following [25, 19, 31] we measure accuracy of the model using MRR@k and Recall@k. Additionally, we measure diversity using Intra-List Dissimilarity (ILD) and a simple metric we call Discrete Diversity (Discrete). To measure the performance of the Attribute Predictor we use the accuracy measure, as this is a classification task.

Next, each metric is presented in more detail.

4.2.1 MRR@k

The Mean Reciprocal Rank (MRR) is a positional metric that measures how high in the list of results the ground truth item is ranked, on average. In this case, the M RR@k is used as only k items are recommended. This reduces the overall value for M RR because in many cases it will be 0 (if the ground truth is not present). M RR@k is calculated as follows:

MMR@k = 1 N N X i=1 1 ranki (4.1)

where N is the number of recommended lists to calculate the mean over, and ranki is the rank of the

ground truth in RL.

The reciprocal rank for a recommended list is set to 0 if the ground truth is not present in the k recommended items. In this setting, there is always a single ground truth item which is why the nominator equals 1. This measure is important when the order of the recommended list matters. In the case of personal offers, this does not matter a lot, as people will make use of an offer if it appeals to them regardless of the position, as it saves money. For movies, this matters as people tend to view and interact with the items on top of lists quicker than others [14]

4.2.2 Recall

Recall@k is the proportion of correctly recommended items from the first k items. A system recom-mending all items gets a recall of 1. However, we only look at the top k, so we restrict the denominator to at most k. Therefore, Recall@k is defined as:

Recall@k = 1 N N X i=1 #hits min(#relevant items, k) (4.2)

(22)

Because the number of ground truth items is always 1, the denominator is always 1, and because in our case we have at most 1 hit, this measure simply becomes:

Recall@k = 1 N N X i=1 = 1 if y ∈ RL 0 else , (4.3)

where y is the ground truth target. This is sometimes referred to as Hit@k [37, 10] and is in the case of 1 ground truth item identical.

4.2.3 ILD

Intra-List Dissimilarity (ILD), sometimes refered to as average dissimilarity [41, 34], measures how different items in a single list are from each other using some D-dimensional representation for each item. Formally, the ILD measure is defined as:

ILD(RL) = 2 |RL| · (|RL| − 1) X i∈RL X j6=i∈RL d(i, j), (4.4)

where d(i, j) is the euclidean distance between items. In this work, the representation used for the items is their one-hot-encoded item-attributes. E.g. if i has attributes [0, 1, 0, 0] and j has [0, 1, 0, 0] the ILD will be 0. Similarly, when j has attribute encoding [1, 0, 0, 1] the ILD will instead be 1.7321.

4.2.4 Discrete Diversity

The second diversity measure used is the discrete diversity (Discrete). This measure counts the number of unique attributes present in the recommended list. To be precise, we go over all the recommended items and set each category of the item to 1 (if two items in RL are from the same

category, the category is only counted once). Then, we sum over all the categories to get the final value for the measure. Discrete diversity indicates whether or not a list has included items from a narrow or broad set of items, from a category point of view.

4.2.5 Accuracy

To measure the performance of the Attribute Predictor we use the accuracy measure. The accuracy is calculated simply as the fraction of correct predictions from the total number of predictions:

Acc. = #correct predictions

#total number of predictions (4.5) As the MovieLens dataset contains movies with multiple labels, the prediction is considered correct if the AP predicts one of the active genres.

4.2.6 Evaluation method

The model only creates the list of recommendations RL during evaluation. While generating, the

score S(v) is calculated for each item, creating the lists using Algorithm 2. The generated lists are stored until all inputs are considered. Subsequently, the measures are computed for each input-output pair and are averaged. The best scoring model according to the validation performance is used and evaluated on the test set. The performance on the test set is reported in the results.

Although ideally for MovieLens recommending unrated movies is preferred. However, we do not reject previously rated items from recommendations. This keeps the model evaluation general, as this is something that might be favorable for scenarios with recurrent purchases, such as e-commerce. However, this might negatively impact the performance on the MovieLens dataset.

Negative sampling is sometimes applied during evaluation [10, 40]. This method reduces the number of items to calculate a score for, allowing for quicker computation, but also for a smaller set of items to select from. However, as we are interested in diversity, it is not beneficial for the ADSR to narrow the search space too much before any recommendations are made. Doing this can result in a less diverse set of items to select from. Therefore, when evaluating ADSR negative sampling is not used. The recommendations are generated from the full item set. The hardware used to train the model allows training on the full item set with reasonable training times.

(23)

4.2.7 Statistical significance testing

To test for statistical significance, the two-tailed t-test is employed. The t-test measures the proba-bility of 2 related or repeated samples to have identical average values. In this work, the data samples that are compared are the calculated metrics for each generated output. Significance tests with P values lower than 0.01 are considered to be statistically significant. In this work we use the scipy statistics package, and leverage their ttest rel function3_.

4.3 Baselines

To validate the proposed model ADSR we construct/select baselines that are fair (use the same information, similar architectures, etc.) to compare with. In this section, the baselines that are used to compare with ADSR are described and motivated.

GRU for recommendations (GRU4Rec)

The GRU4Rec model [19] was one of the first to apply an RNN to the recommendation task. This model uses Gated Recurrent-Unit cells to model the interaction sequence Sv. For each sequence, a

hidden state is produced, which captures the complete input sequence. From this hidden state, the predictions are made by ranking the output of a fully connected layer that predicts a score for each item. This is a simple and solid baseline that does not use attribute information.

Base Sequential Recommender (BSR)

Similar to GRU4Rec, the Base Sequential Recommender is not an attribute-aware model and therefore uses only Svto generate hidden representations. To be more precise, the BSR employs a Bi-directional

RNN and applies additive attention to get the global preference, which is then used to generate RL

with a fully connected prediction layer. This model is similar to NARM [25] and this can be considered as the most basic variant of ADSR. This model is equal to ADSR with a modified AE and without the AP and ADD modules. No attribute information is used in the AE, it only uses the item interaction sequence.

Attribute-aware Neural Attentitive Model (ANAM)

The Attribute-aware Neural Attentitive Model incorporates attribute information and applies atten-tion to the hidden representaatten-tions, similar to ANAM [5].

We have adapted it to do sequential recommendation rather than next-basket recommendation. The core principles remain the same, the attention mechanism is applied to obtain a unified repre-sentation based on the individual reprerepre-sentations which are formed from the item sequence Sv and

the item attribute sequence Sc. This model is one step closer to ADSR, and can be seen as BSR

made attribute-aware. ANAM does not have the AP and ADD modules, but it has the same input. This model is therefore attribute-aware but not diversifying.

Multi-Task Attribute-aware Sequential Recommender (MTASR)

Multi-Task Attribute-aware Sequential Recommender is another step closer to ADSR. This model extends ANAM with the AP module. It predicts the attribute preference using the representations for the item sequence Svand the item attribute sequence Sc. This model then uses the distribution over

the attributes to form a weighted attribute embedding, which is then combined with the attentive hidden representations to predict RL. This model can be regarded as a special case of ADSR with

λs= 1, meaning that the main difference between MTASR and ADSR is the lack of the

Attribute-aware Diversifying Decoder. In practice, the trained weights of this model are used directly in ADSR. This model is attribute-aware but does not diversify.

3_{Documentation of the ttest rel function:} _{https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/}

Attribute-aware Diversification for Sequential Recommendations

MSc Artificial Intelligence

Master Thesis