Investigating Positional and Time Stamp Embeddings for News Recommendation with Bidirectional Transformer User Encoder

(1)

MASTER

THESIS

Investigating Positional and Time Stamp

Embeddings for News Recommendation with

Bidirectional Transformer User Encoder

by

HENNING

B

ARTSCH 12307912 48 EC November 3, 2019 - October 30, 2020 Academic Supervisor: Maartje ter Hoeve, MSc. Company Supervisor: Joris Baan, MSc. Lucas de Haas, MSc. Assessor: Ilya Markov, PhD. Examiner: Maartje ter Hoeve, MSc.

(2)

(3)

Declaration of Authorship

I, Henning Bartsch, declare that this thesis titled, “Investigating Positional and Time Stamp Embeddings for News Recommendation with Bidirectional Transformer User Encoder” and the work present in it are my own. I confirm that:

• This work was done wholly while enrolled as a Master Student at the University of Amsterdam.

• No part of this thesis has previously been submitted for a degree or any other quali-fication at this University or any other institution.

• Where I have consulted the published work of others, this is always clearly attributed. • Where I have quoted from the work of others, the source is always given. With the

exception of such quotations, this thesis is entirely my own work. • I have acknowledged all main sources of help.

• Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself.

Signed:

(4)

(5)

UNIVERSITY OF AMSTERDAM

Abstract

DPG Media Informatics Institute

MSc. Artificial Intelligence

Investigating Positional and Time Stamp Embeddings for News Recommendation with Bidirectional Transformer User Encoder

by Henning BARTSCH

In news recommendation systems, modelling user interests based on their reading history poses one of the central challenges because users have different, diverse and of-ten changing conof-tent preferences. Contextual information about the reading order or teraction time of article, may be useful to capture dynamic interest patterns and to in-form next-article recommendations. In order to model dependencies between articles and learn behavioural patterns from user data, we employ bidirectional Transformer-based en-coders. Our proposed BERT4NewsRec model extends this powerful representation learn-ing method to the domain of news recommendation by integratlearn-ing a news encoder. The model is trained on the Cloze objective, leveraging global dependencies within the read-ing history to predict randomly masked articles. For the news encoder, we compare a trainable CNN-based approach with personalised to attention to fixed article embeddings created with a pre-trained BERT model. We also investigate the effect of supplementing encoded articles with contextual information to improve user interest representations. Ar-ticle positions and order are encoded via learnable positional embeddings. Contrary to our expectations, this sequential information does not improve but rather impairs recom-mendation performance. To model temporal context, we propose to employ a neural net-work to learn a mapping function from time stamps to temporal embeddings. However, they also decrease recommendation performance when added to article embeddings. The learned time stamp embeddings presumably do not capture temporal relation and proper-ties effectively. We conduct experiments on real-world Dutch news data and show that our model without supplemental order embeddings outperforms state-of-the-art news recom-mendation systems. Moreover, we resolve the item cold-start problem prominent in other representative methods, by employing a trainable news encoder and training our model on a pseudo classification task that does not impose fixed-size output units.

(6)

(7)

Acknowledgements

The research and writing process for this Master Thesis was one of my most exciting, challenging, demanding and rewarding experiences, which was made possible by the sup-port and guidance of various people. First of all, I want to thank DPG Media, namely the News Personalisation Squad lead by Anne Schuth, for offering this project, the research internship and the required resources. Thank you Anne and the rest of the team for giving me the opportunity to conduct self-directed research, work alongside experienced profes-sionals and become part of the team.

I want to thank my company supervisors Joris Baan and Lucas de Haas who have guided and accompanied me through the research process beyond our weekly meetings. Thanks to your thoughtful questions and comments I could often resolve mental knots and clarify my thinking and strategy. Moreover, I am grateful for your genuine interest in the research project and developing ideas together.

Big thank you also to my academic supervisor Maartje ter Hoeve who guided me and the research path from the beginning over extensions until the end. Our meetings and discussion have always encouraged me to think more deeply and critically about research. With your questions and comments you helped me to overcome many obstacles and to develop my scientist perspective.

Thank you to the University of Amsterdam, the lecturers and academic staff who pro-vide us with the highest-quality education and organisational structures. Lastly, I want to thank my fellow students and friends of the AI Masters program. Thank you for the active and helpful discussions as well as the personal exchange that contributed significantly to develop my knowledge and skills required to actually realise this Thesis.

(8)

(9)

List of Figures

2.1 Hierarchical model architecture of NPA. Figure adopted from (Wu et al.,2019b) 10

2.2 CNN-based news encoder with personalised attention. DN = [w1, w2, . . . wM]

refers to a article title article, consisting of M words. Figure adopted from (Wu et al.,2019b) . . . 11

2.3 Personalised attention to create attention distributions depending on a user-specific preference query. Mechanism and figure adopted from Wu et al. (2019b) . . . 11

2.4 Schematic description of a Transformer layer (left) and the architecture of BERT4Rec (right). Figure adapted from Sun et al. (2019) . . . 16

2.5 Overview of a Transformer layer adopted from Sun et al. (2019). . . 18

3.1 Distribution over number of read articles aggregated for 10k users over 1 month (no pre-selection) . . . 23

3.2 Cumulative distribution of number of read articles. Red horizontal line marks 95% threshold. . . 24

3.3 Number of articles read aggregated per day. Plot shows the user engage-ment over the month of November 2019 with blue markers on Mondays. The green dashed line indicates the average number of reads. . . 25

3.4 Number of articles read aggregated per hour. Plot shows the user engage-ment trend over the course of the day. The green dashed line indicates the average number of reads. Orange vertical line marks 9 AM. . . 25

3.5 Time difference between consecutive article reads in minutes. Aggregated for all user in the working dataset. Plots zooms in on the time differences up to 30 minutes. . . 27

3.6 CNN-based news encoder with personalised attention. DNrefers to a a news

article, in our notation Ai = [w1, w2, . . . wM], consisting of M words. Figure

adopted from (Wu et al.,2019b) . . . 29

3.7 Personalised attention to create attention distributions depending on a user-specific preference query. Mechanism and figure adopted from Wu et al. (2019b) . . . 30

3.8 B4NR user encoder comprising stacked bidirectional Transformer layers to encode user interests from a sequence of article representation. Figure adopted from Sun et al. (2019). vi refer to the encoded articles, in our notation

de-noted as ri. pi denote the supplemental order embeddings, in our notation

wi. . . 32

3.9 Schematic description of multi-head self-attention (MHSA) that combines features extracted from multiple attention head in parallel. Figure adopted from Vaswani et al. (2017) . . . 33

3.10 Schematic description of scaled dot-product attention. Figure adopted from Vaswani et al. (2017) . . . 33

(12)

5.1 Comparing performances of Vanilla and Modified NPA model on Dutch news data. All models run with default configurations except variation of history length for ModNPA. . . 52

5.2 Training losses and validation AUC scores of the ModNPA baseline and B4NR, L indicating the number of layers of the user encoder. . . 54

5.3 Comparing B4NR with different news encoders. . . 56

5.4 B4NR with different methods to model sequential information. . . 59

5.5 Performance of B4NR with LPE and without depending on the history length. The size of blue dots indicate the proportions of users. . . 60

5.6 Performance of B4NR with LPE and without depending on the history length, zooming into the histories longer than 50 articles. The size of blue dots indi-cate the proportions of users. . . 61

5.7 B4NR with different positional embeddings on separate dataset of users with minimum 50 read articles. . . 63

5.8 The homogeneity of topic distributions across user histories measured by the entropy and standard deviation. Metrics were computed on the default dataset comprising long and short reading histories. . . 66

5.9 B4NR performance with different order embeddings, comparing None, learn-able positional (LPE) and temporal embeddings (NTE). . . 68

(13)

List of Tables

4.1 Default parameter setting of CNN-based news encoder with personalised attention. ‘D#‘ denotes ‘dimension of‘. . . 46

4.2 Default parameter setting of our BERT-based user encoder. ‘D#‘ denotes ‘dimension of‘.. . . 47

4.3 Parameter setting for pre-computing article embeddings with BERTje. ‘D#‘ denotes ‘dimension of‘. . . 48

4.4 Parameter for the temporal embedding function. ‘D#‘ denotes ‘dimension of‘. 49

5.1 Baseline news recommendation systems evaluated on English and Dutch news data. Performances on English were copied from Wu et al. (2019b). ‘hl‘ denotes the history length. Metric scores reported on the respective test sets. . . 52

5.2 Performances of ModNPA and B4NR with varying number of layers L eval-uated on our Dutch test set. Number of trainable model parameters given in million. . . 54

5.3 Performances of B4NR with CNN or BERTje news encoder evaluated on test set. ModNPA baseline for comparison . . . 55

5.4 Performance overview of B4NR with CNN news encoder with different types of sequential information on test set. The Learnable Positional Embeddings (LPE) are either added or concatenated to the article embeddings. Experi-ments where sequence order has been shuffled are denoted with "shuf." . . 58

5.5 Performance of B4NR on dataset of longer sequences with different types of sequential information on test set (Lmin =50).. . . 63

5.6 Performance of B4NR with temporal embeddings (NTE) on test set. Re-ported metric value averaged over 3 runs±SD. . . 69

(14)

(15)

List of Symbols

Su original user-specific sequence of read articles, i.e. reading history

S0_u modified user-specific sequence, i.e. model input S0_u;T

i target-specific user history

Strain_u user-specific sequence for the training interval Stest_u user-specific sequence for the test interval Sm

u set of masked off articles

u user ID

U entire set of all users eu user ID embedding

u user representation as embedding vector Nu number of articles in user-specific sequence

Lhist maximum length of user reading histories

Lart maximum article length

VA article vocabulary, set of all available articles

Ai news article represented as article ID

Ti specific target article

ri article representation as embedding vector

NC number of candidates

Cu user-specific candidate subset

K negative sample ratio

k kernel size, e.g. of convolutional filter ei word embedding vector

ci contextualised word embedding

W matrix with trainable projection parameters b trainable bias term

αi un-normalised attention weights

qw user-specific preference query

Dr dimensionality of article representation

ET_C;i candidate article embeddings for position i

hl_i contextualised representation of last Transformer layer l at position i Hl concatenated Transformer representations at layer l

ti normalised time vector for point i

ti time stamp for article interaction at point i

wi order embedding for position i, e.g. positional or temporal embedding

pmask probability of masking off sequence tokens

p_dropout probability for dropout modules L loss function

(16)

(17)

Chapter 1

Introduction

Application and Methods of Recommender Systems

With the rising popularity and availability of digital media in recent years, an increasing number of users relies on personalised recommendations and search engines to filter rele-vant information from the overwhelming offer (Özgöbek, Gulla, and Erdur,2014; Karimi, Jannach, and Jugovac,2018). The vast amount of available online content follows a con-stant update cycle that makes it even more challenging for readers to find personally rele-vant information (Karimi, Jannach, and Jugovac,2018). Recommender systems (RS) have become valuable and ubiquitous tools for various objects such as books (Park et al.,2012), movies (Miller et al.,2003) or music (Park, Yoo, and Cho,2006). Typically, these systems are automated to match relevant candidate items with the users preferences to improve the user experience and reduce information overload (Das et al.,2007).

Early recommender systems used matrix factorisation methods (Rendle, 2010) or re-stricted Boltzmann machines (Salakhutdinov, Mnih, and Hinton, 2007), applied to very sparse user-item interaction matrices. Typically, a model is trained to project items and users into a shared latent vector space. With these latent representations, items are recom-mended that score highest on a certain similarity measure between user and candidate item (e.g. L2 distance). In general, we distinguish four basic methods, namely collaborative fil-tering () or content-based filfil-tering (Wu et al.,2019a), knowledge-based techniques (Wang et al., 2018) or hybrid methods (Cheng et al., 2016; Guo et al., 2017). In most domains, collaborative filtering methods have been commonly preferred because they are domain-independent (Karimi, Jannach, and Jugovac,2018). Recommendations are made based on user similarities without incorporating item-specific information (see ??). Content-based methods on the other hand heavily rely on item information and analyse user histories to recommend similar items, i.e. "more of the same" (see ??). News RS generally follow a more content-based approach (Karimi, Jannach, and Jugovac,2018) because the items in this case consist of text that can be automatically analysed with techniques from Natural Language Processing (NLP) to create meaningful item representations (An et al.,2019). Most meth-ods distill user interests from this type of implicit feedback (Özgöbek, Gulla, and Erdur, 2014), i.e. the logged reading history, but rarely incorporate explicit feedback () or other interest indicators.

Deep Learning for News Recommendation

With the emergence of deep learning the traditional methods were extended and represen-tation learning methods based on artificial neural networks (NN) were applied to model items and users. Most modern approaches follow a hierarchical model architecture with the goal of predicting click probabilities for a set of candidate items (Young et al.,2018; Zhou et al., 2018). The foundation of this hierarchy form the article representations ob-tained from encoding the article content which, in turn, are aggregated and potentially combined with additional user features to create a latent user representation (Wu et al.,

(18)

2019a). The last component typically predicts a click probability between a candidate ar-ticle and user representation (Wu et al., 2019b). Most research has been conducted on industrial news datasets featuring English articles (Özgöbek, Gulla, and Erdur,2014). As a result, the domain of news recommendation suffers from the lack of established pub-lic datasets and as a result few or no benchmarks (Karimi, Jannach, and Jugovac, 2018). Comparing different methods is often difficult because there is no common baseline. Fur-thermore, other, non-English languages are very underrepresented (Karimi, Jannach, and Jugovac,2018) and hence we lack understanding of general capabilities of the current rec-ommendation and representation learning methods.

Main Challenges in News RS

Despite the evident improvements through deep learning methods (Cheng et al.,2016; Guo et al.,2017; Young et al.,2018), representing user interests remains a challenging task: re-search in this area has advanced recommendation performance on certain datasets, featur-ing approaches like recurrent neural networks (Zhou et al.,2018), denoising auto-encoders (Vincent et al.,2008; Okura et al.,2017) or convolutional neural networks (Wu et al.,2019b), and shows that user representations are crucial for recommendation quality. With these advances, we have gained insights about the open problem of how to model user interests from implicit feedback data and what specific challenges we have to address: (i) diverse and changing interests (Cheng et al., 2016; Zhou et al.,2018), (ii) different dependencies between items, reflecting long- and short-term interests (Das et al., 2007; An et al., 2019) and (iii) varying relevance of consumed items for representing user preferences (Kang and McAuley,2018; Wu et al.,2019b; Wu et al.,2019a).

Recurrent and Attention-based Sequence Representation

Current news recommender system often rely on recurrent neural networks (RNN) to model sequential user behaviour (An et al.,2019; Zhou et al.,2018). However, RNNs are limited in selective encoding and modelling complex dependencies, especially in longer sequences (Bahdanau, Cho, and Bengio,2014; Young et al.,2018). With the emergence of at-tention mechanisms (Bahdanau, Cho, and Bengio,2014; Luong, Pham, and Manning,2015), more effective methods for capturing complex dependencies and variable relevance within sequential data, specifically in NLP tasks (Luong, Pham, and Manning,2015), have been widely used. In news recommendation for instance, Wu et al. (2019b) employed additive attention to model importance of words and articles depending on user preferences. Fur-thermore, they have shown that deep learning methods with such attention mechanisms outperform those without. With the introduction of self-attention and the Transformer model (Vaswani et al., 2017), the attention family further expanded and grew yet more effective in capturing complex and long-term relationships. The powerful Transformer-based architecture has also been adopted to the domain of (sequential) recommendation (Kang and McAuley,2018; Sun et al.,2019). Overcoming the limitations of uni-directional (left to right) context modelling, the language model BERT (Devlin et al.,2018) leverages bi-directional Transformed-based representations, primarily for the NLP tasks of language modeling and sentence prediction, but also for a multitude of other downstream tasks (e.g. ). Besides the success in classic NLP tasks, the bi-directional representation methods from BERT have been applied to modelling user behaviour (Sun et al., 2019) and have shown significant improvements over uni-directional representations (Hidasi et al., 2015; Kang and McAuley,2018).

Crucial to the success of the position-invariant self-attention mechanism were posi-tional embeddings that induce the model with relevant spatial information (Vaswani et

(19)

al., 2017). Without this explicit positional information, the resulting representations and model performance are dramatically worse (Vaswani et al.,2017; Sun et al.,2019), demon-strating that a notion of where something is located within a sequence is highly important (Gehring et al.,2017; Wang et al.,2019). Unlike natural language, the data on users’ inter-action histories contains temporal information in the form of time stamps and sequential order. Among other factors, we assume that the user behaviour partially depends on the given time. For news recommendation in particular, research has shown that recency and local trends may impact user preferences (Das et al.,2007; Karimi, Jannach, and Jugovac, 2018). Similar to sequence position, we hypothesise that the notion of when an interaction occurred can be leveraged to distill user preferences more effectively.

Motivation

Reviewing contemporary literature on news RS (see2), we identified certain gaps and want to address limitations of existing techniques.

Our research is motivated by observations ..

Compared to textual representation methods, the area of user interest modelling seems underrepresented in current news RS research. Sequence representations methods often emerge from NLP research (e.g. Chung et al. (2014), Bahdanau, Cho, and Bengio (2014), and Vaswani et al. (2017)) and can be applied as news encoder for RS. However, user inter-est modelling is less studied and more confined to the specific recommendation domain. Due to this research gap between news and user encoder, we focus more on the latter com-ponent. We evaluate the proposed methods on downstream news recommendation tasks. Since user representations constitute an integral part of news RS, more effective representa-tion methods may improve personalised recommendarepresenta-tion that aims at providing relevant items, in our case news articles, to real users. Among other challenges in news recommen-dation, like the cold-start and recency problem (Özgöbek, Gulla, and Erdur,2014; Karimi, Jannach, and Jugovac, 2018), the aforementioned challenges of user modelling build the focal point of this research project. In recent work, the difficulty of modelling complex user interests has become more apparent (Zhou et al., 2019; Wu et al., 2019b) but, conversely, has also shown the need for more sophisticated interest representation techniques. Recent applications of attention mechanisms for (news) recommendation (Zhou et al.,2018; Wu et al.,2019b; Wu et al.,2019a; Kang and McAuley,2018) have significantly improved over previous deep learning approaches. Extending from additive attention to self-attention may further improve the RS, analogous to advances in NLP (Vaswani et al.,2017; Young et al.,2018). We want to contribute to the open problem of how to capture different de-pendencies within a user’s interaction history and how to determine the item relevance from implicit feedback by applying Transformer-based techniques. Since self-attention has been successfully applied to natural language sequences, the mechanism seems promising to model importance and dependencies within a sequence of items as well. While BERT captures linguistic relationships (Devlin et al., 2018), Transformer-based methods could also model corresponding patterns in user behaviour. To address limitations of recurrent sequence modelling and uni-directional representations dominant in current news RS, we propose to use bi-directional representations to model user interests, similar to (Sun et al., 2019).

Furthermore, we investigate how to incorporate and leverage temporal information on user interactions, a novel approach in this domain. Modelling temporal relationships may help to capture users’ diverse and changing interests. Most RS that incorporate temporal or positional information, apply RNN-based method (Chen et al.,2018; Kumar, Zhang, and Leskovec,2019; Ji et al., 2020). Adding explicit positional embeddings greatly improved

(20)

Transformer-based sequence modelling of natural language (Vaswani et al., 2017). It in-spires development of similar features for temporal embedding to improve user interest modelling.

We want to transform interaction time stamps into temporal embedding vector via trainable projection functions approximated by NN. This effort towards more time-sensitive recommendations should ultimately enhance the quality of the recommender service and the user satisfaction. From a technical perspective, we aim to advance methods to model complex relationships and relevance in sequential, time-sensitive data (of user behaviour).

Research Questions

Transitioning from the brief introduction on news recommendation, sequence modelling and the specific challenges of user interest modelling, we formulate the research questions as follows:

RQ1 How does a state-of-the-art news recommendation model perform on real-world, Dutch news data?

RQ2 Do bi-directional, Transformed-based representations of user interests improve rec-ommendation performance over baseline methods?

RQ3 How do article representations from pre-trained BERTje compare to end-to-end CNN-based methods?

RQ4 Can positional embeddings infuse the our Transformed-based encoder with sequen-tial information and improve recommendation performance?

RQ5 Does our proposed time stamp embedding method improve model performance? Does this outperform positional embeddings?

RQ1should provide insights on how modern news RS perform when transferred to a different language and news dataset. This side-by-side comparison and analysis of general properties is often lacking in the current literature. In general, we would expect similar performance as the baselines using English news data. Since the content-based approaches leverage pre-trained word embeddings as input features for the article encoder we ex-pect that the proposed methods also work with word embeddings for other languages. Subsequent research questions will also be evaluated on the Dutch news data. Extend-ing on that,RQ2asks to compare representation learning methods and analyse their be-haviour and effectiveness for modelling user interests empirically. While using the same news encoder, we compare the existing state-of-the-art personalised attention user encoder (Wu et al.,2019b) to our proposed Transformer method. We expect the relatively complex Transformer-based encoder to outperform the baseline because of their more powerful rep-resentation capabilities to capture inter-article dependencies.

RQ3aims at an analysis of different news encoders. In particular, we are interested in us-ing article representations from pre-trained BERT models (Vries et al.,2019) and compare those to the common end-to-end training regime. Since transfer learning in NLP has re-cently fostered many successes (Howard and Ruder,2018; Devlin et al.,2018; Sanh et al., 2019), we investigate the effectiveness of transfer learning for RS. We expect to confirm the findings of Wu et al. (2019b) that attention-based methods perform best because they can better determine the items’ relevance with respect to the user interests. Furthermore, we hypothesise that the news encoder and the resulting article representations have a higher impact on the recommendation quality than the methods for user interest modelling be-cause of the architecture hierarchy. We also expect article representations from pre-trained

(21)

BERT to yield comparable performance due its powerful representation capabilities but that, ultimately, finetuning is required to suit the recommendation objective. The at an empirical analysis of different methods for user modelling. Combined with the compari-son of user encoders, we will gain insights into integral components of news RS and their respective effect on performance.

InRQ4we infuse the position-invariant Transformer user encoder with positional em-bedding. We evaluate existing methods of modelling sequential order with regard to rec-ommendation performance. Since we assume that sequential information is valuable for modelling user behaviour, the provided positional embeddings are expected to improve performance.

WithRQ5we explore how to condition the user interest representations on time and train the model to predict next reads leveraging temporal order. For temporal embeddings we embed absolute interaction time stamps using a trainable neural-network-based map-ping function. We expect that temporal information in the form of additive embeddings can be leveraged by Transformer modules to weight the importance of articles in the read-ing history for the current next item recommendation. Since we assume user behaviour depends more on specific interaction times rather than positions within the history, the model infused with temporal embeddings is expected to outperform positional embed-dings as well as the baseline. Perhaps there are time-sensitive personal trends, e.g. the reading behaviour in the morning hours is very similar across all days.

Outline and Contributions

In this last paragraph, we will briefly outline the structure of this research project and our main contributions. First, we present the technical background in section ??, namely approaches to RS (e.g. collaborative and content-based filtering) and sequence modelling (e.g. RNNs and attention mechanisms). In section2, we discuss in more detail the most relevant work for our research and the two baselines RS, namely NPA (Wu et al.,2019b) and BERT4Rec (Sun et al.,2019). Given this foundation, we then present our methods in section

3for investigating the research questions and describe the experimental setup in part4. In section5, the experimental results will be analysed and our hypotheses evaluated critically. Lastly, we summarise and conclude our research in section 6 where we also sketch out potential future research.

The main contributions of this research project are as following:

(C1) We test state-of-the-art news RS on Dutch news data and analyse various architecture configurations

(C2) We focus on user interest modelling by applying Transformer-based modelling tech-niques to the domain of news RS.

(C3) We introduce temporal embeddings to induce self-attention mechanism with infor-mation on temporal order and compare different approaches to create these embed-dings

(C4) We show in our experiments that, against our expectation, both positional and tem-poral embeddings impair our model’s performance. These findings challenge funda-mental assumption on the importance of sequence order for recommendation (C5) Our proposed model architecture in its basic configuration with supplemental

or-der embeddings outperforms the state-of-the-art baseline in experiments on our real-world news dataset.

(22)

(23)

Chapter 2

Related Work

In this chapter we provide relevant technical background for the methods presented and discussed in this research project. Furthermore, we will position our research in the con-temporary literature by first presenting related research on recommendation. We with an overview of conventional methods for recommendation in Section2.1.1and deep learning approach in Section 2.1.2. We present two relevant approaches, namely NPA (Wu et al., 2019b) and BERT4Rec (Sun et al., 2019), in more detail as we build our research on their methods and significantly expand them. Lastly, Section2.4presents important challenges specific to news recommendation, which our proposed method addresses.

2.1 Recommendation Systems

2.1.1 Conventional Methods

Conventional recommendation systems (RS) are commonly grouped into four categories, depending on the included data and the processing algorithms. We primarily distinguished between i collaborative, ii content-based, iii knowledge-based and iv hybrid techniques. The following paragraphs briefly describe the principle and presents example applications. Collaborative Filtering In RS based on collaborative filtering (CF), user-specific recom-mendations consider preferences of other, similar users (Özgöbek, Gulla, and Erdur,2014). This approach assumes that users belonging to a certain neighbourhood share common preferences and tends to recommend items that similar users have consumed. In CF, item and user features are build from interest patterns in the identified neighbourhoods and typically don’t consider item properties. Since these features don’t require domain-specific knowledge or detailed information about the items themselves, CF is generally the most common approach to recommendation. Especially for academic research, the availability of public benchmark datasets (e.g. MovieLens, Amazon Purchases) further facilitates the CF applications.

Content-based Filtering With content-based methods, RS incorporate specific item prop-erties and make recommendations based on their extracted features (Özgöbek, Gulla, and Erdur, 2014). In contrast to CF, this techniques aims to find similarities between items. User preferences are typically inferred from the content a user has consumed. These tech-niques are more common in the domain of news recommendation mostly because the text of news articles is well-suited for NLP representation methods and reading histories entail user content preferences.

Knowledge-based Techniques In knowledge-based (KB) approaches, explicit domain knowledge is used to determine similarities between user preferences and item features.

(24)

These techniques require specific and detailed knowledge about the available items, rec-ommendation criteria and user preferences. For news recrec-ommendation, KB methods are less relevant because creating and maintaining necessary knowledge bases is feasible for domain with long life cycles and high involvement. With news articles, we observe the exact opposite because most articles quickly become irrelevant and have relatively low involvement (Karimi, Jannach, and Jugovac,2018).

Hybrid Methods The so-called hybrid methods combine the above mentioned meth-ods, typically content-based and collaborative filtering (Özgöbek, Gulla, and Erdur,2014). While the combinations of methods may differ, the overall goal is to address problems arising from using a single approach. For instance, Lin et al. (2012) apply implicit social experts and TF-IDF features of items for personlalised recommendations.

2.1.2 Deep Learning Approaches for Recommendation

Recent successes of deep learning in feature representation and combination also apply to recommendation systems, shifting from traditional methods (Rendle,2010). Most com-monly, deep learning methods use embedding layers to encode features as latent vectors and apply multi-layer neural networks to extract and combine features. For instance, Wide & Deep (Cheng et al.,2016) and Deep FM (Guo et al.,2017) combine a wide channel of fac-torisation methods with deep NN channel to incorporate high- and lower-order features. Neural Collaborative Filtering (He et al.,2017) captures user preferences with deep NN in-stead of the traditional matrix factorisation methods. Deep Fusion Model (Lian et al.,2018) use neural network with different depth to learn user and news representation, applying attention mechanisms to weight user features.

Modelling Sequence Order In order to incorporate the order in users’ behaviour, most methods use recurrent neural networks (RNN) and its variants to encode previous inter-actions into a latent representation of user preferences. For instance, Hidasi et al. (2015) apply GRU (Chung et al., 2014) to model click sequences for session-based recommen-dation. The Caser model (Tang and Wang, 2018) applies convolutional neural networks (CNNs) to learn sequential patterns. In the news domain, Okura et al. (2017) propose to learn distributed news representations with denoising auto-encoders and create user repre-sentation by applying RNNs to the browsing history. An et al. (2019) learn short-term user interests from recent browsing history with GRU and add an embedding vector capturing long-term interest.

Modelling Time Existing methods (Hidasi et al.,2015; Yu et al.,2019) assume that current user preference build on previous interactions, largely on the most recent ones (Jannach et al., 2012), and their importance depends on temporal order (Li et al., 2017; Zhou et al., 2019). With RUM (Chen et al.,2018), user histories are modelled with external memory networks and attentively access the user-specific memory matrix to generate user repre-sentations. The ASARS model (Wang, Li, and Yan,2019) proposes a session-aware RS that integrates inter-session temporal dynamics. They consider the session gap time as most valuable and embed it to represent importance within and across sessions. The proposed JODIE model (Kumar, Zhang, and Leskovec,2019) applies RNNs to learn joint user-item embeddings depending and changing based on the interaction time. They convert time deltas into a time-context vector with a trainable linear linear. Recently, Ji et al. (2020) pro-pose a gating-mechanism for attention and recurrent modules to incorporate time distance

(25)

between interactions. However, these methods are mostly based on RNNs and model tem-poral dynamics implicitly.

Attention Mechanism As attention mechanism demonstrate effectiveness for sequence modelling in machine translation (Bahdanau, Cho, and Bengio, 2014; Luong, Pham, and Manning,2015), recent works also incorporate attention modules into RS. For instance, (Li et al.,2017) combine GRUs with attention for session-based recommendation to model se-quential behaviour and the session’s main purpose. Zhou et al. (2018) emphasise the diver-sity and importance of user interests, and propose a method using attention mechanism to capture relative user-item interactions. The Deep Interest Evolution Network (Zhou et al., 2019) improves on that not only extracting interest but also modelling the interest evolving process.

Transformer Extending on the basic additive attention mechanism, the Transformer (Vaswani et al.,2017) and BERT (Devlin et al.,2018) model introduce multi-head self-attention (MHSA) and achieve state-of-the-art performance on sequence modelling tasks without recurrent or convolutional modules. The order of tokens in the input sequence is incorporated by adding a positional embedding to the input embeddings. The self-attention layer con-nects all n input positions at once, with a constant number of computations, and allows to capture long-distance dependencies in the input sequence; a typical recurrent layer re-quires O(n) sequential operations and commonly struggles with long-distance relation-ships (Vaswani et al.,2017).

In the recommendation domain, SASRec (Kang and McAuley,2018) introduces a uni-directional Transformer-based model to learn sequential user behaviour for next-item rec-ommendation. Building on that, BERT4Rec (Sun et al., 2019) proposes a bidirectional Transformer-based approach, inspired the architecture and training method of BERT, to achieve SOTA results for several sequential recommendation datasets. Wu et al. (2019a) develop a form of MHSA to capture the interaction and relevance of tokens. To our knowl-edge, an approach using bidirectional Transformer encoders has not been applied to news recommendation and therefore poses a relevant gap.

2.2 News Recommendation with Personalised Attention

The modelling of news and user representations in a hierarchical architecture is picked up and improved by Wu et al. (2019b). Their "News Recommendation with Personalised Attention" (NPA) model first encodes news titles with a CNN-based approach and then creates user representation as a sum of the consumed articles weighted by personalised attention weights. This content-based approach introduces a novel attention mechanism to generate a user-specific attention distribution and models the varying relevance and dependencies of words and articles in browsing histories. Evaluated on a MSN News1 dataset (Wu et al.,2019b), NPA outperforms several representative recommender baselines (e.g. DeepFM (Guo et al.,2017) and DKN (Wang et al.,2018)). Due to its SOTA performance in news recommendation we use this method as baseline. To understand the method and its limitations, we discuss it in more detail.

Dataset First, we look into preparing the data because NPA’s approach for collecting and preparing the data form a reasonable basis for news recommendation tasks. To a large degree, we are going to follow these methods but also make explicit when adjusting them

(26)

for our data and specific setting (see sec. 3.1). For their experiments, NPA uses a news dataset from MSN News that was collected over one month (Dec. 2018 - Jan. 2019). From the available user logs they randomly sampled a subset of N =10k users. Based on these user interactions, an article set is constructed, i.e. each article from each reading history is added to the collection. The composed dataset is split into train, validation and test set based on time intervals. For training, the first three weeks of user logs are used, with 10% randomly sampled for validation. The last week is designated for testing. This division into time intervals reserves some articles and interactions for model evaluation on previously unseen data. With regards to news recommendation, this approach simulates real world processes where new articles are published and the candidate set changes over time.

2.2.1 NPA Architecture

FIGURE2.1: Hierarchical model architecture of NPA. Figure adopted from (Wu et al.,2019b)

Next we examine the NPA model architecture as a representative example of the com-monly used hierarchical approach for (news) RS. As shown in Figure ??, NPA comprises three main components: (1) a CNN plus personalised attention as News Encoder, (2) a per-sonalised attention module as User Encoder and (3) a softmax over inner products as Click Predictor.

News Encoder: local context + word-level attention In summary, the news encoder re-ceives a sequence of words[w1, w2, ..., wM]and a user ID embedding eu, and produces a

contextualised article representation ri.

Starting from the bottom-up, the NPA news encoder first converts each word of the ar-ticle’s title into a dense vector representation ei, using pre-trained Glove word embeddings

(27)

FIGURE2.2: CNN-based news encoder with personalised attention. DN =

[w1, w2, . . . wM]refers to a article title article, consisting of M words. Figure

adopted from (Wu et al.,2019b)

(CNN) that creates a contextualised word representation ci for the i-th word by capturing

local contexts in the word sequence.

The last sub-module, a word-level personalised attention module, produces the final contextualised article representation ri by computing a weighted sum of the CNN’s word

representations. Here, the underlying assumptions are that each word entails different informativeness about the news title and that the relevance of each word depends on the user’s preferences. The proposed personalised attention module, depicted in Figure2.3, models the user-specific relevance of individual tokens, incorporating a user preference query qw. In order to personalise the attention distribution over the word sequence, the

preference query qwis created by a feed-forward NN from the embedded user ID eu.

FIGURE2.3: Personalised attention to create attention distributions depend-ing on a user-specific preference query. Mechanism and figure adopted from

(28)

Throughout the architecture, personalised attention is applied at two levels: the word and document level, for which the authors use two distinct NNs to create the preference vectors qw and qd respectively. Hence these distinct preference vectors can learn

proper-ties of different granularity. To obtain the final article representation ri of the i-th news

title, all M contextualised words are added together, weighted by the computed attention weights. The news encoder network is formed by connecting the above described module and applied to encode all news articles.

User Encoder: personalised attention Based on the produced article representations, the user encoder aims to combine the browsed news into a dense vector u, representing the user’s interests. Again, the personalised attention module is applied to determine and weight the relevance of individual articles for a specific user (see Fig. 2.3). Assuming that a certain article has different informativeness, depending on the user who read it, the article-level attention mechanism may filter out noisy interactions while focusing on rele-vant ones. On the article level, the personalised attention depends on the article preference query qd.

Click Predictor: inner product + softmax In the last step of the architecture hierarchy, the model computes a click score given a user representation and candidate articles. The similarity score is calculated by the inner product between user u and article embedding r0_i, and normalised over all K+1 candidates by the softmax function.

ˆ y0_i = r0_iTu (2.1) ˆ yi = exp(yˆ0_i) ∑K j=0exp(yˆ0j) =softmax(yˆ0_i) (2.2)

The model is trained end-to-end on a pseudo K+1 classification task, which is de-scribed in detail in section 3.4.1. In short, the model encodes and computes similarity scores for K+1 candidate articles, K negative samples plus the target article. The training objective minimises the cross entropy between the target and its predicted softmax prob-ability. From each impression in a user’s history a training instance is created with that impression as target and an unordered input sequence via random subsampling of the original reading history.

2.2.2 Training and Objective Function of NPA

In the following subsection, we will look into the training procedure and objective function of NPA because we will this approach in our baseline and want to highlight its properties and potential limitations.

Creating Training Samples from User Logs Given the pre-processed user logs, NPA first creates training instances directly proportional to length of the reading history. Each such instance comprises a target article Ti, an input sequence S0u;Ti and K negative examples. For a given user ID u, their ordered reading history Su= [A0, ..., AM]comprised of various

ar-ticles Ai ∈VA(represented as word sequences) is retrieved from the dataset. As described

in3.1.2, this original sequence is divided into Strain_u and Stest_u according to the pre-defined time intervals; but to explain the basic procedure, we assume the undivided, full reading history Sufor simpler notation. Now, iterating through the history of read articles, a

(29)

article A0, this one is considered the target T0 = A0. From the remaining articles a

ran-dom subset S0_u;T₀ = {Ai ∈ Su\T0}Lhist is sampled according to the pre-defined maximum history length Lhist. Shorter histories are padded accordingly. This sub-sampling process

effectively creates a bag of articles without any sequential order. Next, a set of candidate articles Cu;T0 = {Ai ∈ VA\Su}K is constructed following a negative sampling method. Note that before sampling the K candidates, the full article setVAis filtered: by removing

the articles that the user has already read (i.e., Su) to avoid repetitive or redundant

recom-mendations. Lastly, one training instance is formed by the user ID, target-specific history, target and negative samples denoted as [u, S0_u;T₀, T0, Cu;T0] and added to the training set

Xtrain. For each following article Ti∀Su, a new target-specific history Su;T0 i and candidates are sampled following this method. Hence, the number of training instances is directly proportional to the number of read articles Mu for each user u ∈ U. The creation of test

instances follows a similar scheme but sub-samples only a single input sequence S0_utestfor all test targets Ti∀Stestu .

While the original NPA approach uses the data efficiently to create an exhaustive set of training instances, it may also introduce a training bias and restrict evaluation. For exam-ple, user A with 50 read articles (in the training interval) would yield 50 training instances. Compared to user B with 5 read article, that would result in ten times more datapoints in the training set. The linear relation between length of reading history and the number of training instances may lead to a dominating presence of users with many read articles and possibly an unbalanced dataset. This could negatively impact the generalisation capa-bilities of the model because of potential over-representation or bias towards a particular user group. The masking approach described in Section3.4.1proposes an alternative that preserves the sequence order and could also mitigate this possible bias.

Objective Function and Pseudo Classification Task The loss for each training instance [u; S0_u;T₀; T0; Cu;T0] ∈Strainis defined as the negative log-likelihood:

L = 1

|_X_train|

∑

S0_u;Ti,Ti∈Xtrain

−log P(yˆi =Ti|S0u;Ti) (2.3)

Recall that ˆyiis the predicted probability for the positive class (see sec.2.2) formulated as

ˆ y0_i =r0_iTu (2.4) ˆ y_i = exp(yˆ 0 i) K ∑ j=0 exp(yˆ0_j) (2.5)

where r0_i is the contextualised representation of candidate article Ai and u is the user

rep-resentation. The recommendation task is formulated as a so-called pseudo K+1 classifica-tion task where the target Ti corresponds to the positive class and the other candidates to

negative classes.

2.2.3 Relevant Findings

The following section presents findings from Wu et al. (2019b) relevant to our research questions and methodology, mainly examining the effectiveness of attention mechanisms for personalisation and the influence of negative sampling. Both, attention mechanisms

(30)

and negative sampling, are important components for our proposed model (see sec. 3.2

and3.4.1).

In their experiments with representative baseline methods Wu et al. (2019b) show that neural network based methods outperform traditional factorisation methods, likely be-cause NNs are able to extract better latent features for users and articles. Among these methods, models using attention-mechanisms perform consistently better than those with-out. Overall, NPA improves over all tested baselines with a statistically significant margin (p<0.001). Compared to performance of contemporary works like LSTUR (An et al.,2019) and NRMS (Wu et al.,2019a), NPA appears the best baseline candidate, not only because it scores highest on that news dataset but also because the code for reproducibility was (publicly) available2.

Comparing vanilla and personalised attention in ablation studies shows the advantage of user-specific attention distributions for creating article and user representations. Since the vanilla attention usually works with a fixed query vector (see Section ??), it cannot model the user-dependent informativeness of words or articles. The proposed person-alised attention module, on the other hand, can dynamically attend to relevant tokens and articles using the latent characteristics that are captured by the user-specific preference queries qwand qa. Analysing the impact of personalised attention on the word and article

level reveals that word-level attention yields the higher performance increase. The au-thors reason that the words form basic semantic units and the selection of relevant words according to user preferences lead to more informative article representations. Since the hierarchical architecture builds on these fundamental representations, their quality propa-gates through the entire modeling process. Personalised attention on the article level, while seemingly less effective than word-level attention, also improves model performance. The results strongly suggest that weighting article relevance leads to better user representations and recommendation performance.

The core idea behind the negative sampling approach is to provide complementary examples to the actual target, i.e. the model should not only learn from positive examples (i.e. what the user has clicked) but also from negative ones. In general, NPA performs better with than without negative sampling. The negative sample ratio K has a relatively small influence on the results. While the model first slightly improves with increasing K, the performance declines for values of K > 8. The authors reason that with too many negative examples, it becomes challenging for the model to recognise the positive target and the added-value through more information quickly diminishes. Based on empirical results, they recommend a moderate value of K=4 for their setting.

2.2.4 NPA’s Limitations

In this section we point out a number of limitations of the NPA approach which our pro-posed BERT4NewsRec model aims to address and improve upon. Here we focus on the news encoder, article length and sequence order. The detailed solutions approaches are presented in Section3. Limitations and adjustments of the training task are discussed later in section3.4.3.

CNN-based News Encoder: as a content-based approach, NPA leverages words from the news article to create an embedding. The proposed method applies a single convo-lutional layer with one kernel size to the news title whose maximum length is restricted to 30 words. This modeling decision leaves room for improvement as the NLP literature proposes more effective alternatives to learn sequence representations, e.g. Transformer-based methods. The authors give no explanation about the modeling choice of their news

(31)

encoder or if they have tried, for instance, deeper CNNs or various kernel sizes like Kim (2014) to create better article representations.

Article Length: additional benefits could be gained from including article content be-yond the title. Feeding more information from the article body to the news encoder of choice could lead to better article representations, improved content understanding and therefore improved recommendation quality. While the title may be sufficient to create a minimalist content representation, additional textual information probably improves that representation by also modeling style, tone and nuances in the content (e.g. (Tintarev et al., 2018)). Even for humans it seems difficult to understand and distinguish articles solely by their titles, especially articles from the same topic. To improve user interest model-ing, more effective representation learning methods and additional article content seem promising leads, both of which is discussed in Section3.2.

Sequence Order: as mentioned above, the NPA model cannot capture sequence order of user interactions. The user encoder is basically a weighted summation of article rep-resentation, which is entirely position-invariant. With this approach, it does not matter whether an article has been read yesterday or two weeks ago. Its relevance for the user in-terest heavily depends on the user preference query, which, again, has no explicit capacity to capture sequential or contextual information. Other research (Zhou et al.,2019; An et al., 2019) has shown that users exhibit both, long- and short-term interests. This suggests that article relevance with regards to the user interest representation depends on the position in the reading history. With NPA, each user representation is built from up to 50 articles, randomly sampled from the user’s reading history. Shorter sequences are padded to this length. For instance, if a user has read 150 articles, the user representation is built from the titles of 50 random, unordered samples from this history. Model performance could probably be improved with longer and ordered reading histories because it provides more information about reading behaviour and could help to distinguish relevant article reads from less informative, noisy clicks.

2.3 BERT4Rec

With BERT4Rec (Sun et al.,2019) a bidirectional Transformer-based architecture for sequen-tial recommendation is proposed that addresses a fundamental limitation of previous se-quential approaches, which model user behaviour sequences from left to right. The authors train and evaluate the BERT4Rec model on publicly available datasets, e.g. MovieLens or Amazon Beauty, on which they outperform SOTA baseline methods. However, they do not apply it to news articles or leverage specific item feature to create respective embed-ding vectors. Our research core builds on the methods of BERT4Rec and extends them to the news domain leveraging article content, therefore we describing the model and its limitation in detail. Our architectural contributions are described in Section3.2.

Sun et al. (2019) pick up the idea of modelling global dependencies in a sequence with Transformer modules (i.e. multi-head self-attention + feed-forward NN) and lever-age the resulting representations for next-item predictions. Inspired by the well-known BERT model (Devlin et al.,2018) and the task of Masked Language Modeling (i.e. Cloze objective, see ??), BERT4Rec is trained to predict randomly masked off items of the orig-inal interaction history by jointly conditioning on both left and right context. Since we adapt this training task for our news recommendation model, it is described in detail in Section3.4.1. During inference, the next-item prediction is realised by appending the mask token to the existing user history as a signal to the predict the item at that posi-tion. While the original BERT focuses on pre-training a sentence representation model and is typically supplemented with additional layers for the specific NLP task (e.g. a linear

(32)

FIGURE2.4: Schematic description of a Transformer layer (left) and the ar-chitecture of BERT4Rec (right). Figure adapted from Sun et al. (2019)

layer to predict the correct word from the vocabulary), this model is trained end-to-end on certain sequential recommendation data. A major contribution is the application of bi-directional, Transformer-based representations of the user history for sequential recom-mendation tasks.

The model architecture is depicted in Figure2.4and will be described in the following sections.

2.3.1 BERT4Rec Architecture

Embedding Items and Constructing Model Input As input, BERT4Rec uses an embed-ding layer to create a Dv-dimensional vector representations for each item vi ∈ EI from

its item ID and supplements this with positional information pi ∈ P. The entire item

set denoted VI comprises the respective item IDs. For each ID, EI ∈ R|VI|×Dv stores

the corresponding item embedding, which is randomly initialised and finetuned during model training. The positional embeddings are also captured by an embedding matrix P ∈ RLhist×Dv _{and learned for each position index i. Sun et al. (2019) chose learnable} po-sitional embeddings (LPEs) because in their experiments they performed empirically bet-ter than the fixed trigonometric positional embeddings from Vaswani et al. (2017). This restricts the maximum sequence length to a fixed length Lhist. Consequently, the input

sequences are truncated, using last L_histitems, and padded on the left side accordingly. The input representation h0_i for a given item Vi is the sum of its corresponding item

embedding embedding viand the positional embedding pi. Through the embedding layer,

each item of the input sequence is encoded and passed to the subsequent user encoder module described in following paragraph.

(33)

BERT as User Interest Encoder The underlying architecture comprises L stacked Trans-former layers, depicted in Figure2.4(right). Each layer consists of two sub-layers: a multi-head self-attention module followed by a point-wise feed-forward network. Starting with the input sequence of length L_hist, at each layer l the model produces hidden representa-tions hl_ifor each position i in parallel. The multi-head self-attention module computes item interactions by jointly attending all sequence positions and aggregating information from different representation subspaces (Vaswani et al.,2017). The following point-wise feed-forward network (PFFN) processes these (sub-layer) hidden representations separately and models interactions between different dimensions. The same two-layer PFFN (with GELU activation (Hendrycks and Gimpel,2016)) is applied to each position and lastly concate-nates the outputs.

In contrast to the original BERT architecture, BERT4Rec does not employ ReLU as the non-linear activation function between layers, but the Gaussian Error Linear Unit (GELU) (Hendrycks and Gimpel, 2016). Sun et al. (2019) chose GELU over other non-linearities because of empirically better performance. The GELU activation function combines certain properties of dropout (Srivastava et al.,2014) and ReLU, viewing a neuron’s output as more probabilistic (Hendrycks and Gimpel,2016). In experiments with this non-linear function on tasks from computer vision and NLP, the GELU has matched or exceeded models with ReLU or ELU activations (Hendrycks and Gimpel,2016). The GELU is defined as

GELU(x) =xΦ(x) (2.6)

≈ 0.5x(1+tanh[√2π(x+0.0044715x3)]) (2.7)

≈ xσ(1.702x) (2.8)

whereΦ is the cumulative distribution function (CDF) of the standard normal distribution (N(0, 1)). Compared to the standard ReLU, the GELU is a smoothed binary activation that can take both negative and positive values and also exhibits curvature in the positive domain.

Following Devlin et al. (2018), BERT4Rec also applies dropout (Srivastava et al.,2014), residual connections (He et al.,2016) and layer normalisation (LayerNorm) (Ba, Kiros, and Hinton,2016) in-between multi-head self-attention (MH) and PFFN. These sub-layer con-nections are crucial to avoid overfitting and stabilise network training, especially when stacking multiple Transformer layers (Devlin et al.,2018; Sun et al.,2019). The full Trans-former layer is shown in Figure2.4(left).

Prediction Layer Given the final hidden representations HL, the prediction layer applies a two-layer feed-forward NN and computes the inner product over all items in EI. The

proposed prediction layer can described as:

P(vi) =softmax(GELU(hLiWP+bP)ETI +bO) (2.9)

where WP ∈ RDi×Di _{(weights), and b}P ∈ RDi _{and b}O ∈ _R|VI| _{(biases) are trainable} network parameters. The prediction layer ultimately yields softmax probabilities over all items for the (masked) position i.

2.3.2 Relevant Findings

Impact of Bidirectional Transformers Besides proposing the BERT4Rec model architec-ture, the authors also investigate its modeling capacities on four real-world datasets and

(34)

FIGURE2.5: Overview of a Transformer layer adopted from Sun et al. (2019).

compare it with state-of-the-art (SOTA) baselines from the literature. The following se-lected findings of Sun et al. (2019) make a strong argument for the effectiveness of BERT4Rec and for its application to News Recommendation. First, the proposed model outperforms and achieves statistically significant improvements (with p < 0.001) over all the tested SOTA sequential recommendation systems. Most notably, the comparison to the "SAS-Rec" model (Kang and McAuley,2018), a unidirectional, Transformer-based recommender, demonstrates improvements gained from bidirectional representations (Sun et al., 2019). This finding underlines the importance of jointly attending and leveraging context from both sides as a valuable property for modeling user behaviour. The analysis of attention distributions of attention-heads shows that the individual heads attend to different sides of context. Across layers, the attention also varies, while layers directly connected to the out-put tend to focus on the more recent items, indicating their importance for the next-item-prediction. In addition to the bidirectional representations, the Cloze task allows to model interests at different positions in the user history, as opposed to the common approach of predicting only the last/next item in a sequence (Sun et al.,2019; Kang and McAuley, 2018). Following the Cloze task, the model needs to predict random items at different po-sitions which incentivises generalisation to handle diverse examples and contexts. Thus, this method seems promising for addressing a core challenge of news recommendation, the multi-modal and changing user interests.

(35)

Effect of Sequence Length Secondly, the experiments on variable sequence length N show that the model can deal with both, longer (N = 200) and shorter (N = 20) input sequences. Even for relatively long sequences (N=200 and N=400), BERT4Rec’s perfor-mance remains strong. With increasing sequence length N, we introduce more information as well as noise but the model seems to be able to attend the relevant items and filter the noise from user history. For their datasets with longer sequences, the model focuses on the less recent items for capturing the user behaviour.

Importance of Positional Embeddings Lastly, the ablation studies confirm that posi-tional embeddings are crucial for the modeling capacity. Removing posiposi-tional informa-tion decreases the performances dramatically as the contextualised representainforma-tions only depend on the item embedding, demonstrating that the items’ positions and order matter. These results align with similar findings on the impact of positional embeddings (Gehring et al., 2017; Devlin et al., 2018; Wang et al., 2019) and motivate further investigations into the design and role of such positional information for modeling and recommenda-tion tasks. Our research aims to investigate the relevance of sequence order as well as how to model and leverage this information with powerful representation learning methods. While BERT4Rec applied two variations of positional embedding we want to extend on this by also considering the interaction timestamps to represent the sequence order (see Section3.3.2).

2.3.3 Limitations of BERT4Rec for User Interest Modelling

Item Embeddings As described above, each item is represented by a randomly initialised embedding vector vi that will be jointly trained with the other model parameters. In the

beginning these embeddings do not capture any meaningful information about the items, neither based on content nor other features, and derive their semantic representation from the context in which they appear, i.e. different interaction histories of many users. Further-more, this embedding approach assumes complete and static item setVI. Since BERT4Rec

was evaluated on such datasets, e.g. Amazon Beauty, MovieLens, this embedding ap-proach is viable. In our news recommendation scenario, however, content-based item rep-resentations seem more promising mainly because (1) the article content is available and indicative for deriving a meaningful latent representation, (2) the interaction matrix is more sparse and therefore difficult to derive good representation from few interactions and (3) the item set dynamically changes in real-world setting. In Section3.2 we discuss this in more detail and propose a content-based representation learning method.

Positional Embeddings While the positional information from pi allows the model to

identify and distinguish the absolute position of an item, it may be limited in capturing the distance and relationship between those positions. Since the positional embeddings are essentially a lookup table that map a given index to a vector, these embeddings are trained independently of each other. As a result, the actual position relative to other po-sitions within a sequence are not modelled (Gehring et al.,2017). For instance, given two position indices, j1 at position pos and the subsequent position j2 at pos+1, the

corre-sponding positional embeddings pj1 and pj2 only imply that these are distinct positions. While the position indices impose an ordered relationship (e.g. adjacency or precedence), the positional embeddings pj1 and pj2 do not necessarily provide information about this relationship (Wang et al., 2019). In the case of natural sentences, when using a similar mapping from a word index to an embedding vector, these embeddings can be trained independently because there is no specific ordered relationship of neighboring indices in a given arbitrary vocabulary. A given sentence composed of various words assumes an

Investigating Positional and Time Stamp Embeddings for News Recommendation with Bidirectional Transformer User Encoder

MASTER

THESIS