Explainable Conversational Recommendations

(1)

MSc Artificial Intelligence

Master Thesis

Explainable Conversational Recommendations

by

Nikolaos Kondylidis

11853913

May 31, 2020

36 EC April 2019 - May 2020

Supervisor/Examiner:

J. Zou MSc

E. Kanoulas Prof.

Assessor:

E Gavves Dr.

(2)

1 Abstract

In this work we study explainable conversational recommendations for new users, focusing on presenting clear reasons why an item is recommended to the user. We merge user and item modeling spaces into a common explainable categorical space, making the behavior of the system more intuitive and explainable. By doing so, we set the foundations for systems that allow the user to comment on the explanations of why an item was recommended and utilize that feedback. Such systems could potentially make the user feel more involved in the recommendation process, amplify the user’s trust and satisfaction, but also improve the performance of the recommender system.

In the setting of conversational item recommendation, the Seeker (user) initiates a conversation with the Recommender (system) in order for the latter to recommend an item towards user’s interests. For the case of new users, the Recommender has no prior knowledge regarding user’s preferences. Therefore, the explanations need to (a) depend only on the knowledge that is given in the conversation, besides the items’ properties, and (b) be concise so that the user can understand them, judge them, and provide feedback.

We developed a neural recommender model that performs category aware item recommendations, by first predicting the user’s item category preference distribution, and then translating it into an item preference dis-tribution. This way, user modeling is being reduced to a distribution over item categories, which is interpretable by definition.

In more detail, the item recommender system is constructed by a feedforward network, which is composed of (a) a category encoder and (b) an item rating decoder. The category encoder translates a set of items ratings to a (user) category distribution, and the item rating decoder translates a category distribution to item ratings. Due to the conversational aspect of the task, item ratings’ knowledge is not provided explicitly but has to be inferred from the conversation. For this reason, we use BERT as a text comprehension medium to either predict the ratings of the user over the mentioned items or to directly infer the category preferences of the user from the conversation. In the latter case, some words mentioned by the Seeker can be used for justifying the predicted category preferences.

The approaches are being applied on a real conversational movie recommendation dataset; the ReDial dataset [1]. We experiment with two tasks: (a) item recommendation using the recommender model while ignoring the conversation and (b) conversational item recommendation. The experimental results on the ReDial conversa-tional recommendation dataset display notable improvement for the item recommendation task compared to the baselines and competitive performance on the conversational item recommendation task. At the same time, the results can also be justified and explained based on the category distribution and keywords mentioned.

2 Acknowledgments

After the completion of this thesis, the final step in graduating from the master’s program, I would like to express my thanks and gratitude to those who contributed to my efforts. Throughout the course, I had the continuous support of my family and friends. In addition, the scholarship offered to me by the Bequest Department of the Aristotle University of Thessaloniki, Greece, made the attendance and completion of this course possible. The teaching staff of University of Amsterdam helped me not only to expand my knowledge, but also to extend my horizons and provide me with inspiration in my field of studies. Regarding this thesis, I want to underline the involvement of my daily supervisor Jie Zou, who was always available to provide me with his guidance and help me understand how to frame an experiment with the appropriate research questions. A special mention is due to Prof. Evangelos Kanoulas, my examiner, for his encouragement, creative guidance and mentoring in writing and submitting scientific papers.

(3)

3 Introduction

Recommender Systems Nowadays, technology provides us access to constantly increasing amounts of in-formation. This makes the process of accessing the right information more and more challenging. Information Retrieval (IR), a field of computer science, studies and tries to tackle this problem. IR tools are so important, that success of big companies, depends on their successful application. Google’s main product is retrieving information on the domain of websites, and Amazon is allowing users to search products from multiple sellers at the same time. One sub-field of IR, is the recommender systems. In traditional IR, users that perform a query know what they are looking for. In contrast, a recommender system is being used when the user is looking for recommendations of usually new, to her, items so her query has a vague and abstract form and the target is unclear.

Popularity of Conversational Recommender Systems Recent advancements in Natural Language Pro-cessing (NLP) and speech recognition allowed us to develop chat bots and conversational systems; systems that have a conversational interface. Personal assistants and conversational systems are becoming increasingly popular. Many big technological companies nowadays are investing on personal digital assistants, such as Siri, Google Assistant, Alexa and Cortana. Conversation will probably be the next most commonly used interface for search engines, as it can be used much faster and easier. A vital module for all these assistants is the conversational recommendation, which is considered of high research importance [2, 3, 4]. A conversational system, apart from its simplicity, has the advantage to ask for clarifications, being bidirectional and being able to have an active role on forming the query.

Importance of Explainable Recommendations Even though recommending the correct item is the main target of recommender systems, being able to explain the recommendations is equally important. As Liu et al. [5], Zhang et al. [6], Tintarev and Masthoff [7] clearly sate, ”By explaining how the system works and/or why an item is recommended, the system becomes more transparent and has the potential to allow users to tell when the system is wrong (scrutability), help users make better (effectiveness) and faster (efficiency) decisions, convince users to try or buy (persuasiveness), or increase the ease of the user enjoyment (satisfaction)”. These arguments are further supported by many studies [8, 9, 10, 11, 12, 13, 14, 15].

Advantage of Handling Feedback The study of Yu et al. [16] shows that using user’s feedback reduces the number of interactions needed for efficient recommending. In an conversational setting, the user is able to provide feedback. Explainable conversational recommending, should allow the user to understand why an item was recommended to her, help her approve, or not, the recommendation and let her provide direct feedback on the reasons that lead to this recommendation.

Problem Definition Recommender systems are having trouble satisfying new users : the well known ”cold-start” problem. Albalawi et al. [17] state that users are frequently dissatisfied with initial recommendations due to the fact that models are unable to elicit their preferences at the first steps of the interaction, nor to grasp their actual needs. Trying to tackle this problem, we worked towards providing a conversational recommender system the ability to provide explanations in a form that are able to allow feedback. In order to do so, the explanation provided should depend solely on the on-going conversation and be concise. This means, that the explanation should not refer nor depend to any sources outside the conversation (e.g. social media content, user’s previous activity, similar users, past preferred items) directing towards a ”cold-start” approach. Furthermore, the explanation needs to be concise enough so that the user can judge it immediately and provide feedback (e.g. predicted user’s preference over item’s categories or properties)

Literature Gap Regardless of the importance of explanations and feedback handling, most explainable recom-menders cannot support a conversational recommender system that depends solely on the information provided in the conversation and function in real time. Most systems provide explanations that depend on user’s reviews, online social account or preferences over either items or their features [18, 19, 20, 21, 22, 12, 23, 9, 24, 25, 26, 27]. Other systems have to be executed separately for each item in order to estimate how much a user will like it [28, 29, 30], prohibiting their application in real time systems.

Scope of the Thesis Our objective is to define a recommender system that complies with the perquisites of the aforementioned problem definition. To that end, we designed a neural network that first performs user modeling on an explainable space, and performs item recommendation based on that. This way, the model can explain its understanding of the user and allow her to provide feedback on that. Regarding our approach, we aimed on minimizing the prior information required about the user, so we assume that no user information is

(5)

provided. Even though we aim on utilizing feedback on the explanation level, we only provide a system that makes this possible and leave the approach of feedback comprehension for future work.

The Proposed Explainable Recommender To allow better interpretability, we propose a new recommen-dation approach, the Category Aware AutoEncoder (CAAE). The CAAE has the properties of an autoencoder, that is (i) it is composed of an encoder and a decoder, (ii) the input space and the output space are common, and (iii) the encoder creates a compact representation of the input space. On top of these properties, CAAE ensures that the compact representation of the input space can be directly interpreted. Specifically, this method assumes that items fall under certain categories (e.g. movie genres, or product categories, food cuisines, etc.), and allows user modeling to take place on the item category space, in the form of a Category Preference Dis-tribution (CPD). The category preferences are then being translated into item ratings. Also, the user profile in our model is self explanatory in plain terms, and the item rating prediction can be directly justified. Moreover, the description is in the form of categories and not in the form of attributes. The attribute space of items can easily be expanded over time. On the contrary, the categories remain the same, ensuring that the user modeling space remains unchanged over time.

The Proposed Explainable Conversational Recommender Furthermore, we apply the same approach on a conversational recommendation setting. Two approaches were studied in this case, using BERT [31] models for extracting information from the conversation. In the first one, we initially predict the ratings of mentioned items and then apply the CAAE. In the second one, we predict the category preference of the user, solely by the conversation on textual level, ignoring the mentioned items, and then apply an item rating decoder. Predicting the category preference of the user from the text can be perceived as topic modeling, and then the recommendation is based on the predicted topics-categories. It is a common technique to use topic modeling either for recommending [23, 32] or for explaining a recommendation [19, 33, 34]. When predicting the category preference directly from the text, we perform further studies in order to utilize user’s words or phrases mentioned during the ongoing conversation for justifying and supporting our prediction and subsequently our recommendation. We do so by highlighting words that received a lot of attention from the BERT model, and led to the predicted category preference distribution. In contrast to studies that use a standard lexicon per topic for identifying important words, we use the self-attention of our model to select the important words. This way we can confidently say that these words actually led to the prediction of the category preference, as Vashishth et al. [35] work proves that the attention can justify the results of self-attentive models, since it is crucial for the produced outcome.

To summarize, we propose a new explainable recommender model, by proposing a new technique to train an autoencoder for collaborative filtering. Our method outperfrorms autoencoders that are trained for collabora-tive filtering, while using the same architecture. This way, our method of utilizing the categorical information of the items can be assumed to be very effective, while having explainable properties. Taking into account that the trade-off between performance and explainability is well known for the recommending task [9, 10], it is very important to make clear that our method both improves the performance of its compared baseline but also makes the model explainable at the same time. On top of that, we propose a way to extend our model’s principal to conversational domains, providing also the advantage of using user’s language for justification. Our conversational model, performs competitively compared to its baselines and opens the way to explainable con-versational recommender systems that provide concise and session-contained explanations which can potentially handle feedback on the recommendation but also on the explanation.

The structure of the thesis is the following. Initially, the research that has been performed in any relevant topics will be presented in Section 4. At the same time, the research gaps that lead us towards developing our approach will be pointed out. At the end of that section we will define the research questions that we try to answer. In Section 5 our proposed approaches will be defined in detail. Later, in Section 6 we present the experimental setup that allows us to answer the research questions. This section contains a subsection for each research question followed by another subsection with the corresponding results. Then, in Section 7 we demonstrate the effectiveness of our approaches with some examples. Some more examples are given in the Appendix. Finally, in Sections 8 and 9 we draw our conclusions and propose ways that this study can be extended in the future.

(6)

4 Literature Review

The main contribution of this thesis is an explainable recommender model, the CAAE. CAAE can be applied on item ratings, predict user’s CPD and then recommend items accordingly. Additionally, we propose three approaches so that CAAE can be applied on the conversational recommendation domain. The first one applies a text comprehension model for predicting user’s sentiment over mentioned items and then apply CAAE. The second one predicts user’s CPD directly from the conversation and then applies an item rating decoder to recommend items accordingly. The third approach is the combination of the first two. In this section, relevant work on recommender systems will be initially reviewed. Then, studies that use similar text comprehension models with the ones that were used in the proposed approaches will be presented. The review of text compre-hension models starts with a common introduction and then splits to the two tasks; predicting Target Sentiment Analysis and Category Sentiment Analysis.

4.1 Recommender Systems

During this subsection, recommender systems will be reviewed. The recommender systems are divided into Autoencoder based recommenders, explainable recommenders and conversational recommender systems. 4.1.1 Autoencoder Based Recommenders

Our explainable recommender system follows the intuition of an autoencoder. For this reason, we will go through two similar recommender systems, even though they lack explainability properties. The two recommenders use autoencoder models, applying collaborative filtering techniques: AutoRec[36], which is used by Li et al. [1], and Collaborative Denoising Auto-Encoder (CDAE) proposed by Wu et al. [37]. In both cases, the ratings of a user are given as input to an autoencoder, which is trained on reconstructing them. The activations of the hidden layer can be perceived as implicit user modeling, as mentioned by Li et al. [1]. The CDAE, adds implicit information for describing the user, by expanding the input differently for each user.

4.1.2 Explainable Recommenders

The explainable recommenders are split according to their limitations : a) require prior user knowledge, b) cannot be applied in real time c) the form of explanations is not conversation friendly or the explanation is not debatable (e.g. similar users or similar items). Many studies fall into more than one of these categories, but we will only refer to them once.

Require Prior Knowledge Regarding the User As the problem definition is defined in Section 3, the explanation needs to only involve information accessible from the current conversation and should not be based on any prior user knowledge. The following explainable recommenders depend on information of these types.

The first two approaches require information from social networks for the user [18, 19]. Ji and Shen [18] used tags obtained from social networks and keywords that describe items. Then they jointly model users, items and keywords on a common representation space using Matrix Factorization. Tags around the items and users can be used for explanation. Ren et al. [19] applied a latent variable model in order to predict item ratings using user reviews and social relations. Their approach assumes that items belong into categories, each category has different viewpoints and each user focuses on a subset of viewpoints. This approach is intuitive similar to our approach, but requires social relationships and user reviews.

Other approaches require the preferences of the user over the aspects of items [20, 21]. Specifically, Hou et al. [20] apply matrix factorization on item-aspect and user-aspect information, predicting item-aspect quality (IAQ matrix) and user-aspect preference (UAP matrix) respectively. These predictions are being jointly used to predict user-item rating, and the aspects are being used for explanation. On the other hand, Zheng et al. [21] performs recommendation of items and tags at the same time, and the tags are used as explanation. The method proves that co-training for tag recommendation improves the item recommending results, similarly to our model, where we prove that first predicting the category preference and then recommending improves the recommendation performance. In contrast to our method, [21] requires user-tag and item-tag rating matrices, and the tag predicted can be argued that is more correlated than causal, since it does not affect the item recommendation directly which is the case in CAAE.

Four more approaches require user’s item reviews, in order to understand her preferences [22, 12, 23, 9]. Wang et al. [22] apply Tensor Factorization in order to co-relate users, items and features. The predicted preference of a user over the features of a recommended item is being used as explanation. Chen et al. [12], Cheng et al. [23] perform user modeling on a predefined number of topics of arbitrary meaning. These topics are correlated with visual features of items’ images. Highlighting correlated image areas that best describe the preferred topics on the image is used as explanation. The former use neural attention, though the later apply Matrix Factorization.

(7)

Also they both require to present the image of the item, which might not be available in every conversational recommender systems. Tao et al. [9] proposed the use of regression trees and rule based generated explanation. The tree nodes represent item features and are being selected from user reviews, which are again additionally needed. Matrix factorization is applied for modeling users and items on a common latent space.

Last but not least, some of the latest approaches extract item features from textual descriptions, including reviews or product summaries, in order to identify certain features that can be used for providing explanations [24, 25, 26]. Nevertheless, these approaches also come with significant limitations, such as the fact that different items have different attributes, the same attribute can be described in many ways, and new items may lack textual descriptions such as reviews. Furthermore, [27] shows that the most useful review, for user-item rating prediction, is the one made by the same user. But this might not be particularly practical in real applications, since such information is expected to be unavailable for new user-item predictions.

Cannot Be Applied in Real Time There are some additional models that are similar to ours, but they are strictly bounded by computational limitations and therefore cannot be applied in a real-time conversational setting. Lin et al. [28], Tal et al. [29], Samih et al. [30], perform point-wise comparisons: need to be executed separately for each item in the dataset to predict item’s rating for a user. Specifically, Lin et al. [28] recommends candidate outfits that match with a given outfit (an outfit can be top or bottom), where the matching score is predicted by applying mutual attention on the visual feature space of the outfits, and a GRU translates this mutual attention into a comment which is used as explanation. Similarly, Tal et al. [29] apply attention on the feature space, among a candidate item and the previous items of a user (each feature has an embedding, allowing for attention on feature space among items). Again, the model can evaluate only one candidate item at a time, and when it finds the better matched item to recommend, it uses the features that received a lot of attention for explaining the recommendation. Furthermore, Samih et al. [30] studied how a knowledge graph and a set of rules can explain or justify a recommendation while calculating the recommendation probability as well. For example, one can query the model ”Why User 4 will choose Item 3?” and get a recommendation probability and a set of ”fired” explanation rules. Some of the generated explanations look like ”User 4’s common friend User 2 and User 1 have chosen Item 3” and ”Item 3 is tagged sport like Item 5 which has already been selected by User 4” [30]. The second example is similar to the form of explanations provided by CAAE, but the model has to be applied for each item separately.

Explanation Not Useful for Our Case Other explainable recommenders combine graph or tree traversing with rule generation in order to generate an explanation [38, 39]. A rule-guided neural recommendation model has been introduced by Ma et al. [38]. This method requires a graph that contains items and their in-between relations. The recommendation follows an item-to-item setting, and the in-between relations are being used as explanations. Given the problem definition (Section 3) where a new user is having a conversation with chat-bot and only a handful of movies are mentioned, using the mentioned movies as an explanation is obvious. The second model is utilizing a Conditional Restricted Boltzmann Machine (RBM) [39]. This model, besides pre-dicting item ratings, also predicts the rating distribution of similar (anonymous) users. Thus, the explainability approach of the method is based on rating distribution of similar users, which is not self-explanatory, and an expected behaviour of CF algorithms by default. Given our motivation, it should also be mentioned that these explanations are not debatable and would not allow the user to comment on them and the system to utilize that feedback.

4.1.3 Conversational Recommender Systems

Recently an actual conversational recommendation dataset (ReDial) [1] was introduced to the research com-munity. Earlier research was conducted either in collaboration with tech companies on non publicly available datasets, or using synthetic conversations. We will separate previous work on conversational recommender sys-tems based on that. To the best of our knowledge, no conversational recommender system study has explainable properties.

Studies Performed on Real Conversations There are four relevant studies based on real conversations. In the first three ([1, 40, 41]), the aim is to develop a recommender system that participates in the conversation. On the other hand, the fourth study [17]) only observes a conversation between two users and recommends advertisements accordingly.

The creators of the ReDial [1] applied three modules in order to build a conversational recommender system ; an item recommender, a targeted sentiment analysis module and a natural language response generator module. The recommender is an autoencoder that applies collaborative filtering technique in order to predict movie ratings, following the approach of Sedhain et al. [36] as described before. The recommender follows a black-box setting, not having any explainable properties. Sentiment analysis on mentioned movies is being performed with the encoder part of a Hierarchical Recurrent Encoder Decoder (HRED) [42]. For text comprehension and Natural

(8)

Language Generation (NLG) the authors use the encoder part of the HRED and the complete HRED for text generation. A switch mechanism [43] is being applied on the hidden activations of the HRED. This mechanism decides if the next token to be produced should be a word or a movie. The switch probability value affects the word probabilities and the movie ratings, which are all concatenated together to form output probability distribution over words and movies. The first two modules are being pretrained separately. Additionally, the encoder of the applied HREDs is modified to take general purpose sentence (GenSen) representations from a bidirectional GRU. Pretrained GenSen representations are being used [44], due to the small amount of text in the dataset.

The two following studies, contrary to our recommendation method, use very complicated approaches, and do not work in an explainable way. Specifically, they both rely on graph information and direct entity relations. The first of them is the only additional study developed on the ReDial [40] dataset. In that study, the text generation model is replaced with a transformer [45] but the main difference lies on the recommender model. External knowledge in the form of graph is being introduced from DBpedia [46]. Graph nodes can either be items entities or non-item entities (e.g. the property ”science fiction film”).Chen et al. [40] initially identify items in the conversation through name matching and then apply entity linking [47] in order to expand the matches between the dialogue and the graph by matching non-item entities. A non-item entity is for example Science fiction film1 which can be matched with the sentence ”I like science fiction movies” [40]. Relational Graph Convolutional Network is being applied in order to recommend items based on entities that appear in a dialogue. In the second study [41], the conversational recommender system proposes venues to travelers. The recommendation in this case is based on three aspects. Firstly, a topic is being predicted given the ongoing conversation. Secondly, conversation and venue similarity is being calculated on a textual level, based on venue’s textual description. Thirdly, a graph is formed based on venue information, and a graph convolution network is being applied for improving venue relations, given their descriptions and details.

Last but not least, the conversational recommender system proposed by Albalawi et al. [17] has very similar intuition to our approach. Their task is to provide a real time conversational recommender system that observes a conversation and proposes relevant advertisements. They apply it on conversations between users of online social networks. They train a Latent Dirichlet Allocation (LDA) model that performs topic modeling on the conversations and the advertisements, and propose the advertisement that is more similar by comparing the topic distributions. In contrast to our approach, their topics are of arbitrary meaning and do not have explainable properties though our approach performs topic modeling on explainable topics that are the same with the categories of the items.

Studies Performed on Synthetic Conversations Studies on conversational recommendation systems have also been performed using synthetic conversations based on item information and reviews, but they never provide explainable recommendations. Two of them apply reinforcement learning techniques in order to predict the correct item in minimum dialogue steps. The first one [48] focuses on predicting item-facets values while acknowledging uncertainty. If the model is certain enough about facet values, then it recommends a set of items. Otherwise, it asks the user for further information regarding the most uncertain facet. The dialogues are being formed by a set of predefined sentences with entities, that are being replaced with facet names and values. The second one [49] constructs the dataset similarly, but, user characteristics are being asked specifically during the conversation and they affect the item-rating prediction.

4.2 Text Comprehension

Even though the main contribution of this study is the CAAE model and and its intuition towards explainable recommendation, it is extended with a Bidirectional Encoder Representations from Transformers (BERT) [31] model in order to be applied on a conversational setting. Two approaches are being proposed and compared: (a) applying Target Sentiment Analysis (TSA) in order to predict user’s sentiment over the mentioned items and then use that as input to the CAAE and (b) applying Category Sentiment Analysis (CSA) for predicting user’s preferences directly from the text and later decode them to item recommendation scores. BERT is being used for both tasks so the necessary background regarding BERT’s input preprocessing and application will be presented. After that, relevant work that justifies the way that BERT was applied for each task. Finally, when CSA is being applied, user’s tokens that can justify the predicted categories are being highlighted using BERT’s attention. Vashishth et al. [35] recently showed that attention values are vital for the model’s prediction, when the model applies self-attention, like BERT. On the other hand, when more that one sentiments are apparent in a part of text, model’s attention gets distracted; a problem known as attention distraction [50]. Three approaches are being put to test for tackling this problem, so, two studies that try to do the same will be reviewed.

(9)

4.2.1 BERT

Transformer models (firstly introduced by Vaswani et al. [45]) consist of an encoder and decoder and has led to significant improvements of performance in a large variety of Natural Language Processing (NLP) tasks [45, 31, 51]. Devlin et al. [31] proposed BERT that applies only the transformer encoder on a bidirectional setting where the hidden representation produced by each token of the input is affected by every token in the input (both the ones preceding and the ones that follow) achieving state of the art results in many text comprehension tasks. Consequently, BERT models were used to either predict user’s sentiment over mentioned items and then apply the CAAE or to directly understand user’s category preferences and recommend items accordingly. BERT is being applied for each token of the input and produces a hidden representation for each of them, which we will call form now on the hidden representation of that token. For further details regarding BERT’s function please refer to the original study of Devlin et al. [31].

Architecture BERT [31], is a Language Model (LM) and its architecture is based on the encoder of a Transformer [45] model, but applies bidirectional attention instead. Transformer consists of transformer layers that linearly project the input space into different sub-spaces of smaller dimensionality where self attention is being applied independently; different self attention parameters for each sub-space, known as attention heads. This way, each transformer layer provides a contextual representation for a token that depends on the input text, for each attention head. Different attention heads pay attention to different aspects of the input, as their parameters are independent and trainable. Transformer model is providing context for each token, that only depends on the tokens that precede it. On the other hand, BERT applies bidirectional context by paying attention to the complete text that surrounds the input token from both directions.

Pre-training Part of BERT’s performance is due to the fact that it is initially pre-trained in an unsupervised manner, and then it is finetuned for a specific task. Even though in this study no pre-trained versions are being used, the text preprocessing follows the same guidelines, so BERT’s application on text will be reviewed. BERT is pre-trained on a Masked LM task, where text is given as input from which the 15% of the terms is being used to generate one sample. The task is to predict the selected 15% of the terms, but the selected terms have been altered on the input in one of following three ways: 80% of the times it is being masked and replaced by a [MASK] special token, 10% of the times it is replaced by a random term and the remaining 10% of the times it is not being altered. In every case the target of the sample is the original term. This way, BERT learns to encode a representation for an input token that depends on the surrounding tokens at such level that the input token can be predicted from them alone. BERT is also pre-trained on the task of binarized next sentence prediction. During this task, the model needs to predict whether a sentence B, follows another sentence A. A and B, are being tokenized, concatenated and two special tokens are being added ; [SEP] is put between them and [CLS] is put at the beginning of the first sentence. The task requires for the model to classify whether sentence B follows A or not and uses the hidden representation of the [CLS] token to decide. After pre-training is complete, BERT is being finetuned for a specific task in a supervised manner. This way, BERT achieved state of the art results at time of publication in a wide range of NLP tasks, including question answering and language inference [31].

Token Embeddings Each token is accompanied by three embeddings; the token’s embedding, a Positional embedding and a Segment embedding. The positional embedding is making up for the fact that the network is not recurrent and needs a way to describe the position of each token in a sentence and was introduced by [45]. The segment embedding, allows the model to understand whether the token belongs to sentence A or B. The final form of a token embedding, which is given as input to BERT, is the summation of the three aforementioned embeddings. Each embedding can either be pre-trained, fixed, or trainable. The encoder outputs a hidden representation for each token that is given as input to a variety of classifiers depending on the task, but usually consists of one linear layer and an activation function.

4.2.2 Target Sentiment Analysis using BERT

The review of Sentiment Analysis (SA) models will be brief, as our main contribution is the CAAE explainable recommender, and we only utilize a SA module in order to extend the recommender’s application to the conversational domain. Therefore, we will only mention relevant work that lead to our SA module’s design. The task of Target Sentiment Analysis (TSA) is to predict the sentiment of a sentence over a target that is mentioned in the sentence. As mentioned before, the hidden representation of the [CLS] token is used for semantic extraction. However, when it comes to the TSA task, Gao et al. [52] proved that it is better to use the hidden representation of the target token(s), even though they applied it on sentences where only one target was present. Li et al. [53] applied the same approach to sentences that contain more than one targets and shows that applying only one linear layer after the hidden representation of the target is enough to predict the

(10)

sentiment. Additionally, Li et al. [53] and Song et al. [54] finetuned a pretrained version of BERT2 achieving state of the art results, proving further the effectiveness of this model on this task. Finally, Lei et al. [55] studied the problem of SA on the microblog domain (e.g. tweets from Twitter) and showed that the model performs better when previous content (e.g. previous tweets) are given as input as well. For the above reasons, we used BERT architecture for predicting the sentiment of the user over a mentioned item in the conversation. We use the hidden representation of the mentioned item and give all past utterances as additional input.

4.2.3 Category Sentiment Analysis Using BERT

Our work on category preference prediction from text is very similar with Hu et al. [56]. Hu et al. [56] predicts sentiment over an aspect using BERT. Specifically, they specify the aspect by concatenating it at the end of the input (after a [SEP] special token). Instead, since the aspects in our case are the categories, We introduce one special token for each category ([CAT 1], ... , [CAT |C|] ) and concatenate them at the beginning of the input and not at the end of it. Again, different from our approach, the extra tokens are used in order to select a sub sequence of the input (using reinforcement learning) and only use them for predicting the sentiment over each aspect. On the other hand, we utilize the complete input and use the hidden representation of the extra tokens for predicting user’s sentiment over each category.

4.2.4 Dealing With Attention Distraction on BERT

Our aim is to explain to the user what led the model to form a description for her and subsequently why each item was recommended. To that end, we are utilizing parts of user’s utterances as a mean to explain why the model finds that some categories characterize her interests. But, as mentioned earlier, result dependent attention is hard to localize when more than one sentiments are apparent in the input. Two relevant studies will be presented that try to tackle this problem by forcing the model to focus only on a subset of words that affect the meaning of the target. Xiao et al. [57] utilize BERT for generating sentence aware word embeddings (which are BERT’s hidden representations for each token) and then apply a graph convolutional network on the syntactic dependency parse tree so that only tokens that describe the target will be taken into account. Hu et al. [56] use reinforcement learning in order to select a sub set of the input tokens and only use these. Subsequently, the corresponding attention is distributed only among these tokens, that describe the sentiment polarity over an aspect. We try different approaches in order to tackle this problem on top of BERT. One of them is using one special token for all categories and the other two approaches use one special token for each category. For the latter case, we further experiment experiment by processing the hidden representations of these tokens by using the same trainable linear layer, or apply a different one for each token.

Given the relevant work presented, we argue that our approaches are innovating and followed by detailed literature research. Furthermore, this study is the first one that opens the way towards explainable conversa-tional recommender systems that would allow feedback handling both on the recommended item but also on the user’s preferences prediction.

4.3 Research Questions

The scope of this study is focused towards developing more explainable recommender systems and applying them in a conversational setting resulting into the three following research questions.

1. RQ1: Can categorical information be used for user modeling and subsequently for item recommendation, making the recommendation more intuitive and explainable?

2. RQ2: What is the influence by extracting user’s categorical preferences from the conversation in different way: using the predicted sentiments of the mentioned items or from the linguistic part of the conversation alone ? Do these two approaches result to the same user description?

3. RQ3: Which of the three proposed approaches better deals with the attention distraction problem when trying to identify important tokens on the conversation that can support the predicted category preference prediction?

(11)

5 Methodology

In this section, we will present our contribution towards explainable conversational recommender systems. Firstly, we will explain how user’s item ratings are being aggregated to a descriptive categorical information for that user. Then we will describe the CAAE model and its training procedure. An extension of the CAAE model follows that is applied on the conversational domain while following the same principal; recommending items using category preference distribution as basis. Finally, regarding the approach when the category preference is directly predicted form the text alone, the different methods for identifying important tokens that can be used for justifying the category prediction will be presented.

The key idea behind CAAE is to first predict the user preferences over item categories and then provide recommendations on the basis of that. This is performed by having a common categorical representation space both for the items and the users. We make the assumption that the user’s category preferences can better be approximated as the conversation develops and more information about the user’s preferences is available. Consequently, only when a conversation has come to an end, the categorical preferences of the user are being best described. We use the categorical information of the items that are being mentioned in the conversation, together with the user’s sentiment over them, in order to estimate the user’s CPD. Therefore, the user’s true CPD can best be approximated when the conversation has come to an end, and as many items as possible have been mentioned in the conversation. Our proposed approach tries to anticipate the true CPD of the user while the conversation is developing and recommend items based on that. To that end, the encoder predicts a CPD that tries to anticipate the true CPD of the user, given only the items mentioned in an ongoing conversation, which has not come to an end yet and more items are expected to be mentioned. Then, the decoder tries to recommend items using the predicted CPD. This approach is intuitive and allows for the predicted user’s desired categories to be used for explaining the recommendation.

5.1 Building a User Category Preference Distribution Vector

During this study, we focus on the item recommendation problem, in the case where items fall into a predefined set of categories. An item can be categorised by more than one categories. Each user, has rated a number of items, which is assumed to be enough for calculating her true CPD. Specifically, let us consider that movie i, denoted as mi, is described by a binary categorical vector cmi of size C, describing the categories that this

movie belongs to, where C is the number of categories. Furthermore, taking into account a user u, who has rated N movies m1, m2. . . mN with the ratings r1, r2. . . rN respectively, were ri∈ {−1, 1}, we calculate user’s

CPD vector as described in Equation 1. We set the item ratings such as the negative ratings are equal to -1 and the positive ones are set to 1.

CP D = Softmax N X i (cmi∗ ri) ! (1)

5.2 Category Aware AutoEncoder

Autoencoders can be trained for collaborative filtering, as it has been proposed by [36, 58]. The input (and output) dimensions are set to be equal to the total number of items, so that each dimension represents the rating of a specific item. A set of user’s item ratings are forming a sparse input vector which the model is trained to reconstruct, ignoring reconstruction loss of unknown ratings for the remaining items. That being said, an autoencoder is trained in an unsupervised manner. The autoencoder, is forced to compress the information that is given as input into a space of lower dimensionality, and then reproduce it, given the compressed representation. This way, the hidden representation distils the important information of the given input, which in our case, can be perceived as implicit user modeling, as Li et al. [1] mention.

The proposed version of autoencoder, even though it consists of an encoder and a decoder, it is not trained in an end-to-end setting but is trained in two steps and the training is not completely unsupervised. The main principal of CAAE is anticipating the true CPD of a user, given only a handful of her item ratings, and use that as basis to recommend items. To that end, we first train the encoder in order to predict the true CPD of the user (whose ratings were given as input) in a supervised manner, since that we provide the true CPD of that user as target. In this work, and different from past work, the dimensionality of the hidden representation space is equal to the number of categories that describe the items, and each dimension corresponds to a specific category. This way, user modeling is taking place on a categorical space. For this reason, the encoder, is named category encoder. After the training of the category encoder is complete, we extend the encoder with a decoder module and train it. The decoder, takes as input the predicted user’s CPD and is trained on reconstructing the input that was given to the encoder. For this reason, the decoder is called item rating decoder. During the training of the decoder, the encoder’s parameters are kept unchanged reassuring that

(12)

Figure 1: A conceptual depiction of the proposed CAAE recommender model. A use case for the domain of movie recommendation is annotated with blue labels.

encoder’s output represents category preferences. Similarly to vanilla autoencoder, the ultimate task is an unsupervised training task, and the input space is common with the output space.

The complete CAAE is presented in Figure 1, with an example of CPD prediction and item recommendation for the domain of movies.

5.2.1 Category Encoder

The category encoder is a feed forward network that consists of linear layers and activation functions between them. Its output activation function is the softmax function, reassuring that the output is a normalized distribution. The encoder, is described by Equation 2. One linear layer is used for simplicity.

predicted CP D = Sof tmax(We× item ratings + be) (2)

Where We and be of size C × M and C respectively, are the trainable parameters of the linear layer.

item ratings of size M is the sparse vector of user’s item ratings that is given as input, C is the number of categories in total and M is the number of items in total. The predicted vector is of size C. The encoder is being trained in a supervised manner, on the task of predicting the true CPD of the user, whose ratings formed the input. The Root Mean Square Error (RMSE) is used as a loss function. The training follows the denoising setting as proposed by [58], where only a random subset of the ratings is given as input each time. This way, the encoder tries to foresee the true CPD of the user given only a subset of her ratings.

5.2.2 Item Rating Decoder

The decoder, takes an input the output of the encoder, the predicted CP D from equation 2. It is trained in an unsupervised manner on reconstructing the sparse vector of user ratings which was given as input to the encoder; the item ratings vector from equation 2. The decoder is a feed forward network that consists of linear functions and activation functions between them. The last activation function is set to be the sigmoid function, because ratings are in the range [0, 1]. Finally, the predicted item ratings are being used as recommendation scores and the item with the highest score is recommended. The function of the item rating decoder is described in equation 3. Again, one linear layer is used for simplicity.

predicted item ratings = Sigmoid(Wd× predicted CP D + bd) (3)

Where Wd and bd of size M × C and M respectively, are the trainable parameters of the one linear layer

that is used as an example. Parameter predicted CP D, of size C is encoder’s output from equation 2 and the output is of size M representing predicted item ratings, over all items. The RMSE is used again as loss function. Similarly to the encoder’s training, a denoising settings is followed, where a random subset of the ratings is given as input to the encoder but all known ratings are expected on the output of the decoder and are being used for loss calculation.

5.3 Category Aware Conversational Recommender

The conversational setting is introducing a different form of input, and a different task to solve. Considering an ongoing conversation between a Seeker and a Recommender, where the first one describes her preferences

(13)

and needs, while the later one is recommending items towards seeker’s taste. In this setting, user’s information is not given explicitly as a sparse vector of item ratings, but has the form of a conversation where some items are being mentioned, surrounded usually by some sentiment information. Furthermore, the task that we are called to perform, is to mimic recommender’s behaviour, and predict the next item that he is about to recommend, given an ongoing conversation and not the complete conversation.

In order to apply our approach on this conversational setting, we use a BERT model [31] as a text com-prehension model. Following the same principal with CAAE, we perform item recommendation based on a predicted CPD of the seeker. Taking into account the conversational aspect of the task, we study whether it is better to predict CPD based on estimated ratings of mentioned items or directly predicting it based solely on the linguistic part of the conversation and ignore the items mentioned. The first approach, performs Targeted Sentiment Analysis (TSA), where the sentiment of a targeted term in a sentence is being predicted; which in our case is a mentioned item. After the TSA prediction model is trained, we extend it with a CAAE, where again we first train the encoder, given the predicted TSA scores, and then the decoder. For the second approach, we perform Category Sentiment Analysis (CSA) for predicting the CPD of the user directly from the text. After the training is finished, we extend the model with an item rating decoder. The text comprehension model is either trained for the TSA task or the CPD prediction task directly. In both cases, the CPD targets are defined as described in Subsection 5.1, based on the total movies mentioned in the conversation, and taking their ground truth ratings into account. This way, even thought all mentioned items of the complete conversation are used for forming the CPD target, the model is only given the ongoing part of the conversation while trying to predict that target. This way, the model again tries to foresee the true CPD of the user, given only the current part of an ongoing conversation. This can be perceived as an analogous denoising training setting, where only a subset of the targets is given as input, but the complete target needs to be reconstructed, as described at the end of Subsection 5.2.2.

5.3.1 Text Comprehension Model

Our text comprehension model is BERT [31] extended with one linear layer that uses BERT’s output (hidden representation). Depending on he task, we either add one linear layer for TSA prediction or one for CPD prediction. Our BERT model has different architectural hyper-parameters (hidden size, number of layers, etc.) that the publicly available pre-trained one and is trained from scratch. Furthermore, we apply the tokenizer that was used by Li et al. [1] and not the BERT tokenizer in order to have comparable results and also trace BERT’s attention to complete tokens, since BERT tokenizer might split a term into sub-tokens. Additionally, the formation and pre-processing that we apply to the input is an extension of BERT’s approach, as described in Subsection 4.2.1. We will now describe how BERT is being utilized for the conversational domain and for the TSA and CPD prediction tasks. The conversation consists of utterances from the Seeker and the Recommender, some of which may contain one or more mentioned items. We will first explain how a conversation is being pre-processed on the utterance level and then on a conversational level. An example of the complete input pre-processing and use of text comprehension model is presented on Figure 2.

Utterance Level Pre-processing Each utterance is being tokenized regardless of the sender. A special token is being added at the beginning, denoting the Start Of the Utterance and one at the end signaling its ending; tokens [SOU] and [EOU] respectively. The model needs to understand that mentioned items belong to a common set of items, and ignore the name of each item. For this reason, each Mentioned Item is being replaced by the special token [MI]. The hidden representation of the [MI] token will be given as input to a classifier that will be trained for the TSA task, having as target the sentiment of the seeker for that mentioned item.

Conversation Level Pre-processing After each utterance has been pre-processed, we need to concatenate all past utterances in the original order, providing necessary context and forming one sample for each Recom-mender’s response. The special token [SEP] is placed between the utterances. At the same time, the model needs to understand who is the sender of each token and utterance. In order to perform this, we denote the sender by adjusting the segment embedding of each token. To that end, the are two segment IDs, 0 and 1, which are being translated into embeddings by using a trainable linear layer. The positional ID of each token is equal to their position in the final form of the processed conversation, which again is taking the form of embedding with a trainable linear layer. Finally, following BERTs technique for extracting meanings on a multi-sentence level, we add a special [CLS] token at the beginning of the input. The hidden representation of this token will be given to a classifier that is being trained on predicting the CPD of the seeker.

(14)

Figure 2: A conceptual depiction of the proposed text comprehension model put into use. An example of an ongoing conversation is given at the bottom of the figure. From bottom to top, you can see the complete pipeline of the text comprehension system. The Segment Embeddings denote the source of each token; Seeker (Orange) or Recommender (Grey). The hidden representation of the [CLS] token (blue) is used for Category Preference Distribution prediction, while the output of the [IM] token (red) is used for Targeted Sentiment Analysis prediction. Even though the complete input is being used by the model, only the hidden representa-tions of the the special tokens [CLS] (Blue) and [IM] (Red) have a training target. The rest of the tokens are only used as context providers. The two tasks are designed so that they can be trained and used individually. Nevertheless, it is possible for the model to be jointly trained on both tasks at the same time and to be used for both as well.

5.3.2 Conversational Recommendation Models

There are three proposed approaches for explainable conversational recommending (a) Using Items : recom-mend based on mentioned items, (b) Using Text : recomrecom-mend based on the conversation (ignoring the items) and (c) Using Both : combine the previous two. An high level description of the three proposed approaches is rendered in Figure 3. The first model performs TSA on the mentioned items and uses the predicted senti-ment as input for the CAAE model. The second one predicts the CPD of the user directly from the text of the conversation, ignoring mentioned items, and them uses an item rating decoder in order to come up with recommendations. The third approach combines the first two in order to predict user’s CPD , and then it recommends accordingly. User’s CPD that was used as training target, is the true CPD that is calculated given all conversation’s mentioned items with seeker’s given sentiment, as described in Subsection 5.1. This way, even though the model is only given the an incomplete part of the conversation, because it is an ongoing conversation, it still tries to anticipate user’s ultimate CPD following CAAE’s principal.

The outputted hidden representations of the text comprehension model are being used as input for a rec-ommender module. The hidden representations of the special tokens [MI] and [CLS] are the only ones being used. Every other token is being used as a context provider for those. The first is being used for TSA and the second for CPD prediction. When performing the former setting (Using Items), after the training of the text comprehension model is complete, we add a CAAE on top of it that uses the predicted TSA values as input. The CAAE again, will be trained in two steps, as described in Subsection 5.2. When performing the latter case where the base model was trained for the CPD prediction task (Using Text), we extend the model with the Item Rating Decoder module of the CAAE alone, predicting item ratings from the estimated CPD directly. We can train the text comprehension model for either task, or jointly train for both. In the latter case, the losses of the two tasks are being normalized so that they are in range [0, 1] . This is performed using minmax

(15)

normalization, where min and max values are being obtained by the evaluation set before every training epoch. Before each training step, we linearly interleave the two normalized losses using a scalar parameter α ∈ [0, 1]. Regarding the third proposed approach for conversational recommendation (Using Both), a model is trained jointly on the two task. Then, the parameters of the text comprehension model are kept unchanged while the TSA prediction is extended by a category encoder that is trained on predicting user’s CPD. Afterwards, the two CPDs are being linearly interleaved using the same α parameter that is used for linearly interleaving the losses of the two tasks. The produced CPD is being normalized so that it sums up to 1, by dividing each category preference by the sum of the interleaved CPD. Finally, the overall model is being extended by an item rating decoder that provides the recommendation scores.

Figure 3: The proposed approaches that CAAE can be used in a conversational setting using BERT as a text comprehension model. As mentioned in Subsection 5.3.2, BERT can predict user’s sentiment over mentioned items and then be extended with a CAAE and have two more steps of training (Top). Otherwise, BERT can provide directly user’s CPD prediction and then forward the prediction to an Item Rating Decoder which pro-duces the item recommendation scores (Middle). Last but not least, BERT can be jointly trained for both TSA and CSA tasks. Then, the predicted TSA is being forwarded to a Category Encoder which is trained for predicting users CPD. Finally, the two predicted CPDs are being linearly interleaved, normalized and for-warded to an Item Rating Decoder that produces item recommendation scores (Bottom).

5.4 Using Token Attention for Justifying CPD Prediction

Working towards explaining to the user what led the model towards its recommendations, we studied whether user’s utterances can also be utilized. To that end, given that the attention of self-attentive models is crucial for the result [35], we use the attention values to point out the tokens that affected the model’s result the most. In order to tackle the attention distraction problem [50] we suggest three different approaches. The vanilla approach is to use only one [CLS] token (Vanilla) for predicting user’s CPD as it appears in Figure 2, and use its attention to point out important tokens. For the other two approaches, we use one [CLS] token for predicting user’s preference over each category separately. Specifically, for the second approach, the hidden representation of each [CLS] token is being handled by a different trainable linear function (C L). The underlying assumption is that the attention values of each [CLS] token, will only focus on the tokens that affected the prediction of the corresponding category preference. This way, we can point out only the tokens that affected the predicted preferred categories. Regarding the third approach, we studied whether the linear layers on top of each [CLS] should share parameters (1 L). The motivation behind this setting is that by having different parameters on the linear layers, all [CLS] tokens might give the same representation and each linear layer will do most of the work by identifying preference for each category. This might not help on the attention distraction problem. On the other hand, if all linear functions share parameters, then the common linear function will only behave as an On/Off switch, forcing each [CLS] token to focus on one category and be more prone to solving the attention distraction problem. The first approach that uses one [CLS] token for predicting the CPD is depicted in Figure 2 and the two approaches that use |C| [CLS] tokens are presented in Figure 4, where |C| is the number of categories.

Aggregating Model’s Attention Among Tokens Each attention head of every transformer layer produces two attention values for each pair of tokens given as input; one from token A towards token B, and one for the reverse direction. We are only interested for the attention values of [CLS] tokens, towards all Seeker’s tokens, having in total H × L values for each [CLS] token, assuming H attention heads and L transformer layers. Some parameters need to be set in order to aggregate the attention of the complete model over one

(16)

Figure 4: The two approaches that use in total |C| [CLS] tokens are presented, where |C| is the number of categories. At the middle, the beginning of the input is presented vertically which consists of |C| CLS to-kens (one for each category). The rest of the input is omitted for clarity, but is the same as in Figure 2. On the Left (Green), the hidden representations are being linearly projected by the same function (1 L). On the Right (Red), the hidden representation of tokens that represent different categories are being projected by different trainable linear functions (C L).

token. Following Vashishth et al. [35] study, where they find that the attention values of the first layer are the most important ones, we choose to utilize only them. Regarding the attention heads, we only use the maximum attention values over all heads, for each token. We choose to do so, since we are interested on attention peaks that show the importance of token under any attention head and an average value over the heads would destroy this information.

Identifying Important Tokens After calculating a single attention value between every [CLS] token (each of them represents a category) and all Seeker’s tokens obtained from the conversation, it needs to be decided which of them are important. We first exclude special tokens and all Recommender’s tokens. Then the softmax function is being applied, in order to get a normalized distribution over seeker’s tokens. Finally, important tokens are being defined as the ones that exceed the average attention threshold which is equal to 1 divided by the number of utilized tokens. It should be noted that important tokens are being identified over all passed Seeker’s utterances.

Identifying Predicted Categories Since the CPD is a distribution produced by the softmax, all predicted CPD values are expected to be in the range [0, 1] This does not provide a clear distinction between preferred categories and the rest of them. For this reason, we had to set a preference threshold, which is equal to the average category preference value: 1 divided by the number of categories. By doing the same for the true CPD of the user, as described in Subsection 5.1, we are able to define Ground Truth (GT) categories for each user and calculate precision and recall values of models, regarding the category prediction.

(17)

6 Experiments

The experiments are designed to answer the research questions, as defined in Subsection 4.3, and this section is organized accordingly. Initially, we will describe the datasets used for the experiments in the first Subsection. After that, there are two Subsections for each research question. One that defines the experimental setup which allows us to answer the question and one that presents the results of the experiments. Starting with the item recommendation experiments, the performance of the CAAE recommender model will be compared to its baselines, allowing us to answer the first research question. Later, the experiments that allow us to investigate whether its better to recommend based on the mentioned items or utilize the linguistic content of the conversations alone will be presented and compared to the baseline models. Finally, the three different approaches for identifying important tokens and predicting user’s category preference from the conversation will be compared.

6.1 Datasets

In this subsection the ReDial conversational recommending dataset will be initially described. All the perfor-mance evaluation experiments are executed using this dataset. The MovieLens [59] (movie-ratings) dataset will also be presented, which is an auxiliary dataset used for further improving the recommender model for one of the item recommendation baselines.

ReDial Dataset As mentioned earlier, the ReDial dataset [1] is the first dataset with real recommender conversations. The reason why this dataset stands out is that the examples gathered come from a realistic scenario both from a conversational aspect and from a recommendation aspect. This prompts the research com-munity for conversational recommendation, because experiments so far were based on synthetic conversations and recommendations that were combined. Redial represents our desired problem setting very well. It does so, because it allows each conversational recommendation session to be treated individually dealing with all users as new users, and it recommends movies that can be described by a common set of categories. Additionally, its conversationally interactive nature allows for future studies of the effectiveness of explanation based feedback. The dataset was collected with Amazon Mechanical Turk. Each conversation is taking place between two work-ers, the recommendation Seeker and the Recommender. The Seeker explains her movie preferences and asks for suggestions. The recommender needs to suggest movies accordingly. All mentioned movies are being tagged by the workers, so that they can later be identified. At the end of the conversation, both workers are being asked to fill in a movie dialogue form. This form contains the Seeker’s sentiment for each mentioned movie (”liked”, ”Didn’t like”, ”Didn’t say”), and whether the Seeker has seen the movie (”Seen”, ”Haven’t seen”, ”Didn’t say”). Each conversation has a minimum of 10 messages, and at least 4 different movies are mentioned. There are in total 10,006 conversations, 182,150 utterances, 51,699 movie mentions, made by 956 users. The ReDial dataset was used for evaluating the proposed approaches and the baselines on the item recommendation and conversational item recommendation tasks.

MovieLens Dataset GroupLens Research is publicly providing large datasets of movie ratings [59].The largest and latest3 _{available dataset was used for information extraction experiments. The information that}

was extracted was in the form of categories that describe each item. Additionally, MovieLens was used for pre-training on the item rating prediction task, in order to reproduce the results of one item recommendation baseline. Regarding the dataset, it has 27 milion ratings, on 58k movies, by 280k users. The ratings are from 0.5 to 5, with minimum difference of 0.5. There are 19 movie categories 4 _{and an option for ”no category}

information”.

Data Preprocessing The data preprocessing follows the guidelines provided by Li et al. [1]. Specifically, we take into account only movie dialogue forms submitted by the Seeker. The movies with unknown sentiment label (”didn’t say”) are being ignored. Each conversation is treated individually, introducing a new user and her preferences. Matching movies on the two datasets is based on the names of the movies. The script is provided by the ReDial authors [1]. Out of 6924 movies in Redial 5178 (75%) were matched perfectly, and the remaining 1746 (25%) were not matched. The ReDial dataset consists of a training set and a test set. We randomly split the training set to training (80%) and validation (20%) sets. MovieLens, has only one set of user-item ratings. They are randomly split it into training (80%), validation (10%) and test (10%) sets. The MovieLens ratings are being binarized using a threshold (greater or equal) of 2, so that the liked/disliked distribution is as similar as possible to the Redial liked/disliked distribution. Regarding Movielens, some users can be found on all three

3_{https://grouplens.org/datasets/movielens/latest/, generated September 2018}

4_{’Comedy’, ’IMAX’, ’Romance’, ’Western’, ’Crime’, ’Sci-Fi’, ’Animation’, ’Thriller’, ’Fantasy’, ’Film-Noir’, ’Mystery’,}

Explainable Conversational Recommendations

MSc Artificial Intelligence

Master Thesis