Explaining a deep-learning text-based recommender system: Effectiveness of presenting explanations on increasing user trust and model persuasiveness.

(1)

1

Explaining a deep-learning text-based recommender system:

Effectiveness of presenting explanations on increasing user

trust and model persuasiveness.

Master Thesis

MSc Marketing Intelligence

Dawid Łepkowski

Supervisor: dr. K. Dehmamy

Second Supervisor: prof. dr. J.E. Wieringa

University of Groningen

Department of Marketing

(2)

2

Acknowledgements

I would like to thank my first supervisor dr. K. Dehmamy for his guidance throughout the entirety of the Master’s program. Your passion for data science is what motivated to pursue this challenging thesis topic and put hours into learning programming skills. Secondly I would like to thank my second supervisor prof. dr. J.E. Wieringa, for introducing me to fascinating world of modelling. Participating in your lectures and tutorials was truly inspiring.

I wish to thank my family, my parents and grandmother, from the bottom of my heart, for their support throughout my entire studies. They are also the ones who convinced me to pursue this degree in the first place. Thank you for believing in me all this time.

(3)

3

Abstract

Text-based recommendation systems have been gaining popularity in recent years, due to the rising popularity of deep-learning algorithms in recommendation tasks. This thesis aims to investigate whether utilizing more complex text-based architectures is beneficial to the model’s performance and secondly, whether providing users with explanations of the model’s prediction improves their user trust in the system and persuasiveness of the system. After outlining the current literature regarding recommendation systems, deep – learning models, trust, and

(4)

4

Introduction

In recent years, Machine-Learning methods have been gaining momentum within the marketing industry. While they offer significant improvements in terms of predictive capabilities, they are notable for their lack of interpretability (Fusco, Francesco, Michalis Vlachos, Vasileios

Vasileiadis, Kathrin Wardatzky, and Johannes Schneider. 2019). Lack of interpretability becomes an issue when the model is deployed in a practical setting. For example, in recommendation systems, customers have been shown to be more open towards

recommendations which are understandable to them i.e. which they can trust (Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016). This stands in contrast with the potential offered by complex machine learning models, which are by design not interpretable to humans (black-box models). However, certain solutions have been developed, which allow for explaining complex deep-learning algorithms in an interpretable way and have been summarized in the article by Guidotti, Riccardo, Anna Monreale, Salvatore Ruggieri, Franco Turini, Dino

Pedreschi, and Fosca Giannotti (2018). These methods potentially unlock new opportunities for making machine learning methods more understandable to the average consumer.

More research in the field of customer’s perception of models is necessary. In order to bridge the gap between the consumers’ need to understand the model and the ever-increasing complexity of applied machine learning algorithm, some form of explanation, could be beneficial, which is where this thesis will aim to contribute.

The structure of the study is as follows. Chapter 1 - Theoretical Framework focuses on

reviewing the current literature, establishing a knowledge-base for conducting research regarding recommendation systems and formulating specific hypotheses. Chapter 2 - Methodology

(6)

(7)

7

Theoretical Framework

Recommender Systems

Recommender systems have been extensively used in recent years by companies to overcome the issue of consumer choice-overload (see Zhang et al. 2019). Customers who do not have to deal with over-choice and are able to easily find a suitable option are more likely to remain

customers, as evidenced by the example of Netflix, which utilizes recommender systems to increase customer engagement, reduce customer churn, increase lifetime customer value, and reduce costs related to replacing cancelled subscriptions (Gomez-Uribe and Hunt 2015). Recommender systems can be classified based on the method of generating recommendations. The three main approaches are collaborative-filtering, content-based and hybrid methods (Samih, Adadi, and Berrada 2019). Collaborative filtering methods utilize data regarding the user

preferences to predict products a user might like (Breese, Heckerman, and Kadie 2013). Content-based algorithms on the other hand recommend items that are similar to the item’s user enjoyed in the past (Ricci et al. 2011, 11). Hybrid methods try to combine the advantages of the two above methods, by for example, mitigating the new-item problem for Collaborative Filtering algorithms by using item features to recommend items without a user rating (Ricci et al. 2011, 12).

An example of a content-based recommender system was presented by McAuley and Leskovec (2013), where the authors leveraged text reviews in building a recommender system by applying a Latent Dirichlet Allocation analysis to discover post-hoc topics hidden within the reviews. This resulted in highly interpretable text labels, which allowed the authors to provide justification for the rating given by the model. This shows the potential of applying text reviews in

(8)

8 In a paper by Zheng, Noroozi, and Yu (2017), the authors propose a deep learning model for recommendation, which utilizes review text data. The proposed model utilizes two parallel neural network layers, one of which learns user behavior from ratings given by the user, while the other focuses on learning latent factors from reviews written for the item. A shared hidden layer is added on top of the network to enable the latent factors to interact with each other and generate a prediction of a rating for the review. The paper demonstrated that the model

outperformed baseline recommender systems on various marketing related datasets (including the Yelp dataset). However, the architecture requires the pair-wise availability of the target text review written by target user, which is rarely availably in real-life databases. Nonetheless, the architecture was built upon by Catherine and Cohen (2017) and adapted to remove this necessity. This demonstrates that the model could potentially be deployed in real-life scenarios where such data is not available, for example generating recommendations for a website user with only a few reviews. Most recently, a study by Dezfouli, Momtazi, and Dehghan (2020) concluded that using text reviews and a neural network based architecture can lead to a significant improvement in performance, compared to the two methods DeepCoNN (Zheng, Noroozi, and Yu 2017) and TransNets (Catherine and Cohen 2017), which were previously mentioned in this section. It becomes clear that this recent trend of including text data into recommendation systems is increasing in popularity, as the existing subject literature grows every year. Overall, based on the most recent literature, it can be concluded that leveraging text data within the scopes of deep-learning model is the current state-of-the-art method for creating recommendation systems. The proposed solutions have implications for marketing managers and data scientists who want to utilize text reviews of their customers. Due to the large size of review text databases, there exists an opportunity to increase performance compared to predictive models utilizing traditional statistical methods. Additionally, making the model more interpretable makes the

implementation process easier and enables the managers to gain insights from text reviews written by customers. Text reviews are also familiar to customers and recommender systems which utilize text reviews could potentially benefit from it, as evidenced by the high

(9)

9

Neural Networks

Neural Networks can be described as machine learning algorithms, which were designed to recognize patters by loosely mimicking the way a human brain works. They have been

successfully adopted across a variety of application such as image recognition (Sharma, N., Jain, V., & Mishra, A. (2018).). Additionally, they are becoming increasingly popular within business application. Sharma, A., & Chopra, A. (2013) summarized the aspects of businesses, where neural networks have been implemented. Examples from the marketing department include: Marketing Data Mining, Brand Analysis, Storage Layout, Target Marketing or Sales Forecasting. These are just some of the examples of how utilizing neural networks within a company

improves its performance. Another example application where neural networks have

outperformed existing models is the area of recommendation systems, discussed in detail in the previous section.

One of the arguments for utilizing deep neural networks is the ability of the model to take advantage of multi-modal data, which is common on the web, such as photos, reviews, and star ratings (Zhang et al. 2017). Deep-learning methods are widely adopted within recommendation systems and are currently on the rise (Zhang et al. 2017).

A downside of using deep learning methods is the difficulty in interpreting the results, compared to traditional statistical methods. By not providing any justification for the recommendation, the customer might lose trust and not accept the recommendation (Zhang et al. 2019).

(10)

10

Complexity

The improvement of recommender system algorithms leads to a significant improvement in the afore-mentioned marketing metrics such as reduced customer churn and increased customer engagement (Gomez-Uribe and Hunt 2015), which leads to the conclusion that more accurate recommender systems are more efficient. One of the ways of increasing accuracy of a

recommender system which is based on a neural network is increasing its depth - number of layers. According to Bianchini and Scarselli (2014), deeper neural networks are better suited for modern applications dealing with different types of data. Therefore, it could be the case that increasing the depth of the neural network will increase predictive accuracy and therefore the quality of the recommendations. On the other hand, another study by Dhingra (2017) showed that, at least in some cases, the complexity of the neural network measured by the size of the network, displays diminishing marginal effects towards the model’s accuracy, meaning that increasing model complexity beyond a certain threshold could cause a significant increase in processing time, but barely increase accuracy. In order to confirm whether this is true, the thesis will aim to fill in the existing research gap by answering the following question:

– Does increasing model complexity improve the quality of recommendations?

And the corresponding hypothesis:

– H1: Increasing model complexity leads to a higher accuracy of recommendations.

Explainability

Explaining the algorithms behind recommender systems and therefore increasing transparency can lead to a series of benefits for the firm and the customer. An experiment described in Zanker (2012) indicated that explanations are a vital part of functionality of a recommender system. Results showed that including explanations significantly contributed to increasing users’ perception of the utility of the recommender system, the intention to use it repeatedly and the commitment to recommend it to others.

(11)

11 accessible(EU 2016). The same Recital dictates that it is particularly relevant in situations where

“… technological complexity of practice make it difficult for the data subject to know and understand whether, by whom and for what purpose personal data relating to him or her are being collected, such as in the case of online advertising.” (EU 2016). This is true for

technologies such as machine learning algorithms, which are designed as black-boxes and are not meant to be easily interpretable.

One recent example of taking a step towards transparency and interpretability is designing transparent models which are explainable and transparent on their own. An example of such model was given by Cheng et al. (2019), where the authors designed a neural network which utilized text reviews and images and was interpretable and faster compared to a neural network that was explained locally. This approach, however, requires building the model in a transparent format and is an advanced task requiring significant processing power resources. Additionally, the methods for explaining the predictions are not applicable to other model types.

Another approach is to explain existing black-box models, which are not transparent by design. In the article by Lipton (2018), the author presents several post-hoc interpretation methods e.g. visualizations of learned representations or models and natural language explanations. In order to gain some understanding of how a neural network-based model generates a

prediction, the structure of the network can be visualized where the connections and layers are displayed. The more important connections and activated nodes can be highlighted; however the output is often difficult to interpret and explain, particularly to customers without technical expertise. Visual explanations have been used to explain image classification networks (Mahendran and Vedaldi 2015) by reconstructing the image at each layer in reverse and highlighting the changes, which allows humans to understand what the network learns at each step. However, the resulting output is still unclear and difficult to understand, therefore its uses are still limited for business applications. The main benefit of using text-based explanations is that they are easily understandable to humans and that they can provide additional insights for the user (McAuley and Leskovec 2013).

(12)

12 which aims to explain a classifier in a faithful way by approximating it locally with an

interpretable model. The core of this method lies in explaining predictions made by any model, by estimating a sparse linear model in a local region nearby a particular point. The results are very data-sensitive as with any local explanation method, as LIME assumes that every complex model is linear on a local scale and fits a simple model around a single observation that tries to mimic the behavior of the global model. Furthermore, LIME was proven to be able to locally mimic the behavior of the complex recommender system (Nóbrega and Marinho 2019) and therefore should suitable for a similar task in this analysis.

The general algorithm applied by LIME can be summarized as follows (adapted from Pedersen and Benesty 2020) :

1. Permute the observation for each prediction to explain.

2. Allow the black-box model to generate predictions for all permuted observations.

3. Calculate the distance between the permutations and the actual observations to obtain similarity. 4. Select ‘m’ features best describing the complex model outcome from the permuted data.

5. Fit a simple model to the permuted data, explaining the complex model outcome with the ‘m’ features from the permuted data weighted by its similarity to the original observation.

6. Extract the feature weights from the simple model and use these as explanations.

(13)

13

Explanatory criteria of Recommender Systems

Recommender systems can be assessed across different dimensions, depending on the goal. Table 1.1 table below presents explanatory criteria and their definitions.

Table 1.1 Explanatory criteria and their definitions (Ricci et al. 2011, 483)

Aim Definition

Transparency Explain how the system works

Scrutability Allow users to tell the system it is wrong

Trust Increase users’ confidence in the system

Effectiveness Help users make good decisions

Persuasiveness Convince users to try or buy

Efficiency Help users make decisions faster

Satisfaction Increase the ease of use or enjoyment

Herlocker et al. (2004) concluded that accurate recommendations do not guarantee that the user will have a satisfying and effective recommendation process and that additional factors

determine whether a user will actually use the system. Ensuring that users use the system and that the recommendations made by the system are actually accepted and followed up upon have been proven to be key goals from a marketing perspective, as higher user satisfactions leads to higher engagement, which in turn results in decreased customer churn (Gomez-Uribe and Hunt 2015). Further marketing consequences of displaying recommendations are discussed in the next sections.

Sinha and Swearingen (2002) on the other hand discovered that users need to develop trust in a recommender system and certain interface elements contribute to overall confidence in the system. Two examples of such interface elements are displaying explanations next to

recommendations and providing basic information about the recommendation. Both elements increase user confidence in the system and recommendations acceptance.

(14)

14 been explored in academic research is analyzing how retailers can utilize recommender systems to persuade visitors to convert or buy a more expensive product without losing their trust. This highlights how intertwined the two aspects of the recommender system are and how important it is to maintain user trust. Potentially, displaying explanations and therefore increasing

persuasiveness while maintaining user trust, could generate value for both the user and the firm, for example by increasing customer-engagement, which leads to increased customer satisfaction (Value-to-Customer), and increased revenues (Value-to-Firm) (Gomez-Uribe and Hunt 2015). Jannach and Jugovac (2019) additionally suggests that different desired effects should be expected from explanations, based on the time frame analyzed. In the short term, they can facilitate the decision-making process (increase effectiveness) or persuade the user to choose a certain option. In the long term on the other hand, explanations are considered a valuable tool in trust-building. Academic researchers in the field have unanimously concluded that providing explanations can have a major impact on the effectiveness of a recommender system (Herlocker et al. 2004; Tintarev and Masthoff 2012). However, results regarding its effects on

persuasiveness are less clear and therefore, the next sections will further explore the available literature on user trust in recommender systems, persuasiveness of systems and how the two are linked. In addition, the marketing importance of these two aspects of recommender systems will be discussed.

Trust

User trust is an increasingly important criteria for evaluating recommender systems. Increased trust in the system may lead to a higher overall user satisfaction according to Benbasat and Wang (2005). According to Gruca and Rego (2005) increased customer satisfaction leads to increased cash flow growth and lower variability, therefore stabilizing the company’s income.

Additionally, Leninkumar (2017) have found a significant and strong positive correlation between trust and customer satisfaction. All these examples show how important trust is to customer relationship management and therefore to the firm’s long-term performance.

(15)

15 choosing products, movies or restaurants based on the recommendation. The issue of trust

becomes more important for decisions and purchases which carry more risk than, for example when the decision requires an investment of time or money, such as going to a restaurant. One of the things which could increase a user’s trust in the system is familiarity of observations. Customers trust a recommender system more, when they recognize the recommendations

provided, even though novel recommendations were often labelled as more useful (Swearingen and Sinha 2001). A way to mitigate distrust coming from unfamiliarity with the recommendation is providing detailed information about the recommended item (Swearingen and Sinha 2001; Cooke et al. 2002).

Studies have been done regarding the effects of providing explanations on trust, but there is no clear verdict on whether displaying explanations leads to increased trust in the recommender system. According to Benbasat and Wang (2005) consumers perceive recommender agents as their “virtual assistants” and their usefulness of the recommendations and the user’s trust in the system influence customer’s intentions to adopt the recommender system. On the other hand, another study showed that consumer trust in the recommender system is not improved by

transparency (Cramer et al. 2008). Due to trust being a complex concept which is built over time, it is difficult to quantify and assess it in a single study. Nonetheless trust is a key aspect of the user experience and the customer relationship and maximizing it from the start should be the goal.

One of the most important consequences of increased user trust is increasing the customer’s intention to return to the agent and save cognitive effort (Pu and Chen 2006). This is highly relevant from a marketing perspective, where increasing the longevity of the customer

relationship leads to higher customer lifetime value when it comes to subscription-based business models.

(16)

16 user trust. This will generate insights into how model complexity, displaying explanations and user trust in the system effect one each other.

After examining the available literature, while considering the contradictory results of the studies and the potential benefits of increasing user trust in the system, the following question is

formulated:

– Does displaying explanations generated by LIME lead to increased user trust in the system?

To answer the above question, the effect of displaying explanations on user trust in the system will be examined in a user survey. It is hypothesized that:

– H2: Displaying LIME-generated explanations alongside recommendations increases user trust in the recommender system.

The existing body of literature seems to support this hypothesis at least partially, however, the results of some studies were inconclusive, even though it is listed as a criterion for assessing a recommendation system. To add to the literature, this thesis will aim to answer the formulated question by testing the hypothesis H2.

Persuasiveness

Persuasiveness measures the ability of the recommender system to convince users to try or buy the product (Ricci et al. 2011, p.483). It is a key metric from the marketing perspective as it encapsulates the ability of the recommender system to influence the purchase decision and it is closely related to the user’s perceived fit of recommendation and trust in the system.

Persuasiveness becomes more important for purchases carrying higher risk i.e. more expensive purchases. In case of digital media recommender systems, the risk of accepting a

recommendation (for example watching a recommended movie on Netflix) is nearly non-existent and therefore persuasiveness plays a lesser role than in the case of a product or restaurant

recommender systems.

(17)

17 of recommended items. A previously mentioned study examined the effects of transparency on trust and acceptance of recommendations and concluded that transparency does not improve trust but displaying explanations alongside recommendations improved acceptance of

recommendations (Cramer et al. 2008). Herlocker, Konstan, and Riedl (2000) found that most users (86% of 210 respondents) of the MovieLens dataset found explanations provided alongside recommendations to be useful and would like to see them. The study presented evidence that providing explanations can lead to improvement of acceptance of the automated collaborative filtering system. Another study determined that among the factors which influence user

enjoyment of the process and in turn perceived fit with the user’s preferences are effort required for the input process, relevance of recommendations and transparency (Gretzel and Fesenmaier 2006). Therefore, more effortless, relevant, and transparent recommender systems could be more persuasive. While more relevant recommendations can be achieved by improving model

accuracy and can be measured objectively with performance metrics off-line, measuring

transparency and decreasing user-side effort usually demand conducting an experiment, as model transparency and perceived effort required are subjective metrics, which are more reliably

measured via a survey or A-B testing.

One aspect which researchers have not examined is the relationship between recommender system persuasiveness and increasing user trust by displaying explanations. There are no clear results regarding effects of displaying explanations on persuasiveness of the system nor regarding the effects of user trust on system persuasiveness, therefore, the following research questions could be formulated to fill in the research gap:

– Does displaying explanations generated by LIME lead to increased persuasiveness of the system?

Which in turn could be hypothesized as:

(18)

18

Research Question

“Are text-based recommender systems of increased complexity more accurate and do they become more persuasive and trusted when explanations are presented alongside

recommendations?”

Research subquestions:

• Does increasing model complexity improve the accuracy of recommendations?

• Does displaying explanations generated by LIME lead to increased user trust in the system? • Does displaying explanations generated by LIME lead to increased system persuasiveness?

Hypotheses

As a result of the literature review, the following hypotheses have been formulated: • H1: Increasing model complexity leads to improved quality of recommendations.

• H2: Displaying explanations alongside recommendations increases user trust in the recommender system.

• H3: Displaying explanations alongside recommendations increases recommender system persuasiveness.

Conceptual model

(19)

19

Methodology

Framework

To test the hypotheses that have been specified in the previous Chapter, a dual approach has been done used. In the first part of the study, a dataset suitable for building a recommender system using text reviews has been acquired from Yelp. The data had was made available in an unusual format (NDJSON), which required the initial processing and filtering to be done in Python. An extensive motivation for choosing this dataset and a detailed description of the data cleaning process is presented in the section ‘Dataset’ below.

Secondly, the data was transferred into R-Studio, where the ‘keras’ package was used to build a neural network-based recommendation system model on the dataset, given the existing

computing power constraints. Finally, the complexity of the best performing model has been altered in order to test whether increasing model complexity improves the accuracy of recommendations, therefore testing hypothesis H1. Details of the model building process is described in the ‘Model’ section of this Chapter.

Subsequently, a research survey was conducted to test hypotheses H2 and H3. The survey

utilized explanations generated by LIME for selected restaurants from for text reviews new to the model and utilized a within-subject pre- post-treatment design, where displaying explanations is the treatment. Respondents were asked several control questions as well as direct questions used to assess their persuasiveness and trust in the model. The section titled ‘Survey’ contains more details regarding the survey questions and the reasoning behind them.

(20)

20

Fig 1.2 Study Framework

Dataset

Training the model was done on an open dataset provided by Yelp.com. The dataset was downloaded in JSON format from (“Yelp Dataset” 2020). It contains data regarding business attributes and user reviews, including ratings and text reviews for a variety of businesses, from a wide range of industries. The dataset was chosen, as it is perfectly suited for conducting text-based model training and testing the hypothesis related to the model’s accuracy. It has been released recently and has not been used for many published studies, however a previous, largely similar version of this dataset has been used in studies regarding text analytics and recommender systems. The large size of the dataset, the volume and variety of the data it contains as well as the fact that it has not been extensively studied make data processing a challenging one, while providing an opportunity to gain deeper insights into real-life explainable Machine Learning applications within the scope of Marketing Science. The dataset is large enough for machine learning methods to potentially outperform traditional analysis methods and the data is directly related to marketing, in this case customer rating of the business, therefore it has been chosen as a base for building the model in this thesis despite the difficulties related to data processing. A previous version of this dataset has been widely used in literature for building recommender systems e.g.(Kouvaris, Pirogova, and Asuncion 2018, Zheng, Noroozi, and Yu (2017); Catherine, Cohen ,2017).

The data was initially processed using Python and saved as a .csv format file to allow for

processing in R. The Python code for data processing, filtering and the exploratory data analysis is available in Appendix A.

(21)

21 file type, for example ‘ndjson’ and ‘jsonlite’ packages. However, using Python libraries ‘pandas’ and ‘numpy’ allowed for processing and filtering the data in the desired way.

The data was made available in six separate dataframes, which contained unique keys for businesses and users, allowing for merging of these dataframes. The complete database of Yelp is very large (8gb) and contains dataframes, which were not relevant for this analysis, therefore, to make the analysis more focused, only the files containing reviews and business data were used in the analysis. The files containing photos, check-ins, tips, and detailed user data were not used. This reduced the size of the dataset to about 5gb. Furthermore, to narrow down the analysis, the data was filtered to only include restaurants from Las Vegas. The city was chosen, due to it having the most observations of all US-cities. The restaurant industry was chosen, as it is one where text reviews offer particularly valuable insights and are an established factor in affecting customer behavior and determining the success of restaurant. It was also the most popular type of business on Yelp, offering enough observations to conduct an analysis. It is also a heavily

customer-oriented industry and is characterized by a competitive nature, therefore additional intelligence generated from text reviews are particularly interesting to stakeholders of

restaurants. Finally, it is the most popular business type on Yelp in terms of number of business observed and reviewed, therefore it will have enough datapoints to utilize the potential of Machine Learning methods.

The filtering resulted in a merged, 1.8 gigabyte-sized ‘.csv’ file containing 13 variables,

including 308 565 text reviews for 2 110 restaurants from 12 686 users. A summary of which is presented in the table below. At this point the dataset was transferred to R-Studio, which was the software used throughout the remaining part of the analysis.

Table 2.1 – Summary of key Yelp dataset descriptives

Review Length (words) Number of reviews (restaurant). Number of

(22)

22

Model

The model was developed using R-Studio, with the help of several notable packages. The tidyverse framework was used for reading and processing the data, as it offers significant processing time improvements compared to base R solutions. For example, just reading the csv file was several times faster using ‘readr’ package, which makes a significant difference with this volume of the data. The ‘ggplot2’ package was for visualizations as it is streamlined for working within the tidyverse framework and offers flexible and aesthetically pleasing visualization solutions for neural networks and their explanations. However, most importantly, the ‘keras’ package was used to build and estimate the model. Keras is mostly known as a neural network API developed within Python, however R users can access this infrastructure and utilize its capabilities thanks to the ‘keras’ package.

It is the most flexible, state-of-the-art solution which allows users to leverage Python’s libraries, while still operating within the scopes of the R programming language. There are several

benefits from using ‘keras’ for developing models, namely: the potential to conduct computation parallelly on both the GPU and CPU and the ability to deploy models to external processing units, such as NVIDIA GPUs or Google’s or Amazon’s Cloud Computing services. This significantly improves local processing speeds, and enables outsourcing lengthy computing operations, such as training neural networks to Cloud Computing Services providers, which drastically reduces the hardware requirements for running deep-learning models. In case of this study, the use of Cloud services was not necessary, but developing the model within ‘keras’ ensures scalability and seamless future deployment of the model on any platform. Another benefit of using ‘keras’ is the ability to manipulate the model’s architecture e.g. concatenating input layers or using advanced functional forms compared to the standard sequential forms. The initial setup of ‘keras’ within R can be a major detractor towards choosing this framework, as it requires an installation of Python and Anaconda, which need to be optimized for working with R-Studio via the ‘reticulate’ package. After the initial setup, the stability of the

(23)

23 The architecture of the model was inspired by Zheng, Noroozi, and Yu (2017), where the authors developed a theoretical recommender system using text reviews given by the user. This allowed the authors to showcase the superiority of deep-learning based solutions compared to other state-of-the-art solutions within a hypothetical framework. A similar model structure is followed in this thesis, where the model utilizes text reviews given by a user to predict the rating the user would give. There is potential for deploying this mainly theoretical model by adding auxiliary layers, as demonstrated by e.g. Catherine and Cohen (2017), which removes the need for the particular user-item review pair to be present in the dataset in order to make a prediction. It is worth noting here that adding additional variables related to for example business or user

attributes resulted in the model’s inability to extract LIME explanations using traditional, built-in methods. Explaining only one type of data is a limitation of the LIME, however as this study is more focused on the effect of displaying explanations than on the intricacies of the model powering the recommender system, the decision was made to proceed with only a single type of data - text reviews. A more in-depth discussion of this topic is presented in the Limitations section.

The dependent variable, which the model will try to predict is the star rating of the restaurant given by the user. This is a similar approach to Zheng, Noroozi, and Yu (2017), where the authors also use rating as the predicted class. This makes it a classification task with five output classes, therefore the dependent variable has to be transformed from a single-column integer into a dataframe of five binary variables, with a value of one in the column corresponding with the original rating and a zero in the other columns.

In order to make text reviews processable within the model architecture, first a dictionary of the most popular words (in this case 5000) has to be built and then every review has to be

(24)

24 validity of the explanation and therefore the decision was made to transform the text layer into a vector of integers using a single dedicated function, which can later be used to draw explanations with LIME.

Thereafter, the vector of integers is fed into the model as an input layer and an embedding layer is stacked onto, which looks the words up in the previously built dictionary of most popular words for each word-index. The next layer returns an output vector of a fixed-length for each example by averaging over the sequence dimension, which ensures that the model can handle inputs of initially varied length without causing errors. At this point, the throughput is fed into a dense layer, which uses a ReLu activation function. The Rectified Linear Unit is the most popular hidden unit in feed-forward networks (Goodfellow, Bengio, and Courville 2016). Previously the logistic sigmoid or hyperbolic tangent activation functions were considered state-of-the-art, however in modern times, the Rectifier is the default choice when it comes to hidden units (Goodfellow, Bengio, and Courville 2016).

The aim of the model is to predict which class the review belongs to, therefore the output will contain five columns indicating probability of belonging to each of those classes, with each value ranging between zero and one and with a total probability of one. This is achieved by

implementing the normalized exponential activation function (SoftMax) as the final layer of the model, which returns a probability distribution of belonging to the number of specified classes, which sum to a total of one (Bridle 1990). The label with the highest probability is then selected as the predicted class. It is widely used as a final layer of classification oriented neural

networks(Goodfellow, Bengio, and Courville 2016). The formula for the SoftMax function is presented below in equation 2.1 (adapted from Goodfellow, Bengio, and Courville 2016, 78):

𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑎_𝑖) = 𝑒

𝑎𝑖

∑ 𝑒𝑎𝑗

𝑗

Eq. 2.1 SoftMax Activation function

As the model aims to predict probability to belong to multiple (five) classes, the Adam optimizer function has been chosen (Kingma and Ba 2015). It is the default optimizer function of choice for classification tasks and is widely recommended throughout the literature, due to

(25)

25 The Mean Squared Error was specified as the ‘loss’ function, or the metric to minimize, when calculated for the validation sample. This metric was chosen in place of the default cross-entropy function, as the main metric used in model comparison is RMSE, which is directly derived from the MSE by taking the square root. Therefore, in order to automatically optimize the model in terms of the comparison metric, the decision was made to use MSE as the loss function. As this thesis aims to answer the question whether increasing model complexity improves accuracy of predictions, two models of differing complexity have been built and compared across their predictive accuracy. The models share the model specifications. They differ however in terms of their total number of nodes within hidden layers. The simple model was created using one layer of ten hidden nodes, whereas the complex model was built using two layers of ten nodes. A detailed comparison of the models can be found in the Results Chapter.

The difference between the models comes in the number of hidden layers and hidden nodes within their architecture, therefore their size and required training time. It is expected that training the more complex version of the model will take significantly more time, while offering improved predictive accuracy. It is also expected that at some point the complexity of the

network will cause overfitting and actually decrease the accuracy of the model, therefore the complexity will be increased by a reasonable amount, in line with modern literature.

Additionally, even before reaching the point of overfitting, the complexity of the model most likely has a diminishing positive marginal effect on the predictive accuracy of the model, as evidenced by Dhingra (2017), where the authors inspected the accuracy-complexity trade-off by reducing the size of a neural network over 250 times, while still achieving satisfactory accuracy.

Predictive accuracy

Since the model predicts the star rating a particular user would give to a restaurant on a 1-5 scale, the predictive accuracy will be measured by means of the popular Root Mean Squared Error metric, as it can be used on offline datasets. This metric was used to measure accuracy during the Netflix Prize contest (see Linden, Conover, and Robertson 2009) and is a widely adopted

(26)

26 𝑅𝑀𝑆𝐸 = √_|𝜏|1 ∑ (𝑟̂_𝑢𝑖− 𝑟_𝑢𝑖

(𝑢,𝑖)∈𝜏

)2

Eq. 2.2 Root Mean Squared Error

Where 𝑟̂𝑢𝑖 represents a set of predicted ratings for test set 𝜏 of user-item pairs (𝑢, 𝑖), for which

the actual 𝑟_𝑢𝑖 ratings are given.

Specifying a model which optimizes the RMSE metric for the test dataset is fairly

straightforward using the ‘keras’ package in R. In the model building phase, the Mean Squared Error was specified as the loss function, which is the goal metric that is to be minimized by training the model. As the MSE is simply the RMSE before taking the square root, minimizing it will also minimize the RMSE. Afterwards, the RMSE will be acquired by taking the square root of the MSE and visualized over training time (epochs) to make model comparison easier. The actual ratings given by a user are available in the training dataset, however the predictive accuracy will be tested on out-of-sample data, by partitioning the dataset into the training and hold-out subsamples. The data will be trained on the 80% of the observations, while the metrics for accuracy and RMSE will be calculated using the 20% of observations not included in the training dataset. This basic cross-validation method of testing accuracy allows to check whether the model is under- or overfitting, which is a prevalent issue with machine learning models (Cawley and Talbot (2010)).

Additionally, classification accuracy of the model will be reported, defined as percentage of correctly predicted cases. Accuracy will be used as an auxiliary accuracy metric to validate the main RMSE metric comparison.

Generating explanations with LIME

(27)

27 model building process, so the decision is made to proceed with the previously specified

vectorization function. An example output of the explanation function is presented in Fig. 2.1.

Figure 2.1 Example Explanation given by LIME

The words highlighted in blue support the predicted rating, whereas the word in red contradict the rating. The number of unique words selected for the local estimation is equal to the number of features specified by the model. The label 5 stars is predicted with a 59.42% confidence, which is high considering the number of labels. The fit of the explainer is 0.93, meaning that the local estimation is a very good fit to the complex neural network estimation. However, these results vary highly when the number of features is manipulated or when a new review is explained. This is caused by the local nature of the explanation. Optimizing LIME to more closely reflect predictions given by a complex neural network is possible through altering the parameters with the explanation function. In this case, the “forward feature selection” algorithm was used, as it is recommended for explaining text data. Remaining function parameter were left in their default state, which could be the cause of the high variety in the quality of the

explanations. Additionally, the output was obtained using LIME’s built in “explain text” function, which produces visually pleasing results, which also should be clear and

understandable to the average consumer and therefore are suitable for using within the research survey for testing hypotheses H2 and H3.

Survey

The effects of displaying explanations on user trust and the joint effects on system

persuasiveness was tested based on a research survey. The data was collected through a survey built using Qualtrics and distributed via social media, however the response count was

(28)

28 is that they are more inclined to complete the surveys and answer them honestly. The computing power required for connecting the previously built recommendation system to the survey and providing explanations in real-time to respondents is largely unattainable using local machines, at the time of writing this thesis, nonetheless cloud-computing solutions exist which enable real-world implementation of such systems. Nonetheless, such solutions are mainly oriented towards Big-Data machine-learning pipelines and building such an implementation for a single survey would not be feasible. As the accuracy of the model has been tested using separate qualitative metrics (RMSE in this case) and due to the infeasibility of a practical test of the model, a hypothetical setting was proposed. This method allows for testing hypotheses related to

consumers’ reception of the model stated previously, regardless of the quality of the underlying model’s recommendations, while reducing the need for intensive computing processing. A set of prepared sample recommendations with and without explanations was prepared and used in the survey as recommendations given by the model. Below is a detailed explanation of the study design. An example of the survey is presented in Appendix D.

The participants were greeted with a short introduction in which they were asked to imagine themselves searching online for a restaurant. After the cover story presentation, they answered questions regarding their age group, gender, and education. This data was collected in order to control for the potential effects of these demographic characteristics on the dependent variables during the analysis.

Respondents were then presented with a recommended restaurant alongside a sample text review, without any explanation, and asked to rate the restaurant based solely on the provided information. They were also asked questions regarding trust and persuasiveness. An NPS inspired question: “On a scale of 1-10 how likely would you be to visit this restaurant?” was used to measure persuasiveness, however of the recommender system, with and without explanations, whereas a direct question was used to measure user trust: “I would trust the

recommender system which generated this recommendation”, which provided a 7-item Likert

(29)

29

Figure 2.1 - LIME generated explanation

Afterwards, they were shown the same restaurants and reviews, but with the addition of

explanations given by LIME and the rating predicted by the model (see Figure 2.1). In order to test whether displaying explanations has an effect on those two aspects, the same questions were asked before and after displaying explanations for the recommendation.

Therefore, the study had a pre-post, within-subject design in which all respondents were subject to the treatment - displaying explanations. Introducing a control group would not add relevant insights in the proposed hypothetical setting and was deemed redundant due to pre-treatment observations serving as a baseline comparison.

The respondents answered questions regarding three selected restaurant categories: Steakhouse, Vegan and Chinese, which therefore resulted in a panel structure of the collected data, since the questions for each restaurant were identical. This structure of the data allows to control for individual differences between respondents and restaurants, while also effectively reducing the number of complete observations required for a valid study. A downside of this design is that certain effects can carry over between measurements, in this study particularly, displaying explanations for the first recommendation can lead to biased responses to the following questions. Additionally, respondents may display practice effects as they progress through the survey and the observations are not completely independent of one-another. Nonetheless,

(30)

30 Finally, the collected data was analyzed using the R package pml (introduced by the authors of Croissant and Millo (2008)), which contains dedicated solutions for analyzing panel data. The results of the analysis are presented in the following Chapter.

Results

Model Comparison

As was previously discussed, after training the neural networks using the available data, the models were compared across their RMSE. Fig. 3.1 presents the visual results of how RMSE changed over time.

Fig 3.1 - Complex vs Simple model comparison

(31)

31 at least using this dataset. It is worth noting that in order to perform a robustness check, multiple versions of the model specifications were tried and tested and have displayed nearly identical results. A potential solution to the overtraining problem could be adding a dropout layer, however even after specifying the dropout rate at 0.5, the complex model was overtraining, and underperforming compared to the simple solution. Changing the number of epochs from 10 to 5 improved the simple model, while slightly improving the complex model, however such short learning times are uncommon in practice and therefore it can be inferred, that the complex model is underperforming compared to a simple neural network with less nodes. This is an interesting finding that finds some support in the discussed literature; however, it could also be an indication of potential model shortcomings, which were not detected in this thesis. More limitations of this hypothesis test, as well as potential future research ideas are discussed in the Limitations section. Even though, the complex model initially outperformed the simple model, it suffered from severe overfitting issues i.e. the model put too much emphasis on fitting the training data, which resulted in the model not being able to predict out-of-sample observations. This is also the case in this thesis, however the RMSE becomes larger for the complex model at epoch seven. Figure 3.1 clearly shows that the simple model would increase performance given more training time, whereas the complex model would experience a decrease.

(32)

32 this study, the simpler model performs better and in line with parsimonious model building it should be preferred over the complex model. There H1: ‘Increasing model complexity leads to

improved quality of recommendations’ is rejected, as there is no improvement in quality of

recommendations generated by the complex model, as evidenced by lower RMSE score. Subsequently, the simple model was used as the model to be explained by LIME, as it allowed for generating fairly accurate predictions and drawing relevant explanations.

Survey results

As discussed in the Methodology section, the goal of the survey was to test how presenting explanations affects user trust and persuasiveness. An exploratory data analysis was performed on the collected data to check for outliers and reveal potential underlying patterns.

After inspecting the data for NAs, seven responses with missing observations were removed. There were no outliers in the data, as guaranteed by the design of the survey - containing only choice questions. After tidying and cleaning the data, a total of 286 observations, collected from 95 respondents were used in the analysis.

(33)

33 Afterwards, the differences between pre- and post- treatment were calculated for all respondents. These variables were then used as part of a regression analysis, which aimed at gaining more insights into the effect of control variables on the two dependent variables trust difference and persuasiveness difference. Originally, the data was to be estimated using the ‘plm’ package, however the acquired dataset did not show enough variation in between individuals to allow for estimation using parameters other than restaurant category and therefore the method was

switched to a classical regression analysis.

The models were specified with trust and persuasiveness differences as the two DV, measured by subtracting the pre-treatment from the post treatment score. Gender, age, and restaurant category as the IVs. Table 3.1 below displays the coefficients of the two estimated models.

Table 3.1 Coefficients of the models

Trust Persuasiveness

Predictors Estimates p Estimates p

(Intercept) 0.05 0.869 -0.51 0.129 Gender -0.04 0.814 0.35 0.053 Age 25-34 0.44 0.040 0.52 0.020 Age 35-44 0.88 0.001 0.56 0.033 Age 45-54 0.89 0.002 0.79 0.009 Age 55-64 0.19 0.749 0.35 0.568 Age 65-74 2.39 0.003 0.83 0.321 Steakhouse -0.46 0.019 -0.02 0.931 Vegan -0.64 0.001 -0.03 0.876 Observations 286 286 R2 / R2 adj. 0.108 / 0.083 0.043 / 0.015

(34)

34 factor, in this case age group 18-24. To illustrate, the variable Age 25-34 has a positive

significant effect, and it can be inferred that a person belonging to this group will have a 0.44 higher difference between pre and post treatment measurements of trust. Positive outcomes in this analysis, mean that the effect is supporting the overall functionality of the recommender system, whereas a negative effect means that the change in trust is moderated by this effect or could even be negative in total. For example, the Steakhouse category had a difference in trust lower by -0.46, compared to the Chinese category, which is the baseline level for this factor. This indicates that the effect of displaying explanations is negatively affected by a restaurant

belonging to this category, compared to Chinese restaurants. However, additional factors which were not accounted for in this survey could have an effect on the effectiveness of explanations, such as the photos presented or even the restaurant name. Furthermore, customers may display a personal preference for a certain category of restaurants, while being skeptical towards other categories.

The explanatory variables in the Trust model display face validity, as their effects are largely in line with logic and pre-conceived hypotheses. For example younger age groups display weaker effects compared to older age groups (with the sole exception of 55-64, which could be cause by a low sample size), which could stem from the fact that older users are more affected by visual explanations compared to younger users who are used to browsing the web for information. Exploring the effects of age on effectiveness of trust-building attempts such as providing

explanation, is an interesting subject to explore for future researchers, however it is not the focus of this thesis, therefore it is not examined in depth. The effects of Gender are insignificant in both models and therefore will not be interpreted.

(35)

35 Only the Trust model was statistically significant overall in terms of the F -statistic (Trust –

4.205, p-value<0.001, Persuasiveness – 1.538, p-value<0.14 ), however the R squared, and

adjusted R square measures were very low. It is not a problematic issue in this analysis, as the goal is to discover additional effects of control variables on the effectiveness of the

explanations, not make predictions, therefore it is more interesting to focus on the effects of the variables, and not on the model’s performance.

Conclusions

Overall, all the hypothesis stated in the thesis have been tested using various statistical methods. Firstly, the effect of increasing model complexity on the recommendations given by the model was examined by comparing a simple neural network with a complex one. However, there was no increase in predictive accuracy as measured by the RMSE, MSE or accuracy. Moreover, the complex model took more time to train, and suffered from overfitting after a few training epochs, even after applying dropout layers to mitigate it. This resulted in rejecting Hypothesis H1.

Therefore, it can be concluded, that for text-based recommendation systems, increasing model complexity does not significantly improve performance and a more simple, parsimonious model architecture can outperform a complex one.

(36)

36 Additional findings regarding the effects of age groups are also worth noting. Control variables seem to affect the effectiveness of displaying explanations next to recommendations, however more research is needed to properly assess this effect.

Managerial Implications

Recommender systems are widely adopted throughout the digital world and potential

improvement of algorithms and the customers’ perception of their transparency are becoming increasingly important, as mentioned in the Chapter 1. The most interesting finding of this thesis from a managerial perspective is the outcome of explanations having a positive effect on the user’s trust in the model and its persuasiveness. This phenomenon can be utilized by Marketing Managers to boost performance of their Recommender Systems by including a text explanation to the recommendation. As discussed earlier, if customers do not trust the model, they are less likely to click through on the recommendations and therefore decrease model’s performance. Equally important is the fact that persuasiveness of the model is boosted by displaying

explanations next to recommendations, which Marketing Managers could take advantage of, by adding explanations to the recommendations. This could have a positive effect on the acceptance of recommendations and therefore increase the model’s performance and the client’s usage or buying frequency. These two findings are the main contribution of this thesis to the existing business and academia knowledge.

Limitations & Further Research Opportunities

(37)

37 into the text model. Research in this field shows that implementing visual features extracted from images in a recommender system for restaurants resulted in increased performance of the system (Chu and Tsai 2017). Therefore, the implementation of more data types, for example reviews data and images could contribute to higher quality recommendations also in the case of this analysis. It can be hypothesized that a more elaborate model, which uses more data types will require more nodes and layers and therefore might benefit from increasing complexity.

Another finding of this study is that the software solutions imported from Python to R, such as LIME and keras offer substantial benefits to researchers who want to utilize both languages. Although implementing even advanced model with ‘keras’ is very straightforward, the imported packages are not as powerful as their Python counterparts, due to for example limitation on specifying own functions an increased difficulty in troubleshooting the model. Additionally, the Python documentation for ‘keras’ is richer than the R documentation. Therefore it can be suggested, that future researchers, who want to start working with a machine learning

infrastructure, should implement them using Python, instead of the ‘keras’ solution imported to R, as it could potentially increase software stability and reduce the number of errors and bugs. It can also be concluded that while R is a powerful language, it still lags behind Python when it comes to machine learning applications, such as developing a neural network-based model. While it was possible to generate explanations from text reviews using LIME, implementing another type of data within the explanation caused it to crash. This could be due to

incompatibility between the processing of tensor layers within keras and the simplified methods used by LIME to estimate the model locally. Therefore, it can be concluded that while LIME is a fine option for generating explanations from text data, it lacks flexibility that is necessary to explain some multi-modal models. Future researchers exploring the subject of explanations could potentially investigate whether LIME’s Python implementation offers more flexible solutions, or perhaps utilize another local explanator altogether, such as e.g. saliency masks.

(38)

‘bag-of-38 words’ method, where no differentiation is made with regards to word order and its neighboring words in a sentence. This does not allow for the model to capture underlying patterns and learn more complicated structures of language, for example contradictions with the word “not”. This could potentially be implemented in future research regarding text data. Additionally, future researchers could compare the fit of explanations generated by LIME for models of varying text-preprocessing functions.

The survey also carried some limitations. First and foremost, most of the responses were

(39)

39

References

Benbasat, Izak, and Weiquan Wang. 2005. “Trust In and Adoption of Online Recommendation Agents.” Journal of the Association for Information Systems 6 (3): 72–101.

https://doi.org/10.17705/1jais.00065.

Bergmann, Reinhard, and John Ludbrook. 2000. “Different outcomes of the wilcoxon—mann— whitney test from different statistics packages.” American Statistician 54 (1): 72–77.

https://doi.org/10.1080/00031305.2000.10474513.

Bianchini, Monica, and Franco Scarselli. 2014. “On the complexity of shallow and deep neural network classifiers.” 22nd European Symposium on Artificial Neural Networks, Computational

Intelligence and Machine Learning, ESANN 2014 - Proceedings 25 (8): 371–76.

Breese, John S., David Heckerman, and Carl Kadie. 2013. “Empirical Analysis of Predictive Algorithms for Collaborative Filtering,” 43–52. http://arxiv.org/abs/1301.7363.

Bridle, John S. 1990. “Probabilistic Interpretation of Feedforward Classification Network

Outputs, with Relationships to Statistical Pattern Recognition.” Neurocomputing, no. C: 227–36.

https://doi.org/10.1007/978-3-642-76153-9_28.

Catherine, Rose, and William Cohen. 2017. “TransNets: Learning to transform for

recommendation.” RecSys 2017 - Proceedings of the 11th ACM Conference on Recommender

Systems, 288–96. https://doi.org/10.1145/3109859.3109878.

Cawley, Gavin C., and Nicola L. C. Talbot. 2010. “On over-fitting in model selection and subsequent selection bias in performance evaluation.” Journal of Machine Learning Research 11: 2079–2107.

Cheng, Heng-Tze, Koc, Levent, Harmsen, Jeremiah, Shaked, Tal, Chandra, Tushar, Aradhye, Hrishi, Anderson, Glen, Corrado, Greg , Chai, Wei, Ispir, Mustafa and others. 2016. “Wide & deep learning for recommender systems.” In Recsys. 7–10.

Cheng, Zhiyong, Xiaojun Chang, Lei Zhu, Rose C. Kanjirathinkal, and Mohan Kankanhalli. 2019. “MMalfM: Explainable recommendation by leveraging reviews and images.” ACM

Transactions on Information Systems 37 (2): 1–28. https://doi.org/10.1145/3291060.

Chu, Wei Ta, and Ya Lun Tsai. 2017. “A hybrid recommendation system considering visual information for predicting favorite restaurants.” World Wide Web 20 (6): 1313–31.

https://doi.org/10.1007/s11280-017-0437-1.

Cooke, Alan D. J., Harish Sujan, Mita Sujan, and Barton A. Weitz. 2002. “Marketing the unfamiliar: The role of context and item-specific information in electronic agent

recommendations.” Journal of Marketing Research 39 (4): 488–97.

https://doi.org/10.1509/jmkr.39.4.488.19121.

Cramer, Henriette, Vanessa Evers, Satyan Ramlal, Maarten Van Someren, Lloyd Rutledge, Natalia Stash, Lora Aroyo, and Bob Wielinga. 2008. The effects of transparency on trust in and

(40)

40 Croissant, Yves, and Giovanni Millo. 2008. “Panel data econometrics in R: The plm package.”

Journal of Statistical Software 27 (2): 1–43. https://doi.org/10.18637/jss.v027.i02.

Covington, Paul, Adams, Jay, , and Sargin, Emre. 2016. “Deep neural networks for youtube recommendations.” In Recsys. 191–198

Dezfouli, Parisa Abolfath Beygi, Saeedeh Momtazi, and Mehdi Dehghan. 2020. “Deep Neural Review Text Interaction for Recommendation Systems,” 1–19. http://arxiv.org/abs/2003.07051. Dhingra, Atul. 2017. “Model Complexity-Accuracy Trade-off for a Convolutional Neural Network,” 3–6. http://arxiv.org/abs/1705.03338.

EU. 2016. “Recital 58 - The Principle of Transparency | General Data Protection Regulation (GDPR).” https://gdpr-info.eu/recitals/no-58/.

Fusco, Francesco, Michalis Vlachos, Vasileios Vasileiadis, Kathrin Wardatzky, and Johannes Schneider. 2019. “Reconet: An interpretable neural architecture for recommender systems.”

IJCAI International Joint Conference on Artificial Intelligence 2019-Augus (August): 2343–9.

https://doi.org/10.24963/ijcai.2019/325.

Gomez-Uribe, Carlos A., and Neil Hunt. 2015. “The netflix recommender system: Algorithms, business value, and innovation.” ACM Transactions on Management Information Systems 6 (4).

https://doi.org/10.1145/2843948.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. Gretzel, Ulrike, and Daniel R. Fesenmaier. 2006. “Persuasion in recommender systems.”

International Journal of Electronic Commerce 11 (2): 81–100. https://doi.org/10.2753/JEC1086-4415110204.

Gruca, Thomas S, and Lopo L Rego. 2005. “Customer Satisfaction, Cash Flow, and Shareholder Value” 69 (July): 115–30.

Guidotti, Riccardo, Anna Monreale, Salvatore Ruggieri, Franco Turini, Dino Pedreschi, and Fosca Giannotti. 2018. “A Survey of Methods for Explaining Black Box Models.”

http://arxiv.org/abs/1802.01933v3.

Herlocker, J. L., J. A. Konstan, and J. Riedl. 2000. “Explaining collaborative filtering

recommendations.” Proceedings of the ACM Conference on Computer Supported Cooperative

Work, 241–50. https://doi.org/10.1145/358916.358995.

Herlocker, Jonathan L., Joseph A. Konstan, Loren G. Terveen, and John T. Riedl. 2004. “Evaluating collaborative filtering recommender systems.” ACM Transactions on Information

Systems 22 (1): 5–53. https://doi.org/10.1145/963770.963772.

Jannach, Dietmar, and Michael Jugovac. 2019. “Measuring the business value of recommender systems.” ACM Transactions on Management Information Systems 10 (4): 1–22.

(41)

41 Kingma, Diederik P., and Jimmy Lei Ba. 2015. “Adam: A method for stochastic optimization.”

3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, 1–15. http://arxiv.org/abs/1412.6980.

Kouvaris, Peter, Ekaterina Pirogova, and Albert Asuncion. 2018. “Text Enhanced Recommendation System Model Based on Yelp Reviews” 1 (3).

Leninkumar, Vithya. 2017. “The Relationship between Customer Satisfaction and Customer Trust on Customer Loyalty.” International Journal of Academic Research in Business and Social

Sciences 7 (4): 450–65. https://doi.org/10.6007/ijarbss/v7-i4/2821.

Linden, Greg, Michael Conover, and Judy Robertson. 2009. “The Netflix prize, computer science outreach, and Japanese mobile phones.” Communications of the ACM 52 (10): 8–9.

https://doi.org/10.1145/1562764.1562769.

Lipton, Zachary C. 2018. “The mythos of model interpretability.” Communications of the ACM 61 (10): 35–43. https://doi.org/10.1145/3233231.

Little, John D C, and Alfred P Sloan. 1993. “On Model Building.”

Mahendran, Aravindh, and Andrea Vedaldi. 2015. “Understanding deep image representations by inverting them.” Proceedings of the IEEE Computer Society Conference on Computer Vision

and Pattern Recognition 07-12-June: 5188–96. https://doi.org/10.1109/CVPR.2015.7299155. McAuley, Julian, and Jure Leskovec. 2013. “Hidden factors and hidden topics: Understanding rating dimensions with review text.” RecSys 2013 - Proceedings of the 7th ACM Conference on

Recommender Systems, 165–72. https://doi.org/10.1145/2507157.2507163.

Nóbrega, Caio, and Leandro Marinho. 2019. “Towards explaining recommendations through local surrogate models.” Proceedings of the ACM Symposium on Applied Computing Part F1477: 1671–8. https://doi.org/10.1145/3297280.3297443.

Pedersen, Thomas Lin, and Michaël Benesty. 2020. “Understanding Lime.” https://cran.r-project.org/web/packages/lime/vignettes/Understanding_lime.html.

Pu, Pearl, and Li Chen. 2006. “Trust building with explanation interfaces.” International

Conference on Intelligent User Interfaces, Proceedings IUI 2006: 93–100.

https://doi.org/10.1145/1111449.1111475.

Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016. “"Why should i trust you?" Explaining the predictions of any classifier.” Proceedings of the ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining 13-17-Augu: 1135–44.

https://doi.org/10.1145/2939672.2939778.

Ricci, Francesco, Lior Rokach, Bracha Shapira, Paul B Kantor, and Francesco Ricci. 2011.

Recommender Systems Handbook. https://doi.org/10.1007/978-0-387-85820-3.

Ruder, Sebastian. 2016. “An overview of gradient descent optimization algorithms,” 1–14.

Explaining a deep-learning text-based recommender system: Effectiveness of presenting explanations on increasing user trust and model persuasiveness.