• No results found

Eindhoven University of Technology MASTER A brief encounter with Vincent the effect on self-compassion from a single interaction with a chatbot that gives or asks for help van As, N.L.

N/A
N/A
Protected

Academic year: 2022

Share "Eindhoven University of Technology MASTER A brief encounter with Vincent the effect on self-compassion from a single interaction with a chatbot that gives or asks for help van As, N.L."

Copied!
85
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Eindhoven University of Technology

MASTER

A brief encounter with Vincent

the effect on self-compassion from a single interaction with a chatbot that gives or asks for help

van As, N.L.

Award date:

2019

Link to publication

Disclaimer

This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration.

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

(2)

Eindhoven, July 10

th

, 2019

in partial fulfilment of the requirements for the degree of

Master of Science in Human-Technology Interaction

Supervisors:

Prof. Dr. W.A. IJsselsteijn Eindhoven, University of Technology

A Brief Encounter with Vincent: The Effect on Self-Compassion from a Single Interaction with a Chatbot That Gives or

Asks for Help

by Nena van As

0903906

(3)

“Don’t worry little computer program. Next time you will do better.”

(Participant R50)

Subject code: Master Project Human-Technology-Interaction

Key words: Mental health-care, chatbots, self-compassion, well-being, positive psychology, emotional needs

(4)

Abstract

Chatbots can help us take care of our emotional needs. They can provide (interim) therapy to those who suffer from mental illness and they can improve resilience against such illness by stimulating self-compassion, a concept related to lower mental illness symptoms and higher well- being. This work set out to test how a chatbot stimulates self-compassion best. Our participants (n = 396) had a 10 minute chat with our chatbot, Vincent, that gave care, asked for care, or chatted about something else. We find that the three conditions did not differ in effectiveness:

overall, Vincent successfully stimulated self-compassion with an effect size of Cohen’s dz = 0.22.

We suggest that people’s attachment to the chatbot plays a crucial role in whether or not our conditions can have an effect and that attachment does not materialize within 10 minutes.

Furthermore, we provide important qualitative insights into the perception of chatbots intended for emotional needs. This type of bot is new to most people, and we find that they do not have to pretend to be human in order to be effective. This has important implications for the design of chatbots for emotional needs.

Key words: Mental healthcare, chatbots, self-compassion, well-being, positive psychology, emo- tional needs

(5)

Acknowledgements

We rarely ever do anything by ourselves. We need people in our lives to guide us, to help us learn, and to realize the potential we possess. The process of writing this Master Thesis has proven this point to me once more, and I want to take this opportunity to thank those people who enabled me to write what you’re going to read. First, I want to say thank you to my supervisors, Wijnand IJsselsteijn, Lily Frank and Minha Lee: not just for providing me with excellent feedback and thoughtful remarks on my thesis, but also for your warmth and openness about your own experiences. I felt safe to ask the questions I needed to ask to improve, and I feel I’ve become a better researcher because of it. In particular, I would like to thank Minha for always being available for any of my questions and for spending her time sparring with me.

Second, I want to thank my friends and my family for having my back throughout these last couple of months. Thank you for listening to my rants about effect sizes, statistics and general deliberations, and for pushing me through the moments that were hardest.

Last, I also want to thank Vincent for reminding me of the kindness that almost all of our participants showed to a poorly functioning, emotional little chatbot.

Nena van As

(6)

Contents

Abstract i

Acknowledgements ii

1 Introduction 1

2 Literature Review 3

2.1 Chatbots . . . 3

2.1.1 Chatbot categories: roles and tasks . . . 4

2.1.2 No judgment . . . 5

2.2 Chatbots for self-compassion . . . 6

2.2.1 Why Self-Compassion? . . . 7

2.2.2 How can chatbots stimulate self-compassion? . . . 7

2.2.3 Giving or receiving chatbots . . . 8

2.3 The current study . . . 10

2.3.1 Grounded effect size . . . 10

2.3.2 New chatbot role . . . 10

3 Method 12 3.1 Data analysis . . . 12

3.1.1 Sample size . . . 12

3.1.2 Equivalence testing . . . 12

3.2 Participants . . . 13

3.3 Measures . . . 13

3.4 Procedure . . . 16

3.5 Conditions . . . 18

3.6 Ethical approval . . . 19

(7)

4 Results 20

4.1 Descriptives . . . 20

4.1.1 Demographics . . . 20

4.1.2 Perception of Vincent and conversation . . . 21

4.2 Outliers . . . 22

4.3 Hypothesis testing . . . 22

4.3.1 Overall effect size . . . 23

4.4 Equivalence testing . . . 23

4.5 Qualitative . . . 24

4.5.1 Perception and experience . . . 25

4.5.2 Expectations for the experience . . . 29

4.5.3 Groups based on behavior . . . 31

4.5.4 Groups based on perception . . . 34

4.5.5 From expectations to behavior & attitude . . . 36

4.6 Exploratory . . . 38

4.6.1 Qualitative insights . . . 39

4.6.2 Other groups . . . 41

5 Discussion 43 5.1 Development of attitude . . . 43

5.1.1 Bond to care . . . 44

5.1.2 Stranger on a website . . . 45

5.2 Chatbots for emotional needs . . . 46

5.2.1 Human or human-like? . . . 46

5.3 Limitations . . . 49

5.4 Implementations & recommendations . . . 50

5.4.1 Grounded effect size . . . 50

5.4.2 Implementation: Single-use or long-term? . . . 51

(8)

5.4.3 A chatbot with human expertise VS a chatbot with bot expertise . . . 52

5.4.4 Targeting . . . 53

5.4.5 Ethical concerns: disclaimers & data . . . 53

6 Conclusion 55 7 Appendix i 7.1 A: Informed consent form . . . i

7.2 B: Conversation flow . . . iv

7.3 C: Detailed axis . . . xiii

7.4 D: Tables . . . xiv

(9)

List of Figures

1 Example of a chat with Vincent, with the participant’s earlier response in a blue bubble

at the top and a GIF from Vincent . . . 17

2 Responses ordered along two axes: humanlike - machinelike & negative - positive . . . 28

3 Dividing users based on their behavior . . . 33

4 Further dividing users based on evaluations . . . 34

List of Tables

1 Demographics & Perception of Vincent and conversation . . . 21

2 ANOVA: Descriptives . . . 22

3 ANOVA: output . . . 22

4 Changes in self-compassion per condition . . . 23

5 Limited chatbot - human axis . . . 27

6 Codes in conversational data . . . 31

7 Self-compassion change between the groups . . . 39

8 Perception of Vincent between participant groups . . . 40

9 Self-compassion change between the conditions: nonsensical excluded . . . 41

10 Self-compassion change between other groups . . . 42

11 Qualitative Insights: ANOVA output . . . xiv

(10)

1 Introduction

Many people suffer from mental illnesses such as depression and anxiety1, yet waiting lists for adequate mental health care are often long2 and stigmas surrounding mental health may prevent people from reaching out (Corrigan, 2004). Computerized versions of therapy are one way to reach more patients and alleviate therapists at the same time, for example through the development of general software packages or even Virtual Reality (VR) to deliver cognitive behavioral therapy (Spurgeon & Wright, 2010).

Alternatively, there is another way to digitize interventions: by computerizing the therapist, not the therapy. The idea of a computer as therapist has been around for a long time, but these early chatbots - programs that talk - were often too crude to serve this purpose (Shah, Warwick, Vallverd´u & Wu, 2016). However, recent advances in machine learning have made them more sophisticated, and their presence in business domains and as personal assistants is increasing (Dale, 2016; Følstad & Brandtzæg, 2017). Moreover, they are online 24/7, easy to access, and will not judge you for whatever problem you might have (Pounder et al., 2016).

This makes them ideal candidates for therapists, with chatbots Woebot and Tess showing their ability to deliver cognitive behavioral therapy (Fitzpatrick, Darcy & Vierhile, 2017; Fulmer, Joerin, Gentile, Lakerink & Rauws, 2018).

However, although the importance of these developments should not be understated, they all share the same perspective of what therapy should do: cure an already present illness. In contrast, the focus of this study lies on prevention - on making people more resilient against the setbacks that life inevitably has to offer and on stimulating well-being, a perspective from the field of Positive Psychology (Seligman, Steen, Park & Peterson, 2005).

A concept that lends itself perfectly for this cause is that of self-compassion: the ability to be kind and forgiving towards yourself in times of struggle (Neff, 2003b). Self-compassion improves well-being (Zessin, Dickh¨auser & Garbade, 2015), and higher scores on self-compassion

1In the Netherlands, public institutes estimate that about 20% of the population suffers from depression. See https://www.trimbos.nl/kennis/depressie-preventie/depressie-feiten-en-cijfers

2Waiting lists for an intake can vary from 5 week to 17 weeks depending on the municipality. See https:

//www.depressie.nl/contact/wachtlijst

(11)

are related to lower scores on symptoms of depression and anxiety (MacBeth & Gumley, 2012;

Neff, 2003b). There are many human-led therapies to stimulate self-compassion (Kirby, 2016;

Neff & Germer, 2013), as well as an increasing number of computerized formats such as online self-help guides (Donovan et al., 2016; Finlay-Jones, Kane & Rees, 2017; Krieger, Martig, van den Brink & Berger, 2016) and even complete VR sessions for clinical patients (Falconer et al., 2016; Falconer et al., 2014).

Chatbots, on the other hand, have only recently been tested as self-compassion therapist (Lee et al., 2019). Self-compassion chatbot Vincent resembled its cousins Woebot and Tess to a great extent, but differed on its roles of patient and therapist: Lee et al. (2019) experimented with the idea that a computerized self-compassion therapist can also ask for care, instead of giving it. Their results were surprising, suggesting that care-receiving chatbot therapists might actually outperform their traditional caregiving counterparts when it comes to stimulating self- compassion in non-clinical people (Lee et al., 2019).

However, Vincent’s results remain inconclusive because of a relatively small sample size. The authors ask for follow-up studies using larger samples and “a more grounded effect size” (Lee et al., 2019, p. 11). Unfortunately, since chatbots are still very new in the field of self-compassion, this grounded effect size does not yet exist. Hence, this study sets out to create one.

To do so, this study returns to the basics. Vincent had several interactions with participants in the work of Lee et al. (2019), but he will have only one interaction in this work. This will allow us to remove any longitudinal effect and instead focus on what a single conversation with a chatbot can do. Earlier research on single interventions and their effect on self-compassion is available, both from a caregiving (Leary, Tate, Adams, Allen & Hancock, 2007) and care- receiving perspective (Breines & Chen, 2013). Using their work as a basis, this study will address the following research question:

What is the immediate effect on self-compassion of a single interaction with a chatbot that either gives or asks for help?

(12)

The study is structured as follows: section 2 describes the literature underlying the research question in more detail and introduces our hypotheses. Section 3 details the method used to test these hypotheses. Section 4 then provides the reader with the results of the experiment, which are discussed in section 5. Section 6 provides an answer to the research question.

2 Literature Review

This section starts by introducing what chatbots are, what they do and how humans perceive and behave towards them. Afterwards, this section delves into the ways that chatbots can help train self-compassion, after which hypotheses are given.

2.1 Chatbots

In this paper, the term “chatbot” refers to any program that interacts with the user using dialogue only. The first chatbot saw the light in the 60’s with ELIZA (Weizenbaum, 1966), and things have changed considerably since then. Not only have these bots become better at talking and understanding their human user, we have also started using instant messaging platforms such as Whatsapp and Facebook Messenger in our daily lives (Dale, 2016). As a result, our habituation to text-based messaging and the use of these platforms have dramatically improved the acceptance and accessibility of contemporary bots: whereas ELIZA was unfamiliar and required a dedicated system and location to engage with, current chatbots are familiar to most people and can be accessed virtually everywhere and at any time (Pounder et al., 2016;

Weizenbaum, 1966). As a result, these programs that talk are pervading our societies, taking up jobs as customer support, providing entertainment and forming the basis for voice-based assistants such as Google Assistant and Siri (Abu Shawar & Atwell, 2007; Dale, 2016; Følstad

& Brandtzæg, 2017).

Chatbots can do these human things because of our tendency to treat our computer systems, and any virtual entity running on it, as a social actor - despite being fully aware that a computer program is non-human and does not require such treatment.

(13)

This forms the basis of the paradigm of Computers As Social Actors (CASA) (Nass & Moon, 2000): we treat our computer systems socially, but we do not anthropomorphize them. That is, we do not really think our computers are human or have human characteristics. Kim and Sundar (2012) use the words mindless- and mindful anthropomorphism to indicate the discrepancy between our mindful attitude - computers are not human - and our mindless behavior - we treat them socially.

CASA is what enables a human to actually have a conversation with a chatbot and allows these bots to take on roles traditionally reserved for humans, but that does not make chatbots human. In fact, they are considered non-human by most people: on a scale of 0 (“poor”, machinelike) through 50 (“good”, but still machinelike) to 100 (humanlike), average scores for six contemporary chatbots lay between 24 and 63 (Shah et al., 2016). Moreover, we treat them differently than we treat other humans: for example, we use more profanity and send shorter messages that lack the richness in vocabulary of human-human interaction (Hill, Randolph Ford

& Farreras, 2015).

2.1.1 Chatbot categories: roles and tasks

Currently, chatbots can be divided into three categories. Most of them are (1) intelligent assistants like Siri and Alexa, capable of doing several short tasks for us such as scheduling appointments or making a web search, or (2) task-focused - only capable of doing one thing, such as finding the cheapest flight. Only a few chatbots fit into the last category of (3) virtual companion. These are bots that come closest to having an open conversation with their users, but there are still ways to go before they actually do (Grudin & Jacques, 2019; Seering, Luria, Kaufman & Hammer, 2019).

Except the virtual companion, all bots are intended to do things for us. This is reflected in what we expect from bots, namely to help us with menial tasks - providing weather updates or keeping an eye on our calendars (Brandtzaeg & Følstad, 2017; Zamora, 2017). Hence, the fact that chatbots function mostly as non-human shopping buddies or command-obeying entities (“Alexa, turn on the lights”) reveals what most chatbots are to us: digital “butlers” (Zamora,

(14)

2017), created to make our lives easier and to be used at will3.

However, Zamora (2017) and Brandtzaeg and Følstad (2017) also revealed participants’

interest in having access to relational kinds of chatbots - bots that are there to listen or to provide motivation. In other words, people would like to have bots that cater to their emotional needs as well as taking care of their menial needs such as shopping lists and agendas. This shift towards bots for emotional needs also suggests a shift in the role that these chatbots fulfill for their users: when we start confiding our worries in bots and begin to depend on their input in order to feel better, they might become closer to us. In effect, this means that bots may be moving up a social rank from their earlier butler position to being a virtual assistant for feelings (Grudin & Jacques, 2019).

2.1.2 No judgment

A key element that makes these virtual assistants-for-feelings attractive is the pervasive notion that chatbots are non-judgmental: the reason that participants in the Zamora (2017) study gave for wanting to have a listening “chatbot ear” was the fact that a chatbot would not judge them for whatever it was that they needed attention for. Moreover, a survey conducted by media agency Mindshare UK showed that people would rather share sensitive or embarrassing medical or financial information with a chatbot than a human (Pounder et al., 2016). In fact, the tendency to share intimate information with chatbots was also seen in the earlier Vincent study (Lee et al., 2019) and has been found in research as well: people disclose more to computers, and to computerized entities such as chatbots, than they do to humans - especially so when the information is sensitive (Lucas, Gratch, King & Morency, 2014; Weisband & Kiesler, 2003).

The rationale behind this openness towards chatbots seems to be a mix of two things: first, talking to a computer feels more anonymous than talking to another human being (Weisband

& Kiesler, 2003).

3One may wonder whether these bots are effectively treated - and verbally abused - as digital slaves, with the user as their superior master: “butler” may be more of a euphemism. See the work of De Angeli and Carpenter (2006) for a discussion on the subordinate role of chatbots.

(15)

Second, people are less afraid of receiving negative evaluation from a chatbot than when they share sensitive information with humans (Lucas et al., 2014). In other words, people seem to feel that a chatbot will not judge them (Pounder et al., 2016; Zamora, 2017).

Chatbot therapists This non-judgmental nature makes them suitable candidates for assist- ants in (mental) health care, where people’s reluctance to share sensitive information often hinders adequate care (Lucas et al., 2014).

An example is that of chatbot therapist Woebot, developed at Stanford (Fitzpatrick et al., 2017). Woebot delivered a self-help program to its users through an instant messaging app, basing its daily conversations on Cognitive Behavior Therapy (CBT). After a period of two weeks, participants that engaged with Woebot showed significant decreases in depressive and anxiety related symptoms, as compared to the group that only received a link to an online self-help guide (Fitzpatrick et al., 2017). Another example is that of Tess, Woebot’s younger sibling, which showed similar results using the same approach (Fulmer et al., 2018).

2.2 Chatbots for self-compassion

Both Woebot and Tess are examples of chatbots intended to make people less ill. Although their development is important and their success promising, they are not addressing the full picture - what about the people who are not ill? This question underlies the development of a field called positive psychology. This field is built on the idea that interventions aimed at improving well-being should be as important as those that cure illness (Seligman et al., 2005).

These interventions can also come in the form of technology, as Calvo and Peters (2014) demonstrate in their book “Positive Computing: Technology for Well-being and Human Poten- tial”. Chatbots are no exception: for example, bot Shim was born out of the desire to stimulate well-being in its users by using insights from positive psychology (Ly, Ly & Andersson, 2017).

This paper intends to follow this line of thought by using the concept of self-compassion to stimulate well-being.

(16)

2.2.1 Why Self-Compassion?

Self-compassion is originally a Buddhist concept, stemming from general compassion: being open and receptive to suffering, generating kindness and the desire to help without judging the actions from those who suffer. Self-compassion, then, is exactly that, but oriented towards the self (Neff, 2003a, 2003b). In more detail, self-compassion consists out of three parts: (1) being kind to oneself rather than being harsh and judgmental; (2) putting one’s experience in the bigger picture of all other humans rather than isolating oneself; and (3) being mindfully aware of the fact that emotions and painful thoughts are something one has, not something one is (Neff, 2003b).

In several studies and meta-analysis, self-compassion has been found to be important for stimulating general well-being, reducing symptoms of anxiety and depression, and to build resilience against stress (MacBeth & Gumley, 2012; Neff, 2003b; Zessin et al., 2015). More evidence for the causal effects of self-compassion comes from the work of Shapira and Mongrain (2010): participants who did self-compassion related exercises for one week showed significant and sustained drops in their depressive symptoms, up until as long as six months after the intervention had ended. In contrast, the participants who had to work with early childhood memories or optimism exercises for a week did not show the same result.

2.2.2 How can chatbots stimulate self-compassion?

To create chatbots for self-compassion, we need to understand how to stimulate self-compassion.

Most traditional (human-led) self-compassion interventions contain three components: (1) psy- choeducation, providing the reason behind self-compassion, (2) mindfulness and meditation exercises and (3) rounds of practice being compassionate to others in the intervention group (Kirby, 2016; Neff & Germer, 2013).

Especially the third component is striking: apparently, practicing compassion for others helps you to be compassionate to yourself4.

4To be compassionate means to have compassion for all, including the self. In fact, to introduce a separate concept for the self would be nonsensical in many Eastern philosophies - for several reasons, such as whether there is or should be a division between the self and others (Neff, 2003a, 2003b).

(17)

A digital example of an intervention that incorporates both receiving and giving compassion is the work of Falconer et al. (2016). They studied whether VR could be utilized to improve the self-compassion levels of people with depression. In their experiment, participants were instructed how to give a compassionate response and were placed in a VR environment together with a crying, virtual child. They were asked to deliver the compassionate response to the child. Afterwards, they were placed in the virtual body of that child, and received their own compassionate response from the perspective of the child. Hence, their participants both gave and received compassion and experienced an enormous increase in self-compassion (Cohen’s dz

= 1.5, p = 0.02) which remained stable after four weeks.

However, it is not clear whether self-compassion needs a combination of receiving instructions and giving compassion to others in order to be stimulated: both receiving instructions and giving compassion to others have separately been found to improve self-compassion. For example, Leary et al. (2007) instructed some of their participants on how to do an exercise called compassionate letter writing, through which participants get to think about the three parts of self-compassion when dealing with their own unfortunate event. They concluded that this exercise successfully induced a state of self-compassion in their participants (Leary et al., 2007). On the other hand, Breines and Chen (2013) experimented with the effect of giving compassion to someone else.

They gave participants a description of a stranger who had just experienced failure and let participants write down a comforting statement to this person. They measured self-compassion before and after the task and found that, on average, being compassionate to someone else positively influenced self-compassion scores.

2.2.3 Giving or receiving chatbots

Knowing all this, where do chatbots for self-compassion fit in? We know they can give instruc- tions for exercises such as compassionate letter writing like their therapeutic relatives Woebot and Tess (Fitzpatrick et al., 2017; Fulmer et al., 2018), so should they stick to that?

Alternatively, chatbots could also ask for compassion like the stranger in the work of Breines and Chen (2013) - assuming that people are willing to show compassion to chatbots. Calvo and

(18)

Peters (2014) write that people can only be compassionate to technology if they (1) witness the technology suffering, (2) understand how this suffering must feel for the technology and (3) have the possibility of doing something to alleviate the suffering. Given that these demands are met, should a chatbot then ask for compassion instead of giving it?

This was exactly the question addressed by Lee et al. (2019) with their work on Vincent.

Like this paper, they targeted non-clinical people with the goal of improving well-being for the average person. Their participants had daily chats with Vincent for a period of two weeks. Half (n = 31) talked with a Vincent that gave help (caregiving), resembling earlier therapist chatbots like Woebot and Tess, while the other half (n = 31) talked with a Vincent that asked for help (care-receiving), based on the findings from Breines and Chen (2013). The group of participants that talked with a care-receiving Vincent reported increased levels of self-compassion after the experiment (Cohen’s dz= 0.35, p = 0.029) whereas the participants who talked with caregiving Vincent did not (dz= 0.1, p = 0.286). Hence, chatbots for self-compassion should ask for help, instead of giving it.

Categories Referring back to the categories of task-focused, intelligent assistant or virtual companion chatbots (Grudin & Jacques, 2019), where do these two versions of Vincent belong?

Chatbot therapists like caregiving Vincent, Woebot and Tess (Fitzpatrick et al., 2017; Fulmer et al., 2018) fit best as intelligent assistants for feelings - those that cater to their users’ emotional needs (Grudin & Jacques, 2019; Zamora, 2017). They seem to be effective in reducing symptoms of illness, but are less suited for a role as self-compassion therapist.

Care-receiving Vincent, on the other hand, shows promise in stimulating self-compassion.

However, he does not seem to fit into either category: he does not assist with a specific task for the user - emotional nor menial - but he also does not engage in open conversation. In fact, conversations with care-receiving Vincent are about addressing his emotional needs, not those of the user. With participants making remarks such as “can I keep him?”(Lee et al., 2019, p.

9), he appears to be another type entirely, namely that of a talking pet - a “ChatPet”.

(19)

2.3 The current study

If a chatbot that asks for help is proven to be more effective in stimulating self-compassion, then this has serious implications for the design of chatbots for self-compassion. However, Lee et al.

(2019)’s sample size was not large enough to make definite statements about the difference in effectiveness between the two conditions. Their interaction effect between time and condition was insignificant and underpowered (F(1, 62) = 0.580, p = 0.449, ηp2= 0.009), making it unclear whether the difference was really there. Hence, a logical next step would be to replicate their work with a larger sample size.

2.3.1 Grounded effect size

Unfortunately, the lack of other research into the topic makes it difficult to move forward: when it comes to chatbots for self-compassion, the only available work to base effect size estimations on is that of Lee et al. (2019), but they were underpowered and did not have a control condition to benchmark the performance of their conditions. Using their η2p = 0.009 to calculate the number of participants needed in a 2-week study would result in a costly endeavour for which there is insufficient grounding in literature: in essence, there is a lack of a grounded effect size for future work on chatbots for self-compassion.

Alternatively, the works that have reliably tested the effect of caregiving and of care-receiving on self-compassion separately could be used to base effect size expectations on. These are the works of Leary et al. (2007) and Breines and Chen (2013), but they only exposed their participants to an intervention once. If these are used as basis, a direct replication of Lee et al.

(2019)’s longitudinal setup is not feasible: instead, this study should use a single interaction.

2.3.2 New chatbot role

Moreover, although there is abundance of work on the ways that people treat and perceive chatbot categories as identified by Grudin and Jacques (2019), there are only a few papers about chatbots for emotional needs (Fitzpatrick et al., 2017; Fulmer et al., 2018), and no other papers about “ChatPet” chatbots like care-receiving Vincent (Lee et al., 2019).

(20)

As a result, we do not know much about people’s perceptions of chatbots for emotional needs - let alone about how we should design them, despite the suggestion that they may be very effective in helping us.

Hence, this study will address four things: (1) have a sample size large enough to make a powered statement about the difference in effectiveness between caregiving and care-receiving chatbots, (2) thereby establish a grounded effect size of the effectiveness for chatbots for self- compassion for future research, (3) employ a single interaction to cross-check our found effects with those of Breines and Chen (2013) and Leary et al. (2007) and (4) add to the understanding of how people treat and behave towards chatbots for emotional needs.

To do so, we will test the effects of a single interaction with Vincent, including a control condition to benchmark caregiving and care-receiving chatbot performance, against the following four hypotheses:

Hypothesis 1a: A single interaction with a chatbot that gives care will improve self-compassion immediately after the interaction

Hypothesis 1b: A single interaction with a chatbot that asks for care will improve self-compassion immediately after the interaction

Hypothesis 1c: A single interaction with a chatbot that does not give or ask for care will not improve self-compassion immediately after the interaction

Hypothesis 2: A single interaction with a chatbot that asks for care will improve self-compassion more than a single interaction with a chatbot that gives care

Qualitative analysis of the conversations with each condition will provide us with information regarding the treatment and perception of each version of Vincent.

(21)

3 Method

This experiment has a 3 (condition: caregiving, care-receiving and control) by 2 (time: pre, post) online survey design.

3.1 Data analysis

Data analysis was done in STATA IC 14.2 and SPSS 22. The appropriate statistical test to assess relative effectiveness of 3 conditions with 2 points of measurement is a repeated measures ANOVA, with post-hoc contrasts in case the interaction effect is significant. The effect size that this test should be able to detect is taken from Lee et al. (2019), who reported a non-significant interaction effect between time and condition of η2p = 0.009, equaling Cohen’s dz = 0.18. The current study will be powered to find this effect size.

3.1.1 Sample size

An a-priori power analysis was conducted to determine the number of participants required to answer the research question. G*power was used to perform the power analysis for a repeated measures ANOVA with 3 groups and 2 measurements, 90% power and an expected effect size of dz= 0.18. The total sample size required is 396, with 132 participants in each condition.

3.1.2 Equivalence testing

In case the interaction effect is not significant, the differences between the conditions will also be studied using equivalence tests. In traditional, Null Hypothesis Significance Testing (NHST) research, the absence of a significant result is often incorrectly reported as the absence of an effect. However, NHST methods only allow the rejection of a hypothesis, not the support: hence, it is impossible to statistically support the hypothesis that the effect is zero. Equivalence tests allows a researcher to test whether the effect falls within a specified range of effect sizes that are so close to zero that any value within these bounds can be statistically regarded as equivalent to zero (Lakens, 2017).

(22)

This paper will make use of the TOST (Two One-Sided T-tests) method. To perform a TOST on our data, we need to set the equivalence bounds. This is typically done by determining a smallest effect size of interest (SESOI). An objective justification of our SESOI is not possible since our hypotheses are not quantifiable theoretical predictions (Lakens, Scheel & Isager, 2018):

instead, we subjectively define a SESOI basing our bounds on our available resources. Although setting the SESOI based on Lee et al. (2019)’s dz= 0.18 was preferred, they only had 12% power to actually find it, begging the question of how reliable this estimate is. Moreover, setting a SESOI based on this estimate would yield a very large sample size which would be impossible to analyze properly given the time frame of this Master’s thesis. Instead, using the sample size calculated above, our smallest equivalence bounds become dz = -0.41 to dz = 0.41 (Lakens, 2017).

3.2 Participants

Any MTurk Worker over the age of 18 and capable of speaking English was eligible to partake in the study. Participants were recruited over the period of May 8th 2019 to June 4th 2019.

Participants who had partaken were marked in the MTurk environment with a qualification score to prevent them from participating another time.

3.3 Measures

Because MTurk Workers are more likely to be depressed and anxious than the average person (Arditte, C¸ek, Shaw & Timpano, 2016), it was important to measure these levels in the parti- cipants. Moreover, to follow the procedure of Breines and Chen (2013), their adapted version of the Self Compassion Scale (SCS) (Neff, 2003b) was used.

(23)

Patient Health Questionnaire-9 The Patient Health Questionnaire-9 (PHQ-9) is a reliable and valid set of nine items intended to assess severity of depression (Kroenke, Spitzer & Williams, 2001). Each item can be responded to with a 0 (not at all) to a 3 (nearly every day). Each item’s score is summed to form the final score, which ranges from 0-4 (no depression), 5-9 (mild), 10-14 (moderate), 15-19 (moderately severe) and 20+ (severe).

General Anxiety Disorder-7 The General Anxiety Disorder-7 (GAD-7) scale is a reliable and valid set of seven items intended to assess severity of anxiety (Spitzer, Kroenke, Williams &

Lowe, 2006). Each item can be be responded to with a 0 (not at all) to a 3 (nearly every day).

The total score is a summation of these individual responses and ranges from 0-4 (minimal), 5-9 (mild), 10-14 (moderate) to 15-21 (severe).

Self-Compassion Scale The Self-Compassion Scale (SCS) (Neff, 2003b) is a 26-item scale, intended to measure the three parts of self-compassion through six opposing subscales: self- kindness versus self-judgment, common humanity versus isolation, and mindfulness versus over- identification. Each item can be scored on a 7-point Likert scale 5. The final self-compassion score is calculated by summing the average score for each of the six subscales, and reverse coding the negative ones.

Current Self-Compassion Scale The current SCS (Breines & Chen, 2013) is a 16-item scale, intended to measure the current state of self-compassion as opposed to the more trait- based approach of the original SCS (Neff, 2003b). Each item can be scored on a 7-point Likert scale. The final self-compassion score is calculated by summing the average score for each of the six subscales, and reverse coding the negative ones.

Opinion about Vincent This measurement consists out of six parts and was taken from Lee et al. (2019). Four parts contain semantic differentials that can be answered on a 10-point Likert scale: caring (five items, e.g. compassionate-not compassionate), likability (four items,

5note that the original scale reports a 5-point Likert scale. The choice for a 7-point scale was made to allow for finer granularity of results

(24)

e.g. likable-unlikable), trustworthiness (four items, trustworthy-untrustworthy), and intelligence (three items, e.g. intelligent-unintelligent). The last two parts contain singular items with a 10-point Likert scale: dominance (three items, e.g. dominant) and submissiveness (three items, e.g. meek).

Opinion about conversation Four questions addressed participants’ perception of their conversation with Vincent. Three of these could be answered on a 7-point scale (“not at all”

to “very much”: (1) “Vincent listened and replied to what I wrote”, (2) “I felt I was having a real conversation” and (3) “Vincent’s responses resembled those of other chatbots”. The fourth question was open-ended: “Why did Vincent’s responses (not) resemble those of other chatbots?”

Recall To measure how accurately participants paid attention to their conversation with Vin- cent, two questions addressed parts of what Vincent had said. The first was a multiple choice question and was the same for all participants: “When introducing himself, Vincent shared his biggest insecurity with you. What is Vincent most insecure about?”. There were four answer options: (1) His knowledge on national anthems, (2) his jokes [correct], (3) his favorite colors or (4) his algorithms).

The second question depended on the participant’s condition. For caregiving, the question was two-fold: “In your conversation, Vincent suggested you complete a short exercise with him and guided you through several steps. How many steps did the exercise have?” (multiple choice:

(1) 1 step, (2) 2 steps, (3) 3 steps and (4) 4 steps [correct]), followed by “What was the content of the step(s)? If you do not remember, you can leave this question empty.”

For care-receiving, the question was multiple choice: “In your conversation, Vincent told you that he failed an important programming exam. He shared some of his thoughts about this failure with you. Please check all thoughts that he shared with you below:”. The answer options were (1) This would never happen to other chatbots [correct], (2) I’ll never get a job like this, (3) I should have never failed, (4) Is it okay to feel so upset? [correct] and (5) I feel as if I’m truly a failure [correct].

(25)

For control, the question was multiple choice as well: “In your conversation, Vincent de- scribed his love for sequoia trees to you. He also gave you some facts about these trees, including the age of the oldest sequoia known to man. How old is the oldest sequoia tree, according to Vincent?”. Answer options were (1) over 500 years, (2) over 100 years, (3) over 2000 years [correct] and (4) over 2500 years.

Feedback Three questions assessed whether participants had an idea about the purpose of the study, whether they had heard of self-compassion before and what it meant to them, and if they had any additional feedback.

3.4 Procedure

The survey for this study was hosted on Limesurvey. MTurk Workers were directed to the survey with a URL when they accepted the HIT (Human Intelligence Task) that was created for this study. In the Limesurvey environment, participants were asked to provide informed consent (Appendix A) to the use of their data. Only if the participant agreed to provide informed consent could they continue in the survey: if they indicated they did not provide informed consent, they were directed to a page informing them that they reached the end of the study.

The first section of the survey entailed demographics (age, gender, MTurk Worker ID) and earlier experience with chatbots on a 7 point Likert scale (“Not at all experienced” to “Very experienced”). It also contained the PHQ-9, the GAD-7 and the trait SCS.

The second section consisted of a series of conditional survey questions which were designed to resemble a chat, as can be seen in Figure 1 on the next page. All participants first got some questions from Vincent to get to know each other a little better, for example through questions like “what is your weird or useless talent?”. At the end of this introduction phase, Vincent asks the participant “Can you tell me about a moment in the past year when you felt really badly about yourself? Like... rejected, or humiliated, or all of that... Really think about this one...”. His response depended on the condition (see 3.5). The full conversational flow of the introduction and the three conditions can be found in Appendix B.

(26)

Figure 1: Example of a chat with Vincent, with the participant’s earlier response in a blue bubble at the top and a GIF from Vincent

Consider the following outtake of the introduction phase conversation between Vincent and a control participant [C9]:

Vincent: “Hi there! Nice to meet you, [C9]. My name is Vincent!”

[C9]: “How are you doing today Vincent?”

Vincent: “I’m told we’re going to have a little chat today. I must say, I’m excited! Can you tell me something about yourself? Can be anything, like, what do you do for a living? What is your favorite color? That kind of stuff.”

(27)

[C9]: “Well, I work in IT services but I’m going on vacation to Mexico in 24 hours. I’m excited like you!”

Vincent: “Alright, pretty interesting! I feel I’m getting to know you a bit better :)”

In the third and last section, participants were asked to report how they perceived Vincent and the conversation and answered the recall questions. Then they were given the current SCS, and last the feedback questions. Participants who completed the whole survey were given a survey code which they were required to fill in in the MTurk HIT. This allowed us to check whether a submitted HIT on MTurk related to a completed survey entry on Limesurvey. The participant’s Worker ID was used to link submitted HIT’s to a Limesurvey entry.

Recruitment Eligible MTurkers could choose to participate in a study intended to assess how reading and responding in a written conversation impacted mood. Participants who chose to complete the approximately 20 minute long exercise were compensated with $2.

3.5 Conditions

To exclude the possibility that simply remembering this moment of failure impacted the current SCS measure at the end of the interaction, all conditions required participants to remember such a moment. The conditions differed purely on Vincent’s response: he either gave care, asked for care, or simply proceeded to another topic, as is detailed below.

Caregiving (CG) Vincent After sharing their moment of failure, CG Vincent suggested to do an exercise that helps to deal with such moments. He guided the participant through a compassionate writing task (Leary et al., 2007), asking them to go through four steps: (1) describe the moment of failure in detail, (2) list other people who’ve experienced similar things, (3) imagine it was a friend who’d experienced the failure and write down what they would tell them, to (4) list their feelings and thoughts from that moment as objectively as possible.

Afterwards, Vincent asked participants whether they liked the exercise and if they had any feedback, which concluded the conversation.

(28)

Care-receiving (CR) Vincent After sharing their moment of failure, CR Vincent responded by telling the participant that he experienced something similar. He then described how he failed a Python programming course, a scenario taken from Lee et al. (2019), in a way that reflected low self-compassion (Neff, 2003b): he showed self-judgment (“I’m a computer program, for crying out loud! All I am is a piece of code, and I failed a programming course!”), isolation (“I keep thinking that this would never happen to other chatbots. Does that even make any sense?!”), and over-identification (“What about feeling as if I’ll never get over it? As if... as if I’m really, truly, a failure?”). Hence, participants were given the opportunity to comfort Vincent in their own way, mimicking the way that Breines and Chen (2013) asked their participants what they would tell a stranger who just experienced failure. Afterwards, Vincent thanked the participant for their help.

Control Vincent After sharing their moment of failure, control Vincent suggested to talk about something else. He gave the participant several options for continuing the conversation, which all ultimately led to Vincent telling the participant excitedly about his love for sequoia trees. This conversation topic was modeled after one of the neutral scenarios that was used in Lee et al. (2019).

3.6 Ethical approval

The ethical board of the Human Technology Interaction department at the TU/e approved of this method.

(29)

4 Results

4.1 Descriptives

In total, there were 435 completed questionnaires. Of these, 39 (9%) were not paid, and their data not used. This was due to the nature of their input to the conversation with Vincent. For example, one of these participants answered “Well, I was reading about various moral philo- sophers today, and was wondering if there was one unified ethical theory which would support all these diverse contrasting thinkers and their theories”when asked to describe what they thought when they first saw a sequoia. The remaining entries were paid6. Workers who were not paid were informed why their entry was not approved. This leaves a total sample of 396 participants, with 132 participants per condition. Table 1 on the next page shows all descriptive values of this sample.

4.1.1 Demographics

Gender was distributed roughly equally between the three conditions: there were 144 women (36%) in the total sample, and the conditions show similar percentages. The average age of the participants was about 34 years for all conditions, with the youngest being 21 and the oldest 65. Participant experience with chatbots lay around the 4 median of a 7-point scale with 4.55, as were their prior self-compassion scores with 3.83. The total sample showed signs of mild depression and anxiety, with the average PHQ-9 score being 7.47 and the average GAD-7 score being 6.08.

Five one-way ANOVA’s tested whether these demographics differed significantly across groups. Using a Bonferroni corrected alpha level of 0.05/1 = 0.01, none of the differences were statistically significant. However, the GAD-7 averages and the PHQ-9 averages do approach significant difference (F(2,391) = 4.37, p = 0.01 and F(2,391) = 3.20, p = 0.04, respectively).

6A total of 462 MTurk Workers submitted a completed HIT. Minus the 39 rejections, this leaves 423 paid Workers; more than the final 396 participants. The difference arises from a failure to save 29 responses in the Limesurvey environment, and 2 participants who failed to register their completed HIT on MTurk. Although attempts to pay these two afterwards were made, these failed due to the anonymity of the MTurk platform.

(30)

4.1.2 Perception of Vincent and conversation

Overall, Vincent was perceived to be intelligent, caring, likable and trustworthy; all scores on these 10-point scales were a 7 or higher. There seem to be some small variations between the conditions: for example, CR Vincent receives higher scores on all perception variables while control Vincent receives the lowest scores, but after running 4 one-way ANOVA’s, none of these differences are statistically significant.

The questions assessing perception of the conversation all scored around 5, indicating average scores that were the same across conditions.

Table 1

Demographics & Perception of Vincent and conversation

Variable Caregiving Care-receiving Control All

n 132 132 132 396

Men 86 82 82 250

Women 45 50 49 144

Other 1 0 1 2

Caregiving Care-receiving Control All

M SD M SD M SD M SD

Age 33.65 9.50 34.35 10.15 33.52 8.54 33.84 9.40

PHQ-9* 6.18 6.79 8.06 7.58 8.17 7.00 7.47 7.17

GAD-7** 4.89 5.38 6.50 6.14 6.86 5.56 6.08 5.75

Chatbot experience 4.52 1.49 4.59 1.53 4.55 1.54 4.55 1.52

Prior self-compassion 3.91 1.18 3.90 1.06 3.67 0.97 3.83 1.08

Intelligence 7.77 2.25 8.00 2.09 7.69 2.06 7.82 2.13

Caring 8.03 1.72 8.25 1.78 7.91 1.65 8.07 1.71

Likability 8.16 1.96 8.53 1.81 8.06 1.95 8.25 1.91

Trustworthiness 7.97 1.98 8.30 1.87 7.93 1.83 8.07 1.90

He listened & replied to what I wrote 5.00 1.76 5.24 1.67 5.02 1.52 5.09 1.65 I felt I was having a real conversation 4.75 2.02 5.24 1.76 4.96 1.68 4.98 1.84 His responses resembled those of

other bots

4.47 1.83 4.27 1.91 4.44 1.68 4.39 1.81

* Differed statistically with p < 0.05 across conditions

** Differed statistically with p = 0.01 across conditions

(31)

4.2 Outliers

1 participant from the care-receiving condition was excluded from further quantitative analysis for having filled in the same value (0 or 1) in all scales, leading to a variance of 0 in their scale responses. Another participant from the control condition was excluded because of the nature of their input to Vincent’s question, e.g. replying with “Today, I woke up and in one fluid motion, I got out of bed. I got out of bed and I took a shower not because I had to but because I wanted to, because I wanted to feel good about myself.” when asked about their moment of failure. This participant should have been rejected.

The remainder of the reported statistics are with these outliers excluded.

4.3 Hypothesis testing

The observed covariances of the groups could be assumed to be equal (Box’s M = 11.011, F = 1.82, p = 0.09). Furthermore, sphericity could be assumed for time because there are only two points of measurement. Table 2 shows the descriptives from the ANOVA data. Table 3 below shows the ANOVA model output. There was a significant effect of time on self-compassion (F(1, 391) = 18.24, p = 0.000, η2p = 0.05). There was no significant interaction effect between condition and time (F(2, 391) = 0.73, p = 0.48).

Table 2

ANOVA: Descriptives

Variable Caregiving Care-receiving Control All

Prior self-compassion 3.91 (1.18) 3.88 (1.04) 3.67 (0.97) 3.82 (1.07) Post self-compassion 4.08 (1.18) 3.97 (0.99) 3.78 (0.98) 3.94 (1.06)

Table 3

ANOVA: output

Type III sum of squares df Mean square F Sig. η2p

Time 2.877 1 2.877 18.237 0.000 0.045

Time * Condition 0.232 2 0.116 0.734 0.481 0.004

Error (time) 61.675 391 0.158

(32)

4.3.1 Overall effect size

Since there seems to an effect of time regardless of condition, we run a paired t-test to establish an overall effect size of the difference between prior and post self-compassion levels. The t-test indicates a significant difference between prior and post self-compassion levels (t(393) = 4.316, p= 0.000), with an average difference of 0.12 (SD = 0.56). This corresponds to an effect size of Cohen’s dz = 0.22.

4.4 Equivalence testing

In this subsection, we attempt to substantiate the effects that we have found so far. Having found no significant interaction effect does not need to mean that an effect is absent. Hence, we ran three equivalence Welch’s t-tests7 for the effect sizes of the differences between the conditions’

improvement in self-compassion. Table 4 shows the changes as averages and as effect sizes dz. The narrowest equivalence bounds, requiring at least 129 participants per group, were

dz= - 0.41 to dz= 0.41.

The effect size of the difference between caregiving and control was significantly equivalent to the range (t(256) = -2.43, p = 0.008), as was the the difference between care-receiving and control (t(260) = 2.99 and p = 0.002). Last, the effect size of the difference between caregiving and care-receiving was significantly equivalent as well (t(252.11) = -2.15, p = 0.016)8.

Table 4

Changes in self-compassion per condition

Condition Change in self-compassion Effect size f

n M SD CG CR Control

Caregiving (CG) 132 0.17 0.58 x

Care-receiving (CR) 131 0.09 0.60 0.14 x

Control 131 0.11 0.50 0.12 0.04 x

7Because of unequal variances

8Note that the distribution of the change in self-compassion was not normally distributed within the condi- tions. Hence, the normality assumption of the t-tests underlying the equivalence test have been violated.

(33)

4.5 Qualitative

Having found that all conditions improved self-compassion and that there was no difference in effectiveness between them, we turn towards our qualitative data to see whether this data might help us explain why this is the case. Our corpus consisted of participants’ entire conversation with Vincent, as well as their answers to the open ended question assessing their opinion about the conversation at the end of the survey: “Why did Vincent’s responses (not) resemble those of other chatbots?”

Analyzing the responses was done both by the author and the author’s supervisor, Minha Lee, to improve the objectivity of the final themes. The focus of this analysis lay on the actual interaction between participant and Vincent, not so much on the ways in which participants described themselves or their moments of failure. However, it should be noted that our dataset contains rich examples of what it means for people to fail and that these examples range from sterile to extremely private and detailed descriptions. Moreover, these moments of failure could all be classified as relational or occupational. For example, one relational moment of failure would be the following:

“I recently started taking yoga classes. I signed up for a beginner class because I had never done yoga before. At first I thought I was doing well enough but then we had to do a pose and I just didn’t understand the instructions and I couldn’t get in the right position. Everyone else in the class did the pose easily, and I felt singled out and very humiliated and I decided right then and there that I wouldn’t go back to the class anymore. But I did keep going, and I think I’m getting better at it. I do still feel humiliated when I can’t do a pose that everyone else does easily, but I just try to push that feeling aside.” [G27]

In contrast, an occupational moment of failure looks like this:

“I recall sometime in November when I was given an assignment at work to write a code and I took it home. Did all I could through out the night and the code ran perfectly, only to get to work the next day, upon submission, and the code refused to run. I have never been that embarrassed in my whole life.” [R6]

(34)

An important starting point in this qualitative analysis is to gauge the way that participants perceived their entire interaction with Vincent. Hence, this section starts by exploring the ways that participants experienced and evaluated their conversation with Vincent. From there, we explore the reason for their experiences and evaluations and see how our participants actually behaved in the conversation. The section ends by creating groups, or categories, of participants based on their final attitude towards Vincent.

4.5.1 Perception and experience

We know from the quantitative section that Vincent was perceived to be equally intelligent, caring, likeable and trustworthy in all conditions, but these scales did not address the entire experience with Vincent.

Evaluation of conversation In the caregiving condition, it was possible to assess people’s evaluation of the self-compassionate letter writing exercise directly through one of Vincent’s questions: “How did this make you feel? Did you like the exercise? Why (not)?”. Participants’

answers fell on an axis ranging from negative to positive. For example, a participant who disliked the exercise said the following:

“It made me uncomfortable. I wouldn’t do it if you weren’t paying me, and really, I feel bad right now that I’ve relived that, and I feel exploited and if you weren’t taking advantage of my poverty I would be in a much better mood.” [G35]

In contrast, a positive participant responded with: “I loved the exercise it makes me question what I believe in and how I view myself. It makes me feel more happy and positive about myself as a person.” [G80]

Most participants, however, were closer to being neutral:

“It was fine.” [G44]

“It was interesting to deep dive into those emotions.” [G50]

These responses show that the exercise was received differently throughout the caregiving condition, but that by itself does not yet tell us much.

(35)

However, there is a more subtle way to gauge for evaluations about the entire conversation by looking at the way that participants responded to Vincent’s last message. In all conditions, Vincent ended the conversation by asking participants to hang out again. Their response to this question can be considered an evaluation of how enjoyable or satisfactory the conversation had been to the participant. Here, too, we see replies ranging from (very) positive:

“Sure, I would love that!:-) And thanks for talking to me as well!” [R38]

“Yeah, sounds good. It was nice talking to you, Vincent. Take care, and I hope you get to see the sequoias in-chatbot one day!” [C10]

Through neutral, or perhaps polite:

“Sure thing.” [C43]

“Okay.” [G39]

To (very) negative:

“No.” [R69]

“No! But you’re a robot, so you don’t care. So, bye!” [G81]

Comparing Vincent Apparently, not all participants enjoyed their conversation with Vincent as much: there was quite some variation in their final goodbyes. However, this variation is not reflected in the quantitative scores from Table 1: there, all questions assessing perception of the conversation received similar scores across conditions. In this case, the open-ended question that ended the section on conversation perception may provide insight:

“Why did Vincent’s responses (not) resemble those of other chatbots?”

Although this question was intended to gauge whether people thought Vincent was a real chatbot9, responses hinted at what kind of chatbot participants compared Vincent with. As expected, many participants commented on Vincent’s limitations in understanding their input, ostensibly comparing him to a more clever and sophisticated chatbot. However, others made remarks about his use of GIFs or his display of emotions - suggesting that they compared Vincent with a more average, non-emotional task-focused chatbot.

9As opposed to being a set of conditional survey questions

(36)

When we look at all answers across conditions, they reveal an underlying axis as shown in Table 5 below: some participants were surprised because of Vincent’s limitations, others were surprised because of his human-likeness, and yet another group was surprised because of other reasons, such as his humor or the nature of the conversation.

Table 5

Limited chatbot - human axis

Axis Quote

Surprised because of limitations

“I felt as though these were pre-determined responses without much thought.

When the machine did not understand it just dealt with it the best way it could by trying to be funny and sidestepping. Most other chatbots tend to at least point you in a general direction and not be funny.” [R1]

“They were dumber and they tried too hard. But most chatbots I used are for things like finance.” [C104]

Surprised because of details

“Vincent interjected humor which was refreshing.” [G52]

“Other chatbots I have dealt with were designed with service in mind. This was a different kind of conversation.” [R15]

“Most that I’ve interacted with do not attempt to seem so human – they do not claim to have (and fail) programming exams or claim to feel bad about such things. Most are all, or almost all, business, so to speak.” [R10]

Surprised because of humanness

“He sounded like a real person to me, he was lovely and very caring, he seemed really worried for his failure, I felt like I was chatting with a real human.”

[R92]

Combining evaluation and comparison For the participants in the caregiving condition, we could go one step further and combine the human-machine axis with the positive-negative axis as identified earlier. We printed 1410 responses that showed clear examples of positive, negative and neutral answers, and ordered them from negative to positive. Then we ordered these participant’s answers on the human-machine axis from human-like to machine-like11. A simplified outcome of this procedure is shown in Figure 2 on the next page. The full visualization, including the participants’ actual replies and answers, can be found in Appendix C.

10Originally, 15 responses were selected - an arbitrary number. The answer to one open-ended question was left out because participant did not address Vincent’s humanness or limitations

11Another Master’s student also ordered the answers on this scale. The final ordering was an average of the author’s and this other student

(37)

Although the distances between observations Figure 2 are not exact, it shows a trend: the more positive participants were about the exercise they completed with Vincent, the more they noted Vincent’s apparent humanness. In contrast, the participants who were negative noted Vincent’s shortcomings as a machine more often. Even though their evaluations came before the comparison, we do not imply that this is a causal relation: rather, they seem to correlate.

Figure 2: Responses ordered along two axes: humanlike - machinelike & negative - positive

In fact, the different ways of saying goodbye showed a similar correlation: those who were negative always remarked on Vincent’s limitations and gave considerably lower evaluations when asked if they felt like they were having a real conversation. For example, consider the goodbye message from the following participant who rated the realness of the conversation with a 1:

“Yeah, thanks but I’m good. have fun taking over the World.” [C114]

(38)

4.5.2 Expectations for the experience

Taking in all of these observations, we suggest that the conversation with Vincent was ex- perienced positively or negatively based on what your expectations are: if you expect a well- functioning, objective computer program, you might be negatively surprised when you are faced with a limited, pretending-to-be-emotional chatbot. On the other hand, if your prior expecta- tions of a chatbot are wider or perhaps less certain, you might enjoy encountering one that talks about failures and fascinations - or at least not be too bothered: you may also be neutrally sur- prised at the topic of the conversation, or the way that Vincent talks. Hence, we derived three themes that relate to these prior expectations of chatbots: humanness, robotness and newness.

Humanness This theme relates to the expectation that that Vincent can show, or even have, human characteristics such as emotions, feelings and failure. From this perspective, Vincent showing human characteristics can be a positive thing, something that sets him apart:

“(feeling upset is) Not (normal) for a program, but in the future it probably will be, so you’re ahead of the game.” [R9]

“This makes you unique. :)” [R97]

In sum, this theme contains all references to participants accepting the fact that Vincent has, or at least shows, emotions, feelings and failures:

“Sure everyone feels down sometimes, even computers.” [R14]

Robotness In contrast, the robotness theme relates to the expectation that Vincent cannot show, let alone be capable of having human characteristics such as emotions, feelings and failures - not now and not in the future. Moreover, some participants expressed their disapproval of the notion that a chatbot could have feelings, or be human-like, in an almost normative way:

“Like all chatbots, its reaction was fake. They are not human and likely never will be able to even fake being human.” [C114]

“No, you’re incapable of emotion. This is all very silly.” [R18]

(39)

“It’s not ok for you as a computer to feel at all. That would be very disturbing. No it is not normal.” [R1]

In contrast, Vincent should be smart in an objective, neutral way. For example, after CR Vincent admitted to having failed a Python exam, one participant remarked:

“I don’t know how that would happen. You should be smarter than that.” [R20]

Hence, this theme contains all references to participants rejecting Vincent’s portrayal of human characteristics:

“You’re a program, so you’re not truly feeling anything.” [R114]

Newness This theme relates to the expectation that chatbots only cater to menial tasks.

Hence, the fact that Vincent addressed something non-menial - namely, feelings - was noticed as something new: several participants across the conditions made remarks about the nature of their conversation with Vincent being unfamiliar, or at least something that set Vincent apart from other bots:

“Vincent had an activity to do and that was something I hadn’t experienced with a chatbot before.” [G15]

“The conversation seemed more natural than other chatbots, probably because it was not tied to a utilitarian task, like providing sales or customer service support.” [C30]

In the care-receiving condition especially, participants were found to make remarks about the fact that Vincent talked about (his own) feelings as something that was new to them:

“Vincent talked about ‘emotions’ and ‘feelings’” [R47]

“I’ve never ran into a Chatbot that was up front about its failures. I found that to be very interesting. It was almost like the Chatbot Vincent was vulnerable. Overall I felt like we had a great conversation and at times I didn’t notice I was actually talking to a Chatbot.” [R99]

Referenties

GERELATEERDE DOCUMENTEN

Since the introduction of technology is performed in different cohorts, a cross-lag-design can be used to study the effects on central characte- ristics of the working conditions

Note that as we continue processing, these macros will change from time to time (i.e. changing \mfx@build@skip to actually doing something once we find a note, rather than gobbling

This study tried to replicate and extend the studies of Boecker and Borsci (2019) and Balaji and Borsci (2019) using the USQ to assess the user satisfaction of chatbots.. However,

H2: Individuals with a personality disorder report significantly lower levels of negative affect after participating in a brief self-compassion promoting exercise, compared to

This study aimed to examine the usage and appreciation of the iThrive intervention and its impact on self-compassion and compassion fatigue among healthcare workers.. To our

Universiteit Utrecht Mathematisch Instituut 3584 CD Utrecht. Measure and Integration

Let B be the collection of all subsets

Yet this idea seems to lie behind the arguments last week, widely reported in the media, about a three- year-old girl with Down’s syndrome, whose parents had arranged cosmetic