• No results found

If a chatbot that asks for help is proven to be more effective in stimulating self-compassion, then this has serious implications for the design of chatbots for self-compassion. However, Lee et al.

(2019)’s sample size was not large enough to make definite statements about the difference in effectiveness between the two conditions. Their interaction effect between time and condition was insignificant and underpowered (F(1, 62) = 0.580, p = 0.449, ηp2= 0.009), making it unclear whether the difference was really there. Hence, a logical next step would be to replicate their work with a larger sample size.

2.3.1 Grounded effect size

Unfortunately, the lack of other research into the topic makes it difficult to move forward: when it comes to chatbots for self-compassion, the only available work to base effect size estimations on is that of Lee et al. (2019), but they were underpowered and did not have a control condition to benchmark the performance of their conditions. Using their η2p = 0.009 to calculate the number of participants needed in a 2-week study would result in a costly endeavour for which there is insufficient grounding in literature: in essence, there is a lack of a grounded effect size for future work on chatbots for self-compassion.

Alternatively, the works that have reliably tested the effect of caregiving and of care-receiving on self-compassion separately could be used to base effect size expectations on. These are the works of Leary et al. (2007) and Breines and Chen (2013), but they only exposed their participants to an intervention once. If these are used as basis, a direct replication of Lee et al.

(2019)’s longitudinal setup is not feasible: instead, this study should use a single interaction.

2.3.2 New chatbot role

Moreover, although there is abundance of work on the ways that people treat and perceive chatbot categories as identified by Grudin and Jacques (2019), there are only a few papers about chatbots for emotional needs (Fitzpatrick et al., 2017; Fulmer et al., 2018), and no other papers about “ChatPet” chatbots like care-receiving Vincent (Lee et al., 2019).

As a result, we do not know much about people’s perceptions of chatbots for emotional needs - let alone about how we should design them, despite the suggestion that they may be very effective in helping us.

Hence, this study will address four things: (1) have a sample size large enough to make a powered statement about the difference in effectiveness between caregiving and care-receiving chatbots, (2) thereby establish a grounded effect size of the effectiveness for chatbots for self-compassion for future research, (3) employ a single interaction to cross-check our found effects with those of Breines and Chen (2013) and Leary et al. (2007) and (4) add to the understanding of how people treat and behave towards chatbots for emotional needs.

To do so, we will test the effects of a single interaction with Vincent, including a control condition to benchmark caregiving and care-receiving chatbot performance, against the following four hypotheses:

Hypothesis 1a: A single interaction with a chatbot that gives care will improve self-compassion immediately after the interaction

Hypothesis 1b: A single interaction with a chatbot that asks for care will improve self-compassion immediately after the interaction

Hypothesis 1c: A single interaction with a chatbot that does not give or ask for care will not improve self-compassion immediately after the interaction

Hypothesis 2: A single interaction with a chatbot that asks for care will improve self-compassion more than a single interaction with a chatbot that gives care

Qualitative analysis of the conversations with each condition will provide us with information regarding the treatment and perception of each version of Vincent.

3 Method

This experiment has a 3 (condition: caregiving, care-receiving and control) by 2 (time: pre, post) online survey design.

3.1 Data analysis

Data analysis was done in STATA IC 14.2 and SPSS 22. The appropriate statistical test to assess relative effectiveness of 3 conditions with 2 points of measurement is a repeated measures ANOVA, with post-hoc contrasts in case the interaction effect is significant. The effect size that this test should be able to detect is taken from Lee et al. (2019), who reported a non-significant interaction effect between time and condition of η2p = 0.009, equaling Cohen’s dz = 0.18. The current study will be powered to find this effect size.

3.1.1 Sample size

An a-priori power analysis was conducted to determine the number of participants required to answer the research question. G*power was used to perform the power analysis for a repeated measures ANOVA with 3 groups and 2 measurements, 90% power and an expected effect size of dz= 0.18. The total sample size required is 396, with 132 participants in each condition.

3.1.2 Equivalence testing

In case the interaction effect is not significant, the differences between the conditions will also be studied using equivalence tests. In traditional, Null Hypothesis Significance Testing (NHST) research, the absence of a significant result is often incorrectly reported as the absence of an effect. However, NHST methods only allow the rejection of a hypothesis, not the support: hence, it is impossible to statistically support the hypothesis that the effect is zero. Equivalence tests allows a researcher to test whether the effect falls within a specified range of effect sizes that are so close to zero that any value within these bounds can be statistically regarded as equivalent to zero (Lakens, 2017).

This paper will make use of the TOST (Two One-Sided T-tests) method. To perform a TOST on our data, we need to set the equivalence bounds. This is typically done by determining a smallest effect size of interest (SESOI). An objective justification of our SESOI is not possible since our hypotheses are not quantifiable theoretical predictions (Lakens, Scheel & Isager, 2018):

instead, we subjectively define a SESOI basing our bounds on our available resources. Although setting the SESOI based on Lee et al. (2019)’s dz= 0.18 was preferred, they only had 12% power to actually find it, begging the question of how reliable this estimate is. Moreover, setting a SESOI based on this estimate would yield a very large sample size which would be impossible to analyze properly given the time frame of this Master’s thesis. Instead, using the sample size calculated above, our smallest equivalence bounds become dz = -0.41 to dz = 0.41 (Lakens, 2017).