• No results found

A study of measurement reactivity effect in mobile-based self-reported measurement towards snacks purchasing behavior

N/A
N/A
Protected

Academic year: 2022

Share "A study of measurement reactivity effect in mobile-based self-reported measurement towards snacks purchasing behavior "

Copied!
83
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Eindhoven University of Technology

MASTER

Does recording consumer buying behavior influence future buying behavior?

a study of measurement reactivity effect in mobile-based self-reported measurement towards snacks purchasing behavior

Rizky Nur Iman, Mohammad

Award date:

2020

Link to publication

Disclaimer

This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration.

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research.

• You may not further distribute the material or use it for any profit-making activity or commercial gain

(2)

MASTER THESIS PROJECT

Does recording consumer buying behavior influence future buying behavior?

A study of measurement reactivity effect in mobile-based self-reported measurement towards snacks purchasing behavior

In partial fulfilment of the requirements for the degree of Master of Science in Innovation Management

Mohammad Rizky Nur Iman 1297627

Innovation Management

Supervisor 1: dr. S. (Shantanu) Mullick, TU/e, ITEM Supervisor 2: dr. N. (Néomie) Raassens, TU/e, ITEM

Eindhoven University of Technology Industrial Engineering & Innovation Sciences

October, 2020

(3)

Eindhoven University of Technology

School of Industrial Engineering and Innovation Sciences Series Master Theses Innovation Management

Keywords: Measurement Reactivity, Mere-measurement, Question-Behavior Effect, Difference in Differences, Consumer Behavior, Mobile-Based Self-Reported

(4)

I

Preface

All praise belongs to Allah: The Lord of the universe; the Compassionate; and the Merciful.

I would never have thought that I would be conducting my master thesis project in such an unprecedented time. This master thesis project was conducted just as the coronavirus pandemic started to hit the globe. Nevertheless, I have received much support from everyone around me, including my supervisors, the university, LPDP, friends, and family. In the next few paragraphs, I would like to express my gratitude to those, in which without them, I would never be able to accomplish my master thesis project.

First of all, I would like to thank my kind supervisors, Shantanu Mullick and Néomie Raasens. I wanted to do a master thesis that emphasizes a quantitative study as I want to brush up my data science skills. I was glad that Shantanu provided me with that opportunity and agreed to become my first supervisor. I also received much feedback from Néomie that was certainly very useful for me to write a better thesis.

Followingly, I would like to express my gratitude to LPDP (Indonesian Endowment Fund for Education) that had supported my financial needs during the master’s program. I am honored to be one of LPDP’s scholarship awardee, and I hoped that I could give my best contribution to my country in the future.

Last but not least, my friends and family deserved my utmost gratitude. To my beloved parents, brothers, and sisters who always supported me throughout my life, I could not thank them enough. To my closest friends, who I have considered as my own family, you know who you are, thank you for everything.

Mohammad Rizky Nur Iman, Eindhoven, October 2020

(5)

II

Abstract

Measuring behavior may lead to changes in subsequent behavior. This study explores the reactive effect of measuring consumer behavior through a mobile application. We argued that reporting chocolate purchases through mobile applications would lead to a decrease in subsequent chocolate purchases over the long term. Using an online panel data of chocolate purchases, we attempted to study the causal effect of such measurement methods using the latest difference-in-differences analysis techniques. We found no strong evidence to support the presence of a reactive effect that may lead to behavior changes in the long term. This finding indicates that a mobile-based reported technique is not sufficient in modifying behavior on its own.

(6)

III

Executive Summary

Measuring behavior may lead to changes in subsequent behavior, thus being reactive.

Literature has shown that measurement reactivity may lead to higher product purchase rate and increased normative behavior. However, one form of measurement reactivity still lacks in literature, which is the reactive effect from a mobile-based self-reported measurement. The measurement reactivity literature has been dominated by the study of implications in asking intention and cognition questions towards the behavior. Our study attempted to explore the reactive effect of a mobile-based self-reported measurement, which does not involve intention and cognition questions.

We explored another type of reactive effect from a different type of measurement, i.e., mobile-based self-reported measurement, which is still limited in the measurement reactivity literature. Furthermore, we studied the reactive effect in isolation and the absence of other behavioral modification techniques to fully understand the effect. We also used a longitudinal study in with high observations in multiple time period and long time frame, thus we had the opportunity to examine the behavior trends over many time period in long time frame which lacks in literature.

We studied the reactive effects of mobile-based self-reported measurement towards consumer behavior towards chocolate (snacks) purchases. The findings can help market researchers better understand their consumers’ future behavior while accounting for any possible reactive effects. Moreover, firms and policymakers can utilize the effect to promote desirable behaviors (e.g., promoting healthy snacks purchase) if such reactive effect is proven to exist.

(7)

IV

We deployed a causal inference study to understand the causal effect of the measurement method in changing subsequent chocolate purchase behavior. Literature has shown that the reactive measurement effect may lead to a decrease in undesirable behavior.

We hypothesized that reporting chocolate purchases through mobile applications would decrease subsequent purchases over the long term.

We used an online panel data that consisted of consumer self-reported chocolate purchases. There were two data collections in the panel data. One group reported their weekly grocery shopping, which we regarded as the control group. The other group used a mobile application in reporting chocolate purchases, which we regarded as the treated group.

Followingly, we implemented a difference-in-differences analysis that controls for variation in treatment timing. Furthermore, we developed several models to increase the robustness of our research.

The results of our analysis provided no substantial evidence that a reactive effect is present over the long term. Out of the three main models that we developed, only one has a significant impact. However, that particular model has low reliability as our robustness check provided different results.

Our findings indicated that a mobile-based reported technique is not sufficient in modifying behavior on its own. This implies that other behavioral modification techniques may be necessary to increase its effectiveness, as shown in several behavioral intervention studies.

(8)

V

Table of Contents

Preface ... I Abstract ... II Executive Summary ... III Table of Contents ... V List of Tables ... VI List of Figures ... VII

1. Introduction ... 1

2. Literature Review ... 3

2.1 The measurement reactivity literature ... 3

2.2 The reactive effect in regards to desirable, risky, & vice behavior ... 5

2.3 The mechanism underlying the reactive measurement effect ... 7

2.4 Theoretical background of causal inferencing ... 9

2.5 Theoretical background of difference in differences ... 10

2.6 Study design in the reactive measurement literature ... 13

3. Methodology ... 13

3.1 Data settings ... 13

3.2 Emperical framework: Difference in differences with multiple time period ... 15

3.3 Analysis tools ... 18

4. Data Analysis and Results ... 18

4.1 Data preparation ... 18

4.2 Data modelling ... 22

4.3 Results ... 25

4.4 Robustness checks ... 28

4.5 Summary of the results ... 31

5: Discussion & Conclusion ... 33

5.1 Discussion ... 33

5.2 Conclusion ... 36

7. Tables ... 38

8. Figures ... 58

9. Equations ... 65

10. Reference ... 68

Appendix ... 73

(9)

VI

List of Tables

Table 1. Summary of measurement reactivity study design ... 38

Table 2. Data descriptions of the selected variables ... 40

Table 3. Frequency Table of Sub-Categories (Chocolate) ... 40

Table 4. Frequency Table of Brand Lines (Chocolate) ... 41

Table 5. Format of Barcode Descriptions ... 41

Table 6. Summary of Treatment Starting Period ... 42

Table 7. Number of Time-Periods According to Level of Aggregation ... 42

Table 8. Summary of models’ specifications ... 42

Table 9. List of Selected Covariates from the Dataset ... 43

Table 10. Augmented Parallel Trend Testing – Model A ... 43

Table 11. Summary Of P-Value For Pre-Test Of Parallel Trends Assumption ... 43

Table 12. Group Average Treatment Effects - Model A ... 44

Table 13. Group Average Treatment Effects - Model B ... 45

Table 14. Group Average Treatment Effects - Model C ... 46

Table 15 Aggregated Treatment Effects of Model A, B, & C ... 50

Table 16. Comparison of Treated and Control Groups Before and After Matching – Model A ... 51

Table 17. Sample Size after propensity matching - Model A ... 52

Table 18. Comparison of Treated and Control Groups Before and After Matching – Model B ... 53

Table 19. Sample Size after propensity matching - Model B ... 54

Table 20. Comparison of Treated and Control Groups Before and After Matching – Model C ... 55

Table 21. Sample Size after propensity matching - Model C ... 56

Table 22. Aggregated Treatment Effects of Model A-P, B-P, & C-P ... 56

Table 23. ATT estimates of single sub-category models with bi-weekly time period ... 56

Table 24. Summary of ATT estimates utilizing 'Frequency of Purchase' ... 57

(10)

VII

List of Figures

Figure 1. Diagram of the Weights Extraction Process ... 58

Figure 2. Diagram of the Weights Imputation Process ... 58

Figure 3. Average Weights Purchased over 260 Weeks - (a) Treated Group (b) Control Group ... 59

Figure 4. Average Weights Purchased over 10 Semesters - (a) Treated Group (b) Control Group ... 60

Figure 5. Average Weights Purchased over 5 Semesters - (a) Treated Group (b) Control Group ... 61

Figure 6. Average Weights Purchased over 30 Months - (a) Treated Group (b) Control Group ... 62

Figure 7. Group Average Treatment Effects – Model A ... 63

Figure 8. Group Average Treatment Effects - Model B ... 63

Figure 9. Group Average Treatment Effects – Model C ... 64

(11)

1

1. Introduction

Today mobile application provides a novel way in consumer data collection, such as reporting purchasing behavior through an application, which we referred to as mobile-based self-reported measurement (e.g., Drott et al. 2016). However, such method’s reliability is still in question as literature had suggested a possible reactive effect from the act of measurement itself. This phenomenon is known as the measurement reactivity effect, where the act of measurement results in changes in the people being measured (French and Sutton, 2010). Measurement reactivity may lead to a higher purchase rate (Morwitz et al., 1993) or improved normative behavior, such as increased health club attendance (Spangenberg, 1997). Therefore, the critical question is, ‘does reporting your purchasing behavior through a mobile application influence your subsequent behavior?’.

This study attempts to identify the reactive effects of mobile-based self-reported measurement in the context of consumer behavior towards chocolate (snacks) purchasing behavior. We tried to find whether by reporting chocolates’ purchase through a mobile- application will lead to any changes in the amount chocolate purchases made subsequently.

The findings can help market researchers better understand their consumers’ future behavior while accounting for any possible reactive effects. Moreover, firms and policymakers can utilize the effect to promote desirable behaviors (e.g., promoting less high-calorie snacks purchase) if such reactive effect is proven to exist.

Our work offers several contributions to the existing literature. First, literature in measurement reactivity has been dominated by the question-behavior effect (QBE) (French &

Sutton, 2010). Our study attempted to contribute to the literature by exploring another type of reactive effect from a different type of measurement, i.e., mobile-based self-reported

(12)

2

measurement. Second, many studies (e.g., Anderson et al., 2001; Tate et al., 2001; Tate et al., 2006; Wing et al., 2006; Duncan et al., 2014; Elbert et al., 2016) had indicated behavioral changes due to mobile (web)-based self-reported measurement. However, these studies had incorporated other behavioral modification techniques in the intervention that may influence the outcome. We studied the reactive effect in isolation and the absence of other behavioral modification techniques to fully understand the effect. Third, we used a longitudinal study in with high observations in multiple time period and long time frame. Most of the similar studies in literature used either a lab-based experiments with short time frame (e.g., hours or days) or field-based experiments with few time period. Thus, we had the opportunity to examine the behavior trends over many time period in long time frame which lacks in literature.

Using the difference in differences analysis (DID), we tested the hypothesis that consumers exposed to a mobile-based self-reported measurement would decrease chocolate purchases. Our study used a large set of online panel data with staggered adoption design in which the units’ starting treatment period varies and they remained exposed to treatment for the rest of the study. Therefore, we used the techniques proposed by Callaway & Sant'Anna (2019) to control the difference in treatment timing periods.

However, our results do not provide strong evidence to support our hypothesis. After conducting a robustness check, we failed to find a significant effect of a mobile-based self- reported measurement towards decreasing (or increasing) chocolate purchase. Our results indicated that the act of self-reporting through an application on its own might not be sufficient to result in behavior changes. As suggested in Michie et al. (2009), a combination with other behavioral modification techniques may be necessary to increase its effectiveness in changing behavior.

(13)

3

The rest of this report is organized into four more chapters. The second chapter provides a literature review and theoretical background related to the measurement reactivity, causal inferencing, and DID. The DID techniques and the online panel data used for the analysis are described in chapter three. Chapter four provides the results and analysis of the several models we have developed alongside the robustness check. Finally, the theoretical and managerial implications of our findings is discussed in chapter five alongside our research limitations, future research recommendations and conclusions.

2. Literature Review

2.1 The measurement reactivity literature

Literature has shown that psychological measurement can affect people’s thoughts, feelings, and behavior and is referred to as ‘reactive’ (French & Sutton, 2010). French & Sutton (2010) defined measurement reactivity as being present where measurement results in changes in the measured behavior. Previous studies have focused on the effect of questioning intentions and cognitions towards subsequent behavior, known as the question-behavior effects (QBE) (Sprott et al. 2006).

There are two main streams of study in the QBE literature. One focuses on the effect of measuring intentions towards the resulting behavior, which is known as the ‘mere- measurement effect’ (Morwitz et al., 1993). The mere-measurement effect literature focusses on the influence of measuring intention and satisfaction on subsequent behavior, in which the majority were related to product purchases or service adoption behavior. Evidence from the studies indicated an increase in purchase rates after a subject is asked about their future purchase intention (e.g., Morwitz et al., 1993; Fitzsimons and Morwitz, 1996; Chandon et al., 2004). Meanwhile, Dholakia & Morwitz (2002) found that measuring satisfaction also

(14)

4

influences purchase behavior, with increased purchasing behavior for satisfied customers and decreased purchasing behavior for dissatisfied customers.

Second, known as the ‘self-prophecy effect’, focuses on the effect of asking self- prediction questions on future behavior towards the resulting behavior (Spangenberg &

Obermiller, 1996). The majority of the studies were related to promoting socially desirable behavior such as increased election voting (Greenwald et al., 1987), reduced students’ cheating (Spangenberg & Obermiller, 1996), increased health club attendance (Spangenberg, 1997), and increased fund donation (Obermiller & Spangenberg, 2000). Furthermore, Spangenberg et al.

(2003 found evidence that the effect still holds even when the self-prediction question was asked through mass-communicated media.

Since the early 1990s, QBE has dominated the measurement reactivity literature.

French & Sutton (2010) brought up the attention towards reactive measurement effect from a self-reported measurement. Here we specifically focused on self-reported measurement in which no questions on intention or cognition are prompted and that the respondent reports without interacting directly with the researchers.

Several healthy behavior intervention studies, which rely on a self-reported mechanism, have shown a significant positive effect in promoting a range of healthy behaviors, such as increased physical activity and decreased unhealthy food intake (e.g, Burke et al. 2003;

Rodearmel et al., 2006). Some of the studies utilized a computer or web-based self-reporting mechanism and showed significant effects on improving behavior (e.g., Anderson et al., 2001;

Tate et al., 2001; Tate et al., 2006; Wing et al., 2006). Recently, mobile phones have become more common as a self-reporting tool in behavioral intervention studies and provided strong evidence of their effectiveness (e.g., Duncan et al., 2014; Elbert et al., 2016; Kerr et al., 2016;

(15)

5

Rabbi et al., 2015). Meanwhile, we found one similar study that did not find strong evidence (Helander et al., 2014).

The majority of interventions in these studies provided evidence of significant effects.

The self-reported measurement mechanism may have played a role in influencing subsequent behavior. However, we cannot draw any conclusion as the interventions used in the studies did not exclusively use self-reporting mechanism only. Other behavioral modification techniques were used, such as feedback systems and educational material provisioning. Therefore, a study on the reactive effects of self-reported measurement in separation from other potential confounders (from other behavioral modification techniques) still lacks literature.

2.2 The reactive effect in regards to desirable, risky, & vice behavior

It is also essential to discuss how measurement reactivity would play out in chocolate purchase behavior. Consuming snacks, especially high-calorie snacks, can be considered as undesirable behavior (e.g., Robinson et al., 2013). Studies from the QBE literature provided mixed results regarding the effect of measurement on desirable or undesirable behavior. (e.g., Sprott et al., 2003; Williams et al., 2006; Fitzsimons et al., 2007; Koletić et al., 2019)

Several studies showed that the reactive effect could increase desirable behavior, such as increased choice of healthy snacks (Sprott et al., 2003). Another stream of literature focused on exploring the effect on risky, non-normative, or undesirable behaviors (Williams et al., 2006; Fitzsimons et al., 2007; Koletić et al., 2019). While Koletić et al. (2019) found no evidence of increased behavior in accessing pornography, other research showed an increased likelihood of engaging in undesirable behaviors. A meta-analysis conducted by Wilding et al.

(2016) found a small negative effect on the research of undesirable health behaviors. Williams et al., 2006 found that the QBE helps increasing health exercise (a desirable social behavior).

(16)

6

Fitzsimons et al. (2007) studied the effect of QBE on vice behaviors, which draw both positive implicit and negative explicit attitudes, such as illegal drug consumption. People tend to have positive implicit attitudes (e.g., drugs are enjoyable) and explicit negative attitudes (e.g., drugs are harmful to your health). Their study found that respondents tend to engage in negative behavior (i.e., skipping class) when asked intention questions. They also found that most respondents likely held conflicting attitudes about skipping class. They argued that the QBE could allow people to be more ready in engaging a vice behavior.

In this study, purchasing or consuming chocolate can also be argued as an undesirable behavior and, perhaps, a vice-behavior. In this case, the conflicting attitude would be the temptation towards tasty and sweet chocolate (positive implicit attitude), while also believing that they are unhealthy when consumed frequently (explicit negative attitude). While Fitzsimons et al. (2007) had used illegal drugs as the behavior of interest, it would be interesting to see how the effect is to chocolate purchase. Notably, these studies were coming from the QBE. The reactive effect in the context of undesirable- or vice-behavior in mobile-based self- reported measurement is still unknown.

As it stands, very little can be said about the effect of a mobile-based self-reported- based measurement on chocolate purchases. There evidence that there are reactive effects from the QBE literature, especially regarding undesirable- or vice-behaviors. However, knowledge regarding the reactive effect of a self-reported measurement is still limited. As the novel mobile-based data collection method emerges, which do not explicitly prompt intention or prediction question as in QBE, it is becoming more important to study the reactive effect of mobile-based self-reported measurement. Studies that involve a mobile-based self-reported mechanism combined the interventions with other behavioral modification techniques such as

(17)

7

information provisioning, goal-setting assistance, gamification, and stress management (Michie et al., 2009; French & Sutton, 2010). Hence, our research question is as follows.

RQ: Does mobile-based self-reported measurement affect consumer purchasing behavior towards chocolate snacks?

2.3 The mechanism underlying the reactive measurement effect

A self-reported measurement may have a reactive effect as it may increase the respondents’

self-monitoring (Capellan et al. (2017). Increased self-monitoring may lead to behavior change with aims to reduce the discrepancy between their desired behavior and the reality itself (Snyder, 1979).

Self-monitoring is related to one of the self-regulatory techniques (Michie et al., 2009).

Self-regulatory itself can be considered one of the self-management approaches that could effectively promote healthy behavior intervention (Michie et al., 2009). Under this approach, the underlying mechanism is the control processes (e.g., monitoring behavior, goal settings, feedback mechanism, & goal evaluation), which drives behavioral changes (Carver &

Scheier’s, 1982). Control processes work to reduce any discrepancy between the present state and the reference value. This mechanism intertwines with the theory of cognitive dissonance that underlies the mechanism behind the QBE. Cognitive dissonance (Spangenberg &

Greenwald, 1999; Spangenberg & Sprott, 2006; Spangenberg, Sprott, Grohmann, & Smith, 2003) is one of the most common mechanism often described as the underlying mechanism behind the question behavior effect. It refers to how a subject tends to reduce the tension caused by unfitting actual conditions with the subject’s beliefs, knowledge, or opinion. This tension can be strengthened by prompting an intention or behavioral prediction questions, followed by

(18)

8

a change in behavior. Hence, when a self-reported measurement stimulates self-monitoring, it may influence the respondents' subsequent behavior to reduce cognitive tension.

Capellan et al. (2017) attempted to study self-monitoring that contributed to measurement reactivity. They conducted a qualitative data analysis based on a randomized clinical trial conducted by Wilde et al. (2015). Both the treated and control groups were found to have behavioral improvement. Capellan et al. (2017) found out that the respondents' self- monitoring had increased during the experiment. This lead to the behavioral improvement found in both groups.

Furthermore, French & Sutton (2010) provided another support for the reactive effect when self-monitoring increases. They argued that using a pedometer to record physical activity may lead to increased self-monitoring, leading to a reactive effect. Their argument was supported by the study of Spence et al. (2009), which found that participants who wore pedometers reported more physical activity than those in the control group.

In conclusion, these studies indicated that increased self-monitoring might lead to a reactive effect. We argued that self-monitoring would also be increased in a mobile-based self- reported measurement. We reasoned that the consumer would be more aware of their purchasing frequency and attempt to decrease the frequency when it is higher than expected.

Hence, this had provided an argument that the basis for our hypothesis.

Hypothesis: Consumers that are exposed to a mobile-based self-reported measurement would have a decrease in chocolate purchases.

(19)

9

2.4 Theoretical background of causal inferencing

Our study focused on ‘causal inferencing’, in which the causal effect of one variable onto another will be measured. Causation is different from ‘association’ or correlation analysis (Cox, 1992). Causation describes how one variable affects another variable (Pearl, 2009).

Causality is described according to the causal effect a variable has on an outcome (Holland, 1986; Pearl, 2009; Morgan & Winship, 2014), which is measured as treatment effects (Imbens & Rubin, 2015). The treatment effect is the difference in the same unit’s (e.g., a person, firm, schools, city) potential outcome under different treatments (e.g., going to school or taking a medicine). The comparison of the potential outcome variable must be taken simultaneously (Imbens & Rubin, 2015). This leads to the fundamental problem of causal inference (Holland, 1986). It is impossible to measure both of the potential outcomes as each unit can only be either treated or not treated at one point in time. Therefore, statistical methods are used to estimate the treatment effect.

Randomized experiments have been widely used to estimate the average treatment effect (ATE) (Rubin, 1974), which estimates the difference in all the samples’ potential outcomes from a population under study (e.g., Frölich & Sperlich, 2019). A group of samples is divided randomly into two groups, receiving treatment group (treated group) and not- receiving treatment group (control group).

Despite its effectiveness, randomized experiments may not always be the best option because of practicality and ethical issues (Rosenbaum, 2017). Therefore, another type of research may be more suitable for this, i.e., observational study (Rosenbaum, 2017). In an observational study, the researcher does not influence the treatment assignment and only observed the outcome and any relevant covariates. Therefore, it is a nonrandomized study of treatment effect, commonly known as a quasi-experimental study (Rosenbaum, 2017). In this

(20)

10

case, it is more common, instead of estimating ATE, to estimate the average treatment effect on the treated (ATT), as shown in Equation 1. ATT is only concerned with the treatment effect for those who are treated.

[Insert Equation 1]

The common critic of quasi-experimental study that it does not have a way to balance the unmeasured covariates. Notably, one popular technique in a quasi-experimental study often used to handle the bias from unmeasured covariate is the ‘differences-in-differences’ (DID) method (Card & Krueger, 1993). This method is a quasi-experimental design that takes advantage of the panel data’s time dimension (Angrist & Pischke, 2009). With a DID analysis, researchers do not need to concern about any unmeasured covariate, hence making it one of the most popular tools in a quasi-experimental study (Goodman & Bacon, 2018).

2.5 Theoretical background of difference in differences

DID analysis is used to estimate an intervention's effect by comparing the changes in the treatment and control group's outcome over the pre-intervention and post-intervention time- period. This approach allows removing biases from measured or unmeasured covariates, following the parallel trend assumptions (e.g., Lechner, 2010).

The assumption states, 'the level of covariates in the treated and control group may differ as long as the changes are the same in both groups over time' (Lechner, 2010). This assumption implies that if the treated group had not been treated, it would follow the same trend as the control group (Lechner, 2010). Therefore, the differences in pre-intervention and post-intervention outcomes in the control group may be used as the counterfactual to the treated group. The difference in differences between the treated and control groups will estimate the treatment's causal effect.

(21)

11

The ATT is usually estimated through a regression model, as shown in equation 2 (Lechner, 2010). The model regresses the potential outcome ("#$) into time and treatment status, including their interaction term. The parameter β3 is used as the estimation of ATT. The regression model is prevalent in the standard DID analysis. However, in the standard DID analysis, the model does not consider if the treated units' treatment timing started varies (Goodman & Bacon, 2018). A two-way fixed-effect (TWFE) estimator is commonly used to account for differences in treatment timing starting point (de Chaisemartin & D'Haultffuille, 2018), as shown in equation 3 (Goodman & Bacon, 2018).

[Insert Equation 2 & 3]

In the TWFE approach, the regression is estimated with the dummies for cross-sectional units (%#) and time-periods (%$), and a treatment dummy ('#$). The parameter of interest is ())which is referred to as the value of treatment effects. Goodman & Bacon (2018) argued that when the treatment effect varies across time, some of the individual estimates may result in negative weights, even though the effect may be positive. In this case, estimating ()) may be problematic and not relevant for evaluating treatment effects. In conclusion, the parameter ()) is not a very reliable causal parameter when the treatment effect is not constant (de Chaisemartin & D'Haultffuille, 2018).

Other studies acknowledged this problem and attempted to account for the dynamic treatment effect when the effect differs across time or treatment-timing groups. Athey &

Imbens (2018), de Chaisemartin & D'Haultffuille (2018), Abraham & Sun (2018) proposed other methods to estimate ()). These three studies decompose the analysis by creating groups of units according to their starting treatment time. They estimated the treatment effects of each group and used those effects to estimate ()). Abraham & Sun (2018) named the group

(22)

12

treatment effects as cohort-specific average treatment effects (*+,-.,0). This treatment effect was derived from the group average change in outcome compared to the never-treated groups.

Athey & Imbens (2018) also used a similar building block decomposition; however, their treatment effects were derived from group average changes in outcome due to change in the starting treatment period. de Chaisemartin & D'Haultffuille (2018) also used a similar decomposition approach. However, they had a more complex model that attempted to account for units that can switch from being treated to untreated. Therefore, their treatment effects were derived from the group average difference in outcome when treated and untreated.

Callaway & Sant'Anna (2019) provided a different alternative to approach this issue.

While the other studies attempted to estimate the two-way fixed effect estimator's parameter, they proposed a general framework that allows the identification and estimation of treatment effects parameters other than ()) . Their approach is based on what they define as the group- time average treatment effects, which is the average treatment effect for group 2 at time 3. This is similar to the decomposition of Abraham & Sun (2018). However, Callaway & Sant'Anna (2019) do not attempt to estimate ()). They argued that to estimate ()) would require some restriction of the treatment effect heterogeneity and that they would avoid this obstacle by directly focusing on the casual parameter of interest. Their study provided under what conditions the parameter can be non-parametrically identified (Callaway & Sant'Anna, 2019).

Another advantage of their approach is that the group-time average treatment effects may be aggregated into one casual parameter by controlling several conditions. The first one is to control for selective treatment timing, in which the units may choose when to become treated. When this is the case, there is a potential of an anticipation effect (Lechner, 2010), in which units anticipate future exposure to treatment. This may lead to bias as it may influence their behavior during the pre-treatment period. The second one is to control for dynamic

(23)

13

treatment effect, in which the effect of a treatment may depend on the length of exposure of the treatment.

These two measures gave extra benefits to this study. As the data came from an online panel, the respondents had some freedom to choose when to start the treatment, so controlling selective treatment timing is essential. Moreover, the aggregation tools that control the dynamic treatment effect helped to understand the effect of length of exposure. Therefore, this study followed the framework of Callaway & Sant'Anna (2019).

2.6 Study design in the reactive measurement literature

We also examined the type of study that has been conducted in the reactive measurement literature. Table 1 shows the list of studies and their study design, whether it is an experiment or quasi-experiment. The majority of the studies were conducted through an experiment. Out of twenty-one studies, three were conducted through a quasi-experiment. Morwitz et al. (1993) used a weighted-balance dataset to compare the percentage of households that made purchases between the treated and control group. Similarly, Fitzsimons & Morwitz (1996) used weights balancing. Meanwhile, Spangenberg et al. (2003) used interrupted time series design, which is commonly used in the absence of a control group. Our study proposed using a DID analysis to control for unmeasurable covariates, which had not been addressed in previous studies.

[Insert Table 1]

3. Methodology

3.1 Data settings

The data of this study was taken from a large international market research firm in the United Kingdom; the same data used in Dubois et al. (2018) and Mullick & Albuquerque (2017). The

(24)

14

data was treated as observational data. There were two sub-data collections in the dataset. First, 45,041 households reported their weekly grocery shopping. They were given a particular scanner, and they had to scan the groceries barcode they had purchased. The scanner automatically sent the data to the researchers. The time-frame of the dataset was from December 2007 until September 2012. This dataset is referred to as the ‘at-home’ dataset (e.g., Mullick & Albuquerque, 2017).

In total, there were 8,081,877 amounts of purchases recorded. The record also consisted of additional information such as the product category, brands, and barcode description. The households’ socio-demographic characteristics were also available; we selected the relevant variables which are presented in Table 2.

[Insert Table 2]

After one year, 4,182 of those households were given a mobile application to report the purchase of chocolate in addition to their weekly grocery shopping. These were the purchases they made different sources such as from a vending machine, restaurants, and convenience store. The households typed-in the snacks they had purchased through the application, and the data was sent to the researchers. These households also still report their weekly grocery shopping using the scanned based method. This group of households was considered as the treated group. This dataset is referred to as the ’away-from-home’ dataset (e.g., Mullick &

Albuquerque, 2017). The rest of the respondents who were not given the application were considered as the control group. Unlike the first data collection, the time-frame of this second data collection is until January 2014.

(25)

15

3.2 Empirical framework: Difference in differences with multiple time period 3.2.1 Identification strategy

To determine whether a mobile-based self-reported measurement is reactive, we tried to observe any behavior changes through a DID analysis. We decided to measure the changes in the weights of the product purchased by the respondents. Our reason was that it provided a better measurement of consumer behavior than just the frequency of purchasing. Arguably, the frequency of purchasing may not represent the whole story. A respondent may have the same purchasing frequency, but a lower quantity of the product or smaller sizes in one purchase transaction. For example, if a respondent routinely purchased five chocolate bars per transaction and decided to purchase only three chocolate bars after being exposed to treatment, measuring purchasing frequency does not capture this change. Thus, we use the total amount of the product weights purchased by the respondents as our dependent variable.

We attempted to identify a causal relationship between the treatment and the resulting by following the framework of Callaway & Sant'Anna (2019). Callaway & Sant'Anna (2019) modification of the DID analysis is based on calculating group-time average treatment effects (4+,-5,$). Each unit is placed into groups (2) according to the starting treatment period. The 4+,-5,$ estimates the treatment effect for the corresponding group 2 at time 3. The calculation of 4+,-5,$ is done for every group 2 independently by utilizing the observed outcome from the control group. Equation 4 (Callaway & Sant'Anna, 2019) showed the formula of 4+,-5,$, which can also be termed as the average treatment on the treated (+,,(5,$)).

[Insert Equation 4]

Since, in most cases, we are more likely not interested in the 4+,-5,$, but rather a more generalized ATT. Therefore, Callaway & Sant'Anna (2019) provided ways to aggregate the

(26)

16

4+,-5,$ into more generalized causal effect parameters. The first way is a simple average of the 4+,-5,$ (6), as shown in Equation 5, which can also be done with a weighting procedure according to the group sizes 67, as shown in Equation 6 (Callaway & Sant'Anna, 2019).

[Insert Equation 5 & 6]

Callaway & Sant’Anna (2019) argued that both aggregations are not appropriate to summarize the treatment effect other than when the effect is homogeneous across groups and time. Other aggregation methods were proposed, which can also be suited to the context of the study.

First, when the units have the choice to become treated, 'selective treatment timing' aggregation is beneficial. In this aggregation, the 4+,-5,$ is aggregated according to each group, as shown in Equation 7 (Callaway & Sant'Anna, 2019). Followingly, the group average treatment effect of each groups (68(2)) are combined with consideration of the group sizes, as shown in Equation 8 (Callaway & Sant'Anna, 2019). The summary of the treatment effects from this aggregation method is denoted by 68.

[Insert Equation 7 & 8]

Another aggregation method considers if the treatment effects are dynamic, hence called the ‘dynamic treatment effects’ aggregation. This is the case where the treatment effect may be dependent on the duration of the treatment. First, the 4+,-5,$ is aggregated by the length of the exposure to treatment, denoted by 9. Equation 9 (Callaway & Sant'Anna, 2019) provides the average effect of treatment for units that have been exposed to treatment for 9 periods. Then the average of all the possible values for different lengths of exposure is taken with equation 10 (Callaway & Sant'Anna, 2019). The summary of the treatment effects from this aggregation method is denoted by 6).

(27)

17

[Insert Equation 9 & 10]

3.2.2 Assumptions

As part of the ATT identification process, there are four assumptions that Callaway &

Sant'Anna (2019) imposed. The first assumption is related to the data type, which needs to be a panel data. Panel data consists of observations from the same group of units/respondents repeated through a series of times (e.g., Baltagi. 2008).

The second assumption is referred to as the (conditional) parallel trend assumption.

Similar to typical DID analysis in which the parallel trend assumption needs to hold, the assumption also needs to hold for all group g and control group. Additionally, the assumption holds only after conditioning on some covariates ;. The third assumption states that once a unit is treated, it can not be reversed in the following periods. In other words, the unit stays exposed to treatment for the rest of the study.

The last assumption imposes the need for the possibility of any covariate ; to be not treated. In other words, for every value of covariate ; found in the treated group g, there is a positive probability that it is also not treated in one or more periods. Therefore, it can be problematic when an always-treated unit is present in the panel data. Callaway & Sant'Anna (2019) suggested removing all always-treated units from the analysis.

3.2.3 Estimation strategy

Next, the approach to estimate the causal parameters are discussed. Estimation of 4+,-5,$

consists of two-steps (Callaway & Sant'Anna, 2019). The first step is to estimate the generalized propensity score for each group 2 and compute the fitted values for the sample.

The calculation of the propensity follows Equation 11 (Callaway & Sant'Anna, 2019). The

(28)

18

fitted values are then inserted into the estimation of 4+,-5,$ derived from Equation 4 and more concisely, as shown by Equation 12 (Callaway & Sant'Anna, 2019).

[Insert Equation 11 & 12]

Followingly, to construct a valid inference, the researchers proposed a simple multiplier bootstrap procedure (Callaway & Sant'Anna, 2019). This approach's advantage is that it is easy to implement and very fast to compute. Second, there is always an observation from each group in each of the iterations. Finally, the simultaneously valid confidence bands are computed. The bootstrap procedure algorithm is shown in Appendix-algorithm 1 (Callaway & Sant'Anna, 2019).

From the confidence bands, the standard error can also be obtained. It is also possible to calculate the standard error from conventional analytical confidence bands. Standard error from these two different confidence bands may be used to test for the robustness of the model.

3.3 Analysis tools

We used R programming language in RStudio for all the data analysis. Notably, for the DID analysis, we used the R library package 'did' developed by Callaway & Sant'Anna (2019).

4. Data Analysis and Results

4.1 Data preparation 4.1.1 Data cleansing

First, the at-home dataset was prepared for the analysis. Data cleansing was required to make certain incorrect data was avoided and that the data had the correct format. First, the format of the time stamp was needed to be converted into a 'date' format. Second, several product descriptions were written in different forms but intended to be the same value. Specifically, the

(29)

19

difference in the format was whether they were written in capital letters or lowercase letters.

Any description of the same value but written in a different form were aggregated into one description so the program would not have mistaken them as different description.

The next step was to filter the data that were of interest to this study. The dataset contained three major product categories: chocolate confectionery, sugar confectionery, & gum confectionery. The dataset was filtered to the chocolate confectionery category. Within the chocolate category, there were 9 sub-categories. We were interested in focusing on the sub- categories that consumers purchased routinely, not something that was too niche or purchased seasonally. Therefore, the dataset was further filtered to the top three sub-categories. They are Countlines, Tablets, and Bagged Selfline. These sub-categories accounted for 92.6% of the purchases made, as shown in Table 3.

[Insert Table 3]

Furthermore, we examined the parent brands (brand lines) of the products. Similarly, we wanted to make sure that we focused on the products that were not too niche or purchased explicitly in a particular season. In the at-home dataset, there were 230 different brand lines.

We filtered out the brand lines that contributed less than 1% to the overall amount of purchases.

This left us with twenty-two brand lines. Followingly, we removed brand lines that were private labels, the products provided by a particular supermarket, and would not be available in any other stores. From this filter, we attempted to remove products that were not available in most of the stores. Finally, there were sixteen brand lines in the dataset, and 2,638,620 purchases remained in the dataset, or 62.99% of the total purchases from the top three sub-categories, as shown in Table 4.

[Insert Table 4]

(30)

20 4.1.2 Extraction of weights information

We were interested in the weights of the chocolates. However, this information was only available in the barcode description. Thus, we needed to extract the information from the barcode description. There were 1,089 different barcode descriptions, which were presented in various formats, as shown in Table 5. Thus, the extraction of the data needed a careful approach.

First, the weights information was presented in the format 'x' gram, with 'x' being the weights' value. The 'gram' was denoted by the letter 'g' or 'gm', with either capital or lowercase letters. Moreover, several of the descriptions contained information regarding the number of units in the product package; certain products included more than one unit in one package.

Consequently, we needed to multiply the weights by the number of units. However, in some cases, the result of the multiplication was also presented in the barcode description. Thus, there was more than one weights value in the barcode description. We needed to make sure that we extracted the correct value.

Figure. 1 shows the summary of the extraction process. 883 barcode descriptions were directly extracted without needing any multiplication or further examination. There were 168 barcode descriptions that contained information on the number of units. Multiplication of units and the weights were required for 91 of the barcode descriptions, while 76 barcode descriptions had information on the final total weight. 38 barcode descriptions did not have any information regarding the weights. We regarded their weights as missing values.

[Insert Figure 1]

(31)

21 4.1.3 Imputation of missing weights value

In addition to the 38 barcode descriptions containing no information on the weights, several products had missing barcode descriptions. In total, there were 1,009,648 purchases that had missing weights value or 38.26% of the total purchases in the at-home dataset. The missing values were not ignorable as it accounted for more than 10% of the dataset (Hair et al., 1998).

Imputation of the missing values was required, which is the process of estimating missing value based on values from other variables and/or cases in the sample (Hair et al., 1998).

The imputation method was 'Mean Substitution' (Hair et al., 1998), which uses that variable's mean value calculated from all available data (Hair et al., 1998). However, instead of using the mean value, we used the median value. This so that the data would represent a weight that had been observed in a product. The imputation process went through several steps as shown in Figure 2.

[Insert Figure 2]

First, we grouped the data according to the product brands. There were 255 unique brand names. For each brand, the medians of the weights were obtained. This value was then imputed to the purchases that had missing weights value accordingly. However, there were 34 brand names that did not have any weights value at all, thus the median values were not obtainable. From this step, 999,526 missing values were imputed and 10,122 were still missing, 0,38% of the dataset. In the next step, we grouped the data according to the product brand lines.

We imputed the missing values in a similar manner from the previous step. Finally, all the remaining missing data was successfully imputed.

(32)

22 4.2 Data modelling

The first part of the data modeling was to assign the treatment starting period. We used the data from the away-from-home dataset to determine the treatment starting period for each household. We assumed that the treatment started when the households reported their first purchase through the mobile application; this information was obtained from the away-from- home dataset. The summary of the treatment starting period is shown in Table 6. Figure 3 shows the average weights purchased over time for both treated and control groups.

[Insert Table 6 & Figure 3]

There were 31,632 households that were not exposed to the treatment. Between December 2008 until November 2009, 2,634 households started to report purchases, while 1,311 households started later. We decided to remove the 1,311 households as they started the treatment more than a year after the treatment was assigned. We argued that these households would influence the quality of the model. One potential problem that may arise from this situation is the build-up of an anticipation effect (e.g. Lechner, 2010). These households may alter their behavior before using the mobile application, thus creating a bias when included in the model. It is also important to note that we only analyzed the at-home purchases, and not the away-from-home purchases. The reason was so that we have a fair comparison between all the households, as only the treated respondents reported away-from-home purchases made in.

Working with online panel data posed a challenge for our analysis. In the data, households were instructed to report any purchase they have made but did not have to report anything if no purchase was made. Hence, it was difficult to determine when the households did not make any purchase in a particular time-period. Possibly, the households did not actively report the purchase. In this case, we could not take it as ‘zero’ purchase. The nature of the online panel allows for this kind of situation to occur. Therefore, we proposed a strategy to

(33)

23

develop several models with different levels of time-period aggregation. We also assumed that households that did not report any purchase for six months to be inactive.

We decided to aggregate the data by months and semesters. The number of months and semesters are shown in Table 7. For each household, the number of purchases and the weights of the product was aggregated accordingly. When a household did not report any purchase in a particular time-period, we assumed that they did not make any purchase. Although it was also possible that they did not report the purchases that they made. Therefore, we examined the purchasing frequency of each household. We assumed that households that did not report any purchase for six months or longer were no longer active in the survey. Therefore, we removed households that did not report any purchase for a minimum of one semester. This caused a significant reduction in the number of households left in the dataset. There were 6,846 households that recorded at least one purchase in each of the semesters in the dataset.

Furthermore, 1,082 households belonged in the treated group and 5,764 in the control group.

[Insert Table 7]

Finally, we did a logarithmic (log) transformation on the weights. Log transformation is useful to handle data when the standard deviation of the treatment effect estimation is high, hence analysis with log transformation would be more supportive of a treatment effect rather than without transformation (Knee, 1995). Since there were zero values present in the variable, we used the <=2(> + 1) in computing the log transformation.

We had developed several models to increase the robustness of the analysis. The descriptions of each model are shown in the next sections. Table 8. shows the summary of all the models’ specifications.

[Insert Table 8]

(34)

24 4.2.1 Model A

The first was the model that aggregated the time-period into semesters. We kept the data of the whole time-frame, from December 2007 until September 2012. Thus, we kept all the remaining 6,846 households in the model, with 1082 and 5,764 households in the treated and control groups respectively. 946 households started their treatment in semester 3, and 136 households started in semester 4. They remained exposed to the treatment for all the semesters afterward. We referred to this model as model A. Figure 4 shows the average weights purchased over time for both treated and control groups in model A.

4.2.2 Model B

In the second model, the aggregation level was still in semesters. However, we selected a specific time-frame, from semester 1 until semester 5 (December 2007 - May 2010). We also only included the treated groups that started the treatment in semester 3. By this specification, we would have one treated group with one year of pre-treatment period and one year of post- treatment period. Due to the shortened time-frame, 3,333 households were added into the model, which previously did not report any purchase for at least one semester in the removed semesters (semester 6 - semester 10). However, 136 households that started the treatment in semester 4 were removed. Finally, 10,043 households remained in the model, with 1,292 in the treated group and 8,751 in the control group. This model is denoted as model B. Figure 5 shows the average weights purchased over time for both treated and control groups in model B.

4.2.3 Model C

The third model followed the same time-frame and included the same set of households as model B. However, the level of aggregation was changed into months. In total, there were 30

(35)

25

months in the whole time-frame. This model was referred to as model C. Figure 6 shows the average weights purchased over time for both treated and control groups in model C.

4.3 Results

4.3.1 Assumption checks

There were four assumptions that needs to hold in conducting the DID analysis. The data used in the study allowed us to accept assumption one, three, and four directly. First, we used a panel data, which was required from the first assumption. Followingly, once a respondent adopted the mobile application, they were instructed to continue using it until the end of the panel study.

Therefore, the third assumption was fulfilled. Lastly, all treated respondents started using the mobile application, at the earliest, after approximately one year of the panel study. Thus, there were no always-treated units in the panel study as imposed by the fourth assumption.

Furthermore, there was one more assumption that we needed to check, which was the (conditional) parallel trend assumption. As part of fulfilling the (conditional) parallel trend assumption in the Callaway & Sant'Anna (2019) framework, conditioning on some covariates

; was essential to the model. The conditioning was done by estimating the generalized propensity score A5 ; from some covariates ;. The generalized propensity score would then be used for calculating the weights in the treatment effects estimation. Five possible covariates can be used from the dataset as the conditioned covariates, as shown in Table 9.

[Insert Table 9]

Table 10. shows the result of the Wald-type test against the augmented parallel trends assumption for each conditioning on covariates using model A. When conditioning on the number of people, region, social class, and working status, the p-values had a non-significant

(36)

26

effect (p-Number of people=0.063, p-Region=0.063, p-social class=0.087, & p-working status=0.088). Thus, we did not reject the augmented parallel trend assumption. Similarly, the augmented parallel trend assumption was not rejected when conditioning on these four covariates simultaneously (p-Number of children + Region + Social Class + Working Status=0.065). Callaway & Sant'Anna (2019) suggested to use more than one covariates ;.

Therefore, all the models in this study were set with the conditioning on these four covariates simultaneously.

[Insert Table 10]

Table 11 shows the augmented parallel trend testing for all the models. The results provided indication that we should not reject the assumption of parallel trend, p = 0.489 for model B. In model C, we did not find enough evidence to reject the conditional parallel trend assumption, p = 0.550. Thus, we could have some confidence in the reliability of the conditional parallel trends assumption in the pre-treatment periods for model A, B and C because the non-significant estimate of the augmented parallel trend testing for each model.

[Insert Table 11]

4.3.2 Group average treatment effects

The group average treatment effects (4+,-5,$) were estimated by conditioning on the selected covariates. All, the standard errors reported were constructed with the uniform 95% confidence bands.

The full results for 4+,-5,$ estimates for model A are shown in Table 12. The majority of the 4+,-5,$ estimates had a negative effect, ranging from -0.027 to -0.158. However, two estimates had a positive effect with 4+,-B,B = 0.035, p = 0.236 and 4+,-B,G = 0.030, A =

(37)

27

0.382. The positive effects were both statistically insignificant. There were significant effects found for 6 out of 13 4+,-5,$ that had negative effects. The plot graph of group treatment effects in each period are presented in Figure 7.

[Insert Table 12 & Figure 7]

In model B, there were only 5 time-periods included, hence the 4+,-5,$ estimates were from semester 2 until 5, as shown in Figure 8. Additionally, there was only one treated group in this model. Out of three of the 4+,-5,$ estimates, one had a negative effect 4+,-B,J = - 0.058 (p = 0.085),while two had a positive effect 4+,-B,B = 0.051 (p = 0.066) and 4+,-B,G = 0.033 (p = 0.286) and all the effects were statistically insignificant. The full 4+,-5,$ estimates of model B is presented in Table 13.

[Insert Table 13 & Figure 8]

In model C, the level of aggregation was set to months. Therefore, these models incorporated a higher number of time periods, as shown in Figure 9. The results for 4+,-5,$

estimates are reported in Table 14. Out of the 93 4+,-5,$, 48 were negative. Followingly only one was found to have significant negative effect, 4+,-KG,LM= -0.368, p = 0.044. Furthermore, there were four significant positive effect: 4KN,KN= 0.597, p = 0.014, 4+,-KO,KM= 1.143, p = 0.004, 4+,-KO,LB= 0.924, p = 0.038, and 4+,-KO,LJ= 0.913, p = 0.028.

[Insert Table 14 & Figure 9]

4.3.3 Aggregated treatment effects

Notably, our main interest was not on the 4+,-5,$ estimates, but these estimates were used to construct the aggregated treatment effects (ATT). The ATT estimates for all the three models are reported in Table 15. The table consists of the simple weighted aggregation effect (67),

(38)

28

selective treatment effects (68), and dynamic treatment effects (6)). Furthermore, we also reported the standard error with the analytical confidence bands to check the robustness of the estimation. The results of the uniform and analytical confidence bands were similar.

[Insert Table 15]

In model A, all the ATT effects were significantly negative and closely ranged from each other. The results were 67=-0.083, p = 0.013, 68 = -0.072, p = 0.010, and 6) = -0.084, p = 0.012. Overall, the aggregated treatment effects of model A provided support towards our

hypothesis.

There was a contrast in the result of the ATT of model B to other models. The results showed that all the ATT were positive, although they were insignificant. The results were 67= 0.009, p = 0.740, 68 = 0.009, p = 0.726, and 6) = 0.009, p = 0.716. The effects estimates were the same for all aggregation method as this model had only one treated group. Therefore, the aggregated treatment effects of model B did not provide support towards our hypothesis.

Similar to model A, all the ATT estimates from model C were also negative. The results were 67=-0.027, p = 0.723, 68 = -0.013, p = 0.873, and 6) = -0.040, p = 0.620. The p-value of all the estimations indicated that there were not enough evidences to reject the null- hypothesis. Therefore, the ATT estimates of model C did not provide support towards our hypothesis.

4.4 Robustness checks

In the next part of the report, we discuss our robustness check of the models. To check the model robustness, first we balanced the number of samples of the treated and control group through propensity score matching. Second, we checked the results when only considering one chocolate sub-category (i.e., Countlines) and for single-member households. Third, we

(39)

29

changed the dependent variable of analysis from the ‘amount of weights purchased’ to the

‘frequency of purchase’.

Our robustness check approach was to use the same set of models developed in the previous sections (i.e., Model A, B, & C). We estimated the ATT only using the selection treatment aggregation method. Additionally, we only reported the estimation using uniform confidence bands. As shown in the original models’ results, there were no significant differences between the estimation of uniform confidence bands and the analytical confidence bands. Lastly, we reported the results from the conditional parallel trend assumptions, with the conditioning on the number of people, region, social class, and working status.

4.4.1 Propensity matching

Due to the unbalanced dataset, in which more households were present in the control group than the treated group, we added another analysis in which the dataset was balanced. We conducted a propensity matching (e.g., Rosenbaum and Rubin 1983) to pair the households from both groups based on their propensity score similarity.

The propensity score was calculated through a logistic regression model (e.g., Ho et al., 2007) in which several covariates were incorporated. In this case, we had included the number of weights- and the frequency of purchase, alongside five other sociodemographic covariates (i.e., Age, Number of people in households, Region, Social Class, & Working Status).

Furthermore, we only incorporated the weights and frequency of purchase during the period when no household was treated. The assumption was that both treated and control households' behavior would be similar in the absence of treatment. Therefore, we matched the households from both groups that were most similar to each other during the pre-treatment period. Finally, we used the R library package 'Matchit' (Stuart et al., 2011) to conduct the propensity matching.

(40)

30

The result of the propensity matching for the three models (A, B, & C) are presented in Table 16-21. We denoted the models with matching procedure as A-P, B-P, & C-P. The matching procedure had succeeded in reducing the difference in the age proportion, average weights and average frequency of purchase between the treated and control group. Prior to matching, there were no significant difference in the other covariates proportion between the treated and control groups.

[Insert Table 16 – 21]

Followingly, the results of the ATT estimates are presented in Table 22. Unlike model A, the ATT of model A-P (68 = -0.034, p =0.331) did not provide support towards Hypothesis 1. However, these results should be interpreted with care as the results of the augmented parallel trend testing indicated that we should reject the parallel trend assumption in the pre- treatment period. Furthermore, the ATT estimates of model B-P (68 = -0.058, p =0.082) and C-P (68 = 0.004, p = 0.962) were similar to estimates in model B & C respectively. These findings provided a stronger support against our hypothesis

[Insert Table 22]

4.4.2 Single sub-category and single member households

All the previous models included purchases from three sub-categories, Countlines, Tablets, and Bagged Selfline. For robustness check, we now only included the Countlines sub- category and added back the purchased of private labels to the model. We compared the results against model B-P. However, we changed the time aggregation into bi-weekly. We found similar results between the two models. Furthermore, we extended the robustness check by focusing on single-member households only. The results of the single-member households models are also similar. Table 23 shows the results the models from this section.

Referenties

GERELATEERDE DOCUMENTEN

The aim of the study was to determine the effect of increasing levels of Maize silage in finishing diets for Merino lambs on their feed intake, production performance,

Flexibele produktie systemen (FPSen) zijn geautomatiseerde produktiesystemen die worden gebruikt voor de vervaardiging van discrete produkten in kleine tot

Near the end of October of 2012, I received an email from my co-editor, Ken Trimmer with information needed to finalise this special issue on IT adoption and evaluation in health..

When water samples measured with the method for lipophilic phycotoxins all blanks including blank chemicals used during clean-up, contained a peak with an equal mass as PnTX E

Bij een hoge mate van fysieke fitheid werd verwacht dat zowel de emotioneel geladen woorden als de neutrale woorden beter herkend zouden worden in de n-back taak, vergeleken met

This leads to the conclusion that providing a disclosure in a sponsored Instagram post by a Social Media Influencer does not lead to less favourable brand attitudes among consumers

The general fabrication method consists of a few basic steps: (1) mold fabrication, (2) conformal deposition of the structural material, (3) isotropic thinning of the

Figure 9 shows the modifications that can be applied on the analysis such that the position of the billet’s nodes belonging to the die-billet contact zone will