• No results found

4. Data Analysis and Results

4.1 Data preparation

First, the at-home dataset was prepared for the analysis. Data cleansing was required to make certain incorrect data was avoided and that the data had the correct format. First, the format of the time stamp was needed to be converted into a 'date' format. Second, several product descriptions were written in different forms but intended to be the same value. Specifically, the

19

difference in the format was whether they were written in capital letters or lowercase letters.

Any description of the same value but written in a different form were aggregated into one description so the program would not have mistaken them as different description.

The next step was to filter the data that were of interest to this study. The dataset contained three major product categories: chocolate confectionery, sugar confectionery, & gum confectionery. The dataset was filtered to the chocolate confectionery category. Within the chocolate category, there were 9 categories. We were interested in focusing on the sub-categories that consumers purchased routinely, not something that was too niche or purchased seasonally. Therefore, the dataset was further filtered to the top three sub-categories. They are Countlines, Tablets, and Bagged Selfline. These sub-categories accounted for 92.6% of the purchases made, as shown in Table 3.

[Insert Table 3]

Furthermore, we examined the parent brands (brand lines) of the products. Similarly, we wanted to make sure that we focused on the products that were not too niche or purchased explicitly in a particular season. In the at-home dataset, there were 230 different brand lines.

We filtered out the brand lines that contributed less than 1% to the overall amount of purchases.

This left us with twenty-two brand lines. Followingly, we removed brand lines that were private labels, the products provided by a particular supermarket, and would not be available in any other stores. From this filter, we attempted to remove products that were not available in most of the stores. Finally, there were sixteen brand lines in the dataset, and 2,638,620 purchases remained in the dataset, or 62.99% of the total purchases from the top three sub-categories, as shown in Table 4.

[Insert Table 4]

20 4.1.2 Extraction of weights information

We were interested in the weights of the chocolates. However, this information was only available in the barcode description. Thus, we needed to extract the information from the barcode description. There were 1,089 different barcode descriptions, which were presented in various formats, as shown in Table 5. Thus, the extraction of the data needed a careful approach.

First, the weights information was presented in the format 'x' gram, with 'x' being the weights' value. The 'gram' was denoted by the letter 'g' or 'gm', with either capital or lowercase letters. Moreover, several of the descriptions contained information regarding the number of units in the product package; certain products included more than one unit in one package.

Consequently, we needed to multiply the weights by the number of units. However, in some cases, the result of the multiplication was also presented in the barcode description. Thus, there was more than one weights value in the barcode description. We needed to make sure that we extracted the correct value.

Figure. 1 shows the summary of the extraction process. 883 barcode descriptions were directly extracted without needing any multiplication or further examination. There were 168 barcode descriptions that contained information on the number of units. Multiplication of units and the weights were required for 91 of the barcode descriptions, while 76 barcode descriptions had information on the final total weight. 38 barcode descriptions did not have any information regarding the weights. We regarded their weights as missing values.

[Insert Figure 1]

21 4.1.3 Imputation of missing weights value

In addition to the 38 barcode descriptions containing no information on the weights, several products had missing barcode descriptions. In total, there were 1,009,648 purchases that had missing weights value or 38.26% of the total purchases in the at-home dataset. The missing values were not ignorable as it accounted for more than 10% of the dataset (Hair et al., 1998).

Imputation of the missing values was required, which is the process of estimating missing value based on values from other variables and/or cases in the sample (Hair et al., 1998).

The imputation method was 'Mean Substitution' (Hair et al., 1998), which uses that variable's mean value calculated from all available data (Hair et al., 1998). However, instead of using the mean value, we used the median value. This so that the data would represent a weight that had been observed in a product. The imputation process went through several steps as shown in Figure 2.

[Insert Figure 2]

First, we grouped the data according to the product brands. There were 255 unique brand names. For each brand, the medians of the weights were obtained. This value was then imputed to the purchases that had missing weights value accordingly. However, there were 34 brand names that did not have any weights value at all, thus the median values were not obtainable. From this step, 999,526 missing values were imputed and 10,122 were still missing, 0,38% of the dataset. In the next step, we grouped the data according to the product brand lines.

We imputed the missing values in a similar manner from the previous step. Finally, all the remaining missing data was successfully imputed.

22 4.2 Data modelling

The first part of the data modeling was to assign the treatment starting period. We used the data from the away-from-home dataset to determine the treatment starting period for each household. We assumed that the treatment started when the households reported their first purchase through the mobile application; this information was obtained from the away-from-home dataset. The summary of the treatment starting period is shown in Table 6. Figure 3 shows the average weights purchased over time for both treated and control groups.

[Insert Table 6 & Figure 3]

There were 31,632 households that were not exposed to the treatment. Between December 2008 until November 2009, 2,634 households started to report purchases, while 1,311 households started later. We decided to remove the 1,311 households as they started the treatment more than a year after the treatment was assigned. We argued that these households would influence the quality of the model. One potential problem that may arise from this situation is the build-up of an anticipation effect (e.g. Lechner, 2010). These households may alter their behavior before using the mobile application, thus creating a bias when included in the model. It is also important to note that we only analyzed the at-home purchases, and not the away-from-home purchases. The reason was so that we have a fair comparison between all the households, as only the treated respondents reported away-from-home purchases made in.

Working with online panel data posed a challenge for our analysis. In the data, households were instructed to report any purchase they have made but did not have to report anything if no purchase was made. Hence, it was difficult to determine when the households did not make any purchase in a particular time-period. Possibly, the households did not actively report the purchase. In this case, we could not take it as ‘zero’ purchase. The nature of the online panel allows for this kind of situation to occur. Therefore, we proposed a strategy to

23

develop several models with different levels of time-period aggregation. We also assumed that households that did not report any purchase for six months to be inactive.

We decided to aggregate the data by months and semesters. The number of months and semesters are shown in Table 7. For each household, the number of purchases and the weights of the product was aggregated accordingly. When a household did not report any purchase in a particular time-period, we assumed that they did not make any purchase. Although it was also possible that they did not report the purchases that they made. Therefore, we examined the purchasing frequency of each household. We assumed that households that did not report any purchase for six months or longer were no longer active in the survey. Therefore, we removed households that did not report any purchase for a minimum of one semester. This caused a significant reduction in the number of households left in the dataset. There were 6,846 households that recorded at least one purchase in each of the semesters in the dataset.

Furthermore, 1,082 households belonged in the treated group and 5,764 in the control group.

[Insert Table 7]

Finally, we did a logarithmic (log) transformation on the weights. Log transformation is useful to handle data when the standard deviation of the treatment effect estimation is high, hence analysis with log transformation would be more supportive of a treatment effect rather than without transformation (Knee, 1995). Since there were zero values present in the variable, we used the <=2(> + 1) in computing the log transformation.

We had developed several models to increase the robustness of the analysis. The descriptions of each model are shown in the next sections. Table 8. shows the summary of all the models’ specifications.

[Insert Table 8]

24 4.2.1 Model A

The first was the model that aggregated the time-period into semesters. We kept the data of the whole time-frame, from December 2007 until September 2012. Thus, we kept all the remaining 6,846 households in the model, with 1082 and 5,764 households in the treated and control groups respectively. 946 households started their treatment in semester 3, and 136 households started in semester 4. They remained exposed to the treatment for all the semesters afterward. We referred to this model as model A. Figure 4 shows the average weights purchased over time for both treated and control groups in model A.

4.2.2 Model B

In the second model, the aggregation level was still in semesters. However, we selected a specific time-frame, from semester 1 until semester 5 (December 2007 - May 2010). We also only included the treated groups that started the treatment in semester 3. By this specification, we would have one treated group with one year of pre-treatment period and one year of post-treatment period. Due to the shortened time-frame, 3,333 households were added into the model, which previously did not report any purchase for at least one semester in the removed semesters (semester 6 - semester 10). However, 136 households that started the treatment in semester 4 were removed. Finally, 10,043 households remained in the model, with 1,292 in the treated group and 8,751 in the control group. This model is denoted as model B. Figure 5 shows the average weights purchased over time for both treated and control groups in model B.

4.2.3 Model C

The third model followed the same time-frame and included the same set of households as model B. However, the level of aggregation was changed into months. In total, there were 30

25

months in the whole time-frame. This model was referred to as model C. Figure 6 shows the average weights purchased over time for both treated and control groups in model C.

4.3 Results

4.3.1 Assumption checks

There were four assumptions that needs to hold in conducting the DID analysis. The data used in the study allowed us to accept assumption one, three, and four directly. First, we used a panel data, which was required from the first assumption. Followingly, once a respondent adopted the mobile application, they were instructed to continue using it until the end of the panel study.

Therefore, the third assumption was fulfilled. Lastly, all treated respondents started using the mobile application, at the earliest, after approximately one year of the panel study. Thus, there were no always-treated units in the panel study as imposed by the fourth assumption.

Furthermore, there was one more assumption that we needed to check, which was the (conditional) parallel trend assumption. As part of fulfilling the (conditional) parallel trend assumption in the Callaway & Sant'Anna (2019) framework, conditioning on some covariates

; was essential to the model. The conditioning was done by estimating the generalized propensity score A5 ; from some covariates ;. The generalized propensity score would then be used for calculating the weights in the treatment effects estimation. Five possible covariates can be used from the dataset as the conditioned covariates, as shown in Table 9.

[Insert Table 9]

Table 10. shows the result of the Wald-type test against the augmented parallel trends assumption for each conditioning on covariates using model A. When conditioning on the number of people, region, social class, and working status, the p-values had a non-significant

26

effect (p-Number of people=0.063, p-Region=0.063, p-social class=0.087, & p-working status=0.088). Thus, we did not reject the augmented parallel trend assumption. Similarly, the augmented parallel trend assumption was not rejected when conditioning on these four covariates simultaneously (p-Number of children + Region + Social Class + Working Status=0.065). Callaway & Sant'Anna (2019) suggested to use more than one covariates ;.

Therefore, all the models in this study were set with the conditioning on these four covariates simultaneously.

[Insert Table 10]

Table 11 shows the augmented parallel trend testing for all the models. The results provided indication that we should not reject the assumption of parallel trend, p = 0.489 for model B. In model C, we did not find enough evidence to reject the conditional parallel trend assumption, p = 0.550. Thus, we could have some confidence in the reliability of the conditional parallel trends assumption in the pre-treatment periods for model A, B and C because the non-significant estimate of the augmented parallel trend testing for each model.

[Insert Table 11]

4.3.2 Group average treatment effects

The group average treatment effects (4+,-5,$) were estimated by conditioning on the selected covariates. All, the standard errors reported were constructed with the uniform 95% confidence bands.

The full results for 4+,-5,$ estimates for model A are shown in Table 12. The majority of the 4+,-5,$ estimates had a negative effect, ranging from -0.027 to -0.158. However, two estimates had a positive effect with 4+,-B,B = 0.035, p = 0.236 and 4+,-B,G = 0.030, A =

27

0.382. The positive effects were both statistically insignificant. There were significant effects found for 6 out of 13 4+,-5,$ that had negative effects. The plot graph of group treatment effects in each period are presented in Figure 7.

[Insert Table 12 & Figure 7]

In model B, there were only 5 time-periods included, hence the 4+,-5,$ estimates were from semester 2 until 5, as shown in Figure 8. Additionally, there was only one treated group in this model. Out of three of the 4+,-5,$ estimates, one had a negative effect 4+,-B,J = -0.058 (p = 0.085),while two had a positive effect 4+,-B,B = 0.051 (p = 0.066) and 4+,-B,G = 0.033 (p = 0.286) and all the effects were statistically insignificant. The full 4+,-5,$ estimates of model B is presented in Table 13.

[Insert Table 13 & Figure 8]

In model C, the level of aggregation was set to months. Therefore, these models incorporated a higher number of time periods, as shown in Figure 9. The results for 4+,-5,$

estimates are reported in Table 14. Out of the 93 4+,-5,$, 48 were negative. Followingly only one was found to have significant negative effect, 4+,-KG,LM= -0.368, p = 0.044. Furthermore, there were four significant positive effect: 4KN,KN= 0.597, p = 0.014, 4+,-KO,KM= 1.143, p = 0.004, 4+,-KO,LB= 0.924, p = 0.038, and 4+,-KO,LJ= 0.913, p = 0.028.

[Insert Table 14 & Figure 9]

4.3.3 Aggregated treatment effects

Notably, our main interest was not on the 4+,-5,$ estimates, but these estimates were used to construct the aggregated treatment effects (ATT). The ATT estimates for all the three models are reported in Table 15. The table consists of the simple weighted aggregation effect (67),

28

selective treatment effects (68), and dynamic treatment effects (6)). Furthermore, we also reported the standard error with the analytical confidence bands to check the robustness of the estimation. The results of the uniform and analytical confidence bands were similar.

[Insert Table 15]

In model A, all the ATT effects were significantly negative and closely ranged from each other. The results were 67=-0.083, p = 0.013, 68 = -0.072, p = 0.010, and 6) = -0.084, p = 0.012. Overall, the aggregated treatment effects of model A provided support towards our

hypothesis.

There was a contrast in the result of the ATT of model B to other models. The results showed that all the ATT were positive, although they were insignificant. The results were 67= 0.009, p = 0.740, 68 = 0.009, p = 0.726, and 6) = 0.009, p = 0.716. The effects estimates were the same for all aggregation method as this model had only one treated group. Therefore, the aggregated treatment effects of model B did not provide support towards our hypothesis.

Similar to model A, all the ATT estimates from model C were also negative. The results were 67=-0.027, p = 0.723, 68 = -0.013, p = 0.873, and 6) = -0.040, p = 0.620. The p-value of all the estimations indicated that there were not enough evidences to reject the null-hypothesis. Therefore, the ATT estimates of model C did not provide support towards our hypothesis.

4.4 Robustness checks

In the next part of the report, we discuss our robustness check of the models. To check the model robustness, first we balanced the number of samples of the treated and control group through propensity score matching. Second, we checked the results when only considering one chocolate sub-category (i.e., Countlines) and for single-member households. Third, we

29

changed the dependent variable of analysis from the ‘amount of weights purchased’ to the

‘frequency of purchase’.

Our robustness check approach was to use the same set of models developed in the previous sections (i.e., Model A, B, & C). We estimated the ATT only using the selection treatment aggregation method. Additionally, we only reported the estimation using uniform confidence bands. As shown in the original models’ results, there were no significant differences between the estimation of uniform confidence bands and the analytical confidence bands. Lastly, we reported the results from the conditional parallel trend assumptions, with the conditioning on the number of people, region, social class, and working status.

4.4.1 Propensity matching

Due to the unbalanced dataset, in which more households were present in the control group than the treated group, we added another analysis in which the dataset was balanced. We conducted a propensity matching (e.g., Rosenbaum and Rubin 1983) to pair the households from both groups based on their propensity score similarity.

The propensity score was calculated through a logistic regression model (e.g., Ho et al., 2007) in which several covariates were incorporated. In this case, we had included the number of weights- and the frequency of purchase, alongside five other sociodemographic covariates (i.e., Age, Number of people in households, Region, Social Class, & Working Status).

Furthermore, we only incorporated the weights and frequency of purchase during the period when no household was treated. The assumption was that both treated and control households' behavior would be similar in the absence of treatment. Therefore, we matched the households from both groups that were most similar to each other during the pre-treatment period. Finally, we used the R library package 'Matchit' (Stuart et al., 2011) to conduct the propensity matching.

30

The result of the propensity matching for the three models (A, B, & C) are presented in Table 16-21. We denoted the models with matching procedure as A-P, B-P, & C-P. The matching procedure had succeeded in reducing the difference in the age proportion, average weights and average frequency of purchase between the treated and control group. Prior to matching, there were no significant difference in the other covariates proportion between the treated and control groups.

[Insert Table 16 – 21]

Followingly, the results of the ATT estimates are presented in Table 22. Unlike model A, the ATT of model A-P (68 = -0.034, p =0.331) did not provide support towards Hypothesis 1. However, these results should be interpreted with care as the results of the augmented parallel trend testing indicated that we should reject the parallel trend assumption in the pre-treatment period. Furthermore, the ATT estimates of model B-P (68 = -0.058, p =0.082) and C-P (68 = 0.004, p = 0.962) were similar to estimates in model B & C respectively. These findings provided a stronger support against our hypothesis

[Insert Table 22]

4.4.2 Single sub-category and single member households

All the previous models included purchases from three sub-categories, Countlines, Tablets, and Bagged Selfline. For robustness check, we now only included the Countlines

All the previous models included purchases from three sub-categories, Countlines, Tablets, and Bagged Selfline. For robustness check, we now only included the Countlines