Addressing missing data problems in data-rich environments

(1)

Addressing missing data problems in data-rich

environments

Chenming Peng

Advisor: Jaap E. Wieringa

July 29, 2018

Abstract

The authors study dealing with missing data in the data-rich environment. Inspired by Ka-makura & Wedel (2000), a general framework based on latent variable models is proposed to analyze missing data. With this framework, the authors develop a generalized latent variable model for clustered, mixed, high-dimensional, and missing data. A simulation and evalu-ation framework for assessing imputevalu-ation accuracy of missing data models is constructed. Simulation studies prove that 1) our model yields the unbiased estimates while the model proposed by Kamakura & Wedel (2000) causes both statistically and practically significant biasedness in estimates, and 2) our model is situable for different types of missingness gen-eration mechanisms. One empirical study also displays that the model can precisely recover estimates of interest (e.g., correlation among variables).

1 Introduction

(2)

for DBM [Database Marketing] applications.” The values of marketing mix variables for competing products and sometimes even the purchased products are missing (Sivarajah et al., 2017).

To deal with missing data, a simple way is the complete-case approach, where only the fully observed observations are analyzed. Although the complete-case analysis is readily understood and implemented, it can jeopardize statistical power, and even lead to biased estimators because the left complete observations does not represent the population if the missingness pattern is not generated randomly (Qian & Xie, 2011). A more accurate and efficient approach to tackle missing data should take into account the missingness generation mechanism (Little & Rubin, 2014, p.11).

In terms of the missingness generation mechanism, missing data consist of three types : miss-ing completely at random (MCAR) data, missmiss-ing at random (MAR) data, and missmiss-ing not at random (MNAR) data (Little & Rubin, 2014, p.11). Data are MCAR, if the probability that a value of a certain variable is missing does not rely on the values of any variable in the data. For example, a consumer does not respond to one question in a survey by accident, leading to MCAR data. Data are MAR, if the missingness probability is only correlated to the observed values of the data. For instance, older respondents are unwilling to tell their salary than younger ones and the age of every respondent is observed. Data are MNAR, if the missing pattern relates to the missing values of the data. For example, consumers with higher salary are more unwilling to tell their salary.

Developing models for addressing data in marketing applications in data-rich environments has to consider three aspects. First, models should be able to accurately impute high-dimensional missing data.1 In data-rich environments, researchers and practitioners collect and exploit various information to advance the personalization of the marketing mix, leading to high-dimensional data (Chintagunta et al., 2016; George et al., 2016; Wedel & Kannan, 2016). Second, models should be applicable to mixed data, since data usually contain dif-ferent types of variables (e.g., rating, ordering and frequency) from difdif-ferent sources such as

1_{“Accurately impute” in this paper does not mean that the model can impute missing values that are}

(3)

text, image, transaction, and survey data (Wedel & Kannan, 2016). Third, models should take into account customer/market segmentation. Customers have different characteristics, leading to heterogenous data (Kim et al., 2013; Wedel & Kannan, 2016; Moon & Kamakura, 2017). Therefore, if the model incorporates heterogeneity, it can fit data better.

To accommodate three features mentioned above, we develop a generalized latent variable model for clustered, mixed, high-dimensional, and MNAR data. Our work is mainly inspired by Kamakura & Wedel (2000). Kamakura & Wedel (2000) is the first and only study in marketing to propose a creative methodology to deal with high-dimensional missing data based on generalized latent variable models. However, we demonstrate that their model does not impute missing values well when data meet their assumptions (a) and (b) discussed below. Therefore, before show our model specification explicitly, we explain how to more accurately impute missing values based on latent variables through mathematical reasoning and propose a framework to tackle MNAR data. After developing our model, we design a simulation and evaluation procedure to numerically explore the capability of the model in terms of two respects:

(i) How much can imputation accuracy be improved relative to Kamakura & Wedel (2000)? Is the improvement practically and statistically significant? Note that an-swers to these questions not only show the superiority of our model over the model in Kamakura & Wedel (2000), but also justify the validity of the proposed general framework for MNAR data based on latent variable models.

(ii) Can our model also precisely impute missing values when data are MCAR or MAR? This question is relevant, since researchers often do not know which type data are and missingness generation mechanism is statistically untestable (Xie & Qian, 2012). This implies that misspecification can happen. If the performance of our model is robust, then we can still make valid inference.

With this paper, we aim to make the following contributions:

(i) This is the first study to demonstrate how to analyze MNAR data based on latent variable models through mathematical reasoning and simulation studies.

(4)

(iii) This is the first study to develop a generalized latent variable model for clustered, mixed, high-dimensional, and MNAR data. Simulation studies prove that 1) our model yields the unbiased estimates while the model proposed by Kamakura & Wedel (2000) causes both statistically and practically significant biasedness in estimates, and 2) our model developed for MNAR data is also suitable for MAR and MCAR data.

The remainder of this paper is organized as follows. In Section 2, we review existing literature for missing data from marketing and other fields, respectively. In Section 3, we present a general framework based on latent variables of dealing with MNAR data. In Section 4, we explicitly describe the model. In Section 5, we discuss estimation and imputation. In Section 6, we provide a simulation and evaluation framework for assessing imputation accuracy and choose valid measures as evaluation criteria. In Section 7, we test the imputation performance of our models through simulation studies and discuss results. In Section 8, we offer one empirical study using the methodology. Finally, Section 9 concludes.

2 Literature review

Many marketing studies have explored how to deal with missing data. Specifically, Kamakura & Wedel (2000) propose a general framework to impute high-dimensional missing data. Ying et al. (2006) discuss how to leverage missing ratings to improve online recommendation sys-tems. Yang et al. (2010) focus on how to resolve estimation bias caused by unobserved consumer behavior in survey data. Feit et al. (2010) illustrate how to impute missing con-sumers’ characteristics in conjoint analysis. Qian & Xie (2011) propose a distribution-free method to impute missing covariates. Furthermore, some studies analyze data fusion that can be considered as a special case of MCAR data (e.g., Kamakura & Wedel, 1997, 2000; Gilula et al., 2006; Gilula & McCulloch, 2013; Qian & Xie, 2013). However, in the marketing field, no study has proposed a general framework to analyze MNAR data.

(5)

2013; Cui & Dunson, 2014; Valera & Ghahramani, 2014; Valera et al., 2017), they assume that data are MAR or MCAR. Our work is most closely related to Song & Lee (2007) that propose a generalized latent variable model to impute non-ignorable missing data. The dif-ferences lie in four points. First, we prove that a valid framework based on latent variable models for MNAR data should incorporate the missingness pattern. Second, we account for customer heterogeneity. That is, we consider the fact that customer preferences are diversi-fied, giving rise to market segments and thus segment-dependent model parameters (Allenby & Rossi, 1998). Third, we create a simulation and evaluation framework. Forth, we discuss the performance of our model through numerical experiments.

3 A general framework for MNAR data

Kamakura & Wedel (2000) develop a generalized latent variable model to impute missing data. Their methodology is innovative but can be further improved. In this section, we will show how to improve their model and then develop a general framework based on latent variable models to impute MNAR data. For ease of comparison, we use the same notation as Kamakura & Wedel (2000) . Suppose data Y consist of missing and observed parts denoted by Ym _{and Y}o_{, respectively. We write matrix X for normally distributed latent variables,}

matrix Λ for factor weights, vector φ for dispersion parameters, matrix Σ for variance matrix of X, and f (·) for the probability function. Different from Kamakura & Wedel (2000), we introduce the missingness pattern denoted by matrix M with elements mnj, where mnj = 1,

if ynj is observed, and 0 otherwise. We denote all parameters by B.

The cornerstone of their model consists of two assumptions: (a) “the missing data mech-anism depends on the latent factors only”; (b) “all information on the factors is con-tained in the (partially) observed variables”. Mathematically speaking, the assumption (a) implies that f (M |X, Yo_{, Y}m_{) = f (M |X), that is, M and Y}o _{and Y}m _are

condition-ally independent given X. Because of the symmetry property of conditional independence,

f (Yo_{, Y}m_{|X, M ) = f (Y}o_{, Y}m_{|X). Moreover, the assumption (a) implicitly supposes that M}

contains information about X. Otherwise, we cannot posit that M is dependent of X. The assumption (b) means that f (X|Yo_{, Y}m_{, M ) = f (X|Y}o_{). That is, given the observed values,}

(6)

must be correlated with Yo_{, and the information on X carried by M is totally embodied by}

the information on X carried by Yo. Otherwise, the information on X offered by M cannot be ignored when conditioned upon Yo, implying that the assumption (b) does not hold.

The likelihood of the observed data Yo _{and M is given by}

L(Yo, M |B) = Z Z f (Ym, Yo, X, M |B)dXdYm, (1) = ZZ f (Ym, Yo|X, M, B)f (X, M |B)dXdYm_. ₍₂₎

Here Equation (1) involves the data augmentation techinque, that is, we first introduce extra variables to construct the wanted model and then integrate them out. Equation (2) holds based on the basic property of the conditional distribution: f (X, Y ) = f (X|Y )f (Y ) for any random variables X and Y .

Still using this basic property and the assumption (a), Equation (2) reduces to

L(Yo, M |B) = ZZ f (Ym, Yo|X, B)f (X, M |B)dXdYm, = ZZ f (Ym|Yo, X, B)f (Yo|X, B)f (X, M |B)dXdYm, = ZZ f (Ym|Yo_{, X, B)f (Y}o_{|X, B)f (M |X, B)f (X|B)dXdY}m_.

It follows that the loglikelihood is

l(Yo, M |B) = logL(Yo, M |B) = log ZZ

f (Ym|Yo_{, X, B)f (Y}o_{|X, B)f (M |X, B)f (X|B)dXdY}m_.

(7)

Replace B with its components and denote parameters of the model f (M |B) by Ω, Equation (6) can be rewritten as

l(Yo, M |B) = logf (M |Ω) + log ZZ

f (Ym|Yo_{, X, Λ, φ)f (Y}o_{|X, Λ, φ)f (X|Σ)dXdY}m_{. (7)}

In fact, f (M |X, B) = f (M |B) and the assumption (a) together signify that the missingness pattern is independent of latent variables, missing and observed parts, and thus the data is MCAR. Now, if we assume f (M |X, B) = f (M |B), Equation (7) holds. The maximization of the second term in the right-hand side of Equation (7) is irrelevant to M and Ω. This implies that we can obtain unbiased estimators for Λ, φ and Σ by only maximizing the second term. Therefore, the relevant loglikelihood becomes

l(Yo|Λ, φ, Σ) = log ZZ

f (Ym|Yo, X, Λ, φ)f (Yo|X, Λ, φ)f (X|Σ)dXdYm

.

It follows that the relevant likelihood is

L(Yo|Λ, φ, Σ) = Z Z

f (Ym|Yo, X, Λ, φ)f (Yo|X, Λ, φ)f (X|Σ)dXdYm. (8) Equation (8) is exactly the likelihood function expressed by Equation (3) in Kamakura & Wedel (2000).

However, Kamakura & Wedel (2000) claim that the data that they analyzed is not MCAR (see page 491). As a result, they cannot very accurately impute missing data, since they omit the information contained in the missingness pattern M about latent variables X. This conclusion sounds counterintuitive, if we recall the assumption (b) made by Kamakura & Wedel (2000). One question to ask is why we still have to utilize information of M on X provided that we have already assumed that all information about X is incorporated into the observed part Yo. Reviewing the whole reasoning, we can observe that we use f (Yo, Ym|X) rather than f (X|Yo_{, Y}m_{). In other words, even though we have f (X|Y}o_{, Y}m_{, M ) = f (X|Y}o₎

based on the assumption (b), the inference does not involve the assumption (b) since we want to derive the latent variable model f (Yo_{, Y}m_{|X) rather than the model f (X|Y}o_{, Y}m_{). When}

unconditional on Yo_{, M can provide information about X. It follows that the information}

(8)

Taken together, we expect that in comparison to Equation (8), we can more precisely impute missing data based on the following likelihood:

L(Yo, M |B) = Z Z

f (Ym|Yo_{, X, B)f (Y}o_{|X, B)f (M |X, B)f (X|B)dXdY}m_,

which can be simplified as

L(Yo, M |B) = Z Z

f (Ym|X, B)f (Yo_{|X, B)f (M |X, B)f (X|B)dXdY}m_, ₍₉₎

if the dependencies among data are assumed not to exist when conditional on latent variables.

To compute the likelihood, we still have one more unanswered question: how to specify

f (M |X, B). Actually, we can model M the same as data Y , since M also provides information

for estimating latent variables. Define matrix Z = (Y, M ) such that the missing part Zm =

Ym_{, and the observed part Z}o _{comprises Y}o _{and M . Equation (9) then can be rewritten as}

L(Zo|B) = ZZ f (Zm|X, B)f (Yo_{|X, B)f (M |X, B)f (X|B)dXdZ}m ₍₁₀₎ = Z Z f (Zm|X, B)f (Zo_{|X, B)f (X|B)dXdZ}m ₍₁₁₎ = Z f (Zo|X, B)f (X|B)dX, (12)

where Equation (10) holds because Yo _{and M are conditional independent on X implied by}

the assumption (a).

4 Model specification

Following the notation defined in the last section, let n = 1, ..., N denote customers, k = 1, ..., K market segments, j = 1, ..., 2J variables, and p = 1, ..., P latent variables, where J represents the number of columns of Y . It follows that Z is an N × 2J matrix with elements

znj, X an N × P matrix with elements xnp. Assume that conditional on X and segment

k, znj are independently generated from a particular distribution in the exponential family

(9)

where φj is a dispersion parameter; a(·) and b(·) are known functions particular to different

distributions (For more details, please refer to Wedel & Kamakura, 2001); θnjkis the

segment-specific canonical parameter defined as

θnjk = λ0jk+

P

X

p=1

xnpλjpk, (14)

where λ0jk and λjpk represent segment-specific intercepts and factor weights, respectively;

latent values xnp are assumed to follow one distribution in the exponential family.

Denote the observed values of individual n by z_no = (zn1o, ..., z_nJo), where 1o and Jo denote

indices of observed values and these indices can be individual-specific. Let cn represent the

segment that individual n belongs to, and f (cn = k) the prior probability of individual n

belonging to segment k. Assume that such a prior probability is identical to all individuals, implying that f (cn = k) = πk. Note that πk > 0 and PKk=1πk = 1. Based on the general

framework shown in Equation (12), the likelihood of the observed values of individual n is

where xn = (xn1, ..., xnP), and f (znjo |xn, k, B) can be derived from Equations (13) and (14).

The likelihood of all observed values is then:

L(Zo|B) = N Y n=1 ( _K X k=1 f (cn = k|B) Z f (zo_n|xn, k, B)f (xn|B)dxn ) . (15)

5 Estimation and imputation

5.1 Estimation

(10)

are numerically computed. That is, we approximate them by taking s = 1, ..., S draws from the distribution of latent variables and substituting a summation over the S draws for the integrals over latent variables. Following Wieringa & Verhoef (2007), we choose the number of segments (K), that of latent variables (P ) and types of latent variables such that the model yields the smallest adjusted Akaike information criterion (AIC3)

AIC3 = −2lnL(Zo) + 3 × K, where K is the effective number of parameters.

In practice, we suggest following steps to specify the model structure:

• Step 1: Find out the best number (P0) of factors based on AIC3 without considering

heterogeneity (i.e., assume that K = 1).

• Step 2: Given P0, determine the best number (K0) of segments.

Such a procedure may lead to an overfitted model, since we determine components of the model structure successively rather than simultaneously. However, overfitting does not im-pair imputation performance (Vermunt et al., 2008).

Different from the usual applications of latent variable models, we focus on imputation instead of parameters interpretation. In this case, we do not need to impose identification restrictions. This is because although the model is unidentified (i.e., there are more than one sets of parameters B leading to the same distribution f (Z|B)), the distribution f (Z|B) of interest is uniquely defined (Vermunt et al., 2008; Mohamed, 2011, p. 36).

5.2 Imputation procedure

We have

f (Zm|Zo) = ZZ Z

f (Zm|X, c, B)f (X|Zo, c, B)f (c|Zo, B)f (B|Zo)dXdc dB,

where c = c(c1, ..., cN)0. This means that imputation can be completed by first drawing ˆB

from f (B|Zo_{), then ˆ}_{c from f (c|Z}o_{, B), then ˆ}_{X from f (X|Z}o_{, c, B), and subsequently ˆ}_Zm_from

f (Zm|X, c, B). To obtain ˆB, two options are available in the frequentist framework. One is

(11)

bootstrap procedure to draw samples (Vermunt et al., 2008; Honaker & King, 2010). We prefer the second approach since the asymptotic theory could offer an unsatisfactory approx-imation even though the sample size is of the order of 1000 in some cases (Menezes, 1999; Bartholomew et al., 2011, p. 165-166).

Specifically, first T bootstrap samples from Z are generated, denoted by Z?

1, ..., Zt?, ..., ZT?.

Second, for each sample the model is estimated leading to T sets of parameters B?

1, ..., Bt?, ..., BT?.

For imputed data set t, we sample from f (Zm|Zo_{, B = B}?

t) for t = 1, ..., T . In this way, we

take the uncertainty of the parameter estimates into account. The conditional distribution

f (Zm_|Zo_{, B = B}?

t) can be rewritten as:

f (Zm|Zo, B = B_t?) = Z Z

f (Zm|X, c, B_t?)f (X|Zo, c, B_t?)f (c|Zo, B?_t)dXdc Therefore, to impute the missing value Zm_{, we first draw c from}

After assigning an individual randomly to one of the K latent classes, we then draw X from

If X is continuous, with Equation (16), we calculate the posterior probabilities for the points that are used for numerical integration, and then sample these points. After sampling c and

X, it is readily to sample Zm _{from f (Z}m_{|X, c, B}? t).

6 A general simulation and evaluation framework

(12)

Collins et al. (2001) and Shah et al. (2014), we propose a framework of simulation and evaluation shown in Fig. 1. More specifically, in Step 1, we specify data generation model through assigning values to parameters of the model including the numbers of segments (K), outcome variables (J ) and latent variables (P ), and other parameters involved in the prespecified distributions of outcome and latent variables. In Step 2, we draw R sets of samples of size N individuals by drawing X and Y in sequence. In Step 3, we obtain estimates of interest in each complete data set resulting in R estimates. For example, obtain the estimators of regression coefficients using ordinary least squares (OLS). In Step 4, we generate a missingness pattern Mr (r = 1, ..., R) for each data set based on the chosen

missingness generation mechanism. According to Mr, make the r-th data set incomplete. In

Step 5, we obtain T imputed data sets for each incomplete data set using imputation models such as the model proposed by this paper or the model constructed by Kamakura & Wedel (2000). In Step 6, we compute estimates of the same parameters as estimated in Step 3 for each incomplete data set using the corresponding T imputed data based on the inference principles proposed by Rubin (2004). In Step 7, we assess the imputation accuracy in terms of evaluation criteria detailed below.

Figure 1: Simulation and evaluation framework

Evaluation criteria

(13)

im-pute missing values. We expect that the results of experiments with R = 1 are convincing since the sample is randomly drawn. If R > 1, we derive summary statistic from these R sets of estimates. Specifically, we use paired-sample t tests to check the biasedness between the estimates of interest obtained in the incomplete data sets by the selected missing data model and those calculated in the complete data sets. If the p-value is less than 0.05, implying that the corresponding estimate is statistically significantly biased. Meanwhile, we investigate practical significance using Cohen’s d, also known as standardized bias, which is one of the most prevalent effect size measures (Murphy et al., 2014, p. 15). Cohen’s d is defined as the average value of biasedness divided by its standard deviation. The smaller Cohen’s d is, the less biased the estimator is. There is a rule of thumb that if the absolute (Abs.) value of Cohen’s d exceeds 0.4, the estimator is economically biased (Collins et al., 2001).

To answer question i (i.e., whether our framework can improve imputation accuracy of the model proposed by Kamakura & Wedel (2000)), for fair comparison, it is logical to use their evaluation criterion, that is, the correlation between the actual values and the imputed val-ues (abbreviated as CORRAI). However, CORRAI not necessarily reflects on the imputation accuracy. Intuitively speaking, the actual values are just one sample from the population

f (Z|B). Through estimation procedure, we obtain estimators ˆB. Even though the

estima-tors ˆB = B when the sample size is large enough, the imputed values sampled from f (Zm| ˆB)

can only reflect on the properties of population (e.g., regression coefficients) rather than to be the same as the actual values. To show this, we implement one simulation study.

Specifically, suppose that there is only K = 1 segment. Draw P = 1 latent variable from the standard normal distribution. Then, draw J = 2 outcome variables from normal distribu-tions. The sample size is N = 1500 individuals such that estimators are quite close to true parameters. We make 10% of the data MNAR. In each bootstrap samples, we assume that we can obtain the best estimaters that are exactly the true parameters. Therefore, in the imputation procedure, we utilize the true parameters and the true distribution of the latent variable to impute missing values. For each missing value, we generate 50 imputations. In such a setting, the CORRAI is only -0.012. However, both actual values and imputed values yield the same density. Taking the variable z1 (i.e., y1) as example, Fig. 1 shows that the

(14)

Figure 2: Kernel Density of y1

7 Simulation studies

In this section, we carry out two simulation studies to assess the impuatation performance of our model in two aspects: improvement of Kamakura & Wedel (2000) and applicability to MAR and MCAR data.

7.1 Question i: improvement of Kamakura & Wedel (2000)

7.1.1 Experiment setting

To answer question i, we design simulation experiments. Particularly, in Step 1, suppose that there is K = 1 segment, P = 1 latent variable following standard normal distribution, and

J = 2 outcome normally distributed variables (i.e., y1 and y2) generated based on Equation

(13). Consequently, we have two variables m1 and m2 to represent the missingness patterns

of two outcome variables, respectively. Note that unlike Kamakura & Wedel (2000), there are fewer latent and outcome variables in our experiment. There is no statistical theory indicating that the values of J and P would affect the imputation performance. This con-clusion is also supported by an our simulation study detailed in appendix. In Step 2, draw

R = 100 sets of complete data samples of the size N = 300. In Step 3, in each complete

(15)

to obtain OLS estimators of β12 and β21. We focus on regression coefficients because they

are the most interesting estimates in many cases in marketing. We can also pay attention to other estimates as we do in the following empirical study.

In Step 4, for each data set, similar to Kamakura & Wedel (2000), we create four types of missingness patterns by removing 10% to 50% of the data in 20% increments. The adjustment of missingness percentage is realized by adjusting intercepts and weights of the latent variable on m1 and m2. In that case, data in effect are MNAR as discussed before,

although Kamakura & Wedel (2000) do not claim that such simulated data are MNAR. In Step 5, we generate T = 10 imputed data for each incomplete data set with the proposed model with the correct values for K, J , and P , and the correct distributions for latent and outcome variables. For continuous latent variables, we draw S = 100 points. Moreover, with the same specification, we re-impute missing values for each incomplete data using the model of Kamakura & Wedel (2000).

7.1.2 Results

Table 1 presents the results of the simulation experiments designed for Question i, where Model 1 and Model 2 represent our model and that proposed by Kamakura & Wedel (2000). Combining the p-value and Abs. Cohen’s d, we can observe that the estimators using our model are not biased, while those derived by Kamakura & Wedel (2000) are significantly biased. This phenomenon is not influenced by the change of the missingness percentage.

Criteria Model 1 Model 2 β12 β21 β12 β21 10% 30% 50% 10% 30% 50% 10% 30% 50% 10% 30% 50% P-value of t tests 1.000 0.484 1.000 0.920 0.920 0.271 0.000 0.000 0.035 0.000 0.000 0.002 Abs. Cohen’s d 0.010 0.047 0.007 0.008 0.019 0.054 0.672 0.603 0.385 0.743 0.736 0.437

(16)

7.2 Question ii: applicability to MAR and MCAR data

7.2.1 Hypothesis development

Since we discuss two types of latent variables below and ensure that the number of outcome variables is larger than that of latent variables, assume that we now have three outcome variables y1, y2 and y3 and corresponding three variables m1, m2 and m3 for representing

the missingness pattern. Furthermore, suppose that y1 is incomplete while the rest are

fully observed. We make such setting since if data are MAR, there must be fully observed variables. In this case, m2 and m3 can be removed since they do not provide any information

on latent variables as shown in Fig. 3. If y1 is MAR, then there must be two types of latent

variables simultaneously as illustrated by Fig. 3(a). One type named type M determines the missingness pattern and only outcome variables that are fully observed. By contrast, the other type named type NM does not influence the missingness pattern but determines both fully and partially observed outcome variables. It is easy to see that there are restrictions on these two types of latent variables. If y1 is MCAR, then only latent variables of type NM

exist displayed in Fig. 3(b), since the generation of the missingness pattern does not depend on data at hand (i.e., y1, y2 and y3) and corresponding latent variables. If y1 is MNAR,

Fig. 3(c) shows that unlike the former two cases, we denote latent variables by x1 and x2

to indicate that there are no restrictions imposed on latent variables anymore. This implies that the models for MAR and MCAR data actually are nested in the model for MNAR data.

Figure 3: The sketches of different data generation mechanisms

7.2.2 Experiment setting

(17)

where the intercepts λ01k = λ02k = λ03k = −0.5, and the weights of xM and xN M are

λ1M k = 0, λ2M k = 0.4, λ3M k = 0.5, and λ1N M k = 0.4, λ2N M k = 0.5, λ3N M k = 0.6. Meanwhile,

we utilize one variable m1 to represent the missingness pattern of y1. In Step 2, draw R = 1

set of complete data sample of the size N = 300. In Step 3, in the complete data set, obtain OLS estimators of β12 and β21 where y1n = β12y2n + n and y2n = β21y1n + ξn. In Step

4, we create the missingness pattern determined by xM where y1 is MAR with missingness

percentage 30%. In Step 5, we generate T = 10 imputed data for each incomplete data set with the proposed model with the correct values for K, J , and P , and the correct distribu-tions for latent and outcome variables. We set the amount of sampling points S = 100 for latent variables. For MCAR data, all else being equal, let the weights of x1

N M and x2N M are

λ1N M1_k = 0.3, λ_{2N M}1_k = 0.4, λ_{3N M}1_k = 0.5, and λ_{1N M}2_k = 0.4, λ_{2N M}2_k = 0.5, λ_{3N M}2_k = 0.6

in Step 1 and make the values of y1 missing randomly.

Table 2 shows that estimates (Est.) using our model for the incomplete data are close to those derived by OLS in the complete data. Moreover, we can find that the standard errors (S.E.) of estimates yielded by our model are always larger than those computed in the complete data. This phenomenon is reasonable since the appearance of missing values brings in extra uncertainty, leading to higher standard errors. In sum, these results demonstrate that our hypothesis is correct. That is, our model can be applied to MAR and MCAR data.

Methods

MAR MCAR

β12 β21 β12 β21

Est. S.E. Est. S.E. Est. S.E. Est. S.E.

Complete data 0.143 0.055 0.153 0.059 0.143 0.055 0.153 0.060 Our model 0.165 0.067 0.182 0.072 0.148 0.059 0.162 0.064

Table 2: Results of the simulation experiments designed for Question ii

8 Empirical study

(18)

with 17 ordinal variables with different levels (see Table 1 in Wieringa & Verhoef, 2007) and one binary variable named Churn describing whether respondents want to churn or not. Because we now carry out an emprical study instead of a simulation study, the procudure is different from that simulation procedure presented in Fig. 1. Specifically, in Step 1, as with the second empirical study in Kamakura & Wedel (2000), we compute the estimate of interest: the correlation between those ordinal variables and the binary variable. Note that here the correlation describes one of population properties, that is, the relationship among variables rather than the invalid evaluation criterion (i.e., the closeness between imputed values and actual values). In Step 2, based on approaches described in Section 5.1, we spec-ify and estimate a model with three latent variables and two segments. In Step 3, we sample latent variables using Equation (16). In Step 4, we use sampled latent variables to create

R = 1 MNAR data with the missingness percentage of 30% through Equation (13). In Step

5, we generate T = 10 imputed missing values using our proposed model with the same structure parameters as specified in Step 2. In Step 6, we estimate the average correlation across these imputed data.

Table 3 shows that the maximum difference between correlation derived with complete data and that estimated by our model is 0.140 and the mean difference is 0.053. These results indicate that our model performs well in recovering correlations.

Methods

Correlation between variables and Churn

SWR1 SWR2 SWR3 SWR4 SWR5 TR1 TR2 TR3 QP1 QP2 QP3 WOM ASW1 ASW2 ASW3 SC PP1 Complete data 0.135 0.180 0.130 0.060 0.078 0.297 0.062 0.121 -0.237 -0.261 -0.243 0.327 -0.032 -0.04 0.159 0.240 -0.187

Our model 0.123 0.161 0.066 0.043 -0.002 0.235 0.096 0.137 -0.229 -0.204 -0.211 0.199 0.029 0.100 0.106 0.127 -0.126

Table 3: Correlation results of the empirical study

9 Conclusion

(19)

Moreover, we construct a simulation and evaluation framework for assessing imputation ac-curacy of missing data models, where we demonstrate that the correlation between imputed values and actual values is not an appropriate evaluation criterion.

Simulation studies have shown that our model can yield unbiased estimates, while the model proposed by Kamakura & Wedel (2000) outputs practically and statistically biased estimates. Furthermore, in simulated data, we have shown that even though our model is developed for MNAR data, it can impute missing data accurately when data is MCAR or MAR. We also use one empirical data to display that our model can precisely recover estimates of interest (e.g., correlation among variables).

However, there are still several limitations. First, we do not analytically derive how large the biasedness could be if we use the model proposed by Kamakura & Wedel (2000). Second, the performance of our model in imputing missing data is restricted by how well the data can be summarized by latent variables.

Appendices

To test whether the number of outcome variables affects imputation performance, we use the same experiment setting displayed in Section 6.3.1 except that here there assumed to be one extra outcome variable y3 and we only consider the missingness percentage of 30%. Table 2

justifies that the conclusions drawn in Section 6.3.2 still hold, which means that imputation performance is irrelevant to the number of outcome variables.

Criteria

Model 1 Model 2

β12 β21 β12 β21

P-value of t tests 0.484 0.368 0.000 0.000 Abs. Cohen’s d 0.074 0.077 0.557 0.667

(20)

References

Allenby, G. M., & Rossi, P. E. (1998). Marketing models of consumer heterogeneity. Journal

of econometrics, 89 (1-2), 57–78.

Bartholomew, D. J., Knott, M., & Moustaki, I. (2011). Latent variable models and factor

analysis: A unified approach, vol. 904. John Wiley & Sons.

Blattberg, R. C., Kim, B.-D., & Neslin, S. A. (2008). Database Marketing: Analyzing and

Managing Customers. Springer.

Bradlow, E. T., Gangwar, M., Kopalle, P., & Voleti, S. (2017). The role of big data and predictive analytics in retailing. Journal of Retailing, 93 (1), 79–95.

Chintagunta, P., Hanssens, D. M., & Hauser, J. R. (2016). Marketing science and big data. Chung, T. S., Rust, R. T., & Wedel, M. (2009). My mobile music: An adaptive

personaliza-tion system for digital audio players. Marketing Science, 28 (1), 52–68.

Collins, L. M., Schafer, J. L., & Kam, C.-M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological methods, 6 (4), 330.

Cui, K., & Dunson, D. B. (2014). Generalized dynamic factor models for mixed-measurement time series. Journal of Computational and Graphical Statistics, 23 (1), 169–191.

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B

(methodologi-cal), (pp. 1–38).

Feit, E. M., Beltramo, M. A., & Feinberg, F. M. (2010). Reality check: Combining choice experiments with market data to estimate the importance of product attributes.

Manage-ment Science, 56 (5), 785–800.

George, G., Osinga, E. C., Lavie, D., & Scott, B. A. (2016). Big data and data science methods for management research. Academy of Management Journal, 59 (5), 1493–1507. Gilula, Z., & McCulloch, R. (2013). Multi level categorical data fusion using partially fused

(21)

Gilula, Z., McCulloch, R. E., & Rossi, P. E. (2006). A direct approach to data fusion. Journal

of Marketing Research, 43 (1), 73–83.

Harel, O., & Schafer, J. L. (2009). Partial and latent ignorability in missing-data problems.

Biometrika, 96 (1), 37–50.

Honaker, J., & King, G. (2010). What to do about missing values in time-series cross-section data. American Journal of Political Science, 54 (2), 561–581.

Jung, H., Schafer, J. L., & Seo, B. (2011). A latent class selection model for nonignorably missing data. Computational Statistics & Data Analysis, 55 (1), 802–812.

Kamakura, W. A., & Wedel, M. (1997). Statistical data fusion for cross-tabulation. Journal

of Marketing Research, (pp. 485–498).

Kamakura, W. A., & Wedel, M. (2000). Factor analysis and missing data. Journal of

Marketing Research, 37 (4), 490–498.

Kim, S., Blanchard, S. J., DeSarbo, W. S., & Fong, D. K. (2013). Implementing managerial constraints in model-based segmentation: extensions of kim, fong, and desarbo (2012) with an application to heterogeneous perceptions of service quality. Journal of Marketing

Research, 50 (5), 664–673.

Kuha, J., Katsikatsou, M., & Moustaki, I. (2017). Latent variable modelling with non-ignorable item non-response: multigroup response propensity models for cross-national analysis. Journal of the Royal Statistical Society: Series A (Statistics in Society).

Lee, S.-Y. (2006). Bayesian analysis of nonlinear structural equation models with nonignor-able missing data. Psychometrika, 71 (3), 541.

Little, R. J., & Rubin, D. B. (2014). Statistical analysis with missing data, vol. 333. John Wiley & Sons.

Liu, C.-W., & Wang, W.-C. (2017). Non-ignorable missingness item response theory mod-els for choice effects in examinee-selected items. British Journal of Mathematical and

(22)

Menezes, L. d. (1999). On fitting latent class models for binary data: The estimation of standard errors. British Journal of Mathematical and Statistical Psychology, 52 (2), 149– 168.

Mohamed, S. (2011). Generalised Bayesian matrix factorisation models. Ph.D. thesis, Uni-versity of Cambridge.

Mohamed, S., Ghahramani, Z., & Heller, K. A. (2009). Bayesian exponential family pca. In

Advances in neural information processing systems, (pp. 1089–1096).

Moon, S., & Kamakura, W. A. (2017). A picture is worth a thousand words: Translating product reviews into a product positioning map. International Journal of Research in

Marketing, 34 (1), 265–285.

Murphy, K. R., Myors, B., & Wolach, A. (2014). Statistical power analysis: A simple and

general model for traditional and modern hypothesis tests. Routledge.

Murray, J. S., Dunson, D. B., Carin, L., & Lucas, J. E. (2013). Bayesian gaussian copula factor models for mixed data. Journal of the American Statistical Association, 108 (502), 656–665.

Qian, Y., & Xie, H. (2011). No customer left behind: A distribution-free bayesian approach to accounting for missing xs in marketing models. Marketing Science, 30 (4), 717–736. Qian, Y., & Xie, H. (2013). Which brand purchasers are lost to counterfeiters? an application

of new data fusion approaches. Marketing Science, 33 (3), 437–448.

Rubin, D. B. (2004). Multiple imputation for nonresponse in surveys, vol. 81. John Wiley & Sons.

Shah, A. D., Bartlett, J. W., Carpenter, J., Nicholas, O., & Hemingway, H. (2014). Compar-ison of random forest and parametric imputation models for imputing missing data using mice: a caliber study. American journal of epidemiology, 179 (6), 764–774.

(23)

Song, X.-Y., & Lee, S.-Y. (2007). Bayesian analysis of latent variable models with non-ignorable missing outcomes from exponential family. Statistics in medicine, 26 (3), 681– 693.

Valera, I., & Ghahramani, Z. (2014). General table completion using a bayesian nonpara-metric model. In Advances in Neural Information Processing Systems, (pp. 981–989). Valera, I., Pradier, M. F., & Ghahramani, Z. (2017). General latent feature models for

heterogeneous datasets. arXiv preprint arXiv:1706.03779 .

Vermunt, J. K., Van Ginkel, J. R., Van der Ark, L. A., & Sijtsma, K. (2008). Multiple impu-tation of incomplete categorical data using latent class analysis. Sociological Methodology,

38 (1), 369–397.

Wedel, M., & Kamakura, W. A. (2001). Factor analysis with (mixed) observed and latent variables in the exponential family. Psychometrika, 66 (4), 515–530.

Wedel, M., & Kannan, P. (2016). Marketing analytics for data-rich environments. Journal

of Marketing, 80 (6), 97–121.

Wieringa, J. E., & Verhoef, P. C. (2007). Understanding customer switching behavior in a liberalizing service market: an exploratory study. Journal of Service Research, 10 (2), 174–186.

Xie, H., & Qian, Y. (2012). Measuring the impact of nonignorability in panel data with non-monotone nonresponse. Journal of Applied Econometrics, 27 (1), 129–159.

Yang, S., Zhao, Y., & Dhar, R. (2010). Modeling the underreporting bias in panel survey data. Marketing Science, 29 (3), 525–539.