Counting your customers : a discrete switching extension

(1)

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided

up into a number of sections and contains references. An outline can be something like (this

is an example for an empirical thesis, for a theoretical thesis have a look at a relevant paper

from the literature):

(a) Front page (requirements see below)

(b) Statement of originality (compulsary, separate page)

(c) Introduction

(d) Theoretical background

(e) Model

(f) Data

(g) Empirical Analysis

(h) Conclusions

(i) References (compulsary)

If preferred you can change the number and order of the sections (but the order you

use should be logical) and the heading of the sections. You have a free choice how to

list your references but be consistent. References in the text should contain the names

of the authors and the year of publication. E.g. Heckman and McFadden (2013). In

the case of three or more authors: list all names and year of publication in case of the

rst reference and use the rst name and et al and year of publication for the other

references. Provide page numbers.

2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All that

actually matters is that your supervisor agrees with your thesis.

3. The front page should contain:

(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Faculty

as in the heading of this document. This combination is provided on Blackboard (in

MSc Econometrics Theses & Presentations).

(b) The title of the thesis

(c) Your name and student number

(d) Date of submission nal version

(e) MSc in Econometrics

(f) Your track of the MSc in Econometrics

Counting Your Customers:

A Discrete Switching

Extension

Eric Dignum

(10246312)

MSc in Econometrics

Track: Big Data Business Analytics Date of final version: 23-03-2017 Supervisor: Dr. N. van Giersbergen Second reader: Dr. K. Pak

Abstract

This study uses a discrete Hidden Markov model, where customers buy according to a Poisson distribution in an active state and have an inactive state in which they do not buy. The fitted model shows some evidence for significant correlations between individual parameters and multi-modal heterogeneity instead of commonly used uni-modal distributions.

(2)

Statement of Originality

This document is written by Eric Dignum who declares to take full responsi-bility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

1 Introduction

With the increasing availability of data on an individual level, marketing researchers and professionals have been focusing more on personalised / cus-tomised marketing instead of traditional marketing at an aggregated level. Due to this shift, it becomes more important to make accurate predictions about marketing decisions of single or segments of customers. Hence in the field of Customer Relationship Marketing (CRM) it is becoming more the standard as well.

One of the main questions in CRM is: ”when will your customers buy next”? Or stated differently: ”how do you predict future purchases of cus-tomers”? If it is possible to accurately forecast when a customer will buy (or obtain the probability that a customer will buy in the near future), they can be “ranked” and targeted more efficiently, which could reduce costs and increase profits. Furthermore, if it can be deducted which customer characteristics lead to a higher probability of buying, marketers can use this information to segment their customers and allocate marketing expenditures accordingly.

Within this context, it is often assumed that customers are active for a certain period (in which they possibly purchase) and become permanently inactive after this (they ”die” or ”defect”). These models are called ”Buy Till You Die” or ”Buy Till You Defect” models (BTYD). Moreover, one can distinguish between two settings: non-contractual and contractual. With a contract, the lifetime is observed, because it ends at the termination of the contract, but in a non-contractual setting there is often no such information available and lifetime needs to be estimated.

Most models in this framework are build on the foundations of the Pareto/NBD model from Schmittlein, Morrison, and Colombo (1987). How-ever, as stated by B¨uschken and Ma (2011, p. 244), these models all assume that after defection there is no probability for a repeat purchase. As a consequence, they might underestimate the number of repeat purchases for customers that become inactive for a specific period instead of permanently. For example, inactive consumers could be ”activated” by a marketing

(5)

cam-paign or other external factors, violating the BTYD assumptions.

This leads B¨uschken and Ma (2011) to suggest an alternative set of assumptions, in which customers do not defect permanently, but where they switch between states. In their specific model people have an active (buying) state and an inactive state where nothing is bought. They argue that the rate of customers switching from an inactive to an active state is an important factor in the fit and predictive ability of their model.

Other, often debated, assumptions are which statistical distributions to use for the purchase/lifetime processes and heterogeneity between cus-tomers. Where the Pareto/NBD model uses continuous distributions for all processes, Fader, Hardie, and Lee (2005) and Jerath, Fader, and Hardie (2011) come up with discrete counterparts of the Pareto/NBD to improve the overall fit and predictive ability. Furthermore, Korkmaz (2014, p. 92) stresses that the used heterogeneity distribution can influence the results substantially and often do not fit the real heterogeneity distribution.

This study will use a Discrete Hidden Markov Model (HMM) in a non-contractual setting to incorporate switching behaviour of customers between an active and an inactive state. In the active state, consumers buy accord-ing to a Poisson process and in the other state it is assumed that there is no probability of making purchases. This will give the opportunity to study customer switching in a discrete setting. Furthermore, heterogeneity is mod-elled using a Multivariate Normal distribution, which allows for correlations between the estimated parameters.

After this section, a summary of existing literature is given, followed by the technical specifications of the proposed model and the used estimation method. Then, an empirical analysis and simulation study are performed and results are discussed. Finally, conclusions and suggestions for further research will be given.

(6)

2 Background

2.1 Buy Till You Die

Within customer base analysis one cannot escape the Pareto/NBD model of Schmittlein et al. (1987), which counts as a building block and bench-mark for more recent models. They assume that customers buy according to a Poisson process, while having an Exponentially distributed lifetime. Both parameters, the purchase rate from the Poisson distribution and the lifetime parameter from the Exponential, follow independent Gamma dis-tributions to model unobserved heterogeneity (differences across customers).

Although an old model, it still performs well when predicting future sales ac-cording to Jerath et al. (2011, p. 876), but besides this robustness there are also shortcomings. For example the difficult estimation process (although less relevant nowadays, due to improvements in sampling methods), which Fader et al. (2005) try to avoid by assuming a different lifetime process for customers. More specifically, in their BG/NBD model they assume that cus-tomers have a probability of defecting after every purchase they make, which differs from the Exponential lifetime assumption used in the Pareto/NBD.

This leads to a much simpler estimation process, but from a theoretical perspective they infer that people with more purchases have more opportu-nities to defect and people with no repeat purchases stay active (Korkmaz, Kuik, Fok, et al., 2013, p. 4), which may not correspond to reality as cus-tomers who only bought once could have easily been defected. Instead they are assumed active, leading to a overestimation of lifetime. Next to this theoretical critique, they do not show any significant advantages in fit or prediction performance.

More recently, Jerath et al. (2011) incorporated a different ”death” process where customers have the possibility to defect at discrete points in calender time (which they call Periodic Death Opportunity or PDO for short) inde-pendent of the amount of purchases made. They show that their model pre-dicts lower purchase rates, longer lifetimes and provides a better in-sample fit on multiple datasets compared to the Pareto/NBD model. However, even

(7)

though the Pareto/NBD is nested within the PDO model and the in-sample differences are substantial, the out-of-sample performance is similar.

Previously mentioned papers only account for unobserved rather than ob-served heterogeneity, which makes it impossible to derive which kind of customers contribute the most and what separates them from other seg-ments/individuals. If observed heterogeneity is incorporated, for example via customer or purchase characteristics, it can provide valuable individual level information for marketers on which customers are longer active and/or buy more often.

Abe (2009) extends the Pareto/NBD model with the possibility to in-clude such characteristics. The Poisson purchase and Exponential lifetime assumptions are kept, but heterogeneity is incorporated in the purchase rate and drop-out rate parameter via a multivariate Log-Normal distribu-tion with covariates included. Moreover, because the purchase and drop-out rate are estimated simultaneously, the independence assumption of the two can be relaxed (i.e. they can be correlated).

Despite the added customer characteristics, Abe (2009) does not find any significant correlation and thus no evidence for the violation of the framework of Schmittlein et al. (1987), Fader et al. (2005) or Jerath et al. (2011). Furthermore, his Hierarchical Bayes (HB) model shows a similar predictive performance as the Pareto/NBD in forecasting aggregate pur-chase frequency.

Using a Poisson or Binomial distribution to describe the purchase process cannot appropriately model two commonly found characteristics in con-sumer purchase data, namely over-dispersion and regularity.

Data is said to be over-dispersed if the variance is greater than the mean, but the variance of the Poisson distribution is equal to its the mean and for the Binomial it is even smaller than its mean. If these were the true distribu-tions this would mean we should see more purchases than actually observed, but in most datasets the variance of the purchase rate (or probability) is of-ten estimated to be bigger than the mean.

(8)

pur-chases follow an Exponential distribution for a Poisson purchase process, and its discrete counterpart is the Geometric distribution (in case of Bino-mial distributed purchases). A key characteristic of these two distributions is memorylessness. For regular buying customers this property may not hold, as it says that the next purchase does not depend on past purchases, while if someone buys every week at a specific time it can say something about his/her future purchases (i.e. regularity).

Various researchers have tried to model these phenomena in BTYD mod-els, for example using the Gamma or Erlang distribution to model inter-purchase times (they can model regularity). Using these distributions re-quires more complex derivations, but it can lead to significant improvements in parameter estimates, fit and predictive ability (Platzer, 2008, p. 73).

2.2 Switching Models

A more intuitive approach to model over-dispersion is with the use of so called Markov Switching models (Hidden Markov models) instead of more complex purchase distributions. Often, the assumption here is that a cus-tomer has certain states with different buying behaviour in-between he or she can switch. To exemplify with a simple case: customers could have an active state in which they make their purchases and an inactive state in which the purchase rate is equal to zero. If a customer switches to the non-buying state at any time in the observation period, it will lead to a bigger variance than the corresponding variance of the used purchase distribution. Another advantage of this class of models is that they nest a BTYD variant when the probability of switching from inactive to active is zero.

The two state case described above is used by B¨uschken and Ma (2011) in their attempt to explain a customers’ purchase process. They assume that customers buy according to a Poisson distribution in their active state and after every purchase they have a probability of switching to inactivity, while time to recovery (inactive to active) is Exponentially distributed. Un-observed heterogeneity is modelled via the Gamma (for positively valued

(9)

parameters) or Beta distribution (for (0,1) bounded parameters).

Their main finding is that recovery rates are rather high, implying that the assumption of “dying” customers in the BTYD models is invalid. Fur-thermore, their model gives a better fit and forecasting performance than the Pareto/NBD and BG/NBD in terms of forecasting purchase frequency. However, as mentioned before, the assumption of switching after a purchase may not be realistic.

In another study by B¨uschken and Ma (2012) they extend their ear-lier model by using multiple Erlang distributions to model waiting times between purchases in the active state (Erlang Mixture State-Switching or EMS). They also allow the individual recovery rate to go to zero, implying permanent defection. Thus, customers can be active, inactive or perma-nently inactive.

They examine five real-life datasets and show that in four of them their EMS model performs better. Furthermore, the dataset where the perfor-mance is less, is used extensively in studies with BTYD models and thus might give a biased view of their general performance. Moreover, the Erlang distributions allow for the concept of regularity being studied. They find that there is a substantial amount of regular buying in their datasets.

To avoid the critique of dying/switching after a purchase, the proposed model in this study will assume a discrete switching process in calender time, which is different from B¨uschken and Ma (2011) where a customer has a probability of becoming inactive after every purchase. Moreover, where B¨uschken and Ma (2011) have a hybrid switching process, continuous from inactive to active and discrete from active to inactive, the switching dy-namics used here will be entirely discrete. Additionally, the same Poisson purchase distribution is used, but heterogeneity is modelled using a Multi-variate Normal density, which allows us to capture correlations between the individual parameters, which is not yet demonstrated within the switching models in this field.

When the probability of becoming active is 0 for a customer, the model transforms in a BTYD variant. This constrained model is also an addition to the literature, because it is the PDO model of Jerath et al. (2011) with

(10)

(11)

3 Model

The Multivariate Normal Hidden Markov Model (MVNHMM) is based on three assumptions, where (1) & (2) are presumed to hold for an individual customer and (3) for the whole population (note that the subscript i is left out to reduce clutter):

1. At any time t, a customer can be either active or inactive. While active, the number of transactions made by the customer follows a Poisson distribution with purchase rate λ; while inactive the customer makes no transactions:

Xt∼ P oi(λ). (1)

2. Let the state at time t be 0 if the customer is inactive (st= 0) and 1

if active (st= 1). It is assumed that at any time step t, the customer

has a probability of transitioning to the other state or stay in his/her current state. These transition probabilities are assumed to be con-stant over time, with pjk the probability for the customer of going to

state k at time t + 1, given that he or she is in state j at time t. This leads to the following transition matrix:

Γ = p00 p01 p10 p11 ! = p00 1 − p00 1 − p11 p11 ! . (2)

Where the simplifications follow because at every period t a customer needs to be in one of the two states and thus rows need to sum up to one. In mathematical terms, the realised state sequence (although unobserved) is called a Markov Chain of order one, hence the name Hidden Markov Model.

3. Heterogeneity between customers is assumed to follow a Multivariate Normal distribution:

(12)

    log(λ) logit(p00) logit(p11)     ∼ N (µ, Σ), µ =     µ1 µ2 µ3     , Σ =     σ11 σ12 σ13 σ12 σ22 σ23 σ13 σ23 σ33     . (3)

Figure 1: A schematic representation of a possible purchase pattern, the dotted lines indicate the hidden Markov process the customer passes through and Xt the number of purchases in time period t, which is what is observed.

(13)

3.1 Likelihood

Given these assumptions we can write the likelihood of observing a purchase pattern, x, for a specific customer (the subscript i is again left out to reduce clutter). Let x = (x1, x2, ..., xT) be an observed purchase pattern, s =

(s1, s2, ..., sT) the hidden state sequence, λ the purchase rate and p00, p11the

probabilities of staying in state 0 and 1 respectively. Furthermore, let δ = (δ0, δ1)0 = (0, 1)0 be the initial state distribution. Customers are assumed to

start in state 1 (the buying state), because in order to be in most datasets one needs to have shown some sort of active behaviour (i.e. bought a product, browsed a website, etc.).

To give the intuition behind the likelihood computation, consider the following example; suppose we have observed two periods and we have seen a purchase in period 1 and none in period 2 (i.e. x = (1, 0)):

The matrix P (xt) contains the probabilities of the observed purchase amount

(14)

distribution and the corresponding purchase rate λ. Moreover, Γ is the tran-sition matrix as given in the assumptions and ι is a column vector of ones. An intuitive explanation of the formula is that every time step t the prob-ability of the observed purchase amount in both states is considered, next to probabilities of staying/going to that specific state. The example uses only two time periods, but already shows that every possible state sequence (Tm = 22 = 4) needs to be calculated. The generalisation follows from Zucchini and MacDonald (2009, p. 37), where for a time period of length T and m states, one can write:

L(λ, p00, p11) = δP (x1)ΓP (x2)ΓP (x3)ΓP (x4)...ΓP (xT)ι (5)

To avoid the necessity of calculating every possible state sequence, the For-ward algorithm is used, which exploits a recursive relationship in the likeli-hood formula to decrease the amount of computations to m2_{T calculations}

instead of Tm. The derivation of the Forward procedure is demonstrated in Zucchini and MacDonald (2009, p. 38) and uses additional variables αt,

also known as forward probabilities (the probability of being in state j at time t): αt= δP (x1)ΓP (x2)ΓP (x3)ΓP (x4)...ΓP (xt) = δP (x1) t Y l=2 ΓP (xl) ⇔ L_T = αTι, α1= δP (x1) and αt= αt−1ΓP (xt) (6)

As the likelihood possibly takes on values close to zero, one wants to avoid numerical underflow. Thus, for practical purposes the log-likelihood is cal-culated using the following scheme (Zucchini & MacDonald, 2009, p. 47):

(15)

w1 = δP (x1)ι, φ1= δP (x1) w1 and ll = log(w1) for t = 1, 2, ..., T v = φ_t−1ΓP (xt) wt= vι ll = ll + log(wt) φ_t= v wt return ll. 3.2 Key Results

Eventually, the interest lies in the prediction of future purchases. In this section some managerial relevant results are derived given the individual estimates (the subscript i is left out to reduce clutter):

• E[A(t)], the expected amount of visits to the active state in t time steps, is given by the stationary distribution, π = (π0, π1)0 = πΓ, of

the transition matrix times t:

E[A(t)] = π1t (7)

• E[X(t)], the expected number of transaction in t time steps:

E[X(t)] = λE[A(t)] (8)

• P (Active at T|x, λ, p₀₀, p11), the probability that a customer with

pur-chase history x is in the active state at time T (ΦT follows from the

Forward algorithm): P (Active at T|x, λ, p00, p11) = 0 1 φ_T (9)

(16)

• P (X_{T +h} = k|x, λ, p00, p11), the probability of making k purchases in

period T + h:

P (XT +h = k|x, λ, p00, p11) = φTΓhP (x)ι (10)

• E[X_{T +h}|x, λ, p₀₀, p11], the expected number of transactions for an

in-dividual with purchase history x in period (T, T + h]:

E[XT +h|x, λ, p00, p11] = λ h X d=1 φTΓdι (11) 3.3 MCMC Estimation

Standard procedures like Maximum Likelihood can be used when all cus-tomers are treated separately, as the likelihood per customer can be max-imised numerically. However, because this is a hierarchical model (the three probabilities per customer follow a Multivariate Normal density) Maximum Likelihood is not feasible. Instead a Bayesian approach is proposed, using a Markov-Chain Monte-Carlo algorithm with use of Gibbs- and Metropolis-Hastings sampling procedures.

The parameters that need to be estimated are; the population parame-ters: µ, Σ and the individual specific parameters {λi, p00,i, p11,i} for i ∈

{1, 2, ..., N }. With use of Bayes’ rule we can write the full posterior distri-bution as follows:

P (λ, p00, p11, µ, Σ|X) ∝ P (X|λ, p00, p11, µ, Σ)P (λ, p00, p11, µ, Σ).

(12) Where the last part can be factored in the following way:

P (λ, p00, p11, µ, Σ) = P (λ, p00, p11|µ, Σ)P (µ, Σ|µ0, Σ0)P (µ0, Σ0).

(13) With use of Gibbs- and Metropolis-Hastings we can obtain samples from this density by sequentially draw new values from all conditional

(17)

distribu-other parameters as given). When sufficiently many samples are obtained, one can calculate sample approximations for the expectation, variance and other measures that characterise the full posterior.

To implement this algorithm, we first need to derive the conditional dis-tributions of all parameters of interest. However, because customers are assumed independent, they do not influence each others conditional distri-butions and we can sample them simultaneously (i.e. independently from each other). This simplification lead to only three necessary conditional dis-tributions, namely one for the vector µ, the matrix Σ (both conditioned on all individual parameters) and the individual parameters conditional on the population parameters. The three conditional densities:

P (µ|Σ, λ, p00, p11) ∼ N ( n¯x + mµ0 n + m , 1 n + mΣ) (14) P (Σ|λ, p00, p11) ∼ W−1(Ψ0+ nS + nm n + m(¯x − µ0)(¯x − µ0) 0_{, n + n} 0), where S = 1 n n X i=1 (xi− ¯x)(xi− ¯x)0. (15) P (λ, p00, p11|Σ, µ, X) ∝ L(λ, p00, p11|X)e− 1 2(y−µ) 0_Σ−1_(y−µ) (16)

Where µ0, Ψ0, m and n0 are prior values that need to be determined

be-forehand and y = (log(λ), logit(p00), logit(p11))0. As the formulas indicate,

the conditional densities for µ and Σ are standard densities (Normal and Inverse-Wishart respectively) and are relatively easy to sample from using a Gibbs procedure.

However, the conditional density for the individual parameters is only known up to a constant. This means the Gibbs algorithm cannot be put into practice, but a Metropolis-Hastings procedure needs to be used to sample from the correct conditional density. To draw a value from the specific distribution in (10), the following scheme is used for every customer i:

(18)

1. Propose new values (simultaneously) for λ, p00, p11by drawing from a

Multivariate Normal distribution with the old values as mean and a given variance (tuning parameter)

2. Calculate P (λold, p00,old, p11,old|Σ, µ, X) and P (λnew, p00,new, p11,new|Σ, µ, X)

3. Draw a sample, a, from the Uniform(0,1) distribution

4. Accept the new value from step 1 as draw from the conditional distri-bution if a < min{1,P (λnew,p00,new,p11,new|Σ,µ,X)

P (λold,p00,old,p11,old|Σ,µ,X) }, otherwise set the old

sample as new draw from the conditional distribution

Next to the values of the prior parameters, every individual parameter needs a variance for its own proposal distribution. Setting these manually will be a time consuming effort, thus for this reason an adaptive Metropolis-Hastings algorithm is used. This method exploits the previously generated values in the Markov chain to estimate all proposal variances of the individual parameters. More specifically, every 200 burn-in steps in the Markov Chain the variances are updated with the sample variance of the 200 most recent generated values (for further reference see Haario, Saksman, and Tamminen (2001)). Note that after the burn-in period the variances do not change, to keep the convergence properties of the MCMC algorithm.

(19)

4 Empirical Analysis

To compare models on a real-life dataset, the extensively studied CDNOW dataset is used. It consists of 23,570 customers who all became a member of CDNOW in the first 12 weeks of 1997 (i.e. they made their first purchase in one of the 12 weeks). As not all customers did a purchase in the first week, there are purchase patterns of different length.

To avoid long simulation times, the first 68 weeks of all purchase pat-terns are used (this equalises all lengths and they can be put in a matrix) and 2,357 customers are drawn at random. Note that a significant amount of customers (59%) only made an initial purchase and did not buy anything after.

The model is fitted on the first 39 weeks of the purchase patterns and the estimated parameters are then used to forecast the number of purchases in the next 29 weeks, as an out-of-sample measure of fit. Customers are assumed to have weekly switching opportunities and they buy according to a Poisson process within those weeks. For estimation, 50,000 MCMC iterations are performed, where the last 10,000 are used for analysis of the obtained posterior distributions.

To start the MCMC sampling method, there is need for prior and start-ing values. The priors are chosen such that they are weakly informative (i.e. they do not have a lot of influence on the posterior), to be specific the following values are chosen: Ψ0 = I3, n0 = 1, m = 1 and µ0 = (0, 0, 0)0.

All parameters will have a starting value of 0, although the starting value can have an influence on the speed of convergence, the chain will eventually convergence regardless of the starting point.

Next to the proposed model, a constrained version (p00 = 1) is fitted for

comparison. This constraint induces the model to have the BTYD assump-tion, as there is no returning from the inactive state.

On the individual level this is a simple constraint, but with the used Multivariate Normal distribution all customers having p00 = 1 is not

(20)

model: the purchase rate λ and the probability of dying p10. This model is

called the MVNPDO model for brevity, as it is the PDO model from Jerath et al. (2011) using Multivariate Normal heterogeneity.

CDNOW Posterior Estimates MVNHMM

Mean q2.5% q97.5% Mean q2.5% q97.5% µ1 -0.268 -0.306 -0.231 λ 0.811 0.390 1.477 µ2 4.190 4.038 4.347 p00 0.942 0.582 1.000 µ3 -3.349 -3.842 -2.916 p11 0.127 0.000 0.777 σ11 0.117 0.096 0.138 σλ 0.092 0.071 0.114 σ12 0.581 0.497 0.670 σλ,p00 0.017 0.014 0.019 σ13 -0.326 -0.462 -0.190 σλ,p11 -0.017 -0.023 -0.011 σ22 3.914 3.295 4.529 σp00 0.013 0.010 0.015 σ23 -1.849 -2.422 -1.289 σp00,p11 -0.007 -0.008 -0.005 σ33 5.520 3.675 7.286 σp11 0.041 0.033 0.048

Table 1: Posterior estimates for the CDNOW dataset with 10,000 MCMC iterations used for analysis (40,000 discarded as burn-in).

Looking at the posterior estimates in Table 1, we see that the mean pur-chase rate is 0.811 and the probability of staying inactive is 0.942, while the probability of staying active is quite low (0.127). This makes sense, as the dataset contains a lot of zeros (i.e. transitions from state 0 to state 0 are highly probable) and transitions from state 1 to 1 are seldom seen. More-over, this increases uncertainty in estimating p11, which is reflected in the

rather wide 95% credible interval for µ3 and the large estimate for σ33 (with

also a wide interval).

The mean probability of recovering (inactive to active) is 1 − 0.942 = 0.058, which suggests that people have long periods of being inactive or may even have defected. Moreover, the high purchase rates mean that whenever the customer is in state 1, he or she almost certainly buys, thus the recov-ering probabilities therefore determine whether a customer buys or not.

The estimates for the constrained model look rather different (Table 2), where the purchase rate was high in the unconstrained model, it is with

(21)

CDNOW Posterior Estimates MVNPDO Mean q_2.5% q_97.5% Mean q_2.5% q_97.5% µ1 -3.356 -3.657 -3.051 λ 0.082 0.003 0.407 µ2 -3.977 -4.526 -3.411 p10 0.132 0.000 0.907 σ11 1.616 1.273 2.003 σλ 0.074 0.040 0.128 σ12 1.581 0.574 3.121 σλ,p10 0.011 0.003 0.026 σ22 9.763 4.879 16.687 σp10 0.055 0.020 0.094

Table 2: Posterior estimates for the CDNOW dataset with 10,000 MCMC iterations used for analysis (40,000 discarded as burn-in).

0.082 substantially lower in the MVNPDO. This is reasonable as this model assumes people are always active before they defect and the unconstrained model permits periods of inactivity (no buying). Moreover, the probability of dying is relatively high as well (0.132), which means that customers have a short expected lifetime. If this is true, this is evidence against the switching variant as it assumes customers always have a probability of becoming active.

Correlation Mean q2.5% q97.5% ρλ,p00 0.485 0.450 0.518 ρλ,p11 -0.274 -0.345 -0.185 ρp00,p11 -0.299 -0.375 -0.210 MVNPDO ρλ,p10 0.164 0.058 0.302

Table 3: Posterior estimates for all correlations using 10,000 MCMC itera-tions (with 40,000 discarded as burn-in).

All correlations within the two models are found to have at least a 95% probability of being different from zero. Looking at Figure 2 and Table 3 we see that the correlation between the probability of defecting and the purchase rate λ in the constrained model is 0.164, from a marketing perspective this means that people who are more likely to defect have a higher purchase rate before they defect. Contrary to Abe (2009) and Fader, Hardie, and Shang

(22)

(a) Correlation between λ and p00 (b) Correlation between λ and p11

(c) Correlation between p00and p11 (d) Correlation in the MVNPDO model.

Figure 2: Correlation plots using 10,000 MCMC iterations (with 40,000 discarded as burn-in).

(2010) who also use a Multivariate Normal distribution for heterogeneity modelling, the correlation found here is significant.

Similar evidence is found in the non-constrained model, where the cor-relation of 0.485 between λ and p00 indicates that the purchase rate tends

to be higher when the customer has longer inactive times. This is of im-portance of marketers, as it might be beneficial to try to activate customers who have long inactive times as they buy more when they become active.

As a robustness check, the two models are fitted on a new random sample (2357 customers) from the full CDNOW dataset. The found correlations are again significantly different from zero and give roughly the same estimates.

To gather more evidence around the positive correlation between the pur-chase rate and the probability of dying in the MVNPDO model, a small logistic regression is employed. The mean purchase rate obtained from the simulation plus a constant are used as explanatory variables to explain if a customer is dead or not. However, one does not observe when a customer

(23)

dies. Thus, as a proxy, customers who did not purchase any CD’s in the test period (last 29 weeks) are classified as ”dead” and customers who did are classified as alive.

Logistic Regression

Coefficient Standard error P-value Constant 1.132 0.048 0.000 ¯

λ -0.039 0.052 0.454

Table 4: Logistic regression of the proxy for being ”dead” onto the mean purchase rate of the simulation.

From the logistic regression perspective, the purchase rate does not have a significant influence on the proxy for dead customers (Table 4), or in other words: it does not support the found correlations in the constrained model.

Looking at out-of-sample prediction, the MVNHMM does not provide an increase in performance, as the Mean Absolute Deviation (MAD) is 0.836 compared to 0.595 for the MVNPDO, using the Wilcoxon signed-rank test this is a statistically significant difference with a p-value of 0.000. When the mean absolute errors are broken down by the number of purchase occasions a customer has had in the 39 weeks calibration period, we see that both models perform rather well for customers with not a lot of purchase occasions and predictions worsen for frequently buying people.

Nonetheless, the BTYD version performs better in all segments, which may have to do with the large group of customers that only bought the first time. Although some customers that belong to this group have made purchases in week 40 to week 68, they made ”only” 205 purchases in the forecast period over 1308 customers. The remaining customers with no pur-chases in the forecast period can substantially favour the BTYD model as these customers exhibit typical defecting behaviour.

To study this claim, all customers that only bought once in the first 39 weeks are removed, both models are fitted on the reduced dataset and

(24)

out-CDNOW

Purchase occasions 1 2 3 4 5 6 7 8+ Total Number of customers 1380 451 200 112 73 50 27 64 2357 Actual total purchases 205 233 189 122 104 112 63 382 1410 MAD MVNPDO 0.258 0.671 1.002 1.071 1.153 1.694 2.012 3.117 0.595 MAD MVNHMM 0.393 0.879 1.342 1.728 1.874 2.610 3.033 3.433 0.836

Table 5: Mean Absolute Deviation for the proposed and constrained model, broken down by the number of purchase occasions in the training dataset.

of-sample performance is again compared.

As can be seen in Table 6, the MVNHMM still performs worse in all segments (significantly worse on the overall level, using the Wilcoxon signed-rank test). This is evidence that the customers in this dataset do exhibit ”buy until you die” behaviour instead of switching from inactive to active.

However, on the individual level, the MVNHMM nests the MVNPDO as a special case and should converge to this solution if the customers in this dataset behave more like the constrained version. A possible explanation is that the Multivariate Normal distribution does not do an adequate job in modelling differences between customers. If the real heterogeneity distri-bution is for example bimodal, with one part having BTYD behaviour and the other part switching characteristics, the now used Multivariate Normal density will not be able to model these aspects as it is unimodal.

Evidence for this case is found in Korkmaz (2014), where Gaussian mixture models are used as heterogeneity distributions in a variety of different cus-tomer base models. She finds that incorporating multi-modality increases out-of-sample prediction compared to uni-modal distributions. Addition-ally, Zhang, Bradlow, and Small (2014) describe the class of Hidden Markov Models the type of technique to model, what they call ”clumpiness”, but they find that the CDNOW dataset does not have a significant amount of ”clumpy” customers. Also, the switching model of B¨uschken and Ma (2011) performs worse on the CDNOW dataset, compared to existing non-switching models. The last two points might explain the worse performance of the MVNHMM on this dataset

(25)

CDNOW (”Dead removed”)

Purchase occasions 1 2 3 4 5 6 7 8+ Total Number of customers 0 451 200 112 73 50 27 64 977 Actual total purchases 0 233 189 122 104 112 63 382 1205 MVNBG (MAD) - 0.834 1.117 1.174 1.202 1.723 1.941 2.982 1.175 MVNHMM (MAD) - 1.586 1.653 1.773 1.734 2.301 2.571 3.202 1.802

Table 6: Mean Absolute Deviation for the proposed and constrained model, with the one time purchasers removed and broken down by the number of purchase occasions in the training dataset.

(26)

5 Simulation Study

To validate the MCMC estimation method, a simulation study is employed. The posterior estimates found in Table 1 are used to artificially generate 100 datasets with 39 weeks of purchases, then the MCMC estimation method is fitted on all of them and the estimates are reported in Table 5. To avoid extensive computational time, the amount of MCMC iterations is reduced to 10,000 (with 5,000 discarded as burn-in) instead of 50,000, which causes some loss in accuracy.

Posterior Estimates Simulated Data

True Values Mean q2.5% q97.5%

µ1 -0.268 -0.311 -0.353 -0.269 µ2 4.190 4.134 3.998 4.251 µ3 -3.349 -3.164 -3.506 -2.842 σ11 0.117 0.183 0.157 0.211 σ12 0.581 0.589 0.501 0.679 σ13 -0.326 -0.340 -0.483 -0.204 σ22 3.914 3.934 3.434 4.474 σ23 -1.849 -1.863 -2.322 -1.378 σ33 5.520 5.131 3.871 6.390

Table 7: Posterior estimates for the artificially generated data using 5,000 MCMC iterations with 5,000 used as burn-in samples.

Looking at Table 7 we see that all estimates are relatively close to their true values except for µ1 and σ11. This bias might come from the number

of iterations not being enough (i.e. the MCMC chain has not converged entirely) or it could be that there are not a lot of actual purchases present in the dataset. As the estimated purchase rate (µ1) is quite low, it means

there is not a lot of information to estimate the parameters leading to more uncertainty. To make sure the sampling method works, the simulation study is employed again. The only difference is that the true value of µ1 is set to

0 (λ = 1) and σ11to 0.2, to slightly increase the number of purchases in the

(27)

The second simulation is shown in Table 8, where the results look more ac-curate than in the first simulation. So, probably due to the lack of observed purchases, the MCMC approach has a harder time estimating the true val-ues in the first simulation study.

Posterior Estimates Simulated Data (increased µ1 and σ11)

True Values Mean q2.5% q97.5%

µ1 0.000 0.000 -0.040 0.040 µ2 4.190 4.196 4.067 4.319 µ3 -3.349 -3.288 -3.660 -2.915 σ11 0.200 0.200 0.164 0.247 σ12 0.581 0.577 0.494 0.673 σ13 -0.326 -0.310 -0.439 -0.185 σ22 3.914 3.965 3.495 4.517 σ23 -1.849 -1.887 -2.432 -1.404 σ33 5.520 5.279 4.225 6.653

Table 8: Posterior estimates for the artificially generated data with modi-fied µ1 and σ11, using 5,000 MCMC iterations with 5,000 used as burn-in

(28)

6 Conclusion

This study aimed to improve the probabilistic customer base analysis liter-ature by introducing a discrete switching model in calendar time, with the idea of de-linking the switching process from the purchase process. Addi-tionally, a constrained version of this model is similar to the PDO model of Jerath et al. (2011) but with a Multivariate Normal distribution to model heterogeneity. Both models have not been demonstrated in this context and are introduced to shed more light on the BTYD versus switching assump-tion. Moreover, due to the Multivariate Normal heterogeneity distribution, correlations between parameters are allowed.

In terms of out-of-sample prediction, the constrained variant performs sig-nificantly better on (part) of the CDNOW dataset, which is reinforced by Zhang et al. (2014) as they find that this dataset does not have the charac-teristics that are modelled rather well by Hidden Markov Models.

Both models find significant negative correlation between the purchase rate and active time. In marketing terms, this indicates that people who have a higher probability of defecting (or switching to inactivity) also have a tendency to buy more in the period(s) they are alive or active (i.e. higher purchase rate). A point of critique is that the mean purchase rate in the MVNPDO model does not seem to have an influence when a proxy for ”dead” customers is regressed on this mean rate and a constant.

The fact that a constrained version performs better than the proposed model may have to do with the used heterogeneity distribution, as this only allows for uni-modality. However, when multiple distinct groups of customers are present in the data, the Multivariate Normal distribution does not do an adequate job of modelling the real differences between customers, because it is uni-modal. In the hypothetical case that one group of customers has ”buy till you defect” behaviour and another one shows switching characteristics, the used Multivariate Normal distribution will probably settle somewhere in the middle, introducing a bias in the estimates of both groups.

(29)

Gaussian mixture distributions to model multi-modality (Korkmaz, 2014) or clustering techniques from the area of Machine Learning to identify distinct groups of customers. Another extension that can introduce multi-modality, is incorporating time variant and/or time invariant covariates. The used Multivariate Normal process easily allows for this, but time variant covari-ates will severely complicate the likelihood function.

(30)

Appendix A: Conditional Distributions

We first derive the full conditional distributions for the mean and variance of the Multivariate Normal distribution, where a conjugate prior is used for mathematical convenience, let us write:

P (µ, Σ) = P (µ|Σ)P (Σ). (17)

We need a Multivariate Normal and Inverse-Wishart distribution for conju-gacy:

P (µ|Σ) ∼ N (µ0, m−1Σ) (18)

P (Σ) ∼ W−1(Ψ, n0). (19)

Next, the full conditional distributions can be derived analytically:

The posterior distribution for the individual parameters is only known up to a constant and can be written as follows:

P (λ, p00, p11|Σ, µ, X) ∝ P (X|λ, p00, p11)P (λ, p00, p11|Σ, µ) (22)

Where the first term on the right-hand side is the likelihood of the Hidden Markov Model as described in (5) and the last part is the Multivariate Normal density following from the assumptions.

(31)

Appendix B: Markov-Chain Monte-Carlo Output

(a) Histogram of µ1 (b) Trace plot of µ1

Figure 3: MCMC output of 50,000 iterations with (40,000 burn-in samples)

(32)

(a) Histogram of σ11 (b) Trace plot of σ11

(33)

(34)

References

Abe, M. (2009). “counting your customers” one by one: A hierarchical bayes extension to the pareto/nbd model. Marketing Science, 28 (3), 541–553.

B¨uschken, J., & Ma, S. (2011). Counting your customers from an “always a share” perspective. Marketing Letters, 22 (3), 243–257.

B¨uschken, J., & Ma, S. (2012). When are your customers active and is their buying regular or random? an erlang mixture state-switching model for customer scoring. An Erlang Mixture State-Switching Model for Customer Scoring (March 8, 2012).

Fader, P. S., Hardie, B. G., & Lee, K. L. (2005). “counting your customers” the easy way: An alternative to the pareto/nbd model. Marketing science, 24 (2), 275–284.

Fader, P. S., Hardie, B. G., & Shang, J. (2010). Customer-base analysis in a discrete-time noncontractual setting. Marketing Science, 29 (6), 1086–1108.

Haario, H., Saksman, E., & Tamminen, J. (2001). An adaptive metropolis algorithm. Bernoulli , 223–242.

Jerath, K., Fader, P. S., & Hardie, B. G. (2011). New perspectives on cus-tomer “death” using a generalization of the pareto/nbd model. Mar-keting Science, 30 (5), 866–880.

Korkmaz, E. (2014). Bridging models and business: Understanding hetero-geneity in hidden drivers of customer purchase behavior (Tech. Rep.). Korkmaz, E., Kuik, R., Fok, D., et al. (2013). ” counting your customers”:

When will they buy next? an empirical validation of probabilistic customer base analysis models based on purchase timing. ERIM Report Series Research in Management (ERS-2013-001-LIS).

Platzer, M. (2008). Stochastic models of noncontractual consumer relation-ships. na.

Schmittlein, D. C., Morrison, D. G., & Colombo, R. (1987). Counting your customers: Who-are they and what will they do next? Management science, 33 (1), 1–24.

(35)

value using clumpiness: From rfm to rfmc. Marketing Science, 34 (2), 195–208.

Zucchini, W., & MacDonald, I. L. (2009). Hidden markov models for time series: an introduction using r (Vol. 150). CRC press.