A Time Varying Approach to Churn Modeling

(1)

A Time Varying Approach to Churn Modeling

Niels Holtrop

September 2, 2011

Abstract

In this paper we introduce a new method to model and predict customer churn. We develop a model that includes both time-varying parameters as well as unobserved heterogeneity between customers. This model combines ideas from Kalman filtering with those of mixture modeling to account for both effects. As an empirical illustration of this model, we use a dataset provided by a large Dutch health care insurer that covers the years 2004-2007. Such a dataset that covers multiple years is ideally suited for the model we develop. First, we show that taking into account changes over time and between customers is relevant, as we observe clear differences in the pa-rameter estimates over time and for different clusters. We compare the predictions generated by our model with commonly used logit and tree models and find that our model performs just as well or better at predicting customer churn. As a measure of predictive ability, we use both the top-decile lift as well as the Gini coefficient. Our model scores better than the common methods on both measures when it comes to long term predictions, which can overcome the problem of low staying power that is often observed in current methods.

Keywords:: Churn modeling, state-space models, Kalman Filter, mixture models

1 Introduction

(2)

Several model approaches have been discussed in the literature. The most common approaches are logit and tree models (Neslin et al., 2006). Other methods discussed are for example discriminant anal-ysis (Hair Jr. et al., 2010), neural networks (Faraway, 2006) and support vector machines (Coussement and Van den Poel, 2008). All these methods focus in principle on estimating a churn model at one point in time and then use this model to predict the churn for future periods. This method has the downside that if the market conditions in which the model was estimated change, predictions for the future can be well off. (Neslin et al., 2006; Risselada et al., 2010) show that the staying power of most models is very low and that most models can only predict reliably one or two periods ahead. One way around this problem is to re-estimate a model; however this can be a costly and time consuming task (Malthouse and Derenthal, 2008). We propose to use a model that incorporates past information on the customers to generate predictions. In this way, more information is included in the model than is included in current, commonly used models. Furthermore, as the method we propose is an on-line estimation method, when new data becomes available the model is easily updated with this information without the need for complete re-estimation, which reduces the costs of the model. We also propose an extension to the model that includes heterogeneity by dividing the customer base into clusters and estimates the model for each cluster, thus including more between customer differences. We believe that such an approach can generate better predictions, as the goal of churn modeling is to

identify differences between customers in the customer base. We will compare our model predictions

to commonly used logit and tree model by means of top-decile lift and Gini coefficient, which makes our results comparable to previous studies in this field. Our findings suggest that our approach, which captures both variation over time as well as between customers, can improve the long term churn predictions and thus has a larger staying power. This can solve one of the largest problems currently observed for some of these models according to some authors (Risselada et al., 2010), although other findings suggest the problem might be less severe (Neslin et al., 2006).

This paper will unfold as follows: In section 2, we discuss the background of this research in more detail. Section 3 discusses the model specification for the model with and without heterogeneity. In section 4 we provide an empirical application of our model using data from a large Dutch health care insurer, and compare our model with other, commonly used techniques. In section 5 we discuss our results, which are followed by the conclusion in section 6.

2 Research background

Modeling and predicting customer churn has become more and more important during the past decade, as the interest of firms turned towards building long lasting customer relationships in combination with a maximization of the value of these customers (Lemon et al., 2002; Bolton et al., 2004). Naturally, customer churn does not lead to long lasting relationships, which raises the issue of retaining current customers. Reasons to aim for retention are the high cost to acquire new customers (Athanassopoulos, 2000), the idea that retention programmers can actually increase customer satisfaction (Colgate and Danaher, 2000) and that retention can be the only way to profitably maintain or grow the customer base, for example in mature markets (Hadden et al., 2005).

(3)

models that underlie the CLV calculation, among which are the churn models.

The importance of the quality models has in the past raised the question which model type is best suited to model and predict churn. (Neslin et al., 2006) discuss a competition among academics and practitioners aimed at surveying a large amount of different models to find out the best models for churn prediction. Their results show that logit and tree models perform best. In particular, tree models with bagging or boosting (Lemmens and Croux, 2006) provide good predictions. All models consid-ered currently in the literature can be seen as static models, i.e. they are time invariant. This implies that as the external circumstances change, the models might loose their validity. While this would not be problematic if the models are still able to provide accurate predictions, previous research has shown that models are only valid for one to three periods ahead (Neslin et al., 2006; Risselada et al., 2010). Hence, if we can develop a model that can provide good predictions for longer periods ahead, we could improve over currently used models. Furthermore, this would raise the opportunity of cost savings, as re-estimating models can be costly and time consuming (Malthouse and Derenthal, 2008). Recent developments in the modeling literature offer a potential solution to the problem of the in-ability of churn models to adapt to changes in the external environment. By allowing the parameters of the model to change over time, we can develop a model that is able to adapt to new situations as they arise. State-space models (Leeflang et al., 2009) offer a way to allow for changing parameters over time. State-space models can be estimated using both a frequentist approach using the Kalman filter (Durbin and Koopman, 2001) as well as through a Bayesian approach (West and Harrison, 1989), which is known as Dynamic Linear Models (DLM). We follow the frequentist approach here, as the Kalman filter offers a computationally light and straightforward to implement way to estimate state-space models. Another advantage of the Kalman filter is that it allows for on-line estimation, which allows the model estimates to be updated as new information becomes available without the need to completely re-estimate the model. This achieves the goal of cost reduction we set out above.

(4)

3 Model specification

In this section we discuss our model approach. First, we introduce the model specification without heterogeneity and discuss the estimation of this model. Next, we extend the model to include hetero-geneity making use of a mixture model type of approach.

We develop a simple state-space model that can be used to model churn. The specification consists of two parts: a state equation and a transition equation. The state equation is given as

yt = Xtβt+ t, (1)

where ytis the binary coded dependent variable representing whether a customer churned in period t or

not, Xt is a matrix containing the customer characteristics corresponding to period t and βt represents

the parameter vector for period t. We assume that E(yt|βt) = µt = g(ηt), with ηt = Xtβt. The function

g(·) is the link function, which in this case is the logit function g(ηt) =

exp(ηt)

1+ exp(ηt)

. This specifies a binomial model for the state equation, as is done in generalized linear models (GLM). For the error

term t, we assume that it is distributed with zero mean and diagonal covariance matrixΣt, where each

diagonal element is the variance given by the logistic variance function V(µt) = µt(1 − µt). Many

references provide (e.g. (Azzalini, 1996, chap. 6)) provide an overview of this link and variance function, as well as for other exponential family models.

To complete the model specification, we need to specify a transition equation for the parameter

vector βt. We shall assume the following:

βt+1= Ftβt+ ξt. (2)

For our application, we shall assume that Ft = I, which in effect specifies a logistic regression model

with time-varying parameters. For the error term ξtwe assume a normal distribution with mean zero

and known covariance matrix Qt. We also assume that the error terms tand ξt are uncorrelated.

To estimate state-space models based on time series data, most applications make use of the Kalman filter (Durbin and Koopman, 2001). However, these models assume a normal distribution for the error terms of the state equation. Clearly, in our case this does not apply, as we assume that the state equa-tion represents a logistic regression model. To estimate the above model, we make use of an extension of the Kalman filter recursions to exponential family data as described in (Fahrmeir and Tutz, 1994, chap. 8). The approach described covers both univariate and multivariate time series, and can thus be broadly applied. For our application we use a multivariate extension of these recursions. Also

note that the filter discussed in (Fahrmeir and Tutz, 1994) is not limited to binary yt, but can be used

to estimate time-varying variants of any exponential family model, for example a count model using the Poisson distribution. Besides a Kalman filter, (Fahrmeir and Tutz, 1994) also derive a Kalman smoother. This fixed interval smoother is similar to the one used in the normal Kalman filter, see (Durbin and Koopman, 2001).

An assumption we did not yet touch upon is that the matrix Qt is assumed to be known.

Further-more, to initialize a Kalman filter, initial state estimates are required, in this case denoted by β0and

Q0. We can estimate these three quantities from the data by using an EM-type of approach (Fahrmeir

and Wagenpfeil, 1997). This approach uses the same Kalman filter as we use to find estimates for β

in a pre-step to the actual model estimation to determine Qt, β0and Q0. We then use these estimates

(5)

3.1 Extension with unobserved heterogeneity

The above model specification pools all observations for a certain year and uses these for estimation. We also consider an extension of this model that uses a mixture model approach to form clusters for each year and then estimates the above model for each cluster. The intuition behind this approach is that besides the observed customer characteristics, we might also expect unobserved factors to influ-ence the churn decision. Furthermore, similarities between customers can also be captured in this way and might improve predictions produced by the model by extrapolating these similarities to future time periods.

To implement this model, we adapt prior work in this direction (Calabrese and Paninski, 2011) to fit exponential family observations. This method is known as the mixture of Kalman filters model, which is a special case of the mixture of Gaussians model (Wu et al., 2004). The idea is to cluster the observations in each time period and also estimate a mixture parameter. The above model is than estimated for each cluster using the modified Kalman filter, and based on the outcomes the cluster memberships are updated again. This is an iterative procedure that is repeated several times until convergence. It results in parameter estimates for each cluster, on which predictions can be based. Clustering takes place on the known customer characteristics. The members in each cluster are sim-ilar with respect to the characteristics we observe and thus we assume that they also are simsim-ilar with respect to some of the unknown characteristics. In this way, we can capture some of the unobserved heterogeneity between customers. Note that we keep the number of clusters fixed over time. This was done because the complete likelihood over all periods is maximized, which does not allow the num-ber of clusters to vary. However, an extension with time-varying clusters could be achieved (Smith and Brown, 2003), but this method only provides approximate results instead of the exact results we achieve here. Therefore, we chose to keep the number of clusters fixed over time and accommodate changes over time through time-varying parameters.

The method used to estimate this cluster model is maximum likelihood, where the likelihood of each Kalman filter is weighted by the mixture parameter. Customers are assigned to the cluster for which they have the highest probability, as indicated by the mixture parameter, as is typical in mixture mod-eling. To estimate all the parameters of the model, an EM algorithm is used that first maximizes the likelihood, then assigns customers to cluster and then maximizes the likelihood again based on the new assignment. Customers are assigned to the cluster for which they have the highest probability of belonging, based on an evaluation of the likelihood using their characteristics and the estimated mix-ture parameter. This is done for a number of iterations, after which the final parameters are smoothed using the same fixed interval smoother as above.

In the next section, we will illustrate both model specifications with an empirical application. We estimate the model with and without unobserved heterogeneity and compare the predictions gener-ated by these models to the commonly used logit and tree models. We also compare the predictions to a boosted tree model, as previous studies (Neslin et al., 2006) found that this model outperformed the other models in terms of predictive power. To compare the model predictions, we make use of the

(6)

4 Empirical application

For our empirical application, we use a dataset provided by a large Dutch health care insurer (see (Dijksterhuis and Velders, 2009; Risselada et al., 2010) for previous applications of this dataset). The data cover the years 2004-2007 and record for each year whether a customer churned or not. The yearly frequency is typical for this type of data, as customers can switch between insurances at most once a year (Dijksterhuis and Velders, 2009). An interesting feature of this dataset is that in 2006, a system change (see (Douven et al., 2007) for some background information) took place, which meant that customers where free to choose their own insurer, whereas before people under a certain income level were restricted to predetermined collective insurers. This change increased the mobility of customers, and also made it more important for insurers to manage customer churn. Another interesting aspect of this dataset is that is an outflow sample, i.e. a customer is followed from 2004 until she churns (or stays with the insurer until 2007). This gives the data a longitudinal dimension, which allows us to follow customers over time. Note that this is not a requirement for our model to be estimated, however.

The dataset contains variables on a variety of characteristics. Included are sociodemographic variables, socioeconomic variables, information on relationship characteristics and insurance details. Most of these variables are available for each year, while some are only recorded at the start of the period under observation (2004). We exploit the changes in these variables with our model, which allows the variables to change over time and adapts the parameters accordingly. Variables that are selected in all models are for example gender, age, whether someone is collectively insured or not, family status (single living, with or without kids), number of phone contacts, automatic bill payment

(yes/no) and the type of insurance package. To estimate the models, we created balanced samples for

each year where the number of churners is equal to the number of non-churners. This approach is commonly used (Neslin et al., 2006; Lemmens and Croux, 2006; Risselada et al., 2010), as the num-ber of churners is often much lower than the numnum-ber of non-churners. In our dataset, the churn rates per year are 8.3%, 7.2%, 32.9% and 4.0% respectively, which warrants this method. For each year, we also created holdout samples that have a number of churners proportional to the actual number of churners for that year in the dataset. These holdout samples were used to generate the predictions with the models for each year.

(7)

To judge the predictive power, we will use two measures that are commonly used for this purpose (Neslin et al., 2006; Lemmens and Croux, 2006; Risselada et al., 2010): the top-decile lift and the

Gini coefficient. The top decile lift is a measure aimed at identifying the high risk churners and is

computed by dividing the fraction of churners predicted to be in the top decile by the fraction of churners in the entire sample. Values in excess of 1 indicate that there are more churners in the top

decile and hence a model predicts better than random assignment would do. The Gini coefficient

con-siders both churners and non-churners and compares these to a random assignment of the customers. It takes values between 0 and 1, values closer to 1 indicating a better predictive fit of the model. Thus,

the Gini coefficient functions as a measure of overall model performance.

5 Results

In this section we present our results. First, we show through some examples that taking into account variation over time is important, as the parameter values change over time. Furthermore, we also show that taking into account heterogeneity leads to quite different parameter estimates between clusters. After that, we compare the predictions between all five models.

5.1 Time-varying parameters

One of the unique characteristics of the state-space models we discuss is the ability to take into changes over time by means of time-varying parameters. Such an approach sets these models apart from models such as the logit and tree model, which only allow for static estimation. That an time-varying parameter approach can be helpful is illustrated by the four examples in figure 2.

<< Figure 2 about here >>

This figure shows that there are substantial changes in the parameters over time, that are not captured by the commonly used models, but are captured by the state-space model. This figure also clearly

reflects the differences in the periods before (2004/2005) and after (2006/2007) the system change.

If we look at gender for example, we find that the parameter values are higher post change than they

were before the change, whereas an opposite effect is found for education. The other two variables

plotted also show similar changes over time. Thus, there seems to be some intuitive appeal to the proposed method.

Besides changes over time, we also extended the model to include heterogeneity between unobserved clusters in the dataset. We estimated models with clusters ranging from 1-10 and found that a two cluster solution showed the best fit, determined by the BIC. In figure 3 we plot the parameters for the same four variables over time as before, but now for two clusters based on the mixture state-space model.

This plot makes clear the differences that exist between the clusters in terms of parameter estimates.

(8)

vein, the other variables we plotted illustrate that the two clusters we find differ with respect to their parameters.

5.2 Prediction results

Now we focus on the predictions generated by the models. The five models we consider (logit model, tree model, boosted tree model, state-space and mixture state-space model) were all estimated using a balanced sample. Next, we used the proportional holdout samples to generate predictions and used

these predictions to compute the top-decile lift and Gini coefficient. We provide predictions for one,

two and three periods ahead. In table 1, we present the top-decile lift and Gini coefficient for all predictions for each model considered.

<< Table 1 about here >>

The top-decile lift and Gini coefficient are computed for each year and depending on the year we can predict up to three years ahead. We will discuss the results for each period separately.

If we look at the one-period ahead forecasts, we find that depending on the year the results differ

substantially. If we predict from 2004 to 2005, we move between two periods with comparable churn rates. The mixture state-space model performs best on both top-decile lift and Gini coefficient, which means that the combination of time-varying parameters and heterogeneity is able to capture most of the characteristics in the data. This is likely due to the heterogeneity part of the model, as the ordinary state-space model performs worst on both measures. In between are the logit and tree models, with the tree model, both with and without boosting, scoring better on both measures than the logit model. The

boosted tree model has some more difficulty with identifying churners than the ordinary tree model,

as indicated by the lower top-decile lift, but it scores slightly better than the ordinary tree model on

the Gini coefficient.

From 2005 to 2006, we observed a large increase in the churn rate due to the system change. This change is also likely to affect the structure of the market. We find that the state-space models have

difficulty coping with this change, as both of them show low scores on both performance measures.

The logit and tree model provide substantially better results, with the ordinary tree model having the highest scores on both measures. We do note, however, that all models show an improvement over random assignment of the customers, which indicates that all of them are robust to large changes in the market and can provide reliable insights even if the external conditions change.

(9)

The predictions for two and three years ahead show that the state-space models perform better than the logit and tree models. Whereas the latter models have problems with identifying both churners and non-churners, the former models score good on both fronts. The predictions in 2004 for 2006 are done best by the logit model on top-decile lift, but the mixture state-space model is a close second. How-ever, the latter model is much better able to classify non-churners, which translates in a higher Gini

coefficient. Looking at the predictions in 2005 for 2007, we find that the ordinary state-space model

has much higher values for both measures than the other models. The three year ahead predictions are again done best by the mixture state-space model, where also the ordinary state-space model is good at identifying churners, but fails to capture the non-churners accurately. Thus, we can conclude that the state-space approach is much better able to predict further ahead than logit and tree models. Furthermore, when the predictions are further ahead in the future, the fact that the mixture state-space model includes heterogeneity seems to benefit the quality of the predictions and also makes them more robust to external changes.

As a final comparison between models, we calculated the average top-decile lifts and Gini coe

ffi-cients averaged over the number of periods predicted ahead (e.g. over period t, over period t+ 1 etc.).

This makes the comparison between models easier and gives a straightforward way to identify the merits of each model. For the top-decile lifts, the plot can be found in figure 4.

These results confirm the finding that as the time horizon of the predictions increases, both state-space models are better able to identify churners than the logit and tree models. The mixture state-space model performs best overall in predictions, and is second best in the same period as the tree model, although the predictive ability of that model diminishes rapidly. The ordinary state-space and logit model perform worse when it comes to identifying churners in all periods, save for a slight outlier for

the ordinary state-space model in period t+ 2, and the same holds for the boosted tree model. For the

Gini coefficients, a similar plot is given in figure 5.

In terms of Gini coefficients, it is more difficult to judge the different models. Clearly, the logit and boosted tree model quickly diminish in terms of predictive power, as we observed before. We also

observe the same outlier for the ordinary state-space model in period t+ 2, but for the other periods

this model performs worst of all in terms of Gini coefficients. Depending on the period, it is either

the tree model or the mixture state-space model that has the highest Gini coefficients, and thus has the

overall best predictive power. A difference that we observe now is that if the horizon of the predictions

increases, it is the tree model that offers the best predictions, where for one period ahead predictions

this holds for the mixture state-space model.

Concluding this section, we can say that overall it are the tree model and the state-space model that incorporates heterogeneity that offer the best predictions. Furthermore, we have shown that there is

some merit in using a time-varying parameter and/or heterogeneity approach to churn modeling, as

(10)

6 Conclusion

In this paper, we formulated, estimated and tested a state-space approach to churn modeling. This ap-proach allows for time-varying parameter, which allows the model to adapt to changes in the external environment over time. Furthermore, we also extended this model to include heterogeneity between customers, which was taught to further increase the predictive power of the model. We compared these two models to a logit model, tree model and a boosted tree model using a dataset obtained from a large Dutch health care insurer. The results showed that out of all these models, the tree model and the state-space model that includes heterogeneity provide the best predictions, based on the top-decile

lift and Gini coefficient. When it comes to predicting churners only, both the state-space model with

and without heterogeneity outperform the other models when the time horizon increases to two or three periods ahead, although the one period ahead predictions of the state-space model with het-erogeneity are also best overall. However, both models seem to have a problem with identifying non-churners, which results in lower Gini coefficient scores for these models, even when the time horizon of the predictions increases. The tree model does better in that respect, which is why we find that this model also performs well next to the state space model with heterogeneity. Overall, we can conclude that our state-space approach works just as well or even better when compared to other mod-els, provided that we include heterogeneity in the model. An important finding is that the predictions of this model are still good after several time periods, which means that this model has a good staying power. The staying power rivals the largest number of three periods ahead found in the literature (Neslin et al., 2006), but at least we can say that our model seems to deliver good predictions for the commonly reported boundary of one or two periods ahead (Risselada et al., 2010). Another finding is

that accounting for changes over time and differences between customers can be very worthwhile for

this type of problems. We found substantial differences on both fronts, which can only be captured by a model such as described in this paper. This shows the promise of this method for future applications. However, there are still some limitations to this study. Firstly, we only use one dataset to illustrate our findings. To validate whether this method works in general, future research should include

mul-tiple datasets to compare differences between them, as this might influence results. In particular, a

(11)

A

Technical Appendix

In this appendix we discuss some of the computational details of our models. First, we introduce the Kalman filter that is used to estimate the ordinary state-space model, alongside the fixed interval smoother that is used for smoothing and the algorithm that is used to find starting values for the Kalman filter. After that, we expand the model with heterogeneity, which leads to the mixture state-space model. Furthermore, we provide a short review of the logit and the tree model as well.

A.1 Kalman Filter

First, recall the model. It consists of two equations, the state equation

yt = Xtβt+ t, (3)

and the observation equation

βt+1= Ftβt+ ξt. (4)

Instead of a normal distribution for yt we assumed that E(yt|βt) = µt = g(ηt), with ηt = Xtβt and g(·)

some given function, known as the link function. For the variances of both equations we assumed

that t ∼ (0,Σt) and that ξt ∼ N(0, Qt). To perform inference, we would like to obtain estimates for

the parameter vector β, given the variancesΣt and Qt. The way to obtain these estimates is using the

Kalman filter. Given the above assumptions we can derive the following recursions (Fahrmeir and

Tutz, 1994) to obtain estimates for βt and their corresponding variance matrix, which is known as

Kalman gain: Kit = Vi−1,tXit0Dit

h D0_itXitVi−1,tXit0Dit+ Σit i−1 , (5) with Dit = δg δηit

and where µit, Σit, Dit are evaluated at βi−1,t. The term Dit is added with respect to

Kalman filters assuming a normally distributed state equation to account for the non-normality in equation 4. For a univariate filter, the correction step is only made once.

Note that the above model and Kalman filter can be used for any situation where the dependent

variable yt comes from an exponential family distribution. For the churn modeling application we

described, we have that yit ∼ Ber(0, 1), which leads to a dynamic logistic regression model if we

assume that link function g(·) is g(ηt)=

exp(ηt)

1+ exp(ηt)

(12)

A.2 Kalman smoother

Once we have obtained parameter estimates using the filtering algorithm, we can also use this infor-mation to go back in time and remove some of the variance to obtain estimates that are smoother than the filtered estimates. This procedure is known as smoothing and can be performed using a smoother. The Kalman smoother we used is a fixed interval smoother (Fahrmeir and Tutz, 1994), which takes the following form:

For t= T, . . . , 1 :

βt−1|T = βt−1|t−1+ Bt(βt|T −βt|t−1)

Vt−1|T = Vt−1|t−1+ Bt(Vt|T− Vt|t−1)B0t

Bt= Vt−1|t−1F0tVt|t−1−1

(6)

A.3 Estimation of hyperparameters

The Kalman filter described in equation 5 requires that we specify initial values Q0, β0and the variance

matrix Qt, which are known as the hyperparameters of the model as we need to specify them before

we can obtain estimates for the model parameters βt. Instead of specifying these ourselves, we would

like to estimate these from the data. We provide an algorithm (Fahrmeir and Wagenpfeil, 1997) that can do this. It is similar to an Expectation Maximization (EM) algorithm and takes the following form:

1. Set starting values Q(0), Q(0)₀ and β(0)₀ and iteration index p= 0.

2. Obtain β_t|Tp and V_t|Tp from (5) and their corresponding smoothed estimates using (6), with

un-known parameters replaced by their current estimates Q(p), Q(p)₀ and β(p)₀ .

3. EM step: Compute Q(p+1), Q(p₀+1)and β(p₀+1)as follows:

4. When a termination criterion is reached, stop. Else, set p= p + 1 and return to step 2.

Using this algorithm, we can obtain initial values for the filter in equation 5. To obtain the estimates

βtwe use these initial values and run the Kalman one last time to obtain the estimates we are looking

(13)

B

Mixture state space model

To extend this model using a mixture model approach, we will first consider the likelihood that is associated with the model in equations 3 and 4. The log-likelihood is given as (Fahrmeir and Tutz, 1994) `(βt)= T X t=1 n X i=1 lit(βt) − 1 2(β0− b0)Q −1 0 (β0− b0) − 1 2 T X t=1 (βt− Ftβt−1)Qt−1(βt− Ftβt−1), (7)

where lit(βt) represents the (conditional) log-likelihood of the observation yit. Note that the Kalman

filter seeks to maximize the penalized likelihood (7), given values for β0, Q0and Qt.

We want to extend the state space model using the mixture model approach to a model that esti-mates a state-space model for J clusters. We do this using an EM approach (Calabrese and Paninski, 2011). To that end, we first define the log-posterior as follows:

`(βt)= T X t=1       log(λ j t)+ n X i=1 lit(βt)       − 1 2(β0− b0)Q −1 0 (β0− b0) − 1 2 T X t=1 (βt− Ftβt−1)Q−1t (βt− Ftβt−1), (8)

which is the loglikelihood (7) with an added term λtj that denotes the mixture probability, i.e. the

probability of belonging to cluster j at time t. We assume thatPJ

j=1λ j

t = 1 and that λ

j

t ≥ 0. For the

EM procedure we first compute the expectation of the above log-posterior, which is given as

J X j=1 T X t=1       p( j)       log(λ j t)+ n X i=1 lit(βt)              − 1 2(β0− b0)Q −1 0 (β0− b0) − 1 2 T X t=1 (βt− Ftβt−1)Q−1t (βt− Ftβt−1) (9) with p( j)= exp log(λtj)+ P n i=1lit(βt) PJ j=1exp log(λ_tj)+ Pn_i₌₁lit(βt) . (10)

With this expectation, we can proceed the algorithm by maximizing it. This involves solving the problem arg max λt,βt J X j=1 T X t=1       p( j)       log(λ j t)+ n X i=1 lit(βt)               −1 2(β0−b0)Q −1 0 (β0−b0)− 1 2 T X t=1 (βt−Ftβt−1)Q−1t (βt−Ftβt−1), (11) which we can actually solve in several steps as we can split up the likelihood. Maximizing

(14)

or the fraction of observations assigned to cluster j. This maximizes the first part of the likelihood. What remains is a likelihood that resembles that in equation (7), which we can maximize separately. To obtain the β parameters we can make use of the above Kalman filter, with a slight modification to the Kalman gain part to account for the presence of p( j):

Kit= Vi−1,tXit0Dit

h

D0_itXitVi−1,tX0itDit+ p( j)Σit

i−1

. (13)

Thus, by a slight modification of the original algorithm we are able to estimate a mixture state-space model. In words, the algorithm reads as follows: Initialize the algorithm by assigning observations to J clusters, e.g. using k-means clustering on the known customer characteristics, and initialize the Kalman filter using initial estimates. Then, repeat the following until some convergence criterion is reached or for a fixed number of steps (e.g. 20 iterations (Calabrese and Paninski, 2011)):

1. Update clusters memberships under current parameter estimates using equation (10)

2. For j=1:J update the mixture probabilities using equation (12)

3. For j=1:J update βt and Vt using the Kalman filter (5), using the modified Kalman gain (13)

C

Logit model

The logit model belongs to the more general class of generalized linear models. This class of mod-els is a generalization of the well-known linear regression modmod-els and allows for modmod-els that follow a non-normal distribution. Examples of often used distributions are the Poisson, gamma and bino-mial distributions. The case of standard linear regression is covered when the normal distribution is used. What all these distributions have in common is that they belong to the exponential family of distributions. These distributions have density function (Azzalini, 1996)

f(y)= exp w

ψyθ − b(θ) + c(y, ψ)

!

, (14)

where θ and ψ are scalar parameters, w is a known constant and b(·), c(·) are knowns functions that de-termine the specific distributions. It can be shown that all of the above distributions fit this description. To introduce a regression structure in this context, we need to define a set of independent variables

to relate to the dependent y above. Let xi denote the independent variables observed for person i. In

a linear regression context we would now specify yi = x0_iβ + i, where i ∼ N(0, σ2), or similarly

Yi∼ (µi, σ2) and µi = x0_iβ. This would formulate the linear relation between yiand xicompletely. We

can generalize this to the exponential family of distributions by assuming the following:

Yi∼ EF(b(θi),

ψ

w), g(µi)= ηi, ηi = x

0 iβ.

That is, we assume that Yiis a member of the exponential family of distribution. Furthermore, there

exists a function g(·), called a link function, that relates the mean value µito the independent variables

xi. Taking this function to be the identity function and assuming a normal distribution for Yi brings

(15)

The expectation of Yiis given by E[Yi]= b0(θi) and its variance is given as var[Yi] = b00(θi)_wψ

(Azza-lini, 1996).

The previous paragraphs were concerned with the general theory of generalized linear models. Now we can use this information to formulate the logit model. We have already seen that the dependent

variable yi only takes on the values 0 and 1. If we generalize this to all n observations and thus

ag-gregate over i, we recognize in this situation the binomial distribution. The density of the binomial distribution in the form (14) can be derived (Faraway, 2006) as follows:

f(y|θ, ψ)= n

y !

µy_{(1 − µ)}n−y

= exp(y log µ + (n − y) log(1 − µ) + log n

y ! ) = exp(y log µ 1 − µ ! + n log(1 − µ) + log n y ! )

We thus see that θ = log µ

1 − µ !

, b(θ) = −n log(1 − µ) = n log(1 + expθ), c(y, ψ) = logn_yand that

w = ψ = 1. The link function can be seen to equal g(µ) = log µ

1 − µ !

, which is the logit function that gives the name to the model. The expectation and the variance can be derived easily from these

principles and are seen to equal E[Y] = exp

θ

1+ exp θ and var[Y] = µ(1 − µ). We can now relate the

independent variables to the dependent variable using a regression structure in the following way:

log µ

1 − µ !

= β0+ β1x1+ β2x2+ . . . βkxk+ = X0β + , (15)

(16)

D

Tree model

Regression trees (Breiman et al., 1984) are a variant of more general non-parametric methods (Wasser-man, 2006) that use binary splits on the independent variables to estimate the model. The resulting splits can be represented in a tree form, from which the model class derives its name. For an example of a simple tree, see figure 1 for an example.

| x2< 0.001342

x4>=0.9323

0.4211 0.7835

0.8227

Figure 1: Example of a tree, based on simulated data. First, a split is made on the variable x2, which results in two groups. Next, at one node of the tree another split is made on the variable x4. This results in a tree with three end nodes (groups of customers), having churn rates 0.4211, 0.7835 and 0.8227 respectively.

We give here a simple exposition of how tree methods work, based on the work of (Wasserman, 2006). However, for a more thorough review of tree methods we refer to (Breiman et al., 1984), who provide more detail and derivations of all the results given here. A tree model is a form of non-parametric regression in which the goal is to find an unknown function r(·), based on independent variables x. We can write the problem as

r(x)=

K

X

k=1

ckI { x ∈ Rk}, (16)

where c1, . . . , ck are unknown constants and R1, . . . , Rk are disjoint rectangles that form a partition

of the space of the covariates. Let x = (x1, . . . , xj, . . . , xd) denote a generic covariate value. For the

ith observation this boils down to x = (xi1, . . . , xi j, . . . , xid). Given a covariate j and split point s

we can now define two rectangles based on the covariate. Let R1 = R1( j, s) =

n x: xj≤ s o and let R2 = R2( j, s) = n x: xj> s o

and xj refers to the jth covariate. Now define c1to be the average of yi

(17)

These values c1 and c2 are chosen in such a way that they minimize the residual sum of squares

P

xi∈R1(yi − c1)

2 _andP

xi∈R2(yi− c2)

2_{. Based on these sum of squares, the algorithm also selects the}

split point s and the variable xjsuch that this sum of squares is minimized. At each node of the tree,

this bivariate problem is solved and in this way the tree is constructed.

In principal, one can continue splitting the variables until each end node consists of one observation. Oftentimes however one grows a large tree and then prunes the tree back to end up with a subtree of the large tree that sufficiently summarises the data. To determine the size of the tree to select, we need

to determine a tuning parameter α. To that end, let Nkdenote the number of points in rectangle Rkof

subtree T and define

ck= 1 Nk X xi∈Rk yi; (17) Qk(T )= 1 Nk X xi∈Rk (yi− ck)2 (18)

Now we define a measure of fit for the subtree T, called the complexity, as follows:

Cα =

|T |

X

k=1

NkQk(T )+ α|T|, (19)

with α > 0 and |T | the number of end nodes in subtree T. If we let Tα be the smallest subtree that

minimizes Cα, then we only need to select ˆα such that this criterion is satisfied. This value of α can

(18)

Figure 2: Plot of the parameters for four variables over time. The variables are gender, education, married and urban living.

(19)

Figure 3: Plot of the parameters for four variables and two clusters over time. The variables are gender, education, married and urban living.

(20)

Table 1: Top decile lift and Gini coefficients for all five models. Forecasts are made for one, two and three periods ahead.

2004→2005 Logit model Tree model Boosted Tree model State Space Model Mixture State Space Model

Top Decile Lift 2.9059 3.6984 3.4342 2.9058 5.2834

Gini coefficient 0.3745 0.4319 0.5022 0.2297 0.6894

Top Decile Lift 2.9806 3.0358 2.4838 1.6973 1.0487

Top Decile Lift 0.0000 0.0000 0.0000 0.0000 1.2275

Top Decile Lift 2.9806 2.2355 2.0699 2.3180 2.8702

Top Decile Lift 0.0000 0.0000 1.1534 2.3068 1.1534

Top Decile Lift 0.0000 1.1894 1.1534 1.3593 2.3068

(21)

Figure 4: Average top decile lifts for each model per period ● ● ● ● 0 1 2 3 4 5 6 Period T op Decile Lift t t+1 t+2 t+3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Logit Model Tree Model

Boosted Tree Model State Space Model

(22)

Figure 5: Average Gini coefficients for each model per period ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 Period Gini Coefficient t t+1 t+2 t+3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Logit Model Tree Model

Boosted Tree Model State Space Model

(23)

References

Athanassopoulos, A.D. (2000). Customer satisfaction cues to support market segmentation and ex-plain switching behavior. Journal of Business Research 47, 191–207.

Azzalini, A. (1996). Statistical Inference Based on the likelihood. Boca Raton: Chapman & Hall/CRC.

Bolton, R. N., K.N. Lemon, and P.C. Verhoef (2004). The Theoretical Underpinnings of Customer Asset Management: A Framework and Propositions for Future Research. Journal of the Academy of Marketing Science 3(32), 271–292.

Breiman, L., J.H. Friedman, R.A. Olshen, and C.J. Stone (1984). Classification and Regression Trees. Belmont: Wadsworth International Group.

Calabrese, A and L. Paninski (2011). Kalman Filter Mixture Model for Spike Sorting of

Non-Stationary Data. Journal of Neuroscience Methods 196(1), 159–169.

Colgate, M.R. and P.J. Danaher (2000). Implementing a Customer Relationship Strategy: The Asym-metric Impact of Poor versus Excellent Execution. Journal of the Academy of Marketing Science 28, 375–387.

Coussement, K. and D. Van den Poel (2008). Churn Prediction in Subscription Services: An Appli-cation of Support Vector Machines while Comparing Two Parameter Selection Techniques. Expert Systems with Applications 34, 313327.

Dijksterhuis, M. and S. Velders (2009). Het voorspellen van switchgedrag in een markt met een lage mobiliteit: Een casestudy, Chapter 10, pp. 167–180. Ontwikkelingen in het Marktonderzoek. Spaar en Hout.

Donkers, B., P.H. Fransen, and P.C. Verhoef (2003). Selective Sampling for Binary Choice Models. Journal of Marketing Research XL, 492–497.

Douven, R., E. Mot, and M. Pomp (2007). Health Care Reform in the Netherlands. CPB: Netherlands Bureau for Economic Policy Analysis.

Durbin, J. and S.J. Koopman (2001). Time Series Analysis by State Space Methods. New York: Oxford University Press.

Fahrmeir, L. and G. Tutz (1994). Multivariate Statistical Modelling Based on Generalized Linear Models. New York: Springer-Verlag.

Fahrmeir, L. and S. Wagenpfeil (1997). Penalized likelihood estimation and iterative Kalman smooth-ing for non-Gaussian dynamic regression models.

Faraway, Julian J. (2006). Extending the Linear Model with R: Generalized Linear, Mixed Effects and

Nonparametric Regression Models. Boca Raton: Chapman & Hall/CRC.

Gupta, S., D. Hanssens, B. Hardie, W. Kahn, V. Kumar, N. Lin, and N.R.S. Sriram (2006). Modeling Customer Lifetime Value. Journal of Service Research 9(2), 139–155.

Hadden, J., A. Tiwari, R. Roy, and D. Ruta (2005). Computer Assisted Customer Churn Management:

(24)

Hair Jr., J.F., W.C. Black, B.J. Babin, and R.E. Anderson (2010). Multivariate Data Analysis. Upper Saddle River: Pearson Education Inc.

Leeflang, P.S.H., T.H.A. Bijmolt, J. Van Doorn, D.M. Hanssens, H.J. Van Heerde, P.C Verhoef, and J.E. Wieringa (2009). Creating Lift versus Building The Base: Current Trends in Marketing Dy-namics. International Journal of Research in Marketing 26, 13–20.

Lemmens, A. and C. Croux (2006). Bagging and Boosting Classification Trees to Predict Churn. Journal of Marketing Research 18, 276–286.

Lemon, K.N., T.B. White, and R.S. Winer (2002). Dynamic Customer Relationship Management: In-corporating Future Considerations into the Service Retention Decision. Journal of Marketing 66(1), 1–14.

Lu, J. (2002). Predicting Customer Churn in the Telecommunications Industry: An Application of Survival Analysis Modeling Using SAS.

Malthouse, E C. and K.M. Derenthal (2008). Improving Predictive Scoring Models through Model Aggregation. Journal of Interactive Marketing 3(22), 51–68.

McLachlan, G. and D. Peel (2001). Finite Mixture Models. JohnWiley & Sons.

Neslin, A., S. Gupta, W. Kamakura, J. Lu, and C.H. Mason (2006). Defection Detection. Journal of Marketing Research XLIII, 204–211.

Risselada, H., P.C. Verhoef, and T.H.A. Bijmolt (2010). Staying Power of Churn Prediction Models. Journal of Interactive Marketing 3(24), 198–208.

Rust, R.T., K.N. Lemon, and V.A. Zeithaml (2004). Return on Marketing: Using Customer Equity to Focus Marketing Strategy. Journal of Marketing 68, 109–127.

Schwarz, G.E. (1978). Estimating the Dimension of a Model. Annals of Statistics 6(2), 461–464. Smith, A. and E. Brown (2003). Estimating a State-Space Model from Point Process Observations.

Neural Computation(15), 965–991.

Verhoef, P.C., J. van Doorn, and M. Dorotic (2007). Customer Value Management: An Overview and Research Agenda. Journal of Research in Marketing (2), 51–58.

Vermunt, J. K. and J. Magidson (2002). Latent class cluster analysis. In J. A. Hagenaars and A. L. McCutcheon (Eds.), Applied latent class models, pp. 89–106. Cambridge: Cambridge University Press.

Wasserman, L. (2006). All of Nonparametric Statistics. New York: Springer.

West, M. and J. Harrison (1989). Bayesian Forecasting and Dynamic Models. Springer-Verlag. Wu, W., M.J. Black, D. Mumford, Y. Gao, E. Bienenstock, and J.P. Donoghue (2004). Modeling