• No results found

Churnprediction:AComparisonofStaticandDynamicModels M aster ’ s T hesis E conometrics

N/A
N/A
Protected

Academic year: 2021

Share "Churnprediction:AComparisonofStaticandDynamicModels M aster ’ s T hesis E conometrics"

Copied!
72
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Master’s Thesis Econometrics

Churn prediction: A Comparison of Static and Dynamic Models

N. Holtrop September 20, 2010

Author: Niels Holtrop Student number: s1545590

(2)

Co-assessor: dr. C. Praagman

(3)
(4)
(5)

Churn prediction: A Comparison of Static and Dynamic Models

Master’s Thesis Econometrics

N. Holtrop

Abstract

In this thesis we compare the, in churn prediction, often used logit and tree models against two models that take a dynamic approach to churn prediction: A model based on a generalized Kalman filter with time-varying parameters and an Accelerated Failure Time (AFT) model used in survival analysis. We use a dataset provided by a large Dutch health care insurer to calibrate the models, which we discuss first. Next, an extensive theoretical background for all models used is provided. Model forecast performance is judged using top-decile lift and Gini coefficient, besides some other, more specific measures. Our findings suggest that both dynamic models can provide a robust alternative to logit and tree models when changes in the churn rate can be expected over time. However, when the churn rate rate stays more or less constant over longer periods of time, logit and tree models seem to perform better than their dynamic counterparts.

Keywords: Churn prediction, Kalman filter, Accelerated Failure Time model, Logit model, Tree model, Top-decile lift, Gini coefficient

(6)
(7)

Preface

The thesis that lies in front of you is the closing product of my Master Econometrics, Operations Research and Actuarial Sciences. To be precise, it is written as a thesis for the specialization Econometrics, but it is also seasoned with some Marketing elements . While for many students a thesis such as this one marks the end of their Groningen period, I will stay in Groningen next year and continue with another Master here.

Still, this thesis marks the end of the Econometrics program in Groningen and thus this is a good time and place to thank a few people.

First of all, I would like to thank Jaap Wieringa for supervising me during the writing of this thesis. His comments and suggestions have always been helpful in improving the end product. Besides that, his enthu- siasm for the combination of marketing and econometrics is what got me interested in the subject of this thesis in the first place. Furthermore, I would like to thank Kees Praagman, who read and commented on my thesis as second supervisor and with whom I have had some fruitful discussions to improve this thesis.

During the writing of this thesis the assistance provided by Remko Amelink and Henny Holtman on R and Latex issues has been helpful as well, for which I would like to thank them.

Furthermore, I would like to thank everyone who made studying in Groningen such a pleasure during the past five years, especially all the friends and fellow students who always were there to enrich the study experience. Without this diverse group of persons it would have been but a dull experience. Finally, I would like to thank my parents for supporting me during all these years and to give me the opportunity to study here in the first place. Their support was an important factor in the successful completion of the study Econometrics here in Groningen.

Niels Holtrop September 20, 2010

(8)
(9)

Contents

1 Introduction 1

1.1 Churn management . . . 1

1.2 Econometric models for churn prediction . . . 2

1.3 Research question . . . 3

1.4 Research outline . . . 3

1.5 Thesis outline . . . 5

2 Data description and summary statistics 7 2.1 Data overview and summary statistics . . . 7

2.2 Data in duration format . . . 10

2.2.1 Time varying variables and episode splitting . . . 13

2.3 Summary . . . 14

3 Model descriptions 15 3.1 Logit model . . . 15

3.2 Tree model . . . 17

3.2.1 An example of tree methods . . . 17

3.2.2 Selection of a splitting method . . . 17

3.2.3 Termination criteria and tree pruning . . . 20

3.2.4 A Stochastic Gradient Boosting algorithm for tree classifiers . . . 21

3.3 Kalman filter model . . . 22

3.3.1 The linear dynamic model . . . 22

3.3.2 The generalized linear dynamic model . . . 24

3.3.3 An EM-algorithm to estimate hyperparameters . . . 26

3.4 Survival model . . . 27

3.4.1 General likelihood for models with right censoring and left truncation . . . 27

3.4.2 The AFT model for durations . . . 29

3.4.3 Unobserved heterogeneity . . . 30

3.5 Summary . . . 30

4 Model setup and estimation 31 4.1 Logit model . . . 31

4.2 Tree model . . . 33

4.3 Kalman filter model . . . 38

4.4 Survival model . . . 40

4.5 Summary . . . 41

5 Classification and forecast results 43 5.1 Model assessment criteria . . . 43

5.2 Model classification results . . . 45

5.3 Model forecast results . . . 47

5.4 Summary . . . 50

6 Conclusions and recommendations 51 6.1 Thesis summary and conclusions . . . 51

6.2 Recommendations for future research . . . 52

A Overview of different variables 53

B Absolute classification and forecast numbers 55

(10)
(11)

Introduction 1

We start out this thesis with a discussion on a problem that many firms face: that of customers leaving the firm and taking their business elsewhere, an event also known as churn. We discuss some options that firms have when dealing with churn, one of them being trying to retain the customers. Several options for retention are available; we will focus on proactive and targeted retention using a probabilistic approach.

We then discuss several methods that are used for this probabilistic approach, which will lead to the main question to be answered in this thesis. Next, we will discuss the outline of the research performed and end with an outline of the remaining chapters.

1.1 Churn management

Many firms nowadays face the problem of customer churn, which can be defined as ‘the voluntary action taken by customers of leaving a company and taking their business elsewhere or terminating a certain ser- vice altogether’. This behaviour arises in many markets, such as those of telephone operators, insurance firms, Internet Service Providers, utilities and banking firms. Churn can be due to both internal and external factors, from the firm’s viewpoint. While external factors are often outside the area of influence of a firm, internal factors such as complaints, service quality and price can be influenced and are often indicative for customer churn (Dijksterhuis and Velders, 2009). When faced with churning behaviour of customers, firms can decide to let the customers go and forego the income they generate or pursue a strategy aimed at retaining these customers. Several studies have shown that retaining customers can be a low-cost alterna- tive for attracting new customers (Athanassopoulos, 2000; Colgate and Danaher, 2000). Moreover, good retention programs can increase the satisfaction of customers as well, leading to long term benefits for a firm (Colgate and Danaher, 2000). The commitment of enough resources is critical to the success of such a program, a factor that should always be taken into account. Another reason for retention could be that the customer base of a firm has reached its peak (Hadden, Tiwari, Roy, and Ruta, 2005) and retention is the only way to profitably deal with customers, as acquiring new customers is too expensive or new customers are too scarce to find.

Once a company has decided to pursue a strategy of retention, it faces the question of how to implement this strategy. Broadly speaking, two options are again available: an untargeted approach and a targeted approach (Neslin, Gupta, Kamakura, Lu, and Mason, Neslin et al.). The untargeted approach uses for example mass advertising to convince customers to stay with the firm. This approach can be effective if customers are satisfied with the product offered by the firm, but it may be costly and ineffective if these conditions are not optimal. The targeted approach can be either reactive or proactive. Reactive targeting takes place when a customer contacts the firm to end his dealings and the firm offers an incentive, for ex- ample a discount or rebate, to convince a customer to stay with the firm. The proactive targeting method on the other hand tries to identify those customers in the customer base that are prone to churn and approaches them with an incentive a priori to the churn event happening. Both targeted methods generally require

(12)

In this thesis we specifically focus on targeted, proactive retention. The reason for this is that this method is often less costly for a firm (a customer often needs a higher incentive when he has already made the step to inform the firm that he will end his dealings with the firm) and that it gives the firm the control over who to target and who not, as opposed to the reactive method where it is the customer who makes that decision.

To determine which customers to target and which not, a criterion is required. One such criterion could for example be the Customer Lifetime Value (CLV), which reflects the value a customer has to a firm over the period of its relationship with the firm (Gupta, Hanssens, Hardie, Kahn, Kumar, Lin, and Sriram, 2006).

Focus of a retention program then usually are the customers with the highest CLV. Another approach is probabilistic in nature, where a churn probability is computed for each customer and those customers with the highest churn probability are the target of the retention program. Both methods lead to identification of (a) group(s) of customers in the customer base which will be targeted. Often, only the top 10% most important customers (highest value or probability) are selected to reduce costs. Additional analysis, such as clustering, can be done if the groups are too numerous or to better profile the groups. Using this in- formation, a retention program aimed specifically at the customers most important to retain or likely to churn can be developed. This often involves adapting the marketing mix to suit the target group(s) more specifically. Taking these steps and actions should lead to the design of an effective retention program that more specifically suits those customers most important to retain or likely to churn.

The subject of this thesis will only be the probabilistic approach to churn modeling, as this is probably the most used approach in practice. The use of econometric models has lead to an extensive literature and a wide array of possible models to obtain the required probabilities for this method. The next section will provide an overview of the possible models.

1.2 Econometric models for churn prediction

In the econometrics literature, many models are available to tackle the problem of classification and pre- diction in the presence of binary observations (churn and non-churn) by computing probabilities for each observation. The most well-known methods are probably logit (Azzalini, 1996) and tree models (Breiman, Friedman, Olshen, and Stone, 1984). These models have also found their way into daily practice and are encountered often when firms deal with the problem of churn management. Some other approaches to the problem are for example survival analysis (Lu, 2002; Helsen and Schmittlein, 1993), discriminant anal- ysis (Hair Jr., Black, Babin, and Anderson, 2010), neural networks (Faraway, 2006) and support vector machines (Coussement and Van den Poel, 2008). This wide range of available methods has lead to the question which one is the best at doing the job of predicting churn. Neslin, Gupta, Kamakura, Lu, and Mason (2006) tried to answer this question by organizing a competition with the goal to compare as much models as possible. Contestants were provided with the same dataset and challenged to produce the best forecast results they could, using all tools at their disposal. What this study showed was that out of all entered model types, both the logit and tree models performed better on average at predicting churn than other models entered.

Besides the good performance of logit and tree models, it was also noted that ‘method matters’, i.e. the choice of model determines the effectiveness of the outcome and can lead to managerially meaningful differences. Managerially meaningful differences in this case means that the amount of customers iden- tified by each model as compared to the actual behaviour of customers can lead to large differences in the resources and costs that are required for effective retention of customers. It is therefore of interest to know which models perform well at prediction and which models do not. This thus lead to the remark that ‘practitioners should continue to search for better models’. Neslin et al. (2006) suggested among other alternatives a dynamic approach to churn prediction as in Lu (2002). All of the models discussed in Neslin et al. (2006), most importantly the logit and tree model, are static in nature, that is they do not take into account changes in variables over time and require re-estimation if changes occur. The suggestion of a dynamic approach is in line with a study done by Leeflang, Bijmolt, Van Doorn, Hanssens, Van Heerde, Verhoef, and Wieringa (2009), who provide an overview of current trends in the dynamics of marketing.

(13)

1.3. RESEARCH QUESTION

They remark that models should use appropriate metrics, disentangle temporary from persistent effects, allow for time-varying parameters and cross-sectional heterogeneity. The importance of a dynamic ap- proach for churn prediction becomes clear from this, but outside the work of Lu (2002) not much is known yet about this subject. This will therefore play an important role in our research question formulated in the next section.

1.3 Research question

In our research we will focus on two aspects mentioned by Leeflang et al. (2009), namely that of time- varying parameters and that of cross-sectional heterogeneity. A time-varying parameter approach that also allows for cross-sectional heterogeneity suggested in the paper is one using Kalman filtering techniques.

Fahrmeir and Tutz (1994) provide a good overview of these techniques and we will use of one the models suggested there. The other model is that of Lu (2002), which was already suggested by Neslin et al. (2006).

It only allows for cross-sectional heterogeneity, but it is worth looking into as survival analysis models are more readily available in most software packages, which could be of practical importance. We also feel that the work of Lu (2002) could be expanded, as it does not provide much detail in the underlying methods.

We will refer to both these models collectively as the dynamic models.

As we are interested in the merits that dynamic models offer over other models, we will seek to com- pare them to current methods. We already mentioned the widespread use and good performance of logit and tree models, both of which are static in nature. Therefore, it makes sense to compare the dynamic models to both of these models. This leads directly to our main question, which is: How do dynamic mod- els perform compared to static models when it comes to modeling and predicting customer churn?Hence, our main question focuses on comparing the dynamic models to currently used static models in terms of prediction accuracy. The measures to do this will be discussed in the next section, which deals with the outline of the research.

1.4 Research outline

Before we are able to answer the main question of this thesis, we have to work through several steps. These steps are illustrated in figure 1.1. We start out with the data, as this influences both the model structure as well as the estimation of these models. After validating the models, we are in a position to produce forecasts and finally evaluate the forecast, after which we can return to the main question of how dynamic models perform relative to static models.

The dataset we will use was provided by a health care insurer and was previously used in Dijksterhuis and Velders (2009). The dataset spans the years 2004-2007 and contains yearly observations on 10,000 customers. The presence of multiple years will allow us to make forecasts up to three years ahead. To use the combination of this dataset with the desired dynamic model specifications, we need to both adapt the dataset at points and choose our models carefully. This will be explained in chapters 2 and 3. An additional interesting aspect of this dataset is the fact that during this period a health care system reform took place, leading to strong increase in churn rate in 2006. Details on the reform can be found in Douven, Mot, and Pomp (2007a) and Douven, Mot, and Pomp (2007b). This provides an extra challenge for the model spec- ifications, but hopefully also tells us something about the robustness of the models.

After we have selected suitable model specifications to match our dataset, we can estimate and validate the models. We want each model to fit the data as good as possible, but as each model is unique in its own way, validation differs from model to model. This might lead to for example different variables included in each model, different post-hoc tests for each model and a different goodness of fit of each model. This is not

(14)

Figure 1.1: Modeling flow chart

Once the models are estimated, obtaining forecasts is easily done. It is determining the quality of the fore- casts that is more difficult, but also the most important step to be able to answer our research question. We will use the same approach as Neslin et al. (2006) used. They based their comparison of models on mea- sures called top decile lift and Gini coefficient. These measures are used in other studies as well (Lemmens and Croux, 2006; Burez and Van den Poel, 2009) and are the most important measures used to judge the quality of churn models. Top decile lift compares the top 10% of predicted churners to the total churn rate, which reflects the practice of only targeting the group most at risk of churning described in section 1.1. The Gini coefficient is a measure of overall model performance and takes into account the identification of both churners and non-churners, making it a complementary measure to top decile lift, which only focuses on identified churners. Models that score high on one or both of these measures can be considered to perform better than other models. Besides these main metrics, we will also use some additional measures derived from Burez and Van den Poel (2009) and Medema, Koning, and Lensink (2009). These measures are aimed at identifying the strengths and weaknesses of each model and should help us determine why the models function as they do and provide more detailed and specific evaluation possibilities than our main metrics alone would do.

When we have obtained all the data on these metrics, we are able to answer the question how dynamic models perform when compared to static models. By taking the same approach as in Neslin et al. (2006), we expect that our results are comparable to known results and can serve as an extension to existing litera- ture on this subject.

(15)

1.5. THESIS OUTLINE

1.5 Thesis outline

As the structure of the dataset was seen to be an important cornerstone of the research process in the pre- vious section, we will start out with a discussion of the data in chapter 2. We will show that the dataset is suitable to estimate logit, tree and Kalman filter models as well as survival models. Summary statistics will be provided and some exploratory estimates for survival analysis are done as well. In chapter 3 the theo- retical background for the models will be presented. The discussion on logit and tree models in sections 3.1 and 3.2 is less deep than that on the Kalman filter and survival models in sections 3.3 and 3.4, but for all models the necessary theory is discussed. This chapter is the most technical chapter of the thesis and a solid background in mathematics and statistics is recommended. Less mathematically inclined readers can refer to the summary in section 3.5 to get the gist of the chapter and continue reading the next chapter.

Combining the previous two chapters, chapter 4 will discuss the estimation the of the models. We will show how the models can be estimated using the dataset and will provide the results of the estimation procedures. We also focus on model validation in this chapter to asses the quality of the models. Chapter 5 will start with a discussion of the metrics used to determine the classification and forecast quality of the models. Subsequently, we will provide the results for each model and discuss them. Based on these results, we will draw our conclusions in chapter 6 and provide recommendations where necessary.

(16)
(17)

Data description and summary statistics 2

In this chapter we will discuss the dataset we have available for our analysis. The reason for starting out with an overview of the dataset is the special structure of the data we have available. Knowledge of this structure is useful when we develop the models presented in the next chapter, as some issues that will be discussed are driven by data characteristics. We will start this chapter with an overview of the variables available in the data. We will present some summary statistics and compare the values of some variables to those of the total market or population. Next, we will also discuss a modification to the dataset into duration format, which will be necessary to estimate survival models. This also leads to some additional preliminary summary statistics, which will be discussed as well.

2.1 Data overview and summary statistics

The dataset we have available was provided by a large Dutch health care insurer. The data span the years 2004-2007 and consist of data on 10,000 customers. There are a total of 173 variables available in the dataset. Some of these variables correspond to a certain year, others are fixed over the years. That some data vary over the years will be the core of our research, as we are interested in whether taking changes into account will lead to better estimation results as compared to only estimating static models for each year.

We will not use all the variables due to missing values in some of the variables. We restrict ourselves to using 44 variables, where some the variables are allowed to vary over time. An overview of these vari- ables is presented in table A.1 in the appendix, alongside a description and the scale of the variables. The most important variable is the variable Uitstroom, representing the churn in each year. It is coded in the usual way as 0/1 and it is this fact that determines the model approach to be followed. Models that can handle binary coded data are the previously mentioned logit, tree and Kalman filter models, with only the survival model taking a different approach, as we will see later in this chapter. We will refer to all variables excluding this one as the set of independent variables or covariates. This variable can also be viewed as the dependent variable in that context then.

To get an idea of the structure of the data we present some summary statistics in table 2.1. If we look at the sample size, we find that some observations have been deleted before 2004. These were related to missing values in the independent variables or persons who churned twice (which would lead to problems when considering their duration in the sample) and were therefore deleted. As we deem the number of observations remaining to be enough, we stick with this solution to the missing variable problem. That the number of observations after 2004 reduces is mainly due to churn and to a far lesser extent to missing variables (usually 10-30 observations were deleted due to missing values).

If we look at the churn rates, we find that there is a sharp rise in the churn rate in 2006 due to the sys- tem change in the Dutch health care insurance discussed in the previous chapter. If we compare the churn rates with the national averages, we find that they are higher than average. According to Laske-Alderhof

(18)

Variable 2004 2005 2006 2007

Sample size 8674 7291 6693 5432

% churn 8.3 7.16 32.93 4.01

% male 46 44 44 53

Average age 47.59 years 48.39 years 48.73 years 48.44 years Average relation duration 15.35 years 15.83 years 16.01 years 15.60 years

% Social insured 35.54 37.29 38.59 42.60

% Moved 23.46 24.40 24.14 27.06

% Single living 40.92 40.74 40.58 43.37

% Family with kids 11.16 11.37 11.22 12.68

% Family with no kids 27.27 27.64 28.27 25.55

% Divorced 3.95 4.18 4.43 4.71

% Married 3.97 4.29 4.35 4.01

Average phone contacts 0.67 0.74 0.79 0.95

% Digital portal users 3.7 4.06 4.44 5.69

% Automatic Bill Payment users 57.06 61.88 66.22 84.22

Average revoked payments 0.40 0.38 0.41 0.56

% Maroccan 4.83 4.73 5.01 6.70

% Surinam 1.87 1.80 1.73 2.38

% Turkish 3.45 3.61 3.80 4.25

Average number of complaints 0.0068 0.0075 0.0081 0.0057

Average number of store visits 0.21 0.26 0.22 0.15

Average health costs 2718 1791 1603 1464

Table 2.1: Summary statistics for the dataset

There are some other interesting aspects to the data. According to both Laske-Alderhof and Schut (2005) and Verbond van Verzekeraars (2006) the percentage of socially insured persons prior to 2006 (when this difference was dropped) was about 66%, while our data show the opposite with slighty more than 33%

being socially insured. We also notice that the etnicity percentages are slighty different when compared to the Dutch population, which according to CBS (2010) were 5.27% for Maroccan decent, 1.92% for Suri- nam decent and 3.72% for Turkish decent in 2004. Furthemore, we notice that a large part of the dataset consist of single persons and that families are a much smaller part of the customer base. The number of registered complaints is extremely low and services like a digital consumer portal and the insurance store are only used by a minority of the customers present in the sample. Use of the more traditional automatic bill payment is remarkably higher, with more than half of the customers using this feature.

The way the summary statistics are presented also gives us a preliminary view in the effects of certain variables over time. We find that the ethnic groups present, the use of automatic bill payment or the digital consumer portal and being socially insured all seem to reduce the churn. This can be inferred through the increasing numbers over time, implying that persons with these characteristics make up a larger part of the dataset when people start to leave the company. A final interesting note is that the health related costs drop drastically over time. This could indicate that people with high costs decide to go elsewhere or that less costs are covered by the insurance.

(19)

2.1. DATA OVERVIEW AND SUMMARY STATISTICS

Figure 2.1: Graphical representation of the sample size

A final issue that we will discuss is the splitting of the data. The data were split in two parts for each year. This situation is depicted in figure 2.1. This figure also clearly shows the decrease in data points over time. The models will be estimated using the training sample, which contains approximately 70%

of the data. Validation and prediction of the models will be done with the holdout sample which contains the remaining datapoints. That we have approximately 70% of the data in the training sample is caused by a rebalancing that was done to these samples. When we take a random sample, we would expect the churn rates in the sample and the population to be equal. However, in practice the churn rate in a sample usually differs slightly due to sampling error, usually leading to higher churn rates in the sample. This is also the case in this thesis. Burez and Van den Poel (2009) show that undersampling the common cases in the training sample taken (i.e. non-churners in this case) leads to better classification results for models estimated on the data. The undersampling corrects the training sample churn rate downwards, matching it with the population churn rate again. Other, more advanced methods discussed in that paper also did not show better results compared to the simpler undersampling method.

This short overview of the data should be helpful in getting a basic idea about the structure of the data.

The structural break in 2006 when it comes to the churn rate should proof especially challenging to model, as these events are rare and models are expected to perform worse when large deviations from a trend take place. In the next section we will focus on another special feature of the data, namely the possibility to put the data in the so-called duration format.

(20)

2.2 Data in duration format

In the previous section we focused on the binary coded variable Uitstroom. The models for this type of dependent variable are well-known and commonly used. We will, however, also focus on the data in the so-called duration format, i.e. we will try to model the time (duration) that individuals spent as customer of the insurer. This approach is taken in what is known as survival analysis, which emerged in a.o. the medi- cal sciences, where for example the time until death of a patient was modeled. Survival analysis is possible due to the presence of the variable Relatieduur, which indicates the number of years a customer has spent with the insurer prior to the observation of four years. The relation time was measured in December 2003, so it measures all the years prior to the observation period (a customer can only switch at the start of a year).

The period before 2004 can be called left truncated. Left truncation in duration analysis is, according to Jenkins (2005), the situation where we know the spell start date (the time at which a person joined the insurer in this case), but start observing the subjects under study at a later time. In the time between the start date and the start of the observation period subjects are at risk of experiencing the event (churn), which leads to an implicit selection in our sample: Only persons that survived (stayed with the insurance company) until 2004 are included in our sample. To correct for this bias, left truncation has to be taken into account when estimating a model. However, due to the episode splitting procedure to be explained in the next section, this issue is of lesser importance for our results. The episode splitting also introduces a left truncation structure into our dataset, which is the methodological relevant left truncation. What we can learn from the discussion here is that we should take into account the durations before the observation period as well, which is easily done in combination with the episode splitting to be discussed later on.

To circumvent the problem of prior durations, many authors, such as Lu (2002) and G¨on¨ul, Kim, and Shi (2000), only observe new customers, who have no prior durations.

In the observation period 2004-2007 two things can happen: A customer can churn and leave the com- pany during one of the years or he can stay for the entire duration of the observation period. If a customer churns, we say he has experienced an event and his duration is set to Relatieduur+ Ti, where Tiis the num- ber of years passed since 2003. So a customer churning in 2004 has for example Ti= 1. If a customer does not churn during the observation period, we say that this customer is censored (right censored in this case) and Ti= 4 for all these customers. Censoring in this case refers to the situation that we know a customer did not churn before (and hence survived until) 2008, but what happens after this period is unknown to us;

the customer might churn immediately in 2008 or stay with the insurer for a long time.

2004

2005 2006 2007

A

B C

Figure 2.2: Illustration of right censoring and left truncation for the insurance dataset

Figure 2.2 gives several examples of censoring and truncation for this dataset. Person A churns in 2006 and only is left censored (indicated by the left vertical line). Person B is both left truncated and right censored, as his duration spans the entire observation period and ends after 2008 (indicated by the right vertical line).

Person C joins the insurer in 2006 and stays with the insurer until after 2007, hence he is only right cen- sored.

(21)

2.2. DATA IN DURATION FORMAT

The dataset for the durations was split in a 75/25 fashion. The training sample has size 6185, while the holdout sample has size 1952. The larger training sample is due to more missing variables that lead to more observations being deleted from the holdout sample, leading to a shift in relative sample size. The censoring percentage of the total dataset is 56%, which is high. By using the undersampling approach used previously, we made sure that the training sample had this same censoring percentage.

We are of course interested in modeling the durations corresponding to the events to predict when an event will happen. The techniques for this have been developed in the branch of survival analysis. Klein and Moeschberger (2003) and Jenkins (2005) provide a fairly complete overview of the theory underlying survival analysis and most of the exposition in this and future chapters will be based on these references.

In survival analysis there are two basic quantities that often times are the most interesting to investigate and model: the survival function giving the survival probabilities and the hazard function. The survival function S (·) gives the probabilities of the form

S(t)= P(T > t) = 1 − F(t),

that is the probability that the duration time T exceeds a value t, starting from time period 0. The survivor function is the opposite of the distribution function. The hazard function θ(·) is formally defined as

θ(t) = f(t) S(t),

with f (t) the density function corresponding to F(t). It can be shown (Jenkins, 2005) that its interpretation is that of the instantaneous probability of experiencing an event in the interval (t, t+ ∆t] with ∆t small. It is not a true probability, however, in the sense that it can be larger than 1 (however, it’s always non-negative).

Several prior estimates are available for the survival function and the hazard function based on the du- rations, all of which do no take into account the presence of independent variables. These estimates can however help in determining the expected relations between the duration and the churn event. We will present the so-called Kaplan-Meier estimate for the survival function (Kaplan and Meier, 1958) and Nelson-Aalen estimate for the (integrated) hazard rate (Klein and Moeschberger, 2003). The estimated functions are presented in figure 2.3, alongside 95% confidence bounds for the estimates.

If we look at the estimated survival function in the left panel of figure 2.3, we find that its downward sloping and decreasing toward the end. We can interpret this curve as follows: Persons that have shorter durations (less than about 20 years) have a higher probability to experience an event (churn). Once a person has stayed for a longer period with the insurer, the probability of churn becomes almost 0. After about 8 years the churn probability has already dropped below the 0.5 mark, which could serve as the cut-off point for considering the group of more loyal customers. The hazard rate confirms this observation: In the first 20 years the slope is relatively steep, indicating high instantaneous churn probabilities. After 20 years the slope flattens out and decreases toward the end. Larger jumps in the slope of the hazard rate can be found at 20 and 35 years, indicating time periods that correspond to stronger decreases in hazard rate.

In this section we have introduced some basic quantities of survival analysis and provided prior estimates for these quantities to serve as summary statistics. We have not yet introduced a relation between the ob- served durations and the independent variables, which we will do in the next chapter. In the next section, however, we will focus on the format our independent variables need to have to be able to analyze them using survival analysis techniques.

(22)

0 20 40 60

0.00.20.40.60.81.0

Duration

Estimated Survival Probability

Kaplan−Meier

0 20 40 60

024681012

Duration

Estimated Intergrated Hazard rate

Nelson−Aalen

Figure 2.3: Plots of the estimated survival function and integrated hazard rate

(23)

2.2. DATA IN DURATION FORMAT

2.2.1 Time varying variables and episode splitting

The dataset consists of variables that are the same in all periods and variables that take on different values in each period. Table A.1 provides an overview of these variables. To be able to analyze the changes over time in some of the variables it is not possible to have the variables in this format. We will illustrate the procedure required to obtain the right format with an example. The example can be found in table 2.2.

Before episode splitting

Person Churn04 Churn05 Churn06 X1 X204 X205 X206

1 0 1 NA 32 5 4 NA

2 0 0 0 26 7 7 6

After episode splitting

Person Churn X1 X2 start stop

1.1 0 32 5 0 1

1.2 1 32 4 1 2

2.1 0 26 7 0 1

2.2 0 26 7 1 2

2.3 0 26 6 2 3

Table 2.2: Illustration of episode splitting

The situation depicted is that of two persons. Person 1 churns in 2005, while person 2 stays in the dataset for the entire duration. Two variables are observed: Variable X1 is fixed across time, while variable X2 is allowed to change over time. The situation before episode splitting is given in the top part of the table. We have three variables corresponding to churn and three corresponding to X2, one for each year. However, to be able to use survival analysis techniques we require the data to be in a form where one row in the dataset corresponds to one observation on one period in time for one individual (the rationale behind this is that each row presents one fixed contribution to the likelihood, which is used to estimate parameters). As it is now, one row contains observations for three years if we include the time-varying variable X2. The trick we will use is to split each row such that for the duration given in the row, the time varying variables remain constant. This is called episode splitting. The situation after episode splitting is shown in the lower part of the table. The number of rows has increased due to the fact that we observe two time periods. In each period the values for X1 and X2 are given. Two new variables have been created, start and stop. These give the lengths of the interval on which the variables can be considered constant; in this case the interval length is 1. The churn variable has also been split accordingly.

As Jenkins (2005) explains, the situation we have introduces a form of left truncation. The first record for each person (1.1 and 2.1) starts at the beginning of the observation period and is right censored at the duration where the next row begins. All the following records (1.2, 2.2 and 2.3 here) are left truncated from the duration where the previous row stopped and are also right censored. This means that if we use a model specification that allows for left truncation and right censoring in combination with the data in this format, we get correct results. The exposition of episode splitting given here is a simplified adaptation of Jenkins (2005), as our dataset has a structure that makes the steps to be taken easily understandable. We refer to that text for a more general exposition on episode splitting that can be used for a greater variety of problems.

For our dataset the observed period is four years, which would quadruple the size of the dataset. Luckily, if a person churns earlier no data is observed for the periods after the event and these records are not added to the dataset, leading to a smaller overall dataset. This is also seen in the example given above, where person 1 only has two records instead of three. The number of variables to episode split on is 16, which is quite a lot. After episode splitting, the training dataset has size 20995 and the holdout dataset has size 6007.

(24)

2.3 Summary

In this section we introduced the dataset which we will use to test the models. We showed that dataset can be used in two ways: one way is based on the churn in each year and indicates churn as a 0/1 variable. The other way focuses on the durations of the customers and introduces a separate indicator for the churn event.

Besides discussing the structure of the dataset we also presented some summary statistics to get a basic idea of the variables present in the dataset. For the data in duration format we also presented Kaplan-Meier estimates for the survival rate and Nelson-Aalen estimates for the integrated hazard rate.

(25)

Model descriptions 3

In this chapter we will give an in-depth discussion of the models that we will investigate later on. This chapter is aimed at providing the necessary theoretical basis to understand the structure of the models. We will connect the theory discussed here with the dataset we have available in a later chapter. All the work here is due to the authors referenced in the text. However, we did adjust the notation used by the respective authors to facilitate the reading of this chapter. We will start out with a discussion of the static models we will use, the logit model and tree classification models respectively. After that, we focus on the dynamic models, the Kalman filter model and the survival model, in that order.

3.1 Logit model

The logit model belongs to the more general class of generalized linear models. This class of models is a generalization of the well-known linear regression models and allows for models that follow a non-normal distribution. Examples of often used distributions are the Poisson, gamma and binomial distributions.

The case of standard linear regression is covered when the normal distribution is used. What all these distributions have in common is that they belong to the exponential family of distributions. Following Azzalini (1996), these distributions have density function

f(y)= exp(w

ψ{yθ − b(θ)} + c(y, ψ)), (3.1)

where θ and ψ are scalar parameters, w is a known constant and b(·), c(·) are knowns functions that deter- mine the specific distributions. It can be shown that all of the above distributions fit this description.

To introduce a regression structure in this context, we need to define a set of independent variables to relate to the dependent y above. Let xidenote the independent variables observed for person i. In a linear regression context we would now specify yi = x0iβ + i, where i∼ N(0, σ2), or similarly Yi∼ (µi, σ2) and µi= x0iβ. This would formulate the linear relation between yiand xicompletely. We can generalize this to the exponential family of distributions by assuming the following:

Yi∼ EF(b(θi),ψ

w), g(µi)= ηi, ηi= x0iβ.

That is, we assume that Yiis a member of the exponential family of distribution. Furthermore, there exists a function g(·), called a link function, that relates the mean value µito the independent variables xi. Taking this function to be the identity function and assuming a normal distribution for Yibrings us back to the case of linear regression. The addition of a link function to relate the mean and the independent variables adds to the flexibility that generalized linear models offer for modeling a variety of problems. The expectation of Yiis given by E[Yi]= b0i) and its variance is given as var[Yi]= b00i)ψw, as found in Azzalini (1996).

(26)

The previous paragraphs were concerned with the general theory of generalized linear models. Now we can use this information to formulate the logit model. We have already seen that the dependent variable yi

only takes on the values 0 and 1. If we generalize this to all n observations and thus aggregate over i, we recognize in this situation the binomial distribution. The density of the binomial distribution in the form (3.1) can be derived, following Faraway (2006), as follows:

f(y|θ, ψ) = n y

!

µy(1 − µ)n−y

= exp(y log µ + (n − y) log(1 − µ) + log n y

! )

= exp(y log µ 1 − µ

!

+ n log(1 − µ) + log n y

! )

We thus see that θ = log µ 1 − µ

!

, b(θ) = −n log(1 − µ) = n log(1 + expθ), c(y, ψ) = logn

y

 and that

w= ψ = 1. The link function can be seen to equal g(µ) = log µ 1 − µ

!

, which is the logit function that gives the name to the model. The expectation and the variance can be derived easily from these principles and are seen to equal E[Y]= expθ

1+ exp θ and var[Y]= µ(1 − µ). We can now relate the independent variables to the dependent variable using a regression structure in the following way:

log µ

1 − µ

!

= β0+ β1x1+ β2x2+ . . . βkxk= X0β, (3.2) where X is an n × k matrix containing the observations for n individuals on k variables and β is a vector corresponding to the parameters for each variable. The interpretation of this relation is the same as for linear regression, except now the dependent variable is linearly related to the regressors via a link function transformation.

Estimation can be done through maximum likelihood estimation. Using the Newton-Raphson method in combination with Fisher scoring yields a simple and efficient algorithm known as iteratively reweighted least squares (IRWLS). Following Azzalini (1996) the algorithm reads:

1. Set starting values z= log y+ 0.5 n − y+ 0.5

!

and W= Ik

2. Follow these steps until convergence is achieved:

• R= (X0W X)−1

• β = RX0Wz

• η = X0β

• µ = exp(η) 1+ exp(η)

• ∆ = µ(1 − µ)

• z= η +(y − µ)

• W = n∆

3. The last value of β gives the parameter estimates, while the last value of R gives the variance matrix Convergence is achieved when the parameter vector does not change anymore up to a certain threshold value. For the purpose of this thesis we made use of the function glm contained in the R software package, which incorporates this method in the R language.

(27)

3.2. TREE MODEL

3.2 Tree model

Besides the logit model discussed in the previous section, the problem of classifying 0/1 observations has also been solved by applying tree based methods. By applying recursive splits on the set of independent variables, observations are classified as either being 0 or 1. The results can be visualized in a tree structure, which lends its name to the method. Seminal work on tree methods is presented in Breiman et al. (1984), which is where the exposition given here is based upon. A short summary of this book is given in Therneau and Atkinson (1997), which we will use as secondary source as well.

3.2.1 An example of tree methods

To illustrate the ideas behind tree methods, we will start out with an example related to our problem of identifying churners based on a large set of independent variables. The tree based approach works by subsequent binary splits made on these independent variables in such a way that the difference between churners and non-churners is as large as possible. As a measure for ‘as large as possible’ we will use what is called impurity, which will be explained later. The collection of all these splits can be visualized in a tree-like figure, which is why we call the method a tree.

As we said, splits are done in a binary way. This means that if we have for example a variable called Income, ranging from 1 to 500, say, then possible splits can be for example Income > 2 on the one side of the tree and Income ≤ 2 on the other side of the tree or Income > 300 and Income ≤ 300. This re- sults in two sets of persons in each splitting step, fulfilling the criteria specified by the split. These two sets can be split further again, on the same or on other variables, until theoretically every observation is as- signed its own set. In practice, splitting is stopped earlier to give meaningful interpretations to the final sets.

An illustration using our data from the year 2004 is given in figure 3.1. First, a split is made on the variable Betaalaut, which is coded as 0/1 and the split is made halfway. The two groups formed have churn probability 0.0121 and 1-0.0121=0.9879, respectively. The latter group is split again, now on Koste- nAVtrend, giving one group with churn probability 0.0328 and one with probability 0.9551. After that, two more splits are made on the group with the largest churn probability. This results in 5 groups or nodes in the end, each with its own characteristics and churn probability, as can be seen in the figure. By following the splits we can determine the characteristics of certain group. For example, the group with churn probability 0.0328 uses automatic bill payment (Betaalaut) and has average health care costs (KostenAVtrend) that are below 0.09.

This example illustrates the basic idea of splitting on specific variables to end up with groups that dif- fer in churn probability. With this in mind, we can move somewhat deeper into the theory behind making the splits, determining when to stop splitting and characterizing the final groups.

3.2.2 Selection of a splitting method

Breiman et al. (1984) state that solving a classification problem using tree methods requires three things:

1. A method for selecting the splits (on the independent variables)

2. The decision when to declare a node terminal (stop splitting) or to continue splitting it 3. The assignment of each terminal node to a class

In this and the following sections we will discuss each of these aspects, which will lead to an operational algorithm to create trees. We will also present an additional algorithm to further improve the classification made by ordinary tree models, a so-called boosting algorithm.

(28)

Betaalaut>=0.5|

KostenAVtrend>=0.09

Verhuisd>=0.5

Betaalacc>=0.5 0.0121

0.0328

0.003185

0.05668 0.4274

Figure 3.1: Example of a tree

We will start out with the first criterion, selecting a splitting method. Assume there are two classes (churn and non-churn) i= 1, 2 and that we want to split them in k terminal nodes. We require some notation first, following Therneau and Atkinson (1997): Let πidenote the prior probabilities for each class, L(i, j) denote the loss matrix for incorrectly classifying an i as a j and A denote a node of the tree. A can be viewed as both a set of individuals in the sample, as well as a classification rule for for new data, if we use the tree that produced it. Now let τ(x) denote the true class of an observation x, with x a vector of independent variables, and τ(A) the assigned class of A in a terminal node. The assigned class is the class with the highest proba- bility in that node. Finally, let nidenote the number of observations in the sample of class i, nAthe number of observations in node A and niAthe number of observations in the sample that are of class i and in node A.

We can now introduce some probability measures. The probability of node A is given as

P(A)=

2

X

i=1

πiP(x ∈ A|τ(x)= i) ≈

2

X

i=1

πi

niA

ni . (3.3)

The part before the ≈-sign gives the theoretical probability, while the latter part provides the estimate that is used in practical applications. Thus, the probability of a node A is derived from the relative frequencies observed in that node. Also note that the prior probability is often set to equal the observed sample fre- quencies and we will do so as well in our applications. However, for the purpose of completeness we will leave it in for now.

Associated with this probability is the conditional probability piA:

piA= P(τ(x) = i|x ∈ A) = πiP(x ∈ A|τ(x)= i)

P(x ∈ A) ≈ πi(niA/ni) Pπi(niA/ni).

(29)

3.2. TREE MODEL

We also introduce two so-called risk measures, which we will minimize in the splitting procedure. Let R(A) denote the risk of node A, with

R(A)=

2

X

i=1

piAL(i, τ(A)),

where τ(A) is chosen to minimize the risk. As τ(A) is deterministic, L(i, τ(A)) corresponds to a fixed misclassification cost, which we will give later on. The total model risk R(T ), with T a complete tree, is then given as

R(T )=

k

X

j=1

P(Aj)R(Aj), (3.4)

where Ajare the terminal nodes of the tree. If we set L(i, j) = 1 ∀ i , j and set the prior probabilities to equal the observed class frequencies, we have that piA= niA

nA

and R(T ) gives the proportion of misclassified observations.

If we split a node A into two sons Aland Ar, we find that P(Al)R(Al)+ P(Ar)R(Ar) ≤ P(A)R(A) according to Breiman et al. (1984). This would suggest splitting a node such as to maximize∆R. This however leads to unfavourable splits. This can be illustrated by an example, derived from Therneau and Atkinson (1997):

Suppose we have a dataset where 80% of the data is of class 1, with class 2 being the other option. The first split produces 54% of class 1 in node Aland 100% of class 1 in node Ar. This leads to the assignment τ(Al)= τ(Ar)= 1 and ∆R will be 0. Maximizing ∆R will usually lead to looking for another assignment of classes, but from a practical point of view this split makes sense and can serve as a good basis for further splitting, because node Alhas a good mix of both classes with node Aronly having one class. This defect often leads to splitting procedures shutting down after the first few splits, while further splitting can be done and makes sense from a practical point of view. Another problem with maximizing∆R is that linear risk reduction is preferred. When one splits leads to groups with unequal risk reduction (say, 75% and 40%), while the other leads to groups with equal risk reduction (say, 50% and 50%), the latter is preferred by the algorithm, while the former is often preferred from a practical stance due to the fact that it makes further splitting easier.

To circumvent these problems, a so-called impurity measure I(·) is introduced with

I(A)=

2

X

i=1

f(piA),

where f (·) is a concave impurity function with f (0)= f (1) = 0. Two often used functions are the informa- tion index f (p)= −p log(p) and the Gini index f (p) = p(1− p). We can then select the split with maximum impurity reduction

∆I = P(A)I(A) − P(Al)I(Al) − P(Ar)I(Ar) (3.5) instead of focusing on maximizing∆R.

The method of maximizing the impurity reduction only focuses on selecting a good split, but it does not yet take into account the goodness of the classification, which ´ıs taken into account when we maximize

∆R. This is due to the inclusion of a loss matrix in R(A), which in turn influences ∆R. This problem can be solved by including a loss matrix L(i, j), which penalizes for misclassification, into our impurity split procedure as well. Let

L(i, j)=

( Li i , j 0 i= j ,

where Liis a certain loss value chosen by the researcher. The losses Liare usually taken to be equal, al-

(30)

Because we don’t have any risk measure R(·) now, we have to find an alternative way of implementing this loss matrix. What we will do is adapt the priors πito new priors as follows:

π˜i= πiLi P

jπjLj

.

These new priors are scaled by the loss Liand thus take into account a penalty for misclassification. These changes affect P(A) (equation (3.3)) and this in turn influences our impurity split method in equation (3.5).

Thus, by correcting our priors we now have improved the impurity split criterion in such a way that it works similar to the risk criterion.

Summarizing the splitting procedure: At each node we evaluate all the splits that are possible. We se- lect the split for which (3.5) achieves a maximum and execute the split. This procedure is repeated at each node until the desired number of terminal nodes k is reached. Although we use impurity instead of risk to split the tree, risk is still important when it comes to terminating the splitting. This will be shown in the next section.

3.2.3 Termination criteria and tree pruning

Using the method described above we are able to split a tree. We can continue with splitting the variables until each case is assigned its own node and we would end up with n nodes. However, this would be un- informative as no structure is reflected in the tree then. Therefore, splitting up to k < n nodes is what we would like to do. In this section we will introduce a simple criterion described in Therneau and Atkinson (1997) that helps us determine the number of terminal nodes k to retain.

Let T1, T2, . . . Tk denote the respective terminal nodes of tree T , that is tree T has k nodes. Recall that we define R(T )= Pki=1P(Ti)R(Ti) to be the associated risk of tree T (see equation (3.4). Call 0 ≤ α ≤ ∞ the cost complexity parameter measuring the cost of adding another split to the model and define

Rα(Ti)= R(Ti)+ αk (3.6)

with i= 0, . . . , k to be the total tree cost. Here R(T0) is the risk of the tree without splits. Taking α = ∞ is equivalent to not splitting at all and taking α= 0 is equivalent to splitting until the end. Of course we would like to select α such that we end with up with a tree that has minimal risk (3.6). It can be shown (Breiman et al., 1984) that it is possible to select the α that minimizes the risk in an efficient way.

We will use the cross-validation procedure outlined in Therneau and Atkinson (1997) to select the best α. The cross validation procedure works by first fitting the full tree and determining a set of cost com- plexity parameters for the dataset. Then the data is divided into groups that exclude a small number (1 or 10 observations depending on dataset size) and the tree is grown on these groups. The excluded observa- tions are sent through the tree and are classified. Consequently, their risk (3.6) is computed using the cost complexity parameters obtained earlier. Summing the risks over all groups yields an estimate for the risk per cost complexity parameter. We select the tree with minimal risk and prune it to that value of the cost complexity parameter.

Having determined the number of terminal nodes of the tree, we still need to assign a class to them. This is easily done by realizing that we have assigned a probability P(A) to each node A, which includes the terminal nodes as well. Selecting the class with the highest probability can therefore be used to label each terminal node with a class. We have thus shown how to build tree models by minimizing the overall tree risk. Next we will outline a so-called boosting procedure to improve tree classification.

(31)

3.2. TREE MODEL

3.2.4 A Stochastic Gradient Boosting algorithm for tree classifiers

Boosting algorithms provide a method to improve the classification of models. The method relies on re- peated application of a simple classification rule(a tree model in this case) to achieve an overall improved performance in classification. Using weights, the harder to classify observations are given more attention in subsequent applications of the classification rule, leading to improved fits. Lemmens and Croux (2006) compare several boosting algorithms and find that the Stochastic Gradient Boosting algorithm described in Friedman (2002) performs the best in improving the classification, based on several criteria. We will therefore use this algorithm to improve our classification.

The general algorithm for Stochastic Gradient Boosting is described in Culp, Johnson, and Michailidis (2006), who discuss the R package ada which was used for this thesis. However, this is a very general exposition, much of which is not applicable to the situation given here. A special case of the algorithm, Real Adaboost, was shown to perform best in the study done by Lemmens and Croux (2006). Therefore, this variant of the algorithm was used in this thesis. This much simpler variant of the general algorithm is discussed in Friedman, Hastie, and Tibshirani (2000), who give the following outline of the algorithm:

1. Set weights wi= 1

n for i= 1, . . . , n 2. For m= 1, . . . , M repeat:

(a) Obtain probability class estimates pm(x) = ˆPw(y = 1|x) ∈ [0, 1], where each observation is weighted with wi

(b) Set fm(x) ← 1

2log pm(x) 1 − pm(x) ∈ R

(c) Set wi← wiexp(−yifm(xi)) and normalize such thatP

iwi= 1 3. The final classifier is given as signPM

m=1 fm(x)

In the first step of the algorithm we set initial values for the weights. In the second step we iteratively run through several steps. First, a special tree, called a stump, is fitted. This tree only has two to four nodes, depending on the size of the dataset and the cross-validated model fit. In each node we have a probability distribution for each each class, denoted as τ(A) previously. We call the probability of falling into class 1 pmin step 2a, where x denotes the vector of dependent variables as before. The probabilities are obtained with observations that are multiplied by a weight wi, which in subsequent runs of the algo- rithm leads to observations that are harder to classify being given a larger weight. In step 2b we compute the quantity fmas the half-logit transform of the probability obtained in step 2a. The use of the half-logit transform is motivated in Friedman et al. (2000) by the fact that we can view the problem as additive lo- gistic regression. This quantity then is the minimum of a criterion similar to minimizing residual error in OLS. In the final step, 2c, the weights are modified for each observation by using the old weights, the class of each observation and fmobtained in the previous step. The algorithm then returns to step 2a and is run until M steps are completed. The final classifier is given as the signed sum of the half-logit quantities.

The function g(y, f ) = exp(−y f ) in step 2c is called the loss function (different from the loss function used to estimate the tree) in Culp et al. (2006) and determines the class of boosting. An often used alterna- tive function is the logistic loss function g(y, f )= log(1 + exp(−y f )), which leads to L2-boosting. Instead of the half-logit function in step 2b we could also use the sign function or the identity function, which leads to discrete and gentle boosting, respectively. The combination of the half-logit function and the exponential loss function is the true Real Adaboost algorithm, the first part of the name referring to the type of boosting (instead of discrete of gentle) and the second part of the name referring to the type of loss function used.

(32)

3.3 Kalman filter model

The previous two models only allowed the parameters to be stationary in time, that is they were static and fixed in a certain time period. As we have seen, however, we have data that spans four years and we would like to use that aspect of the data in our models. Therefore, the next two models focus on allowing for temporal changes in the parameter vector β. In this section we present a model based on the Kalman filter, which is often used to model time series. The model presented here is based on work done by Fahrmeir and Tutz (1994),Fahrmeir and Wagenpfeil (1997) and Farhmeir (1992). The final model we will use is an adaptation of the normal Kalman filter to panel data (i.e. cross-sectional data collected in several time periods) and to exponential family distribution models (the reasons for that were explained before). For completeness, we start with a review of the traditional Kalman filter and gradually generalize this to the case of an exponential family panel data model.

3.3.1 The linear dynamic model

The model we will use falls in the class of state-space models, which is a very general class of models that offers a great deal of flexibility to model time series data. In the context presented here the name linear dynamic models is often used, which stems from the literature on Bayesian statistics. A good outline of linear dynamic models is given in West and Harrison (1989), where also a wide array of applications is discussed. The Bayesian aspect of the method comes from the fact that given certain prior distributions on the state of the system in combination with prior information available to the researcher, a posterior distribution is formed using a series of updating steps. Using the mean and variance of this posterior dis- tribution, inference can be performed. Below, we start out with a general description of the linear dynamic model and gradually we will work towards obtaining a formulation suitable for our problem.

Dynamic linear models consist of two equations that model the state-space. The first equation is the obser- vation equation, which takes the form

yt= Xtβt+ t, t∼ N(0,Σt) (3.7) for t= 1, 2, . . . T. This equation relates the observations ytto the state vector βt, which is unknown. The matrix Xtis called the design matrix. The notation already suggests that it can be seen as a matrix contain- ing independent variables if we are in a regression context, such as we have seen in for example the logit model. However, in other problems it can take different forms if required. The error process tis a white noise process. We assume that the error variables are mutually uncorrelated and have expectation 0 and variance matrixΣt.

The second equation is called the state equation, denoted by

βt= Ftβt−1+ ζt, ζt∼ N(0, Qt) (3.8) for t= 1, 2, . . . T. This equation describes the transitions of the vector βtover time. The matrix Ftis called the transition matrix, which describes the form of the transition process. It is for a large part responsible for the flexibility that state space models offer, due to the different forms for different problems it can take.

The error process ζt is again assumed to be a white noise process, independent of t. We assume that the matrices Xt, Ftt, Qt, β0 and Q0 are known as well, although we will show in a later section that this assumption can be relaxed for the last three matrices.

Referenties

GERELATEERDE DOCUMENTEN

Adding a social influence variable and historical data to the model, resulted in highly significant, strong beta’s which influenced the predictive power of the churn model in a

H1b: Churn intention acts as a mediator on the relationship between the perceived benefits/costs and actual churn Accepted (Mediation) H2a: The premium of other insurance companies

Also does the inclusion of these significantly relevant effects of customer and firm initiated variables give enough proof to assume that the effect of the

The predictors included in the model were divided into relational characteristics and customer characteristics (Prins &amp; Verhoef 2007). The relational characteristics

Since both partial and absolute approaches are supposed to predict customer attrition, as well as to ensure that a partial approach is actually a viable option, which is

In this research ANP will be used as decision method for the selection of the best level of mechanization for the warehouse by structuring and ranking the defined

The calculation below will show how the calculation of the maintenance performance works, furthermore the relation between effectiveness, efficiency and performance will

The first part of this overarching project consist of an information analysis regarding the people involved (stakeholders), work flows and creation and design of business