Comparison of probabilistic and machine-learning methods for churn prediction

(1)

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided

up into a number of sections and contains references. An outline can be something like (this

is an example for an empirical thesis, for a theoretical thesis have a look at a relevant paper

from the literature):

(a) Front page (requirements see below)

(b) Statement of originality (compulsary, separate page)

(c) Introduction

(d) Theoretical background

(e) Model

(f) Data

(g) Empirical Analysis

(h) Conclusions

(i) References (compulsary)

If preferred you can change the number and order of the sections (but the order you

use should be logical) and the heading of the sections. You have a free choice how to

list your references but be consistent. References in the text should contain the names

of the authors and the year of publication. E.g. Heckman and McFadden (2013). In

the case of three or more authors: list all names and year of publication in case of the

rst reference and use the rst name and et al and year of publication for the other

references. Provide page numbers.

2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All that

actually matters is that your supervisor agrees with your thesis.

3. The front page should contain:

(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Faculty

as in the heading of this document. This combination is provided on Blackboard (in

MSc Econometrics Theses & Presentations).

(b) The title of the thesis

(c) Your name and student number

(d) Date of submission nal version

(e) MSc in Econometrics

(f) Your track of the MSc in Econometrics

Comparison of probabilistic and

machine-learning methods for churn

prediction

Jan Hynek

11748494

MSc in Econometrics

Track: Big Data Business Analytics Date of final version: 11th August 2018 Supervisor: dr. Noud van Giersbergen Second reader: dr. Kevin Pak

Abstract: We compare probabilistic and machine learning methods of the customer churn prediction, with the BG/NBD and AdaBoost models as the main examples. We conduct two different training and testing dataset splits, namely a temporal and user split, and evaluate the results using bootstrapping. We found out that in the per-user setting AdaBoost outperforms other algorithms. However in the temporal split setting, which we consider a better reflection of the use case, the BG/NBD model performs similarly and with the benefit of a deeper insight into the customers’ future behaviour.

(2)

Statement of Originality

This document is written by student Jan Hynek who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

Acknowledgment

I would like to express my gratitude to dhr. dr. Noud van Giersbergen, who had enough patience with me and my thesis and showed me the ways of data science and econometrics. Similar applies to the team of aLook Analytics, for the provided data along with helpful insights in the data science. In addition, I would like to thank my fiance´e Kamila, who provided me with extensive moral support throughout the whole year. Moreover, I would like to thank my mother Ilona and my sister Lenka and her family, who were supportive during the stressful times of the studies. Special thanks go to my sister Tamara and her family, who supported me the whole time and funded my studies. Last, but not least, a great recognition goes to my friends Radim, Stepan and Samuel, who showed me that no problem is unsolvable and anything can be achieved if one tries hard enough.

(4)

1 Introduction

The share of the e-commerce retail revenue from total global share is rising1 _{and the}

online market is becoming more and more important. Wang et al. (2016) state that retailers who move to online platforms soon can profit from early mover advantage (EMA) and authors also emphasise that Customer Relationship Management (CRM) capabilities such as customer retention can strengthen such EMAs. And the internet environment nowadays provides us with new methods of CRM in contrast to the past, which can be seen with the example of churn prediction.

Churn prediction is a term coined by telecommunication companies in the early 1980s. At that time this kind of prediction indicated which customer is likely to cancel a subscrip-tion and it allowed for an early solusubscrip-tion of the problem. In the e-commerce retail business, where customers usually do not have contracts with companies but share the transaction history, churn prediction helps to indicate whether a given customer will continue buying at a given company. When an important customer is identified as possible churner, he can be addressed with an offer and continue the relationship with the company. This can improve a company’s profitability as several studies have stated that customer retention (i.e. using churn prediction) has lower costs than customer attraction (Weinstein, 2002; Rosenberg and Czepiel, 1984).

This thesis is performing churn prediction on data from a small fashion-oriented e-commerce company operating in Central Europe. This company was concerned with customer outflow and decided to employ churn prediction to address the problem. While solving the problem, two questions came up.

• The current norm is to employ machine learning (ML) methods for many problems (including churn prediction), but at the same time there is an extensive probabilistic framework developed for estimation of future purchases and customer lifetime value. Which of these methods is more accurate in binary classification of future churners? • While the transactional data have time dimension, the typical dataset for both ML and probabilistic model reduces this to variable entry. When assessing performance using out-of-sample prediction, what is the difference between a temporal split of the dataset and a typical split on per-user basis?

These are the questions we tackle in this thesis. To do so, in the next chapter we review the literature on this topic. Afterwards we present how the dataset looks like, how it was modified, what models were applied, and how they were evaluated. Next, we look at the results. We conclude with a discussion about the results and possible future work.

1_{Statista.com:} _{E-commerce share of total global retail sales from 2015 to 2021.} _{Available on} https://www.statista.com/statistics/534123/e-commerce-share-of-retail-sales-worldwide/

(7)

2 Literature Review

In the following section, we first look at different customer and their relationship man-agement methods and its relationship with churn prediction. Next, we look at Customer Lifetime Value and how it can be applied in this problem and afterwards examine some of the recent work done in the field of churn prediction. In the end we look deeply into methods used in this thesis.

2.1 Customer Relationship Management

Currently, a plethora of companies providing Customer Relationship Software (CRM) software exist. The market for CRM software is currently the biggest software market, with $39.5 billion of total revenue, and is estimated to be the fastest growing market as of 2018.2

Even though the topic of CRM has been thoroughly discussed throughout the years, a single universally accepted definition of CRM does not exist. One of the newest defini-tions is by Kumar and Reinartz (2018) who define CRM as ’strategic process of selecting customers that a firm can most profitably serve and shaping interactions between a com-pany and these customers.’ Authors specify the goal of the CRM to be ’optimizing the current and future value of customers for the company.’ We consider churn prediction as part of the CRM for several reasons. Firstly, it helps in both selecting the customers who a company can profitably serve and shape these interactions. Secondly, churn prediction helps a company determine whether a customer would be active in the future thus ob-taining an important piece of information in optimisation of the current and future value of the company’s customer base.

In order to explain where exactly in the CRM framework the churn prediction is located, we use the division by Schultz (2000). He argues that that two main approaches exist:

• Analytical - Mainly developed in North America, a technology-oriented solution aimed especially at customer acquisition

• Operational - Used historically in Northern Europe, company-customer relation-ship oriented, aimed mainly at customer retention.

We observe that churn prediction fits in both of these definitions, as we can define it as an analytical approach in customer retention. Ngai et al. (2009) redefine operational CRM as the automation of business processes and analytical CRM as the analysis of

2_{’Gartner Says CRM Became the Largest Software Market in 2017 and Will Be the Fastest Growing} Software Market in 2018’ https://www.gartner.com/newsroom/id/3871105

(8)

customer behaviour and characteristics to support company decisions. We see that churn prediction in this context satisfies the definition of both parts. Buttle (2004) supports this claim, stating that in operational CRM, marketing automation using customer data allows us to provide customers with targeted offers. He also states that analytical CRM helps companies with determining which customers should be targeted or where the sales effort should be concentrated.

Churn prediction is also important in another influential CRM framework presented by Swift (2001), where the author divides CRM in four dimensions, consisting of customer identification, attraction, retention and development. We can think of churn prediction as being part of a one-to-one marketing campaign, used for the identification of ’soon-to-be-dead’ customers, thus reducing costs. At the same time, such campaigns could use Customer Lifetime Value (CLV) in the modelling phase to identify customers who are profitable to contact. The first part of the campaign would be considered by Swift to be part of customer retention, the second as customer development.

2.2 Customer Lifetime Value

CLV has an important role in the CRM framework (Kumar and Petersen, 2005; Gupta et al., 2006). Possible applications in CRM consist of measuring the success of the busi-ness, using it as an objective way to make marketing accountable, or exploring customers’ data and using their revealed preferences. In this thesis we discuss churn prediction as a possible application of CLV. To evaluate CLV, Gupta et al. (2006) present several ways how to model it but for this thesis only three of them are important, namely RFM models, Probability models, and Computer Science models.

First, RFM models, which are named after Recency/Frequency/Monetary values, the variables used for modelling. Authors state that these models are used for a long time and are easy to implement, however they are limited in CLV estimation as they are able to predict the next period only. Another drawback of this method is that the underlying distribution is completely ignored. In contrast with that, Fader et al. (2005b) state that RFM variables are sufficient statistics for the CLV model.

Second, Probability models. Gupta et al. mention Pareto/Negative Binomial Dis-tribution (NBD) model, which was developed by Schmittlein et al. (1987). In this paper, they assume that each customer’s relationship is either alive or dead. If alive, the number of transactions can be modelled using Poisson distribution and the lifetime duration can be simplified using exponential distribution. Moreover, authors assume that the purchas-ing and death rate for an individual customer follow a Gamma distribution. Nevertheless, these two distributions are independent. One of the main benefits is that these models do

(9)

not require the monetary value of the transactions. This model became a cornerstone for several other models, for instance Beta-binomial/Beta-geometric by Fader et al. (2004). Apart from that, Gupta et al. (2006) state that the Pareto/NBD is a ”good benchmark model when considering non-contractual settings where transactions can occur at any point in time”.

Last, Computer Science models. In this part it is clear that Gupta et al. are not familiar with machine learning methods, and name a great deal of possible methods to be used. Yet the authors state that the main approach here is modelling binary outcomes using a large number of variables, with emphasise on predictive ability. In addition, the authors state that these approaches are usually not interpretable and conclude that these approaches are little known in marketing literature and they need a closer look in the future.

2.3 Churn prediction

Tamaddoni et al. (2016) provide us with an intriguing approach to customer churn and compare the performance of customer churn prediction using probabilistic (applying Pareto/NBD models) and computer science modelling methods (employing Support Vec-tor Machines as single algorithm technique and Boosting as ensemble learning method). The authors utilize two approaches to two datasets coming from different industries. First, they use only RFM variables as these are only variables which can be used in Pareto/NBD, to compare the predictive power of the models in settings where there is not enough information. They conclude that in most cases, boosting provides the best performance apart from cases where the sample size is very small.

However, the authors try to tackle an important question: how do we discover which customer churn prediction technique is the most profitable? This is done through simula-tion and innovative lift-based cumulative profitability where authors utilize the expected profit for a customer in the prediction period, value of the offered incentive, probability of churn for a given customer, probability that the given customer would accept an incent-ive offer by the would-be churner and probability of accepting the incentincent-ive offer of the would-not-be churner. There is a major drawback in this approach because distributions for ’offer acceptation’ probabilities are assumed. They are not based on any research, nor are any arguments provided for why they should be Beta distributions.

One of the few companies in the e-commerce industry sharing their churn prediction and CLV estimation system deployed in practice is Groupon (Vanderveld et al., 2016). This company currently serves more than 30 countries and 500 markets and offers dis-counts on physical merchandise and travel deals. The authors describe the modelling

(10)

workflow, where they first divide customers into five different cohorts (Unactivated, New users, One-time buyers, Sporadic buyers, Power users), what is similar to classical RFM division. Afterwards, authors state that they use two-stage Random Forests (RF) to evaluate future CLV, where in the first stage they use the RF to determine which cus-tomers will purchase something, and in the second stage, applied to those cuscus-tomers who will purchase, they use RF to predict the dollar value of their CLV. Apart from that, this paper describes how these models are both daily and quarterly retrained. Regarding churn, the first stage of the two-stage RF model, authors report Accuracy, Precision, Recall and False-Positive rate for each of the cohorts. However it is hard to evaluate the overall performance of this approach against other methods as the authors do not use a benchmark model.

2.4 Theory of methods used

In this part I will look at the methods used and explain why they should work for churn prediction. I refer to the original papers for their derivation

Pareto/NBD and BG/NBD As mentioned before, Pareto/NBD belongs to probab-ilistic models, and in this section we explain the assumptions behind them. The model comes from Schmittlein et al. (1987), and needs only two main pieces of information: recency and frequency. Regarding the assumptions, these authors categorise them in two groups - aimed at individual customers and the customer heterogeneity, and justify their utilization followingly:

1. Poisson purchases - This in turn implies exponentially distributed interpurchasing times, which means that there is constant probability of being stimulated to make a transaction

P [X = x|λ, τ > T ] = e−λT(λT )

x

x! , x ∈ {0, 1, 2...} (1)

with λ as purchase rate, X as number of purchases and T as as given time period. 2. Exponential lifetime -The authors assume that events, which could trigger customer ”death” (i.e. lifestyle change, financial setback, move) follow a Poisson distribution. They state that even though e.g. moving might not be Poisson distributed, the set of all events which might cause customer ”death” can be considered as a single Poisson distributed process. The lifetime is distributed as:

f (τ |µ) = µe−µτ, τ > 0

(11)

From these assumptions, Schmittlein et al. (1987) derive the desired probability P [Customer Alive|P urchasing Inf ormation], which can be written as

P [τ > T |λ, µ, X = x, t, T ] = 1

1 + _λ+µµ (e(λ+µ)(T −t)_{− 1)}

where t (0 < t <≤ T ) is the time of the last transaction.

However this equation is not operational as we do not know the purchase rate λ and death rate γ for individual customers. Still, we can estimate their distribution over the customers. The authors assume that they both follow Gamma distributions independ-ently, which leads to following assumptions:

3. Individuals’ purchasing and death rates distributed Gamma - Individuals have dif-ferent purchasing rates (some are purchasing often, some less) and as well as death rates (the events-causing customer ”deaths” affect some customers more than oth-ers):

f (λ|r, α) = α

r

Γ(r)λ

r−1_e−αλ_; _{λ, r, α > 0;} ₍₂₎

where r and α are shape and scale parameter, respectively. h(µ|s, β) = β

s

Γ(s)µ

s−1_e−βµ_; _{µ, s, β > 0}

with β being scale parameter and s being shape parameter.

4. Independence of purchasing and death rates - There is no evidence to suggest that there is either positive or negative correlation between the purchasing and death rates. A heavy purchaser can be careless with money - and therefore can ’die,’ or is strongly attached to the brand and therefore will probably not ’die’.

Fader et al. (2005a) builds upon previous work. They suggest that instead of using Pareto distribution, which assumes that customer ”death” can occur at any point in time, we could assume that customer ”death” happens immediately after a purchase and thus use beta-geometric distribution. To formalise this, they maintain Poisson purchases, presented in Equation (1) and Gamma distributed purchase rates featured in Equation (2). They suggest following assumptions:

• A customer becomes inactive after a transaction with probability p. This implies that a customer ’dies’ at a point drawn from geometrical distribution:

P (inactive immediately af ter jth transaction) = pj(1 − p)j−1; j = 1, 2, 3, ... • Heterogeneity in p is distributed according to beta distribution:

f (p|a, b) = p

a−1_{(1 − p)}b−1

B(a, b) (3)

(12)

• Transaction rate λ and dropout probability p vary independently across customers. The authors build upon the presented assumptions and derive the equation to get the expected number of transactions in a given period (Equation 5 in Fader et al. (2005a)):

E(X(t))|λ, p) =1 p−

1 pe

−λpt_. ₍₄₎

To calculate the expected number of purchases in the future, authors maximize log-likelihood and calculate parameters r, α, a, b:

LL(r, α, a, b) =

N

X

i=1

log[L(r, α, a, b|Xi= xi, txi, Ti)],

where we calculate likelihood

L(r, α, a, b|Xi = xi, txi, Ti) = B(a, b + x) B(a, b) Γ(r + x)αr Γ(r)(α + tx)r+x + δx>0 B(a + 1, b + x − 1) B(a, b) Γ(r + x)αr Γ(r)(α + tx)r+x

where δx>0is 1 if x is greater than 0 and 0 otherwise.

These obtained parameters allows us to calculate p and λ using Equations (2) and (3), which can be afterwards put into Equation (4), thus obtaining the desired quantity of expected number of purchases in the future.

Algorithm 1 AdaBoost according to Hastie et al. (2009). Initialize wi= 1/N, i = 1, 2, ..., N

for m = 1 to M do

1. Fit a classifier Gm(x) to the training data using weights wi

2. Compute errm= PN i=1wiI(yi6= Gm(xi)) PN i=1wi

3. Compute αm= log((1 − errm)/errm)

4. Set wi← wi∗ exp[αm· I(yi6= Gm(xi))], i = 1, 2, ..., N

Output G(x) = sign[PM

m=1αmGm(x)].

AdaBoost This is short for Adaptive boosting, a powerful ensemble method which com-bines several weak algorithms (with lower error rate than random guessing, i.e. decision trees) developed by Freund and Schapire (1997). We decided to use AdaBoost instead of other commonly used ML methods such as Random Forest because of its superior per-formance, as described in i.e. in Chapter 15 of Hastie et al. (2009). This algorithm trains

(13)

several weak learners, in this case decision trees which are processed afterwards using weighted majority voting to make a final prediction. This is presented in Algorithm 1.

To explain the algorithm in layman terms, Freund et al. (1999) use betting in horse racing as an example. If one would like to start betting on horse races, it would be handy to ask an expert for the rules of thumb, as it is unfeasible for the expert to provide us with the whole knowledge of the problem. A single rule, such as ’bet on the horse with the highest odds’ might not be enough for us to predict which horse is going to win. However if we would aggregate several rules of thumb by several different experts, we could get a more qualified prediction of the race winner. Two problems arise: which of the pieces of information are the most useful, and how to combine these crude rules into an effective and accurate single rule. Boosting is solving both of these problems.

(14)

3 Methodology

In the following part we describe the used dataset and the format of the data. Afterwards we describe how the dataset was modified and how the features were engineered. Then we specify two ways by which the testing and training split were achieved, how exactly the modelling was done and which evaluation methods were used. We conclude with a summary of the chapter.

3.1 Dataset description

The dataset comes from a small fashion e-commerce company operating across Central and Eastern Europe, consisting of transactional data from 2011 to 2015, from 6500 users. The price data are scaled by an unknown constant and thus not allowing access to the overall revenue of the company and other significant performance measures. However overall revenue consisted from 286 920 543 Currency Units (CU).

The dataset consists of five columns: • ID txn - transaction ID

• ID user - ID of the user

• txn time - time of the transaction • ID product - ID of the bought product • price - scaled price of product (in CU)

There are circa 42 thousand rows. If the individual transaction consisted of multiple items, these rows were duplicated. The distribution of transactions in a sample of 50 customers can be seen in Figure 1. There are 6583 users and 28 713 unique transaction IDs, which indicate that when a user bought several items in one transaction, a single transaction ID was issued. The average price is around 6683 CU with a median of around 3367 CU. 6370 CU is in the 75th percentile, which indicates that the price distribution is skewed.

3.2 Dataset wrangling

In order to be able to perform churn analysis, we transformed the transactional data to user level data. For each of the customers, we created the binary churn target. The company decided to define a churner as someone, who did not buy something in the last year. Thus we created a script, which takes two main input arguments: transaction dataset and date, from which the binary churn target variable should be calculated. to create a reliable target variable, this date should be at least one year from the end of the dataset. It is created as follows:

(15)

Activ e Inactiv e 2011 2012 2013 2014 2015 African buffalo African leopard Bear Bedbug Bee Beetle Black widow spider Cape buffaloCardinal Dingo Domestic pig DonkeyFinch Gayal Haddock Hedgehog Junglefowl Kingfisher Kite Manatee Pheasant Possum Rat Roundworm Swordtail Tuna Turkey Turtle Whitefish Yellow perch Ape Bonobo Cat Cow Deer Fancy mouse Grasshopper Ground shark Gull MarmosetMarmot Ox Prairie dog Right whale Roadrunner Steelhead trout Sugar glider Trout Vampire squid Wren

Purchases through time

Customers 0.5 1.0 1.5 Monthly purchase rate

Figure 1: Sample of 50 customers and distribution of their purchases through time. Users with single transactions are marked as gray squares, and names of customers are randomly assigned animal names. We can observe that active users are usually long-term buyers, and most inactive users are one-time buyers. The active buyers are more frequent as well.

1. For each customer in dataset before date, find his latest transaction

2. If the transaction is more than a year before, mark him as churner, and drop them from subsequent analysis.

3. Otherwise, from this last transaction, look one year in the future

4. If the customer have not bought anything, then mark him as churner, and vice versa.

This method of customer churn identification marked around 60% of customers in the dataset as churners. It is important to state the reasoning behind this marking of the churners. If we would look one year from a given date to the future and mark the churners accordingly, there would be users who did not buy anything e.g. in the last 7 months as well as nothing in the future, and they would therefore be marked as churners 7 months later. Hence, if user did not buy anything in the last 7 months, we only need to look 5 months in the future, which is the same as looking one year forward in time from the last transaction.

(16)

In order to maintain the time dimension of the data, we created an additional set of variables. This consisted of variables indicating whether customers spent more or less than the quarter before and whether customers bought more or less items in the last four customers’ quarters. We created other variables such as the average transaction value and the number of days from the last transaction. An overview of the variables along with a short explanation is in Table 1.

Variable name Variable description

’T’ Customer’s time (in days)

from the first transaction until the end of dataset ’recency’ Time (in days) from first to last transaction ’frequency’ How many transactions given user made ’monetary’ How much given customer spent overall ’items total’ How many items customer bought altogether?

(1 transaction can have more items)

’in work’ Does the user buy usually in working hours? ’in evening’ Does the user buy usually in the evening? ’avg price txn’ Average price if the transaction

’avg price thing’ Average price of bought item ’last txn days’ When was the last transaction? ’overall price’ 1-4 How much revenue customer brought

in the last 4 quarters (per quarter)? ’item count’, 1-4 How many transactions customer made

in the last 4 quarters (per quarter?) ’txn rate month’ Average monthly transaction rate ’revenue rate month’ Average monthly revenue rate

’purchase rate gt 12’ Is average purchase rate higher than 12? ’monetary rate gt 10k’ Is average purchase rate higher than 10 000? ’trend revenue’, 1-4 Is the trend in revenue increasing

in the last 4 quarters (per quarter)?

Table 1: Description of the used variables. All were created from the original dataset. These variables were used in the modelling afterwards.

3.3 Exploratory dataset analysis

To explore the relationships in the data, we decided to take dataset with users active in period between 2013-05-04 and 2014-05-04, consisting of 2900 users. We used the Time

(17)

- Recency - Frequency - Monetary variables, and compare them with the churn variable, to get an estimate of the churner distribution among the customers. We decided to take these variables as these are the most important for the modelling which followed. We can observe the relationships in Figure 2, namely between the churn variable and mentioned variables, along with another custom variable T - Recency, which represents the number of days since the last transaction. This is also the only variable that is positively correlated with churn, what is logical. With increasing number of days since last transaction, it is more likely that customer will churn, whereas with other variables this relationship is reversed. Other interesting observation from the correlation plot is the positive correlation between monetary and frequency, which confirms the logical conclusion that customers with more transactions will spend more.

churn T frequency recency monetary T − recency

−1.0 −0.5 0.0 0.5 1.0 value

Figure 2: Correlation plot of RFM variables and churn. We can observe that the churn variable is negatively correlated with all variables, except the time since last transaction, which is positively correlated. This is according to our expectations.

First, to find out which customers are the most likely to churn, we split the recency variable in several groups, and looked at the distribution of churners in these individual groups. We can observe this distribution in Figure 3, where we can see that more than 25 % of customers are single buyers, who never return. Similarily, customers with few transactions in short period of time, are also likely to churn. On the other hand, customers

(18)

with lasting relationship, will probably stay loyal in the future. single purchase 0 − 1/2 y ear 1/2 − 1 y ear 1 − 2 y ears 2 − 3 y ears more than 3 y ears single purchase 0 − 1/2 y ear 1/2 − 1 y ear 1 − 2 y ears 2 − 3 y ears more than 3 y ears 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

Share of the category

Share of the chur

ners within categor

y

Churned?

yes no

Figure 3: The length of customer relationship and the churn of users. The red area corresponds to total share of churners, which is around 59%. We can also observe that customers with very short lifespans are likely to churn.

Second, we decided to create other variables, namely yearly frequency, which indic-ates number of purchases (frequency) divided by the length of the customer relationship (recency). We can observe the distribution of churners among different groups in Figure 4. As we are dividing with recency, we decided to omit users with a single purchase from further analysis. On the other hand, the share of these users is the same as in Figure 3 and consists of circa 25% of users. We can observe a negative correlation (-0.27) with churning and yearly purchasing frequency, for users with this frequency less than 12. As the data are skewed with people who made a small amount of purchases in a short period of time but did not buy anything afterwards, we can see that users with higher purchas-ing frequency than 12/year are positively correlated with churnpurchas-ing (0.31). To capture this nonlinear relationship, we created an additional dummy variable, indicating whether customer has purchasing frequency higher than 12/year.

Last, we decided to have a look at the connection between monetary variable and churner. To examine this relationship in depth, we created a variable representing cus-tomers’ average monthly expenditure and divided the customer base in several distinct

(19)

less than 2 purchases/y ear 2 − 3 purchases/y ear 3 − 6 purchases/y ear 6 − 12 purchases/y ear

more than 12 purchases/y

ear

less than 2 purchases/y

ear 2 − 3 purchases/y ear 3 − 6 purchases/y ear 6 − 12 purchases/y ear

more than 12 purchases/y

ear 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

Share of the chur

ners within categor

y

Churned?

yes no

Figure 4: Yearly frequency of the purchases and its relationship with customer churn. We can see negative correlation between purchasing frequency and churn, but rising again in the most frequent category. Presented data corresponds to circa 75% of the users in the dataset, with more than one purchase.

groups. This is shown in Figure 5. We can observe a negative correlation (-0.16) between churning and average monthly customer spending below 10 000 CU/month. In contrast, the correlation is 0.29 in the group with an average monthly spending higher than 10 000 CU/month. To capture this nonlinear relationship we decided to create an additional dummy variable indicating whether customer spend more than 10 000 CU/month.

3.4 Test / train split

This thesis evaluates algorithmic performance using out-of-sample validation and there-fore bethere-fore each modelling, the dataset is split. On the first part, the so-called training dataset, the model is learned, and on the second part of testing dataset the model per-formance is evaluated. This thesis examines the split in two ways.

In the first method, two datasets are created using the script described in Chapter 3.2. The testing dataset takesDate of the last transaction − 1 year as a reference date. The training dataset takes Date of the last transaction − 2 years as reference date. This

(20)

less than 1 000 CU/month 1 000 − 2 000 CU/month 2 000 − 5 000 CU/month 5 000 − 10 000 CU/month

more than 10 000 CU/month

less than 1 000 CU/month 1 000 − 2 000 CU/month 2 000 − 5 000 CU/month 5 000 − 10 000 CU/month

more than 10 000 CU/month

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

Share of the chur

ners within categor

y

Churned?

yes no

Figure 5: Average monthly customer revenue and its relationship with churn. We can observe a negative correlation between churning and average monthly customer spending. This does not apply to the group with high monthly expenses, as these data are partially skewed by the users with short lifespans like in Figure 4. Similarly, this graph represents circa 75% users with more than one purchase.

allows to evaluate the algorithm performance in a realistic way and determines whether models based on the historical dataset can be applied in the future. To specify this, the training dataset is taken as users with transaction from 2012-05-04 until 2013-05-04 (altogether 2200 users). Afterwards the model is evaluated on the users with at least one transaction from 2013-05-04 until 2014-05-04 (2900 users). However, the drawback of this methods is that training and testing sample is not completely independent (Arlot et al., 2010). At the same time this method better reflects real life use, as the models would be trained on historical data and applied on the current data.

The second way to evaluate the model performance is hy taking the part with reference date set as Date of the last transaction − 1 year, and subsetting a random group of customers consisting of 33% of the users and creating a testing dataset. The rest becomes the training dataset. The model is learned using the training dataset and the model performance evaluated using the testing dataset. This is the traditional way of making a out-of-sample prediction and we can determine whether the model trained on one group

(21)

of users could be extended to another group. To clarify: we take the dataset with users with at least one transaction within period between 2013-05-04 and 2014-05-04 (2900 users). We then randomly chose 67% of users, which became part of the training dataset. The rest became part of the testing dataset.

3.5 Modelling

After preparation of the data, we created the models using Python package scikit-learn developed by Pedregosa et al. (2011). This package implements a wide variety of state-of-the-art algorithms and aims to bring machine-learning methods to an audience with lower knowledge of Computer Science. Another important benefit is that a unified interface throughout the package makes it easier to evaluate the algorithms. We used the AdaBoost algorithm mentioned above, logistic regression as the benchmark, random training/testing split per user basis, cross validation using grid search and evaluation metrics, all available in the sklearn package.

In order to determine the best parameters of the model, we added grid search cross-validation. Therefore in every case of the AdaBoost model, we calculated the optimal number of n estimators (number of weak learners applied in the AdaBoost algorithm), chosen from grid of 11 values between 10 and 1000, increasing geometrically. At the same time, we controlled for the learning rate parameter, which controls for the contribution of individual weak learners in the final combination. The learning rate was chosen from 10 values between 0.1 and 1, with increments of 0.1. For each of the combinations, F1 score (see Section 3.6) was calculated and we chose the combination with the lowest one. Regarding the BG/NBD algorithm, we used implementation in lifetimes package cre-ated by Cameron (2018). This uses 3 variables in modelling - Time from the first trans-action, span in days between first and last transactions and number of items bought. The prediction of the model returns the expected number of purchases in given period. To transform this output to binary variable to make things level for both algorithms, we determined the threshold value of purchases and we used the model to predict on training dataset. Then we took the churn rate from the training dataset. This value times 100 was taken as the percentile which was used as the threshold. Everything below this threshold value was classified as one, otherwise zero. This ensured a similar churn rate between both algorithms.

As a benchmark, we used logistic regression without any regularization. It is often used in probability modelling, which makes it a good fit for this problem. As sklearn package does not support logistic regression without regularization, we decided to use a very small number (10−42) as the regularization parameter.

(22)

3.6 Evaluation

We used several metrics to evaluate algorithm performance, namely precision and recall, F1-score and confusion matrices. We decided to use these methods as this is the golden standard in an evaluation of algorithm performance.

Confusion Matrix When using evaluating the result of a classification problem on a test set, a common evaluation technique is using two-dimensional matrices with a row and column for each class. Each element of the matrix is the number of test set cases predicted (displayed in columns) and belonging (displayed in rows) to a given class. In this case we have only two classes, and therefore such a matrix has two by two dimensions (Witten et al., 2016). In this matrix, we identify several important quantities, described in Table 2. These are used afterwards to determine other metrics described below.

Identified Negative Identified Positive Actual Negative True Negative (TN) False Positive (FP) Actual Positive False Negative (FN) True Positive (TP)

Table 2: Confusion matrix overview for binary target variable. Colour coding represents correctly (green) versus incorrectly (red) identified quantities.

Accuracy The most common classification metric is accuracy. It is natural to ask -what is the percentage of correctly classified elements? It can be calculated as follows:

Accuracy = T P + T N

T P + T N + F N + F P

However, accuracy is not an adequate metric when the target class is not balanced, as is this case.

Precision/Recall The unbalanced target classes is the main reason we decided to use Precision-Recall metric to evaluate classifier output quality. Precision is a measure of result relevancy while recall tells us how many truly relevant results are returned. (Pedregosa et al., 2011).

P recision = T P T P + F P Recall = T P

T P + F N

To put this in layman terms, precision tells us the share really positive among the pre-dicted cases, and recall indicates what is the share of the cases we were able to find in the dataset.

(23)

F1-Score To summarise previous metrics in single number, F1-score is often used. Definition follows:

F1= 2 ∗

precision ∗ recall precision + recall

F1-score is the harmonic mean of precision and recall. The main reason for this is that this mean penalises extreme values. Therefore to have a high F1-score, both precision and recall has to be high.

Receiver Operating Characteristic curve, which is created by plotting True Pos-itive Rate (which is the same as 1 - Specificity) against False PosPos-itive Rate (same as Sensitivity) of different probability thresholds. This is used to judge the discrimination ability of various statistical methods (Hanley and McNeil, 1982).

Bootstrap To evaluate whether there is a significant difference between the individual approaches, we decided to employ bootstrap. We decided to bootstrap the training data-set. Afterwards we created models on this dataset and evaluated their performance on the training dataset, using precision, recall, F1-score and accuracy. These values were stored and tested on whether the values significantly differ from each other. For the hypothesis test, we decided to use a Welch Two Sample t-test (Welch, 1947), which is used for testing equality of means with unequal variance and is defined as follows:

t = ¯ X1− ¯X2 q s2 1 N1 + s2 2 N2

(24)

3.7 Modelling approach summary

Full specification of the modelling process can be summarised in the following steps: 1. Modify the transaction dataset and create adequate variables for each of the relevant

customers. Take only customers active in given year (as others are already identified as churners).

2. Create training and testing split:

• Temporal split: model is trained on training data, taken as users with trans-action until 2013-05-04 (altogether 2200 users). Afterwards the model is evalu-ated on the users with at least one transaction from 2013-05-04 until 2014-05-04 (2900 users).

• Split per user:

3. For each of the splits, the following models were trained: • BG/NBD

• AdaBoost, trained on same variables as BG/NBD

• AdaBoost, trained on extended dataset, as it can utilise more information about the users.

• Logistic regression, trained on RF variables • Logistic regression, trained on extended dataset

4. For each of the models and for each of the split, evaluate the models using

• Confusion matrix to have an overview of the prediction, especially false posit-ives and false negatposit-ives

• Precision, recall, accuracy and F1-score

• Bootstrapping the training dataset and calculating precision, recall, accuracy and F1-score

(25)

4 Results

We discuss the parameters obtained for the models created on the training datasets. Afterwards, we look at the prediction performance on the test sets created using two different techniques, namely temporal split and per-user split. Finally, we discuss the differences of the performances of the models between the techniques.

4.1 Model parameters on training datasets

First, we look at the estimated parameters for BG/NBD models, which are in table 3. In the temporal split, the threshold number of purchases is ideal as it makes sense that when the model predicts less than 1 purchase, the user will not buy the product. Models in both settings also accurately predicted the share of the churners in the dataset. We can observe that the shape parameters a, b of the Beta distribution explaining heterogeneity are smaller in temporal split than in per-user split, and so is the share a/b.

temporal split per-user split

a 0.42 0.71

b 7.78 28.07

α 9.87 12.76

r 0.51 0.61

threshold 0.998 1.24

predicted churn rate 0.583 0.611

test-set churn rate 0.599 0.584

Table 3: Parameters of the BG/NBD model estimated on the training datasets.

We can observe parameters obtained for the AdaBoost algorithm in the Table 4. We can observe that while using temporal split, the lowest F1-score was achieved using lower number of weak learners than using per-user basis. It is according to our expectations that if we have a lower number of variables, less ’weak learners’ are needed. This can be seen in a comparison of AdaBoost models trained on dataset with extended features and dataset with RF features. Similarly, it is no surprise that if we have a larger number of weak learners, the learning rate decreases and thus more learners contribute to the final result.

4.2 Prediction using temporal split

In this setting, while we can observe that AdaBoost trained on dataset with extended features performs similarly to the BG/NBD model, AdaBoost trained on RFM features

(26)

AdaBoost setting learning rate # of learners extended dataset, temporal split 0.2 158

RF variables, temporal split 0.8 15

extended dataset, per-user split 0.3 630

RF variables, per-user split 0.8 63

Table 4: AdaBoost parameters for different settings, obtained by cross-validation on the training dataset. We can observe the tradeoff between number of learners and learning rate.

Algoritm

(temporal split)

Precision Recall Accuracy F1-score AUC

AdaBoost (extended data) 0.86 0.86 0.858 0.86 0.934 AdaBoost (RF var.) 0.84 0.81 0.810 0.81 0.902 BG/NBD 0.87 0.87 0.872 0.87 0.941 Logistic Regr. (extended data) 0.80 0.80 0.801 0.80 0.899 Logistic Regr. (RF var.) 0.83 0.83 0.828 0.83 0.931

Table 5: Performance of algorithms in temporal split setting. We report average precision, recall and F1-score among both classes. These values are obtained by model evaluation on test dataset consisting from 2925 entries.

lacks and performs even worse than the logistic regression, which can be seen in Table 5. We can observe that the AdaBoost algorithm, trained on the extended dataset, per-forms best, with BG/NBD algorithm matching the performance. The performance of the AdaBoost algorithm using RF variables is worse than the performance of the logistic regression. We can compare the prediction performance in confusion matrices, presented in Table 6.

It is worth noting that while AdaBoost performs better in identifying people not likely to churn and BG/NBD is better in identifying churners, both algorithms have same accuracy. Similarly, it is importan to note that the logistic regression performs worse with extended features than with RF features. Further investigation revealed that this is caused by overfitting.

(27)

AdaBoost extended Pred 0 Pred 1

Act 0 1033 139

Act 1 275 1478

RF vars Pred 0 Pred 1

Act 0 1081 91

Act 1 464 1289

Logistic regression extended Pred 0 Pred 1

Act 0 750 422

Act 1 160 1593

Act 0 826 346

Act 1 155 1598

BG/NBD

Act 0 1008 164

Act 1 211 1542

Table 6: Confusion matrices in temporal split setting.

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 1 − Specificity Sensitivity Model

LogReg w/ extended data LogReg

AdaBoost w/ extended data AdaBoost

BG / NBD

Figure 6: ROC curves of the algorithms from predictions on the test set us-ing temporal split. These ROC curves are created from probability predictions from AdaBoost and Logistic Regressions. For BG/NBD, predicted purchases are used.

(28)

Algorithm (per-user split)

Precision Recall Accuracy F1-score AUC

AdaBoost (extended data) 0.90 0.90 0.901 0.90 0.961 AdaBoost (RF var.) 0.89 0.89 0.886 0.89 0.958 BG/NBD 0.87 0.87 0.869 0.87 0.945 Logistic Regr. (extended data) 0.85 0.84 0.845 0.84 0.940 Logistic Regr. (RF var.) 0.87 0.87 0.863 0.87 0.947

Table 7: Performance of models in per-user split setting. We report average precision, recall and F1-score among both classes. These values are obtained by model evaluation on test dataset consisting from 966 entries.

4.3 Prediction using per-user split

We present the out-of-sample prediction performance when conducting dataset split on per-user basis in Table 7. Detailed performance presented in confusion matrices is in Tables 8.

We can observe that AdaBoost performs better on the same set of variables than both logistic regression and BG/NBD model. AdaBoost model trained on the extended dataset performs even better, as it can utilise additional information about the user. By contrast, adding additional features to the model worsened the logit performance. We can observe that AdaBoost models were the best in every presented metric, and the performance of the BG/NBD and logistic regression is similar on the same set of variables. If we observe ROC curves in Figure 7, it is interesting that logit has the highest True Positive Rate (or sensitivity) when the False Positive Rate (or 1-Specificity) is low. This might make logit preferred model in some settings, where the campaign costs might be too high and addressing users unnecessarily could be undesirable. However, with increasing FPR the logit performance quickly becomes the worst of all algorithms.

4.4 Comparison of approaches

We can observe that BG/NBD model prediction performance remains constant among the out-of-sample test split approaches. On the other hand both Logistic regression and AdaBoost performance differ among the approaches. This is most evident in the case model learned on RF variables, where almost all metrics differs between settings by

(29)

AdaBoost extended Pred 0 Pred 1

Act 0 354 47

Act 1 49 516

Act 0 347 54

Act 1 56 509

Logistic regression extended Pred 0 Pred 1

Act 0 305 96

Act 1 45 520

Act 0 332 69

Act 1 63 502

BG/NBD

Act 0 325 76

Act 1 50 515

Table 8: Confusion matrices in per-user split setting.

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 1 − Specificity Sensitivity Model

LogReg w/ extended data LogReg

AdaBoost w/ extended data AdaBoost

BG / NBD

Figure 7: ROC curves of the algorithms from predictions on the test set us-ing per-user split. These ROC curves are created from probability predictions from AdaBoost and Logistic Regressions. For BG/NBD, predicted purchases are used.

(30)

approx. 8 p.p. whereas differences of BG/NBD are all within 1 p.p. The best performance has AdaBoost algorithm using data on extended dataset and in training and testing split.

4.5 Comparison of model performance using bootstrap

To determine whether the presented results are not coincidental, we decided to use boot-strap. We decided to run 1000 runs and stored values for precision, recall, F1-score and accuracy, obtained using evaluation on testing dataset. Observed distributions can be seen in Figures 8, 9, 10 and 11, respectively.

Precision We can observe from Figure 8 that AdaBoost and BGNBD scores are similar in temporal split. To evaluate that, we test the following hypothesis:

H0: µprecision_{AdaBoost(ext. data)}= µprecision_{BG/N BD} for temporal split

Ha: µ precision

AdaBoost(ext. data)< µ precision

BG/N BD for temporal split

When evaluated using a two sample t-test, we reject the null hypothesis with p − value < 0.001, therefore we can conclude that we found enough evidence to support the fact that BG/NBD performs better in this setting. However, when performing analysis on per-user split, Adaboost model performs the best, and this is supported by the fact that two-sample t-test rejects the null hypothesis when comparing H0 : µprecision_{AdaBoost(ext.data)} = µprecisionother model

vs. Ha: µ precision AdaBoost(ext.data)> µ precision other model. 0 20 40 60 0.80 0.85 0.90 0.95 precision density Temporal split 0 20 40 60 80 0.84 0.87 0.90 0.93 precision density variable AdaBoost_ext−True AdaBoost_ext−False LR_ext−True LR_ext−False BGNBD_ext−False Per−user split

Figure 8: Bootstrapped precision distributions for individual models. Label ext-True indicates, whether extended dataset were used and ext-False indicates use of RF variables.

(31)

0 20 40 0.70 0.75 0.80 0.85 0.90 recall density Temporal split 0 50 100 150 0.86 0.88 0.90 0.92 0.94 recall density variable AdaBoost_ext−True AdaBoost_ext−False LR_ext−True LR_ext−False BGNBD_ext−False Per−user split

Figure 9: Bootstrapped recall distributions for individual models. Label ext-True indic-ates, whether extended dataset were used and ext-False indicates use of RF variables.

Recall First, if we observe the temporal split, we find that the logistic regressions on both the extended dataset and RF variables have the highest recall. However, the most important observation is that BG/NBD has higher recall than both AdaBoost models. All statements are supported by relevant hypothesis tests with p − value < 0.001

Second, in the per-user split the recall of the AdaBoost is almost the same as the recall of the logistic regression trained on extended dataset, and we cannot reject neither two-sided two-sample t-test as the p − value is 0.475, nor one-sided alternative Ha :

µrecall_LR(ext.data) < µrecall_{AdaBoost(ext. data)} with p − value = 0.2375. At the same time these models have higher recall than the other models with p − value < 0.001.

F1-score As it was mentioned in previous paragraphs, in the temporal split BG/NBD model was second in precision and third in recall, so we can say that it has good balance of precision/recall tradeoff, which can be seen by the best F1-score, with p − value < 0.001. When conducting per-user split, AdaBoost was first in both precision and recall, and therefore it has the highest f1-score, again with p − value < 0.001.

Accuracy Similarly as in the previous paragraph, th BG/NBD model has the highest accuracy in temporal split setting, and AdaBoost model on extended dataset in per-user split setting.

(32)

0 50 100 150 0.81 0.84 0.87 0.90 f1 density Temporal split 0 50 100 150 200 0.87 0.89 0.91 0.93 f1 density variable AdaBoost_ext−True AdaBoost_ext−False LR_ext−True LR_ext−False BGNBD_ext−False Per−user split

Figure 10: Bootstrapped F1-score distributions for individual models. Label ext-True indicates, whether extended dataset were used and ext-False indicates use of RF variables.

0 50 100 150 0.76 0.80 0.84 0.88 accuracy density Temporal split 0 50 100 0.825 0.850 0.875 0.900 accuracy density variable AdaBoost_ext−True AdaBoost_ext−False LR_ext−True LR_ext−False BGNBD_ext−False Per−user split

Figure 11: Bootstrapped accuracy distributions for individual models. Label ext-True indicates, whether extended dataset were used and ext-False indicates use of RF variables.

(33)

5 Conclusion

We have thoroughly compared methods of customer churn prediction and evaluated its predictive performance in two different settings. This analysis was done on a single dataset, representing a fashion-oriented e-commerce company. We observed that the al-gorithm performance is quite high but machine learning methods have different perform-ance among methods of dataset out-of-sample split. This is important for two reasons:

• It is crucial to first state how the problem looks like and which method suits it more. In the case of the fashion e-commerce business, the temporal split makes more sense as it reflects how the data would be used in the future. If the business needs to examine who will churn at time tk, it would use the data from period

(t0, tk), 0 < k and create a model learned on these data. Then they look at the

period (tk, tn), k < n and analyse who will churn in this period. The temporal

out-of-sample prediction tries to simulate this use case.

• From the results drawn when the per-user split was conducted we can easily con-clude that the performance of the AdaBoost algorithm is superior to traditional probabilistic methods of customer churn prediction, as is suggested by Tamaddoni et al. (2016).

Thus the researcher could be lured in to use AdaBoost (or any other machine learning algorithm) when there actually is well-known method for the estimation of customer churn using probabilistic method. In addition, a lot of time is saved as there is no need for the creation of additional features. The algorithm provides us not only with binary classification, whether the user will or will not buy, but also with the number of purchases he will make.

On one hand, one can say that these statements apply only for this individual use case made on presented dataset. We are aware that this analysis might be an outlier and therefore future work might be done to replicate this thesis on multiple transaction datasets. On the other hand, in order to make the conclusions as broad as possible, we decided to bootstrap the results and were therefore able to rule out the option that these conclusions are too specific.

At the same time, we are also aware that there is no empirical work done on the subject and as a consequence there is no study done on which of the presented methods is the most profitable, which is the most important measure for the business. This stands as an suggestion for future work.

To conclude, it is handy to have a Swiss army knife in your pocket, but one does not see the chefs using them in the kitchen. Similarly, as the publicity around the machine

(34)

learning3 is currently strong, one might assume that these methods are the best solution for everything. Even though they are indeed powerful, one should be careful before using them, as other well-known solutions for the problem might exist. We think that the existing probabilistic methods for customer churn prediction are as precise as the machine learning methods for this use case, but have the benefit of not needing so much information to perform as well, and can bring in deeper insight in the prediction.

3_Top _Trends _in _the _Gartner _Hype _Cycle _for _Emerging _{Technologies,} _2017, _Gartner: https://www.gartner.com/smarterwithgartner/top-trends-in-the-gartner-hype-cycle-for-emerging-technologies-2017/

(35)

References

Arlot, S., Celisse, A., et al. (2010). A survey of cross-validation procedures for model selection. Statistics surveys, 4:40–79.

Buttle, F. (2004). Customer relationship management. Routledge. Cameron, D.-P. (2018). Lifetimes, github repository.

Fader, P., Hardie, B., and Berger, P. D. (2004). Customer-base analysis with discrete-time transaction data.

Fader, P. S., Hardie, B. G., and Lee, K. L. (2005a). ”Counting your customers” the easy way: An alternative to the Pareto/NBD model. Marketing science, 24(2):275–284. Fader, P. S., Hardie, B. G., and Lee, K. L. (2005b). RFM and CLV: Using iso-value curves

for customer base analysis. Journal of Marketing Research, 42(4):415–430.

Freund, Y., Schapire, R., and Abe, N. (1999). A short introduction to boosting. Journal-Japanese Society For Artificial Intelligence, 14(771-780):1612.

Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139.

Gupta, S., Hanssens, D., Hardie, B., Kahn, W., Kumar, V., Lin, N., Ravishanker, N., and Sriram, S. (2006). Modeling customer lifetime value. Journal of service research, 9(2):139–155.

Hanley, J. A. and McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1):29–36.

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer, 2 edition.

Kumar, V. and Petersen, J. A. (2005). Using a customer-level marketing strategy to enhance firm performance: a review of theoretical and empirical evidence. Journal of the Academy of Marketing Science, 33(4):504–519.

Kumar, V. and Reinartz, W. (2018). Customer relationship management: Concept, strategy, and tools. Springer.

Ngai, E. W., Xiu, L., and Chau, D. C. (2009). Application of data mining techniques in customer relationship management: A literature review and classification. Expert systems with applications, 36(2):2592–2602.

(36)

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.

Rosenberg, L. J. and Czepiel, J. A. (1984). A marketing approach for customer retention. Journal of consumer marketing, 1(2):45–51.

Schmittlein, D. C., Morrison, D. G., and Colombo, R. (1987). Counting your customers: Who are they and what will they do next? Management science, 33(1):1–24.

Schultz, D. E. (2000). Learn to differentiate CRM’s two faces. Marketing News, 34(24):11– 11.

Swift, R. S. (2001). Accelerating customer relationships: Using CRM and relationship technologies. Prentice Hall Professional.

Tamaddoni, A., Stakhovych, S., and Ewing, M. (2016). Comparing churn prediction techniques and assessing their performance: a contingent perspective. Journal of service research, 19(2):123–141.

Vanderveld, A., Pandey, A., Han, A., and Parekh, R. (2016). An engagement-based cus-tomer lifetime value system for e-commerce. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 293–302. ACM.

Wang, S., Cavusoglu, H., and Deng, Z. (2016). Early mover advantage in e-commerce platforms with low entry barriers: The role of customer relationship management cap-abilities. Information & Management, 53(2):197–206.

Weinstein, A. (2002). Customer retention: A usage segmentation and customer value approach. Journal of Targeting, Measurement and Analysis for Marketing, 10(3):259– 268.

Welch, B. L. (1947). The generalization of student’s’ problem when several different population variances are involved. Biometrika, 34(1/2):28–35.

Witten, I. H., Frank, E., Hall, M. A., and Pal, C. J. (2016). Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann.

(37)

List of tables and figures

List of Figures

1 Sample of 50 customers and distribution of their purchases through time. 14

2 Correlation plot of RFM variables and churn. . . 16

3 Recency and relationship to customer churn . . . 17

4 Yearly frequency of the purchases and churn . . . 18

5 Average monthly customer revenue and its relationship with churn . . . . 19

6 ROC curves in temporal split setting . . . 26

7 ROC curves in per-user split setting . . . 28

8 Bootstrapped precision distributions for individual models . . . 29

9 Bootstrapped recall distributions for individual models . . . 30

10 Bootstrapped F1-score distributions for individual models . . . 31

11 Bootstrapped accuracy distributions for individual models . . . 31

List of Tables

1 Variable description . . . 15

2 Confusion matrix overview . . . 21

3 Parameters of the BG/NBD model . . . 24

4 AdaBoost parameters for different settings . . . 25

5 Performance of algorithms in temporal split setting . . . 25

6 Confusion matrices in temporal split setting . . . 26

7 Performance of models in per-user split setting . . . 27

Comparison of probabilistic and machine-learning methods for churn prediction

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided

up into a number of sections and contains references. An outline can be something like (this

is an example for an empirical thesis, for a theoretical thesis have a look at a relevant paper

from the literature):

(a) Front page (requirements see below)

(b) Statement of originality (compulsary, separate page)

(c) Introduction

(d) Theoretical background

(e) Model

(f) Data

(g) Empirical Analysis

(h) Conclusions

(i) References (compulsary)

If preferred you can change the number and order of the sections (but the order you

use should be logical) and the heading of the sections. You have a free choice how to

list your references but be consistent. References in the text should contain the names

of the authors and the year of publication. E.g. Heckman and McFadden (2013). In

the case of three or more authors: list all names and year of publication in case of the

rst reference and use the rst name and et al and year of publication for the other

references. Provide page numbers.

2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All that

actually matters is that your supervisor agrees with your thesis.

3. The front page should contain:

(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Faculty

as in the heading of this document. This combination is provided on Blackboard (in

MSc Econometrics Theses & Presentations).

(b) The title of the thesis

(c) Your name and student number

(d) Date of submission nal version

(e) MSc in Econometrics

(f) Your track of the MSc in Econometrics

Comparison of probabilistic and

machine-learning methods for churn

prediction

Jan Hynek

11748494

Statement of Originality

Acknowledgment

Contents

1

Introduction

2

Literature Review

2.1

Customer Relationship Management

2.2

Customer Lifetime Value

2.3

Churn prediction

2.4

Theory of methods used

3

Methodology

3.1

Dataset description

3.2

Dataset wrangling

3.3

Exploratory dataset analysis

3.4

Test / train split

3.5

Modelling

3.6

Evaluation

3.7

Modelling approach summary

4

Results

4.1

Model parameters on training datasets

4.2

Prediction using temporal split

4.3

Prediction using per-user split

4.4

1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided

rst reference and use the rst name and et al and year of publication for the other

(d) Date of submission nal version