Customer Churn Prediction in Fitness Industry with Machine Learning Methods

(1)

Customer Churn Prediction in Fitness Industry with Machine Learning Methods

submitted in partial fulfillment for the degree of master of science Tom van den Bogaart

10020934

master information studies data science

faculty of science university of amsterdam

2018-07-27

Internal Supervisor External Supervisor Title, Name Tamis van der Laan Ruben Visser Affiliation UvA, FNWI, IvI Virtuagym

(2)

Customer Churn Prediction in Fitness Industry with Machine

Learning Methods

Submitted in partial fulfillment for the degree of Master of Science

Author:

Tom van den Bogaart

University of Amsterdam Amsterdam, The Netherlands

info@tombogaart.com

Supervisor:

Tamis Achilles van der Laan

University of Amsterdam Amsterdam, The Netherlands

t.a.vanderlaan@uva.nl

ABSTRACT

It is often more profitable for companies to have long-term relation-ships with customers than to acquire new customers. This study investigates if we can use machine learning methods to predict whether customers of a fitness gym are likely to churn based on data of membership activities. Churn occurs when customers cancel their membership. This study generated two datasets allowing for a comparison of the effect on the churn prediction performance of classifiers when trained with single versus multiple observations per customer. While single observations per customer look at the data for one single period in time, multiple observations per cus-tomer include the data for multiple periods in time, resulting in a larger dataset. The ensemble model XGBoost is found to be the most effective model for the single-record dataset and a highly ef-fective model for the multiple-record dataset with a mean AUROC of respectively 83% and 80%. We observe that both the XGBoost model and Logistic Regression model significantly improve their prediction performance from training on multiple observations per customer compared to single observations per customer. This might be because the multiple-record training set includes a vast amount of non-churn examples across different time periods, allowing mod-els to better generalize periodic patterns with respect to non-churn behavior. This in turn contributes to a more accurate separation between the churn and non-churn class. Finally, the most useful features for next month churn prediction is found to be a combina-tion of 12 features that together describe the customer’s activity, their basic demographics and calendar-based periodic patterns.

1 INTRODUCTION

Customer retention is of increasing importance for companies, as it is often more profitable to have long-term relationships with customers than to acquire new customers in this saturated and competitive market [1]. Machine learning methods can be used to identify customers that are more likely to churn, which occurs when customers cancel their business with a service or company. As such, these methods can help improve a company’s ability to target these customers in order to retain them. This study aims to predict next month customer churn in the fitness industry by applying state-of-the-art machine learning methods based on data of membership activities. As such, the main research question of this study is as follows:

Research Question 1 “What machine learning model allows for effective customer churn prediction in the fitness industry based on their membership data?”

Recent studies which utilized data of historical customer be-havior for churn prediction use a single observation per customer that describes the data for one single period in time [2, 3, 4, 5]. As such, a typical churn training dataset includes one record per customer, discarding data of different periods in time that could potentially be used to help train a classifier. Multiple observations per customer, however, include the data for multiple periods in time, resulting in multiple records per customer in a dataset. This study generated two datasets allowing for a comparison of the effect on the churn prediction performance of classifiers when trained with single versus multiple observations per customer.

Research Question 2 “Does using multiple observations per customer significantly increase next month churn predictive ac-curacy when compared with the traditional single observation per customer?”’

In an attempt to effectively predict customer churn in the fitness industry it is useful to find the drivers of churn. In order to iden-tify these drivers, this study conducts several experiments. Firstly, the linear correlation between churn and customer activities along with basic demographics are analyzed. Secondly, the study ana-lyzes which features are most useful for characterizing customer behavior that leads to churn. These findings are used to determine the most informative features that in turn can improve prediction performance. As such, a third research question is addressed:

Research Question 3 “Which features are most useful in predicting customer churn for customers in the fitness industry based on their membership data?”’

The following chapter provides background information, whereby it elaborates on previous research on predictive churn modelling. Subsequently, chapter 3 describes the methodology of the study. Chapter 4 presents the results of the experiments. In the final chap-ter, we present a summary of the findings of this study and aim to answer the research questions. Furthermore, we provide sugges-tions for future research.

2 THEORETICAL FRAMEWORK

This chapter provides an introduction to some of the relevant ter-minology and previous research on predictive churn modelling.

(3)

2.1 Machine Learning Techniques for Churn

Prediction

Little research on churn prediction in the fitness industry exists that uses machine learning methods. However, several studies have looked into the possibility to apply machine learning techniques to predict churn in other industries.

2.1.1 Naive Bayes

The Naive Bayes (NB) classifier is a simple probabilistic classifier that is based on applying Bayes’ theorem of probability with strong naive independence assumptions. In particular, it assumes that all features of a class are independent of each other. According to Be-hara and Nath [6], the NB classifier has proved to be an effective classifier for predicting churn for the wireless telecommunication industry as their model obtained a 68% predictive accuracy. Al-though using accuracy as only evaluation metric might not provide a reliable indication of the performance of a model, Vafeiadis et al. [7] also point out that the NB classifier is not as effective as other classification methods, such as Support Vector Machines, Decision Trees and Artificial Neural Networks, for predicting customer churn of telecommunication customers.

2.1.2 Support Vector Machines

Support Vector Machine (SVM) is a supervised machine learning algorithm that can be used for both regression and classification problems. A SVM classifier finds the hyperplane that maximizes the distance between the nearest examples of different classes that lie in a transformed input space [8]. As reported by Xia and Jin [9], SVM outperforms NB classifier, Decision Trees, Artificial Neural Networks and Logistic Regression in predicting customer churn of telecommunication based on various evaluation measures. Further-more, Vafeiadis et al. [7] showed that SVM achieved an accuracy of 93% and F-score of 73% for predicting customer churn of telecommu-nication customers. Moreover, they found the best classifier to be a boosted SVM using the AdaBoost algorithm with a 97% accuracy and 84% F-score.

2.1.3 Decision Tree Learning

Decision Tree (DT) learning is a supervised learning method that is used for classification and regression. The motive for using a DT for classification is to build a model that predicts the class of the target variable by learning decision rules that are inferred from features in the training data. Although DTs do not have the ability to capture complex relationships between features, Hadden et al. [10] point out that this technique performed best in a comparative analysis with Artificial Neural Networks and Linear Regression in terms of overall accuracy of predicting customer churn. Further-more, Vafeiadis et al. [7] report that DT outperforms NB classifier, SVM and Logistic Regression in predicting customer churn, based on associated testing error.

2.1.4 Artificial Neural Networks

Artificial Neural Networks (ANNs) can be described as com-puter programs inspired by the biological neural networks of living species [11]. ANN is a broadly used approach to solve complex problems, including churn prediction. It is increasing in popularity due to its proven ability to generalize complex functions, its robust-ness towards data-preprocessing and the emergence of accessible computation power [12]. However, it can be hard to interpret a

trained network with respect to the problem at hand because it provides little insight into the influence of the variables that were included in the prediction process. According to Vafeiadis et al. [7], ANN was the method with the best performance (without the use of boosting) in terms of corresponding testing error, with an accuracy of 94% and F-score of 77%. Furthermore, Prashanth et al. [1] report that ANN performs better than general linear modelling for customer churn prediction. Finally, Fridrich [12] argues that optimization of the hyper-parameters of ANN classification models improve customer churn prediction performance.

2.1.5 Logistic Regression

Logistic Regression (LR) analysis is a statistical classification model that can be used to estimate the probability of the target variable, such as churn, based on a set of independent predictor variables. As reported by Vafeiadis et al. [7], LR fails short in per-formance in terms of accuracy and F-score in predicting customer churn when compared to SVM, DT and ANN. This result is in line with the findings of Prashanth et al. [1], who observed that Logistic Regression is outperformed by non-linear models such as Random Forest and ANN when applied for churn prediction.

2.1.6 Ensemble Classification Methods

Ensemble classification methods create one aggregated model by combining a set of classifiers in order to improve classification per-formance [13]. Generally, there are two types of ensemble methods: bagging and boosting. In bagging methods, each classifier in the ensemble is constructed using bootstrapped training sets (i.e. sam-ples taken with replacement) and then combine their predictions through a majority vote. Random Forest is a successful ensemble classification method, and is a variation upon bagging [14]. On the other hand, in boosting methods, ensembles are built incremen-tally where each newly added model aims to minimize the bias of the combined model. Misclassified examples are attributed higher importance in the training data by assigning different weights af-ter every iaf-teration, which forces classifiers to focus on examples that are more difficult to classify correctly. Prominent examples of boosting algorithms are AdaBoost, Gradient Tree Boosting (GTB) and Extreme Gradient Boosting (XGBoost).

XGBoost is an implementation of a decision tree boosting al-gorithm, which is widely used in the data science community to achieve state-of-the-art results on a wide range of machine learning challenges [15]. Nimmagadda et al. [16] report that XGBoost had the best prediction performance in a comparison with LR and ANN as it gave the lowest cross-entropy loss (0.117) for the problem of predicting whether users of a music streaming service will cancel their subscription in the next month.

2.2 Evaluation Criteria

This study uses the below measures in order to evaluate the per-formance of models in customer churn prediction. These measures are computed from the confusion matrix that is illustrated in Table 1. We represent true positive and false positive outcomes as TP and FP respectively, while we represent true negative and false negative outcomes as TN and FN.

2.2.1 Precision

Precision is the fraction of positive identifications that were correct and is defined as follows:

(4)

Precision =_{T P + FP}T P (1) 2.2.2 Recall

Recall is the fraction of actual positives that were identified correctly and is defined by the following equation:

Recall =_{T P + FN}T P (2) 2.2.3 Accuracy

Accuracy is the fraction of the total number of predictions that were correctly identified and can be calculated as follows:

Accuracy =_{T P + FP + T N + FN}T P + T N (3) When we use accuracy, however, the formula attributes the same relevance to each class. As such, when dealing with an imbalanced dataset - for instance 99%, of its records are in one class and only 1% in the other - a classifier algorithm might reach an accuracy of 99% by simply predicting that every instance belongs to the majority class and thus it does not provide a reliable measure of the performance of a model.

2.2.4 F-score

The F-score is a measure that combines precision and recall into a single score. It is defined as the harmonic mean of the precision and recall:

F-score=2 ×_{Precision + Recall}Precision × Recall (4) 2.2.5 ROC Curve and AUROC

The Receiver Operating Characteristics (ROC) curve is a graph that shows the performance of a classification model at all classifi-cation thresholds. This curve plots the true positive rate against the false positive rate at different classification thresholds, which are defined as follows:

True Positive Rate =_{T P + FN}T P (5) This metric is a synonym for Recall and thus measures the frac-tion of actual positives that were identified correctly.

False Positive Rate = FP

FP + T N (6) This metric measures the fraction of actual negatives that were misclassified by the model as positive.

The Area Under the ROC Curve (AUROC) measures of how well a classifier can discriminate between two classes. The AUROC is calculated by measuring the area under the ROC curve. A random predictor has the area under a diagonal ROC curve, which has an

Predicted Class Churn Non-Churn

Actual Class Churn TP FN Non-Churn FP TN

Table 1: Confusion matrix for measuring classifier perfor-mance

AUROC of approximately 0.5. The random predictor is commonly used as a baseline to determine how well a given model is perform-ing in comparison to random guessperform-ing. Moreover, the AUROC is a useful measure when dealing with an imbalanced dataset, since it allows to evaluate if a particular model is able to learn the rare class.

2.3 Challenge of Class Imbalance

The class imbalance problem typically refers to a classification problem where one class is represented by the majority of examples, while the other is represented by a only a small portion of the dataset [17]. For the sake of clarity, we define the class imbalance problem in the context of churn as the situation in which the total number of positive (i.e. churn) examples is far less than the total number of negative (i.e. non-churn) examples. This imbalanced class distribution hinders the performance of some standard classifiers, as it leads the algorithms to ignore the rare positive class and classify all the records as negative while yielding a high accuracy. In fact, machine learning methods are often applied with the aim to predict the rare positive class, rather than the negative class.

Different studies propose a number of sampling approaches to address the class imbalance problem, which can be summarized into the following four important groups:

(1) Random Undersampling, a technique that randomly elimi-nates majority-class examples from the dataset in order to equalize the class distribution [18].

(2) Random Oversampling, a technique that mitigates the class imbalance problem by randomly replicating the under-represented positive class [18].

(3) SMOTE, is a form of oversampling of the minority-class by adding synthesized examples as a linear combination of ex-isting minority-class examples [19].

(4) ADASYN, is also an oversampling technique that uses a weighted distribution for the minority-class examples ac-cording to their level of difficulty in learning, where more synthetic examples are created for minority-class examples that are difficult to learn [20].

However, these sampling techniques have several disadvantages. Undersampling discards potentially useful examples belonging to the majority class and can therefore deteriorate the performance of a classifier [18]. Oversampling techniques introduce additional training examples and can thus increase the amount of time to train a classifier. Furthermore, because oversampling techniques create identical or similar copies of examples, it might lead to overfitting [18].

2.4 Representation of Dynamic Attributes

Predictive models in customer churn generally rely on static and dynamic attributes [21]. The customer attributes that do not change over time are the static attributes. For example, ‘gender’ is consid-ered static during the lifetime of a customer. On the other hand, the dynamic attributes describe the customers’ behavioral changes over time. Dynamic attributes are, however, more difficult to manage, since their values usually change over time. Traditionally, this type of information is summarized into static attributes, which allows them to be represented in a dataset [21]. For this reason, we have

(5)

to choose a sampling frequency. For instance, if we want to model customers’ check-ins at a gym, we can count their visits for each month or for every week in order to create a static attribute from their dynamic attributes.

2.4.1 Dynamic Timeline

In most industries, churn is a rare phenomena, which results in a small amount of churners compared to the number of active customers within a given time window [18, 22]. As mentioned earlier, some standard classifiers have difficulties with learning patterns of examples with an imbalanced class distribution. Weiss et al. argue that also the absolute rarity of a class makes it difficult to find patterns within this rare class and can therefore hinder the classification performance [23]. Absolute rarity occurs when “[. . . ] the number of examples associated with the rare class/case is small in an absolute sense” [23]. Particularly, greedy algorithms suffer from this problem [21]. For instance, the Decision Tree approach starts with all the data and then repeatedly partitions the data into smaller groups, which may lead to data fragmentation when dealing with a rare class [23]. Burez et al. [18] argue that this is a problem, since “regularities can then only be found within each individual partition, which will contain less data”.

As such, a problem arises when we try to analyze the customers’ dynamic attributes for a particular time period. More specifically, when using a fixed time period, the activities in this interval are the only activities analyzed. Consequently, the data of customers who churned before or during the interval will not be fully used to train a model, as there is no data after a customer churns. Related literature suggests to use a dynamic time period instead, which differs for each customer (see Figure 1) [24]. This way, we are able to scan behavior for each customer dynamically using a variable time window.

2.5 Dataset Frameworks for Churn Prediction

This section describes two churn prediction frameworks which utilize the dynamic nature of customer data.

2.5.1 Single Period Training Data (SPTD)

The SPTD, as first introduced by Lee et al. [3], is a training dataset that includes customer history data of one continuous period of time. This training dataset consists of one record per customer whose dynamic attributes (i.e. historical behavior) are summarized into single valued features by using an aggregation function, e.g. the median or the mean. The target variable is defined as a binary variable: customers who churned in the period under examination are denoted as 1, while the customers who did not churn are char-acterized by a 0. Typically, the STPD is produced in such a way that it looks at the most recently available data [21], such that it captures the customers’ behavior at a present time.

2.5.2 Multiple Period Training Data (MPTD)

Ali et al. [21] introduce a churn prediction framework called the MTPD that extends SPTD by taking multiple observations into consideration per customer. They report that MPTD increases the prediction accuracy, as this framework “increases the sampling den-sity in the training data and allows the models to generalize across behaviors in different time periods”. However, using multiple obser-vations per customer introduces “a lack of independence [that] may

bias the parameter estimates and/or artificially increase the signif-icance of the parameters with methods that assume independent and identically distributed error terms”[21].

3 METHODOLOGY

Having outlined the theoretical foundations of this research, this chapter provides a description of the raw data and how it was processed into the final datasets that were used as input for the machine learning algorithms. Subsequently, we elaborate on the methods that were used in the course of this study.

3.1 Dataset

3.1.1 Data Description

This study collected data from an online fitness platform that provides membership management, billing and scheduling soft-ware for gyms. Each customer is described by a wide variety and large amount of records. For this reason, it is crucial to reduce the information size in order to limit the computational time that is required to analyze the data. As such, we aggregated the customers’ dynamic attributes per calendar month, as this period still allows to interpolate customers’ behavior while maintaining a dataset that is not too complex to analyze. Furthermore, this period is in line with the membership subscriptions which are on a monthly basis. The available data spans from January 2015 to May 2018. There-fore, each customer contains a maximum of 28 records, which is the number of calendar months in the considered period. Table 2 shows a simplified example of the aggregated data for an active vice-a-vis a churned customer with only the check-ins as their dynamic attribute.

The data can be categorized into the following four groups: (1) customer data, containing the customers’ basic demographics; (2) check-in data, including all registered physical visits of customers to a gym; (3) classes data, containing all participated group ex-ercise classes of customers at a gym; and (4) portal activity data, comprising the registered customer visits to the online platform.

This work uses the historical data of customers of two selected fitness gyms to train and evaluate models in predicting churn. After cleaning and validation, the data consists of 7734 customers of which 45% discontinued their membership. These customers have a month-by-month membership at these gyms, which means they have the possibility to cancel the subscription with a month’s notice. We set this criteria deliberately, since it would add an additional level of complexity to model churning behavior when customers are for example bound to a 1 or 2 year contract.

3.1.2 Data Noise

The most prevalent source of noise in the data are the missing values for some customer records. Simply removing these records from the dataset, however, would also remove a large proportion of the available training data. Therefore, for some records the missing attributes are inferred. Since the start and end date of a customer’s membership are essential pieces of information to determine when a customer is active or not, this study carefully selected two fitness gyms with a high proportion of customers that contain this data. For missing categorical variables such as ‘gender’, a new ‘missing’ category was added to allow for the potential scenario that the reason behind a missing value has a relation to the likelihood of

(6)

Figure 1: Fixed Timeline (left) and Dynamic Timeline (right)

Customer ID Month Check-ins Churn 001 January, 2017 4 0 001 February, 2017 2 0 001 March, 2017 3 0 001 April, 2017 1 0 001 May, 2017 2 0 001 June, 2017 0 0 001 July, 2017 1 1 002 April, 2018 2 0 002 May, 2018 2 0 002 June, 2018 2 0 002 July, 2018 2 0 002 August, 2018 2 0

Table 2: Example data with monthly aggregated check-ins for two customers

churning of the customer. For missing continuous variables such as ‘age’, the median of all other records was used as a replacement. Besides missing data, we extensively verified whether the data that was collected from the database is correct, since databases of such size are prone to errors. Examples of mistakes that were often found are:

• Categorical variables that are described by different values but hold the same meaning, such as ‘Male’ and ‘Man’ for gender. Cases like these were handled by mapping all values with equivalent meaning to one and the same value. • Customers who have a cancellation date before and on their

subscription date. These records were removed.

• Multiple registered check-ins and participated group classes of the same customer within a very short period of time. This was handled by taking the customer’s first activity record and deleting the subsequent activity records that fall within a time difference of 2 hours, assuming that members of a gym do not have multiple workout sessions within such a short time frame.

It is crucial to carefully account for these mistakes as these errors can lead to biased results in a later phase of the project.

Figure 2: Example timeline of prediction model

3.1.3 Definition Churn Prediction Task

We formally define the churn prediction task as predicting whether a member of a fitness gym is among the ‘subscribed’ or the ‘un-subscribed’ class. Since it is more profitable for fitness gyms to have long-term relationships with their customers, it is essential to detect the customers who are likely to churn with sufficient notice, as this allows fitness gyms to take action to retain these customers. Evidently, the fitness gyms would then not be able to take actions to rectify customers’ behavior, as the available amount of time left is not sufficient. Therefore, this study aims to train a classifier that is able to identify customers who are likely to churn with adequate advance.

For this study, we aim to predict churn one calendar month ahead of time, based on the input of three continuous months of customer activity data (embedded in features). The output of the model is whether the customer cancels the membership in the subsequent calendar month or remains an active member. Figure 2 shows an example of a data sample where the observed period (input) is from March up to and including May and the prediction period (output) is the month June.

3.1.4 Generating Training Data: Single Versus Multiple Observations per Customer

This study provides a comparison of the effect on the churn pre-diction performance of classifiers when trained with single versus multiple observations per customer. We generated two datasets that both include three-month customer observations of the pe-riod January 2015 to May 2018, following the dataset frameworks described by Ali et al. (see subsection 2.5).

The first dataset that we created is based on the SPTD framework, including one record per customer. Traditionally, a SPTD inspired dataset uses the most recent customer observations. However, in

(7)

Customer ID Prediction Period Avg Check-ins Month Churn

001 July, 2017 1 1

002 July, 2018 2 0

Table 3: Single customer observations training set example

Customer ID Prediction Period Avg Check-ins Month Churn

001 April, 2017 3 0 001 May, 2017 2 0 001 June, 2017 2 0 001 July, 2017 1 1 002 July, 2018 2 0 002 August, 2018 2 0

Table 4: Multiple customer observations training set exam-ple

order to address the absolute rarity problem, the customer observa-tions were sampled for each customer separately using a dynamic timeline (see subsection 2.4.1). Here, the samples are collected by taking the three-month observation period prior to the month of membership cancellation for churned customers and by taking a random three-month observation period for active customers.

The second dataset that we created is based on the MPTD frame-work, including multiple observations per customer over time. Here, we use a sliding-window approach where we collect customers’ be-havior data within a window of three continuous calendar months, provided that the customers were active during this time. From there we “slide” our window one month further and repeat the process until we fully exploited the available data. A customer is considered to be in the ‘churned’ class if he or she cancelled the subscription during the subsequent calendar month with respect to the sliding window period. Similarly, a customer is considered ‘subscribed’ if no cancellation occurred within this interval.

As a result, there are 7,734 observations in the single observa-tions dataset and 89,014 observaobserva-tions in the multiple observaobserva-tions dataset. Note that each customer observation in both the single and the multiple observations dataset uses customer activity data from three-month periods to create the predictor variables which sum-marize the past customer behavior. An example of both datasets, based on the simplified data from Table 2, is given in Table 3 and Table 4. Here, the ‘average monthly check-ins’ represents the pre-dictor variable which describes the customer behavior of the three calendar months prior to the prediction period, i.e. the considered calendar month in which we aim to predict if a customer churns. The statistics of both datasets are described in Table 5.

3.1.5 Feature Extraction

As above-mentioned, customers’ behavior is described by dy-namic attributes of which the values change over time. As such, we summarize three month customer activity observations into new features which describe the customer behavior over the corre-sponding time period. We consider four different periods of time to compute these new features. More precisely, we compute the sum of the dynamic attributes for all three individual months and the mean

Single-record Multiple-record Customer Observations 7 734 (100%) 89 014 (100%) Churned Observations 3 656 (47%) 3 669 (4%) Active Observations 4 078 (53%) 85 345 (96%) Table 5: Class distribution single-record and multiple-record dataset

of these three months. Furthermore, we extracted static features, which are the customer’s age, gender and language preference. We also extracted the considered calendar month in which we aim to predict if a customer cancels its membership, allowing a model to learn potential periodic patterns of churn.

During the feature extraction phase, an initial exploratory data analysis was performed on all of the records. Figure 3 shows a visualization of a correlation matrix (using the Pearson correlation coefficient) that was used to find linear relations between features and the churn variable. Simple descriptive statistics of the features in the single-record dataset are given in Figure 8. Besides creating a better understanding of the data, we believe these results are relevant in our aim to answer our third research question.

As a result, we end up with 12 continuous and 3 categorical variables. Categorical variables were binarized by applying one-hot encoding. Hereafter, the total number of features is 30.

3.2 Modelling Phase

We selected several classification models for the churn prediction case study, and compared their performances in terms of stan-dard evaluation metrics. These performance scores also allow us to evaluate the effectiveness of utilizing a single-record versus a multiple-record dataset and applying sampling techniques.

3.2.1 Baseline Model

We started our modelling phase by establishing a baseline, since it gives a reference point to which we can compare the results of all other models that were constructed. As mentioned earlier, contem-porary research into churn prediction in the fitness industry is very limited. As such, we do not have a baseline based on current state-of-the-art approaches. We therefore use two “dummy classifiers” as our baseline, using scikit-learn’s dummy estimators [25]. As a first baseline we use the “most frequent” strategy, which assigns every test example to the majority class. The second baseline is a random classifier, which predicts the class label that is equally likely. 3.2.2 Classifier Algorithms

The classifier algorithms that were chosen for the experimental comparison are based on churn prediction literature. Additionally, the classifier algorithms were selected based on the criteria that they should be able to quantify the uncertainty about their prediction by outputting a probability value. This allows for business applications that utilize the ability to sort customers into a ranked list based on their assigned churn probability. For research question 3, we established an additional condition for the algorithms that were used to answer this question such that they should be able to provide which features leverage the model’s performance. This will allow

(8)

Figure 3: Correlation matrix

us to find the most informative features in predicting customer churn in the fitness industry.

Based on the above mentioned criteria we selected the following classical classifier algorithms for research question 1 and 2:

(1) Naïve Bayes (2) Decision Tree (3) Logistic Regression (4) Artificial Neural Network

Additionally, we selected the following ensembles of decision tree methods for research question 1, 2 and 3 because of their capability to provide estimates of feature importance from a trained predictive model:

(5) Random Forest (6) AdaBoost

(7) Gradient Tree Boosting (8) XGBoost

Appendix C presents an overview of the configuration of each classifier that is used in the performance comparison.

While literature shows that SVM achieves decent results in pre-dicting churn (section 2.1.2), this study decided not to include this algorithm in the model comparison since it is very slow to train and can take enormous amounts of time to tune its parameters.

3.2.3 Hyper-Parameter Optimization

Several machine learning classifiers mentioned in the previous section require a set of hyper-parameters to be set to configure various aspects of the learning algorithm, which can have different effects on the resulting model and its performance [26]. Hyper-parameter optimization can be used to find the optimal set of pa-rameters for a learning algorithm, such that it maximizes the per-formance of the model on a validation set.

Traditionally, the way to find the optimal parameters has been grid-search, which simply uses brute-force to exhaustively search over a manually specified subset of the hyper-parameter space of a learning algorithm [27]. This study rather uses random search which offers a more computational time efficient method for hyper-parameter optimization by randomly selecting combinations of parameter configurations instead of exhaustive search while still obtaining similar performances to a full-grid search [28].

The best hyper-parameter configuration for models that were included in the hyper-parameter optimization process can be found in Appendix C.

3.2.4 Evaluation Framework

We formally defined the task at hand as a binary classification problem, where the possible outcomes are 0 if no cancellation hap-pens in the subsequent month or 1 if a cancellation haphap-pens.

We used k-fold cross-validation to evaluate the performance of each classifier, which is one of the most common methods for

(9)

accuracy estimation and model selection [29]. Specifically, we used the stratified variant of k-fold cross validation, where each fold maintains approximately the same class distribution as that of the entire dataset, which is important to maintain a realistic testing condition for the classification models. In this research, we parti-tioned the data in ten (k = 10) equal sized folds. Studies show that ten is an optimal number of folds for real-world datasets similar to ours, as it optimizes the necessary time to run the test while mini-mizing the variance and bias of the comparison results [29, 30]. We iteratively used each individual fold for testing and the remaining nine folds for training. The performance measures were calculated for all ten folds after which the mean and standard deviation of the performance scores were obtained in order to asses an overall performance of each algorithm.

We use AUROC as our main metric for performance comparison, since it is independent of prior probabilities (or class prevalence) as well as the selected decision threshold. In addition to that it offers a single number which allows comparison among models [31]. Furthermore, it “has become the de facto standard metric for evaluating classifiers under imbalance” [31]. Additional evaluation measures that are reported for classifier comparison are F-score, accuracy, precision and recall.

3.2.5 Single-record and Multiple-record

We use the above-mentioned evaluation framework to evaluate the performance of each classifier for both the single and multiple-record training sets. However, using a k-fold cross validation ap-proach does not allow to use the same test dataset to compare the impact on predictive performance of the single-record versus the multiple-record training dataset. Therefore, based on Ali et al. [21], we decide to create 30 bootstrap samples for both type of training sets and compare the results on the same test dataset. In particu-lar, we use the data starting from January 2015 until January 2018 to generate the training datasets (i.e. single-record and multiple-record) while using the three-month customer observations from January 2018 up to and including March 2018 as test dataset. The lat-ter is used to evaluate the models in their ability to predict whether the customers will churn in April 2018 or not. We then analyze the mean AUROC values that were retrieved by several classifiers for both training sets. Additionally, we investigate whether the use of oversampling improves the performance of these classifiers for both training set generation methods. We selected Decision Tree and Logistic Regression for the classification methods of this experiment, since these methods are often applied as benchmarks in churn related studies [21]. We use SMOTE to create synthetic churn class examples in order to equally represent both classes for the two type of datasets. In order to test for significant differences between performance results of classifiers when trained with the single-record or multiple-record training set, we perform two-tailed t-tests with the significance levelsα = 0.01, α = 0.05, α = 0.1. 3.2.6 Under-sampling and Over-sampling

As can be seen in Table 5, especially the multiple-record dataset’s underlying class distribution is highly unbalanced. We addressed this problem by applying sampling techniques, using the imbalanced-learn python package [32]. In particular, we applied random under-sampling, random over-under-sampling, SMOTE and ADASYN to deter-mine whether these techniques can improve the prediction accuracy.

For all sampling techniques, the churn and non-churn classes were over-sampled or under-sampled to achieve an equal number of examples. Applying over-sampling techniques before the data is split into train and test might lead to biased results, since there is a higher chance of information “bleeding” from the test set into the training set. In order to avoid this, only the training data is subjected to sampling techniques for all iterations during 10-fold stratified cross-validation.

3.2.7 Feature Importance

Insights into the likelihood of customers of a fitness gym to cancel their membership is useful. However, this study aims to take a step beyond churn prediction by gaining an understanding of the reasons why a customer leaves, since this allows stakeholders of a fitness gym to come up with actions to incentivize the customer to remain a member. Certainly, there are an infinite amount of reasons for a customer to cancel its membership that we did not capture, but we can analyze which of our introduced features are most useful to make key decisions with tree-based models in order to infer their relative importance.

As such, we closely inspect the following tree-based learned models: AdaBoost, Gradient Tree Boosting and XGBoost. Generally, the more a feature is used to make decisions with decision trees, the higher its relative importance. We calculate this importance explicitly for each feature in the dataset, which allows us to rank and compare them with each other. The tree-based models are developed using the sci-kit learn API which provides an embedded function that outputs the features’ importance based on the Gini Importance. The Gini Importance method calculates the the importance of a feature as “[...] the sum over the number of splits (across all trees) that include the feature, proportionally to the number of samples it splits’ [33].

The ranking of these feature importance scores can be used for feature selection in order to reduce the dimensions without much loss of the total amount of information. This might be beneficial, since a lower amount of features reduces the training time and reduces the risk of overfitting. As such, we use recursive feature elimination (RFE) [34] to determine the best selection and best number of features based on the feature importance values. Initially, we start with all the features and then recursively remove a single feature with the lowest feature importance score while attempting to eliminate dependencies and collinearity that may exist in the model. To find the optimal number of features, we used this method in combination with cross validation in order to calculate the mean AUROC score on the validation set for each step where a feature is removed. The number of features left at the step which gives the maximum score is considered to be the optimal number of features of our data.

4 EVALUATION

This chapter presents and analyzes the results of the experiments that were conducted for each research question. First, we evaluate the performance of several classifiers in predicting customer churn in the fitness industry. Second, this study provides a comparison of the effect on the churn prediction performance of classifiers when trained with single versus multiple observations per customer.

(10)

AUROC F1 XGBoost 0.8386 0.7038 GTB 0.8324 0.6977 AdaBoost 0.8226 0.6969 Logistic Regression 0.8155 0.7084 Naive Bayes 0.7966 0.7072 ANN 0.7872 0.7205 Random Forest 0.7803 0.6324 Decision Tree 0.6568 0.6152 Baseline I (majority) 0.5000 0.0000 Baseline II (random) 0.5000 0.4695

Table 6: Single-record model performance in terms of AU-ROC and F1-score

Finally, we analyze which features are most useful for characterizing customer churn based on linear correlation and feature importance.

4.1 Research Question 1

The main research question of this study is “What machine learning model allows for effective customer churn prediction in the fitness industry based on their membership data?”. In our efforts to answer this question, we trained a combination of classical classifiers and ensemble classifiers for the churn prediction case study and com-pared their performances using AUROC, supported by the F-score. Additional performance results in terms of Accuracy, Precision and Recall can be found in Appendix B.

By analyzing the results that are obtained by cross-validation on the single as well as on the multiple-record dataset, described in respectively Table 6 and Table 7, we can see that all classifiers outperform the two baseline classifier models. We notice that for both training set generation methods the boosted tree classifiers are the best performing models, as XGBoost, GTB and AdaBoost all obtained a mean AUROC score of approximately 83% on the single-record dataset and approximately 80% on the multiple-record dataset. Interestingly, the Logistic Regression classifier, which is considered to be a conceptually simpler model compared to ensem-bles of decision tree models, approaches this performance with an average AUROC score of 82% on the single-record dataset and 78% on the multiple-record dataset. Finally, the results show that the DT classifier is the least effective model with a mean AUROC score of 66% for the single-record dataset and a close to random (53%) performance for the multiple-record dataset.

Interestingly, the obtained F-scores for the multiple-record dataset tend to zero or are at least far lower than the achieved F-scores in the single-record dataset. Since the multiple-record dataset is highly imbalanced and as some classifiers have difficulties learning with a minority class, we applied several sampling techniques in order to equally represent the classes in the training sets during cross valida-tion. However, even when subjecting the training sets to sampling techniques, the average performance in predicting churn for highly unbalanced test sets does not improve. This might be explained because the use of undersampling discards potential useful non-churn examples, while oversampling techniques introduce a vast amount of exact or similar copies of churn examples, which might

AUROC F1 GTB 0.8079 0.2607 XGBoost 0.8074 0.2586 Logistic Regression 0.7766 0.0000 AdaBoost 0.7764 0.0000 Naive Bayes 0.7709 0.1401 ANN 0.7573 0.0005 Random Forest 0.6442 0.1091 Decision Tree 0.5334 0.0942 Baseline I (majority) 0.5000 0.0000 Baseline II (random) 0.5000 0.0729

Table 7: Multiple-record model performance in terms of AU-ROC and F1-score

lead to overfitted models. Another explanation might be that the existing churn examples in both training sets are highly different from each other. As a result, creating synthetic examples as a linear combination of these churn examples give unrealistic examples and thus do not contribute to the performance of the model.

Table 8 compares the mean AUROC and F-score values for each combination of the XGBoost and the Logistic Regression classifier and the sampling method that was used for the multiple-record dataset. We can see that the performance in terms of AUROC de-teriorates by subjecting the imbalanced multiple-record data to sampling methods. However, the Logistic Regression classifier is found to achieve an approximately 16% higher F-score when it is trained with a balanced data set.

Classifier Sampling AUROC F1 XGBoost No sampling 0.8074 0.2586

Random Under-sampling 0.8037 0.1714 Random Over-sampling 0.8048 0.1724 SMOTE 0.8001 0.2446 ADASYN 0.7983 0.2409 Logistic Regression No sampling 0.7766 0.0000 Random Over-sampling 0.7724 0.1644 Random Under-sampling 0.7737 0.1635 SMOTE 0.7725 0.1617 ADASYN 0.7690 0.1595 Table 8: AUROC and F-score comparison for XGBoost and Logistic Regression when subjecting imbalanced multiple-obervation data to different sampling methods

Overall, we can conclude that the XGBoost model seems to be the most effective model for the single-record dataset and is highly effective for the multiple-record dataset in terms of both AUROC and F-score. In accordance with the confusion matrix for classifier evaluation in Table 1, we report XGBoost model’s confusion matrix in Figure 4, which was calculated with cross-validated estimates for all single-record data points. The cross-validated estimated probabilities for the single-record data points were used to draw the ROC AUC curve of the XGBoost model in Figure 5, which

(11)

Figure 4: Confusion Matrix XGBoost

Figure 5: ROC Curves XGBoost and Logistic Regression

confirms that this model strongly outperforms the baseline model on the single-record dataset with an AUROC that is 34% higher than a random classifier (i.e. AUROC=0.5).

While we report a relatively low F-score for the multiple-record dataset, we also observe a relatively high mean AUROC score. As such, the current model performs well in separating the test data in the appropriate classes, with the non-churn cases at one end of a scale and churn cases at the other. Meanwhile, the F-score gives a reflection of the performance of the current threshold. Therefore, there might be another threshold of the decision boundary where the F-score performance is higher.

4.2 Research Question 2

The second research question that this study aims to answer is “Does using multiple observations per customer significantly in-crease next month churn predictive accuracy when compared with

the traditional single observation per customer?”. As such, we used a single-record and a multiple-record dataset in order to train several classifiers, after which their prediction performance was evaluated using the same test dataset. The results reflect the ability to predict churn in the upcoming calendar month, namely April 2018, based on the customer observations of the prior three months. The test dataset includes records of 4124 customers of which 33 churned. Table 9 provides a comparison of the single versus the multiple-record training set generation method for predicting churn in the next calendar month in terms of AUROC. Additionally, this table shows the prediction improvement that the multiple-record method provides with respect to the single-record method. Finally, the ta-ble presents the corresponding p-values for testing whether there is a significant difference between the mean AUROC scores that were measured using the single-record training dataset and the multiple-record training dataset.

We observe that training both the XGBoost model and the Logis-tic Regression model with multiple customer observations signifi-cantly outperforms the models that are trained with single customer observations. On average, the multiple-record training set provides 6% improvement for the XGBoost model and 17% improvement for the Logistic Regression model. The theoretical justification for this prediction improvement might be the fact that the multiple-record method includes a vast amount of non-churn examples across dif-ferent time periods that for a large part are discarded by the single-record method. We could argue that this allows the model to better generalize periodic patterns with respect to non-churn behavior. This in turn contributes to a more accurate separation between the churn and non-churn class. The Decision Tree classifier’s per-formance, on the other hand, is approximately 10% lower for the test set when trained with the multiple-record method in compari-son to the single-record method. This behavior is in line with our expectations based on the literature, since the performance of a single decision tree is known to suffer from class imbalance and the absolute rarity of a class as it might lead to data fragmentation [18].

Considering the results, we conclude that subjecting both train-ing sets to synthetic oversampltrain-ing does not change the outcome as to which training set generation method achieves a better perfor-mance per classifier. As already mentioned in section 4.1, this might be because creating synthetic examples as a linear combination of highly varying churn examples give unrealistic examples that do not contribute to the performance of the model.

4.3 Research Question 3

The final research question of this thesis is “Which features are most useful in predicting customer churn for customers in the fitness industry based on their membership data?”. In order to address this question, we consider both the data exploration phase and the calculated feature importance scores.

In an early stage of the study, we performed an exploratory data analysis on all of the customer records. Supported by the correlation matrix in Figure 3, we determined the features that are linearly correlated with the target variable (i.e. churn).

We found that the following features have the largest negative correlation with churn:

(12)

Classifier Sampling Mean AUROC Multiple-record improvement p value H0no difference

Single-record Multiple-record vs. Single-record (%) Multiple-record vs. Single-record XGBoost No oversampling 0.6795 0.7176 5.61% 0.0000***

SMOTE 0.6861 0.7072 3.08% 0.0089***

Logistic Regression No oversampling 0.6419 0.7493 16.73% 0.0000***

SMOTE 0.6423 0.7459 16.14% 0.0000***

Decision Tree No oversampling 0.5747 0.5164 -10.14% 0.0324**

SMOTE 0.5715 0.5387 -5.75% 0.1227

***_{p < 0.01,}**_{p < 0.05,}*_{p < 0.1}

Table 9: Comparison of single vs. multiple record for next month churn prediction using the AUROC measure

• Average Check-ins per Month and Check-ins Last Month (both -0.3965)

• Average Participated Group Classes per Month (-0.3130) • Average Portal Visits per Month (-0.2890)

• English Language Preference (-0.2763) • Customer’s Sex is Missing (-0.2498) • Observed month is February (-0.1334)

The features with the largest positive correlation with churn are: • Days Since Last Check-in (0.4915)

• Dutch Language Preference (0.2769)

• Customer’s Sex is either Female (0.1692) or Male (0.1191) • Member Lifetime (0.1018)

Note that the Pearson coefficient method assumes a linear rela-tionship between two variables and relies on the means and stan-dard deviations of the variables, which can be seen in Figure 8.

After training a selection of boosted tree models, we extracted the individual relative feature importance scores. Unlike linear models, tree-based models are able to map non-linear relationships. As such, the feature importance scores have the advantage over the correlation results by giving insights into both linear and non-linear patterns that are most useful in making key decisions with tree-based models. Figure 6 illustrates all feature importance scores that are assigned by the boosted tree models ordered by their average score. In general, all models agree on the relative importance of the features, although AdaBoost assigns a notably higher importance score to the feature ‘Days Since Last Check-in’.

Figure 7 visualizes the result of recursive feature elimination in combination with cross-validation to compare the performance for each best selection ofn features, where n = 1, 2, ..., 30. We observed that the optimal number of features for our data isn = 12 with an average AUROC score of 0.8430. The following set of features is considered to be the most useful combination of features for predicting next month customer churn:

(1) Days Since Last Check-in (2) Member Lifetime

(3) Average Portal Visits per Month (4) Age

(5) Amount of Check-in Last Month (6) Observed Month is April (7) Observed Month is February (8) Average Check-ins per Month

(9) Customer’s Sex is Missing (10) Observed Month is January

(11) Participated Group Classes Three Calendar Months Back (12) Dutch Language Preference

5 CONCLUSION AND FUTURE RESEARCH

The main aim of this study has been to predict next month customer churn in the fitness industry by applying state-of-the-art machine learning methods based on data of membership activities. As such, the main research question of this thesis has been “What machine learning model allows for effective customer churn prediction in the fitness industry based on their membership data?”. The ensemble model XGBoost was found to be the most effective model for the single-record dataset and a highly effective model for the multiple-record dataset with a mean AUROC of respectively 83% and 80%.

The second research question that this study aimed to answer is “Does using multiple observations per customer significantly increase next month churn predictive accuracy when compared with the traditional single observation per customer?”. We found that both the XGBoost model and Logistic Regression model sig-nificantly improved their predictive performance from training on multiple observations per customer compared to single observa-tions per customer. More specifically, the average mean AUROC score of both classifiers increased with respectively 6% and 17%. The theoretical justification for this improvement might be because training on a vast amount of non-churn examples across different time periods allows the models to better generalize periodic pat-terns with respect to non-churn behavior. This in turn contributes to a more accurate separation between the churn and non-churn class.

Subjecting both the multiple-record and the single-record train-ing set to sampltrain-ing techniques in order to equalize the class distri-bution does not improve the prediction performance in terms of AUROC. Moreover, adding synthetic churn examples to equalize the class distribution of both training sets does not influence the outcome as to which type of training set achieves better perfor-mance. This might be explained because the use of undersampling discards potential useful non-churn examples, while oversampling techniques introduce a vast amount of exact or similar copies of churn examples, which might lead to overfitted models. Another explanation might be that the existing churn examples in both train-ing sets are highly different from each other. As a result, creattrain-ing

(13)

Figure 6: Feature Importance Scores

Figure 7: Feature ranking with recursive feature elimination and cross-validated selection of the optimal number of fea-tures

synthetic examples as a linear combination of these churn examples give unrealistic examples that do not contribute to the performance of the model.

The final research question that we addressed is “Which features are most useful in predicting customer churn for customers in the fitness industry based on their membership data?”. Based on findings related to linear correlations and feature importance scores of tree-based models, we found that the most informative features are a combination of 12 features. These features can generally can be categorized into three groups: (1) customer activity; (2) basic customer demographics; and (3) calendar-based periodic patterns.

Although this study adds value to contemporary literature, it is not without some limitations that can be addressed in future research. First, we used a case study approach where we collected data of 7734 customers of two fitness gyms and showed that we can effectively use machine learning methods to predict whether these customers are likely to cancel their membership. However, further research is needed to investigate the generalizability of our find-ings by validating them with data of a larger amount of customers of different gyms. In addition, these customers were selected on the condition of having a membership that allows termination of the contract at any time with a one-month notice period. Future research could therefore aim at modelling churn behavior for cus-tomers that are bound to a longer notice or contract period. Due to the fact that a large proportion of the available data contains missing values, we inferred some of these variables’ missing val-ues. For the same reason, a broad range of variables were omitted that could potentially contribute to characterizing the customers’ behavior and hence predicting customer churn. Future research might therefore investigate the predictive power of variables such as the ones that describe the customer’s workout schedule, food intake and retail purchases.

REFERENCES

[1] R Prashanth, K Deepak, and Amit Kumar Meher. 2017. High accuracy predic-tive modelling for customer churn prediction in telecom industry. In Interna-tional Conference on Machine Learning and Data Mining in Pattern Recognition. Springer, 391–402.

[2] Sven F Crone, Stefan Lessmann, and Robert Stahlbock. 2006. The impact of preprocessing on data mining: an evaluation of classifier sensitivity in direct marketing. European Journal of Operational Research, 173, 3, 781–800. [3] Yen-Hsien Lee, Chih-Ping Wei, Tsang-Hsiang Cheng, and Ching-Ting Yang.

2012. Nearest-neighbor-based approach to time-series classification. Decision Support Systems, 53, 1, 207–217.

[4] Carlotta Orsenigo and Carlo Vercellis. 2010. Combining discrete svm and fixed cardinality warping distances for multivariate time series classification. Pattern Recognition, 43, 11, 3787–3794.

(14)

[5] Anita Prinzie and Dirk Van den Poel. 2006. Incorporating sequential infor-mation into traditional classification models by using an element/position-sensitive sam. Decision Support Systems, 42, 2, 508–526.

[6] Shyam V Nath and Ravi S Behara. 2003. Customer churn analysis in the wireless industry: a data mining approach. In Proceedings-annual meeting of the decision sciences institute, 505–510.

[7] Thanasis Vafeiadis, Konstantinos I Diamantaras, George Sarigiannidis, and K Ch Chatzisavvas. 2015. A comparison of machine learning techniques for customer churn prediction. Simulation Modelling Practice and Theory, 55, 1–9. [8] Armin Shmilovici. 2009. Support vector machines. In Data mining and

knowl-edge discovery handbook. Springer, 231–247.

[9] Guo-en Xia and Wei-dong Jin. 2008. Model of customer churn prediction on support vector machine. Systems Engineering-Theory & Practice, 28, 1, 71–77. [10] John Hadden, Ashutosh Tiwari, Rajkumar Roy, and Dymitr Ruta. 2006. Churn prediction: does technology matter. International Journal of Intelligent Technol-ogy, 1, 2, 104–110.

[11] S Agatonovic-Kustrin and R Beresford. 2000. Basic concepts of artificial neural network (ann) modeling and its application in pharmaceutical research. Journal of pharmaceutical and biomedical analysis, 22, 5, 717–727.

[12] Martin Fridrich. 2017. Hyperparameter optimization of artificial neural network in customer churn prediction using genetic algorithm. Trendy Ekonomiky a Managementu, 11, 28, 9.

[13] Koen W De Bock and Dirk Van den Poel. 2011. An empirical evaluation of rotation-based ensemble classifiers for customer churn prediction. Expert Sys-tems with Applications, 38, 10, 12293–12301.

[14] Leo Breiman. 2001. Random forests. Machine learning, 45, 1, 5–32. [15] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: a scalable tree boosting system.

In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 785–794.

[16] Sravya Nimmagadda, Akshay Subramaniam, and Man Long Wong. 2017. Churn prediction of subscription user for a music streaming service.

[17] Pedro Domingos. 1999. Metacost: a general method for making classifiers cost-sensitive. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 155–164.

[18] Jonathan Burez and Dirk Van den Poel. 2009. Handling class imbalance in customer churn prediction. Expert Systems with Applications, 36, 3, 4626–4636. [19] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321–357.

[20] Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. 2008. Adasyn: adaptive synthetic sampling approach for imbalanced learning. In Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on. IEEE, 1322–1328.

[21] Özden Gür Ali and Umut Arıtürk. 2014. Dynamic churn prediction framework with more effective use of rare event data: the case of private banking. Expert Systems with Applications, 41, 17, 7889–7903.

[22] Yaya Xie, Xiu Li, EWT Ngai, and Weiyun Ying. 2009. Customer churn prediction using improved balanced random forests. Expert Systems with Applications, 36, 3, 5445–5449.

[23] Gary M Weiss. 2004. Mining with rarity: a unifying framework. ACM Sigkdd Explorations Newsletter, 6, 1, 7–19.

[24] Vivek Bhambri. 2013. Data mining as a tool to predict churn behaviour of customers. International Journal of Management Research, 59–69.

[25] F. Pedregosa et al. 2011. Scikit-learn: machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

[26] Marc Claesen and Bart De Moor. 2015. Hyperparameter search in machine learning. arXiv preprint arXiv:1502.02127.

[27] Guoqi Li, Guangshe Zhao, and Feng Yang. 2014. Towards the online learning with kernels in classification and regression. Evolving Systems, 5, 1, 11–19. [28] James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter

optimization. Journal of Machine Learning Research, 13, Feb, 281–305. [29] Ron Kohavi et al. 1995. A study of cross-validation and bootstrap for accuracy

estimation and model selection. In Ijcai number 2. Vol. 14. Montreal, Canada, 1137–1145.

[30] Leo Breiman. 2017. Classification and regression trees. Routledge.

[31] T Ryan Hoens and Nitesh V Chawla. 2013. Imbalanced datasets: from sampling to classifiers. Imbalanced Learning: Foundations, Algorithms, and Applications, 43–59.

[32] Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas. 2017. Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18, 17, 1–5. http://jmlr.org/ papers/v18/16-365.html.

[33] HAIWEI MA, C ESTELLE SMITH, LU HE, SAUMIK NARAYANAN, ROBERT A GIAQUINTO, RONI EVANS, LINDA HANSON, and SVETLANA YAROSH. 2017. Write for life: persisting in online health communities with expressive writing and social support.

[34] 2018. Rfecv - scikit learn. (July 2018). http://scikit-learn.org/stable/modules/ generated/sklearn.feature_selection.RFECV.html.

[35] 2018. Xgboost python package. (July 2018). https://xgboost.readthedocs.io/en/ latest/python/index.html.

(15)

APPENDIX

A

DESCRIPTIVE ANALYSIS

(a) Churned customer records (b) Active customer records

Figure 8: Descriptive statistics

B

MODEL PERFORMANCE SCORES

AUROC F1 Accuracy Precision Recall XGBoost 0.8386 0.7038 0.7411 0.7148 0.7377 GTB 0.8324 0.6977 0.7399 0.7132 0.7254 AdaBoost 0.8226 0.6969 0.7325 0.7046 0.7300 LR 0.8155 0.7083 0.7397 0.7064 0.7426 NB 0.7966 0.7072 0.7021 0.6367 0.8264 ANN 0.7872 0.7205 0.7274 0.6853 0.7690 RF 0.7803 0.6324 0.7074 0.6878 0.6304 DT 0.6568 0.6152 0.6577 0.6117 0.6417 Baseline I 0.5000 0.0000 0.5273 0.0000 0.0000 Baseline II 0.5000 0.4695 0.4833 0.4561 0.4836 Table 10: CV scores with single-record customer data (no sampling)

AUROC F1 Accuracy Precision Recall GTB 0.8410 0.7056 0.7418 0.7145 0.7415 XGBoost 0.8389 0.7067 0.7406 0.7096 0.7451 AdaBoost 0.8246 0.7018 0.7333 0.7031 0.7424 LR 0.8149 0.7084 0.7324 0.6907 0.7566 NB 0.7964 0.7063 0.7006 0.6357 0.8256 RF 0.7795 0.6449 0.7138 0.6942 0.6449 ANN 0.7776 0.7190 0.7192 0.6678 0.7900 DT 0.6458 0.6047 0.6467 0.6076 0.6294 Baseline I 0.5000 0.0000 0.5273 0.0000 0.0000 Baseline II 0.5000 0.4695 0.4833 0.4561 0.4836 Table 11: CV scores with single-record customer data (SMOTE)

(16)

AUROC F1 Accuracy Precision Recall GTB 0.8079 0.2607 0.9629 0.4089 0.1995 XGBoost 0.8074 0.2586 0.9632 0.4212 0.1951 LR 0.7766 0.0000 0.9588 0.0000 0.0000 AdaBoost 0.7764 0.0000 0.9588 0.0000 0.0000 NB 0.7709 0.1401 0.6099 0.0784 0.7303 ANN 0.7573 0.0005 0.9588 0.1000 0.0003 RF 0.6442 0.1091 0.9511 0.1919 0.0872 DT 0.5334 0.0942 0.9127 0.0854 0.1183 Baseline I 0.5000 0.0000 0.9588 0.0000 0.0000 Baseline II 0.5000 0.0729 0.4932 0.0394 0.4832 Table 12: CV scores with multiple-record customer data (no sampling)

AUROC F1 Accuracy Precision Recall GTB 0.8056 0.1753 0.6827 0.1013 0.7348 XGBoost 0.8048 0.1724 0.6767 0.0992 0.7364 AdaBoost 0.7756 0.1593 0.6527 0.0909 0.7198 LR 0.7737 0.1635 0.6515 0.0932 0.7410 NB 0.7708 0.1253 0.5138 0.0682 0.8246 ANN 0.7646 0.1617 0.6169 0.0912 0.7983 RF 0.6434 0.1194 0.9432 0.1586 0.1142 DT 0.5312 0.0888 0.9201 0.0833 0.1074 Baseline I 0.5000 0.0000 0.9588 0.0000 0.0000 Baseline II 0.5000 0.0729 0.4932 0.0394 0.4832 Table 13: CV scores with multiple-record and Random Over-sampling

AUROC F1 Accuracy Precision Recall XGBoost 0.8037 0.1714 0.6652 0.0990 0.7383 GTB 0.8033 0.1729 0.6689 0.1000 0.7389 AdaBoost 0.7736 0.1586 0.6489 0.0904 0.7247 LR 0.7724 0.1644 0.6550 0.0940 0.7375 NB 0.7708 0.1253 0.5158 0.0682 0.8197 ANN 0.7662 0.1625 0.6190 0.0917 0.8005 RF 0.7574 0.1742 0.7525 0.1043 0.5909 DT 0.6222 0.1236 0.6536 0.0696 0.5881 Baseline I 0.5000 0.0000 0.9588 0.0000 0.0000 Baseline II 0.5000 0.0729 0.4932 0.0394 0.4832 Table 14: CV scores with multiple-record and Random Under-sampling

AUROC F1 Accuracy Precision Recall XGBoost 0.8001 0.2446 0.9401 0.2483 0.2657 GTB 0.7983 0.2597 0.9395 0.3001 0.2649 LR 0.7725 0.1617 0.6656 0.0927 0.7070 AdaBoost 0.7709 0.1985 0.9111 0.1631 0.2804 ANN 0.7701 0.1622 0.6195 0.0918 0.7907 NB 0.7666 0.1320 0.5652 0.0727 0.7783 RF 0.6348 0.1243 0.9486 0.1974 0.1068 DT 0.5409 0.1004 0.9128 0.0889 0.1294 Baseline I 0.5000 0.0000 0.9588 0.0000 0.0000 Baseline II 0.5000 0.0729 0.4932 0.0394 0.4832 Table 15: CV scores with multiple-record and SMOTE

AUROC F1 Accuracy Precision Recall XGBoost 0.7983 0.2409 0.9467 0.2766 0.2401 GTB 0.7973 0.2457 0.9490 0.3108 0.2362 AdaBoost 0.7769 0.2082 0.9334 0.2178 0.2270 LR 0.7690 0.1595 0.6463 0.0906 0.7378 NB 0.7660 0.1273 0.5391 0.0697 0.7971 ANN 0.7617 0.1591 0.6000 0.0894 0.8163 RF 0.6332 0.1126 0.9486 0.1743 0.0951 DT 0.5413 0.1017 0.9131 0.0888 0.1308 Baseline I 0.5000 0.0633 0.2247 0.0330 0.8000 Baseline II 0.5000 0.0729 0.4932 0.0394 0.4832 Table 16: CV scores with multiple-record and ADASYN

(17)

C

MODEL CONFIGURATIONS

This section describes the tuned hyper-parameters values for the models using random search. Models and remaining parameters that are not mentioned use the default values of the corresponding classifier as implemented in scit-kit learn [25] and the XGBoost Python package [35].

C.1 Logistic Regression

(1) penalty =‘l1’ (2) C = 1.0

C.2 Random Forest

•min_samples_split = 10 •min_samples_leaf = 1 •max_f eatures =‘sqrt’ •max_depth = 50 •bootstrap = True

C.3 Extreme Gradient Boosting

•colsample_by_tree = 0.8 •дamma = 1.5 •max_depth = 5 •min_child_weiдht = 1 •subsample = 0.6

C.4 ANN

We implemented a simple Artificial Neural Network with a single hidden layer includingn = 25 neurons. This number of neurons achieves the best performance balance between AUROC and F-scores when varyingn = 5 × i for i = 1, 2, ..., 10 (see Table 17). We used the following parameter configuration:

•activation =‘relu’ •alpha = 0.0001 •hidden_layer_sizes = (25) •learninд_rate =‘constant’ •solver =‘lbfgs’ Neurons AUROC F1 5 0.4999 0.0000 10 0.7991 0.7067 15 0.7819 0.7120 20 0.7886 0.7219 25 0.8130 0.7017 30 0.8055 0.7057 35 0.7890 0.7357 40 0.7952 0.7045 45 0.8069 0.7033 50 0.8036 0.7123

Table 17: AUROC and F-score (estimated averages) for 10-fold stratified cross-validation for varying amount of neu-rons in single hidden layer