What Data Retains Customers? A Study on Data Characteristics affecting the Predictive Performance of Supervised Machine Learning methods in

(1)

1

What Data Retains Customers? A Study on Data Characteristics affecting the Predictive Performance of Supervised Machine Learning methods in

Churn Modeling

by

Etienne Kick

University of Groningen Faculty of Economics and Business

Department of Marketing PO Box 800, 9700 AV Groningen

The Netherlands

Thesis MSc Marketing Intelligence & MSc Marketing Management January 2021

e.kick.2@student.rug.nl

Supervisor: Prof. Dr. J.E. Wieringa

Second assessor: A. Bhattacharya

(2)

2

Abstract

This research aims to identify the circumstances under which the predictive performance of selected Supervised Machine Learning methods in predicting customer churn is amplified and how marketers can benefit from these insights. We simulate churn data and alter the Machine Learning algorithm and the characteristics of the dataset to investigate the effects of differences in data on various algorithms' predictive performance. Furthermore, we merge these results into guidelines for using Machine Learning in marketing analytics, validated by real-world churn data across twenty industries. Additionally, we provide insights into the monetary implications of data characteristics for marketing practice. The results show that the appearance of interactions and nonlinearity in the data and dichotomization of numeric variables significantly and positively affect TDL. In contrast, the number of possible predictor variables and the appearance of interactions in the data significantly and positively affect the GINI coefficient. This research adds to the field of explainable artificial intelligence. It provides insights for marketing practice and research regarding the implications of data characteristics on the predictability of customer retention.

Keywords: Machine Learning, (explainable) artificial intelligence, customer retention, data

characteristics, marketing analytics, data science, Big Data, predictive modeling

(3)

3

1. Introduction

As the amount of data continues to increase exponentially, also its importance keeps on growing. In the digitalized world we live in nowadays, many researchers point out that the use of the ever-increasing amount of data, referred to as Big Data, is becoming the primary business challenge of the coming years (e.g., Leeflang et al., 2014; Mela, 2020; Verhoef et al., 2016).

This challenge lies in the struggle of gaining valuable customer insights from the data and the development and understanding of more extensive ways of analyzing data (Leeflang et al., 2014). Back in the days, data analysis was not commonly applied in marketing environments.

However, with the increase in complexity and volume of data, new ways to extract more valuable customer insights from those data emerged (Leeflang et al., 2014; Verhoef et al., 2016; Wedel & Kannan, 2016).

These data can be used to provide insights into customer characteristics and behavior.

Research shows that more knowledge and data of customers are valuable in predicting customer churn, which has implications for the firm's value (Verhoef et al., 2016; Wedel &

Kannan, 2016). Gupta et al. (2004) show that an increase in retention has a five and fifty times greater positive effect on firm value than an equal increase in margin or acquisition cost, respectively. They also find that an increase in retention has approximately five times greater impact on firm value than the same change in the discount rate or cost of capital.

Hence, customer retention is one of the significant points of attention within firms, and with good reason.

While the challenge of analyzing the increasing amount of data occurs in many disciplines, marketing might be one of the most affected. Technological developments like the increase in data-storage capacity, analytical capacity, and online usage offer excellent opportunities to capture value for contemporary marketers (Verhoef et al., 2016). Subsequently, more experienced data scientists and improvements in artificial intelligence (AI) and Machine Learning (ML) provide the opportunity to exploit these developments (Leeflang et al., 2017;

Verhoef et al., 2016).

These developments led to the recognition of the superior performance of ML over more

traditional analysis techniques. Although researchers and practitioners know the apparent

value, debate remains about whether ML captures the additional value or increased data

quality (Bradlow et al., 2017). These authors suggest that an increase in the sample size of

(4)

4

time series data will only lead to outdated data, which has consequences on the model's predictive power.

Moreover, Brei (2020) gives an appropriate overview of widespread applications of Machine Learning techniques in both marketing research and practice. However, still, no conclusions are drawn about the circumstances under which different ML techniques perform the best.

Besides these technical improvements for marketing, it remains interesting to consider how and when marketing decisions and campaigns are affected by these new methods. Many examples show that the usage of ML techniques in marketing has beneficial effects on marketing efforts in companies compared to traditional forms of marketing. For instance, forecasts of frequencies of social media usage (Ballings & Van Den Poel, 2015), models that clarify responses to advertising efforts (Goldfarb & Tucker, 2011; Gubela et al., 2020), and improvements in customer churn predictions (Verbeke et al., 2012) all contribute to improved marketing operations.

Furthermore, Neslin et al. (2006) found that different modeling approaches yield different managerial implications. Therefore, it is interesting to investigate the consequences of using different predictive modeling techniques in customer relationship management. Previous studies focus solely on the performance of models in certain circumstances compared to other methods. Therefore, in this study, we investigate the characteristics of the data with which different ML methods flourish the most and how marketing decisions based on computer science can benefit customer relationship management.

We do this by analyzing simulated datasets and subsequently, for generalization purposes,

projecting the results on a real dataset, containing data on churn across twenty industries. To

see what consequences an application of our results has for marketeers, we calculate the

monetary value of an increase in sample size in a similar way as Neslin et al. (2006). As a

deliverable, we provide marketing practice and science with guidelines consisting of datasets'

properties that determine the circumstances under which certain methods are preferred over

others. Regarding the simulation, we construct several datasets, each with different properties,

to test the formulated hypotheses. For these artificial data, we compare the differences in the

performance of selected supervised ML methods. In this way, the characteristics of the data

that cause ML methods to perform at their best are explored, and the following research

question is answered:

(5)

5

To what extent is the performance of selected supervised Machine Learning methods affected by data characteristics, and how can marketers benefit from these insights?

As Risselada et al. (2010) highlight in their study, the use of more advanced ML methods has not yet been adopted throughout marketing; we will provide insights into these methods' benefits. Hence, in this study, comparisons for predicting customer churn are made between Support Vector Machines, Neural Networks, Random Forests, and Naïve Bayes’, whose performances will be judged by predictive validity measures for classification problems (TDL, GINI). Since data characteristics are investigated, we simulate data to ensure the analyses are free of confounds and enable the ability to test the hypotheses very specific. To check the generalizability of the results on the simulated data to real marketing practice, we use our guidelines to verify the results of the simulation on real data, retrieved from de Haan et al. (2015). To value the importance of our study for marketing practitioners, a similar approach to Neslin et al. (2006) is adopted, in which the authors value the retention of customers at different levels.

We found that the selection of the ML algorithm affects the model's predictive performance, i.e., different ML algorithms perform differently on the same data. Furthermore, the sample size has a nonlinear effect on the predictive performance, with decreasing returns to scale for a not yet known optimal size in this simulation. Next, we found that the number of possible predictors only affects the predictive performance if the number of predictors is small, which is exemplified by an increase from nine to eighteen possible predictors. Moreover, nonlinearity appears to positively affect the predictive performance, albeit merely for predicting the top decile of churners. Furthermore, we found that interactions positively affect the predictive performance of the selected ML methods. Finally, our study shows that the dichotomization of numeric variables positively affects the predictability of the top decile, while it negatively affects the predictability of the entire sample.

This study provides more thorough explanations of applications of analytics in marketing decisions, the concepts and methods used, and all the necessary steps taken to be able to conduct the analyses. In the end, the results of the study are displayed, as well as limitations and future research suggestions.

By conducting this study, we answer the rising demand for explanations of algorithmic

outcomes, better known as the area of explainable artificial intelligence. The importance of

this field is increasingly recognized in not only marketing (De Bruyn et al., 2020) but also in

(6)

6

the fields of law (Deeks, 2019; Higgins, 2019) and social sciences as well (Miller, 2019). This research is distinctive in its attempt to open the black box of ML, providing additional information about the properties of data that cause the performance of ML in predicting customer churn to be amplified. Therefore, this study returns valuable information for marketing practice. It gives insights into when which method should be used and how the marketing decision-making process can benefit from data analytics. Practitioners, as well as researchers, can benefit from the guidelines for using ML in marketing.

2. Literature overview

First, the applications of marketing analytics in practice are discussed, after which customer relationship management is touched upon in more detail. After that, Big Data and Machine Learning are clarified, followed by an explanation of the algorithms we use in this study. The subsequent section shows what performance indicators are used in this study. At last, the different characteristics of the data that are used in this study are discussed.

2.1. Marketing Analytics in Marketing Decision Making

Data is inevitably integrating into the decision-making processes within organizations, as increasingly valuable data is becoming available about what customers do, think, and even feel about products, brands, and people. This availability contributes to the shift of companies towards conducting a more data-driven business approach. Marketers are using marketing analytics in various ways to improve their business operations.

The rise of possibilities with data causes new forms of marketing to emerge. Improved personalization and recommendations, geo-fencing and hypertargeting, search marketing, and retargeting open up new opportunities for marketers to continually reach their customers with very relevant messages (Bradlow et al., 2017; Wedel & Kannan, 2016).

However, the question remains whether marketers use it most efficiently and effectively.

There is no closure on what type of analytics is most valuable for which type of problems and data, what methods are most effective for new sources of data, and what type of data leads to the best marketing decision support.

Nevertheless, Wedel & Kannan (2016) provide a framework based on one of the research

priorities of the Marketing Science Institute, issued in 2014. They divide key domains for

applications of analytics for supporting marketing decisions in four main areas: the marketing

mix, personalization, privacy and security, and customer relationship management (CRM).

(7)

7

The last concept will be discussed more in detail in section 2.2, because of its higher relevance for this study.

Marketing mix

New (digital) sources of data and the ability to incorporate them in analyses provide the opportunity to gain more detailed explanations of the effects of marketing mix efforts. More extensive internal data (e.g., direct surveys, attitudinal research, or behavior in physical stores) as well as external data (e.g., online word-of-mouth, online reviews, or clickstreams) both allow for these more detailed explanations (Wedel & Kannan, 2016).

Additionally, more specific attribution and allocation of marketing mix effects to new touchpoints give better insights into the effectiveness of the marketing actions. While many challenges remain in this field, progress is already made in modeling different marketing mix efforts and their spill-over effects at an omnichannel level. Also, online marketing allows for more ways to influence the customer by enabling organizations to generate their own content and engaging with customers on their channels. Consequently, marketing budgets can be better managed due to better insights into the effectiveness of different marketing mix efforts (Wedel & Kannan, 2016).

The incorporation of new data sources makes it possible to assess qualitative better causalities. More possibilities for exogenous independent variables related to endogenous control variables yield “cleaner” causality estimates. Furthermore, digital data environments allow for field experiments in the form of, for instance, A/B testing to determine the causality of control variables. Finally, more extensive data volumes with more variety allow for incorporating customer’s forward-looking behavior in models (Wedel & Kannan, 2016).

Personalization

Personalization builds upon the marketing mix allocations in that it develops adjustments on the product and service offerings of organizations at an individual level scale. With personalization, the dominant model in marketing shifts from mass broadcasting to more interactive based personal communication towards customers (Steckel et al., 2005; Wedel &

Kannan, 2016).

The growing amount of data does not necessarily mean that it is desired to personalize at the

individual level. However, it provides companies with the opportunity to customize different

elements of the marketing mix to different levels of granularity. Imagine McDonald's, who

(8)

8

build the image of the brand at a mass level (big yellow ‘M’, Ronald McDonald), while their product offerings differ for different segments (McKroket in The Netherlands) and their promotions are done at the individual level (I ordered a Big Mac, so I get a discount for a Big Mac on my next order).

Two different types of personalization can be distinguished: recommendation systems and adaptive personalization systems (Wedel & Kannan, 2016). Recommendation systems consist of two different types of recommendations. The first one is content filtering, which involves making recommendations based on past preferences (e.g., Netflix’s similar movie/series genres). The second one is collaborative filtering, with which recommendations are made based on how similar users to the customers behave and what they like (e.g., Spotify’s

“People also listened to …”). Model-based systems use these filtering methods to predict the subsequent preferences of customers.

Personalization algorithms are most often used first to learn customers’ preferences, then adapt offerings to those customers, followed by an assessment of the effectiveness of the offer. Digital environments allow this process to be automated in nearly real-time, enabling adaptive personalization (Wedel & Kannan, 2016).

Adaptive personalization systems are concerned with more real-time personalization. For example, personalized offers of Airbnb or Groupon can be made due to algorithms that use the available information of, for instance, location, previous interactions with the company, and availability in the market. Furthermore, this type of personalization is used in real-time bidding in online auctions for online advertising. The appropriability of the ad, along with the fee advertisers want to pay for, affects the decision made in the auction which ad is shown to which individual (Steckel et al., 2005). However, it should be noted that as these decisions are made in real-time, they require more computational power than regular recommendation systems.

Adaptive personalization is expected to grow in the coming years. With the rise of new and more individual level and user-generated data from a wider variety of sources (e.g., the Internet of Things), the automation and accuracy of online offerings become increasingly better.

Privacy and Security

The increasing concerns of customers about the personal data companies have available, and

supposed violations of organizations regarding privacy regulations affect the organizational

(9)

9

behavior of online advertisers. The easy availability of data, alongside the privacy laws and security technology that lags behind the pacing technologies of data collection, storage, and processing, is subject to the rise in numbers of data breaches and misuse of private customer information (Wedel & Kannan, 2016).

Consequently, two trends emerge with regards to data compliance. The first trend is the increasingly strong governmental regulations regarding the use of private customer data.

Governments and Unions show their growing concern about customer privacy by coming up with new regulations and laws restricting the data use and analytics in marketing. Research by Tucker (2014) shows that a second trend is a growing number of companies that are proactively increasing policies themselves. The research demonstrates that companies who communicate clearly about their data usage and respect customers’ privacy (by giving respondents more autonomy about the data used) have better relationships with their customers and perform better.

The growing privacy concerns drive new ways of, so to say, compliant analytics, in which companies make proper use of private customer data by applying data minimization or anonymization (Verhoef et al., 2016). Using data minimization, the use of data is limited to only the data needed by marketeers while disposing unnecessary data. With data anonymization, data is transformed so that individual-level data cannot be retraced to the individual level, e.g., by using age groups (18-25) instead of exact age (23).

These advancements show a necessity and an opportunity for companies to develop new ways of analyzing personalized data without violating privacy restrictions while maintaining analytical and predictive power.

2.2. Customer Relationship Management

Although the developments in marketing analytics for the marketing mix, personalization, and

privacy and security come with significant challenges and opportunities, the focus of this

study is on CRM and customer retention. We do this due to the opportunities for higher

revenues by increasing customer equity of retained customers and the hands-on applications

of marketing analytics in CRM, yielding substantially increased profits, opposed to the less

profitable marketing efforts of acquiring new customers (Holtrop et al., 2017). Many other

authors also emphasize the usefulness of applications of classification algorithms in CRM

(e.g., Miškovic, 2014; Verbeke et al., 2012). Finally, the study of Gupta et al. (2004) shows

(10)

10

that increasing customer retention yields high revenues within firms. These opportunities make this area highly relevant both for researchers and practitioners.

CRM is defined as: “the strategic process of selecting customers that a firm can most profitably serve and shaping interactions between a company and these customers. The ultimate goal is to optimize the current and future value of customers for the company.”

(Kumar & Reinartz, 2018, p5). In the past era, CRM shifted from having the marketing concept as a leading principle towards having the customer concept as a leading principle (Kumar & Reinartz, 2018).

The marketing concept was concerned with the increasing value from improvements made in product or brand performance. The focus was on getting maximum product sales, market share, and product profitability in specific segments without considering individual customers' characteristics and preferences (Kumar & Reinartz, 2018). On the contrary, the customer concept is about creating profitable long-term relationships with, preferably known, customers and transform them into loyalists to the firm (Ramani & Kumar, 2008).

By retaining and profiting from loyal customers, the value to the firm of each customer is maximized, making this strategy more profitable than the acquisition of new customers, lowering costs, or attempt to increase the profit margin on average (Gupta et al., 2004; Shah et al., 2006). Organizations that effectively apply the customer concept and put the customer central to everything the company does are often referred to as customer-centric organizations. Table 2.1 shows a comparison of the essential differences in the approach of customer-centricity versus product-centricity, adapted from Shah et al. (2006, p115).

This new focus of organizations to put the customer central to whatever they do implies new communication forms with and towards the customer. Personalization is an excellent way to provide better-personalized messages and offerings to customers and can, therefore, be used in customer relationship management.

Using personalization, data about the customer is deemed relevant, and offerings are based on the customers' preferences, which is related to the philosophy of customer-centricity in that it serves customers, and the preferences are taken as the starting point, instead of products.

Research shows that organizations that employ personalization in e-commerce environments

saw an increase in benefits in both B2C and B2B markets (Jackson, 2007). Although

(11)

11

individually targeted advertisements positively affect the purchase intentions of customers, organizations should be aware that a high degree of intrusiveness or too personalized offers (e.g., addressing someone with private information) negatively affect purchase intentions (van Doorn & Hoekstra, 2013).

Table 2.1 differences in product-centric approach versus customer-centricity approach

Product-centric approach Customer-centric approach

Basic philosophy Sell products, no distinction in who wants to buy

Serve customers. Everything starts at the customer

Orientation Transaction Relationship

Positioning Product features and advantages Benefits for customers

Organizational focus Internally focused, new product development, market share growth

Externally focused, profitability through customer loyalty

Performance metrics Number of new products, market share & profitability per product

Share of wallet, customer satisfaction, customer lifetime value, customer equity

Selling approach To how many people can we sell our product?

How many products can we sell to our customers?

Customer knowledge Data is only for control Customer knowledge is a valuable asset

Customer Engagement

Another way to ensure that customers are at the center of an organization is by positively

engaging them with the business. Customer engagement is defined as all the behaviors from

customers towards a firm or brand that go beyond purchasing. These behaviors can both

positively and negatively affect the firm (van Doorn & Hoekstra, 2013). Kumar et al. (2010)

agree, stating that customer value to the firm is not only expressed in financial ways but also

in the non-monetary profits or losses a company makes by engagement of its customers. They

capture this value in customer engagement value (CEV), which consists of customer lifetime

value (CLV), customer referral value (CRV), customer influencer value (CIV), and customer

knowledge value (CKV). These antecedents of CEV show different ways for customers to

engage with companies to create value for the firm.

(12)

12

Customer Churn

The management of retaining customers is known as customer churn management, where churn/churners refer to customers leaving the company. More specifically, churn is the ending of contractual relationships between firms and customers (Kumar & Reinartz, 2018). As the importance of retaining customers within firms is increasingly acknowledged, because of the possibility of high profits (Gupta et al. 2004), the focus of CRM shifts towards retaining at- risk churners, especially in industries where churn rates can be up to 40% every year, like the telecom industry (Kumar & Reinartz, 2018). Therefore, correctly classifying which customers are at what risk of churning is of significant importance to organizations since this allows them to tailor marketing efforts to different groups of customers.

Interestingly, there exists friction in the literature about which customers should be targeted in retention campaigns. While Holtrop et al. (2017) and Ascarza (2018) show several studies claiming that marketing campaigns should be targeted at at-risk churners, Ascarza (2018) argues that the focus should be more subtle. The customers sensitive to interventions, like discounts or special offers, should be targeted proactively, regardless of their risk of churning.

In this way, the most effective strategy for retaining customers is applied.

Customer Engagement & Customer Churn

Research shows that engagement and retention are positively related, meaning that positively engaged customers are first of all not likely to churn. On the other hand, they are likely to spread positive WOM (Malthouse et al., 2013). Contrarily, customers with negative experiences with the company are more likely to churn and spread negative WOM.

Furthermore, for engaged customers, relationship termination is more challenging. Also, Haenlein (2013) shows that interactions between customers are linked to the likelihood of terminating a contract with a firm.

Malthouse et al. (2013) also show that the effect of customer engagement on customer

retention causes organizations to rethink their acquisition and retention strategy, providing

opportunities to yield higher profits. This opportunity, together with the higher profitability of

retention of customers in comparison with increasing margin or acquisition cost or changes in

the discount rate or cost of capital, makes that both customer engagement and customer churn

are the most relevant subjects within customer relationship management (Gupta et al. 2004).

(13)

13

If organizations find ways to engage customers positively by putting them at the core of their business and thereby retain them for a longer period, an effective and profitable marketing strategy is applied. Additionally, it remains interesting to investigate the effect that data characteristics have on the predictability of customer retention. By increasing the predictability, more effective marketing activities can be targeted at customers that might churn and significantly impact the profitability of the organization, yielding higher profits.

2.3. Big Data and Marketing

As mentioned before, marketing is probably one of the most affected businesses by Big Data.

Advertising used to be based on mass marketing and reaching as many people as possible with the same message, but with the rise of Big Data, reaching as many people as possible with individual level custom made messages was made possible (Bradlow et al., 2017; Wedel

& Kannan, 2016). However, what exactly do people consider to be Big Data? It is interesting to explore in the literature what constitutes Big Data. Many authors agree that Big Data is characterized by the four V’s: Volume, velocity, variety, and veracity (e.g., Sagiroglu &

Sinanc, 2013; Verhoef et al., 2016; Wedel & Kannan, 2016).

Volume indicates the growth in the size of databases and increases in data availability of large amounts of customers with many characteristics. Velocity refers to the speed with which data arrives and how fast it can be analyzed, which is characterized by the switch of snapshots to streaming data. Variety can be illustrated by the increasing number of different sources of data. Data used to come in numerical form but is now joined by more unstructured forms like reviews, videos, or social media. The last V, veracity, is often referred to as the data’s reliability, validity, and trustworthiness. With the increase in various data sources, it needs to be checked whether sources provide correct data. Some authors argue for value as a fifth V, indicating the value that businesses capture from analyzing the data (Verhoef et al., 2016;

Wedel & Kannan, 2016).

These characteristics of Big Data enable data scientists to combine several sources, return richer data, and provide opportunities for more in-depth analyses. This new data revolution, partly characterized by combining different sources, has more value and more significant consequences on the field of marketing than previous data revolutions (Sudhir, 2016).

Especially in the field of online marketing, Big Data has a significant impact. Predictive

modeling and more specific customer characteristics facilitate improved segmentation. Data

regarding browsing behavior and preferences allow for particular personalized targeting

(14)

14

possibilities, and profit-optimization algorithms make use of the data of browsing behavior and preferences to determine how likely it is that customers will click on an advertisement, determining the price for that specific advertisement (Leeflang et al., 2017).

Although statistical methods for predictive modeling have been used in marketing over the years, the vast amounts of data require new analysis methods. The statistical models cannot handle the increased volume of data and higher complexity in the form of more variables and more complex problems within the variables like interactions and endogeneity problems due to a lack of processing power (Leeflang et al., 2017). This causes these methods to lag behind the desired real-time decision making in online marketing that Machine Learning enables.

Since this study focuses on classification for CRM, only Machine Learning methods for classification purposes will be discussed.

2.4. Machine Learning

To gain a better understanding of what it is that makes the performance of ML seem superior over more statistical ways of analysis, it is essential to gain a better understanding of where it comes from, what the concept is, and what kind of different forms exist.

ML is a discipline that origins in computer science and is now at the intersection of computer science and statistics. Thereby, it is at the core of artificial intelligence and data science. Its focus is two-sided: on the one hand, automatically improving the performance of a learning agent and, on the other hand, the applying statistical-IT laws that govern learning systems (Jordan & Mitchell, 2015).

Dzyabura & Yoganarasimhan (2018) identify four main differences between ML and statistical modeling. First of all, ML is more based on heuristics and less on theory and testing hypotheses. ML aims at having high predictive validity for the predictions and outcomes, rather than the explainability of the results. This is why people regularly refer to ML as being a black box. The econometrical and statistical aim is more at getting the best unbiased estimators. This trade-off is often referred to as the bias-variance trade-off in ML literature.

Secondly, ML algorithms are developed to work without a priori theories about what causes the outcome.

On the contrary, statistical methods are often designed to test specific causal theories. Third,

ML methods can cope with vast amounts of variables and identify which variables to retain or

drop. At last, ML is very well able to cope with scalability issues. By using more efficient

algorithms, the desired real-time decision making is enabled for online marketeers.

(15)

15

Supervised, Unsupervised, and Reinforcement Learning

Although Leeflang et al. (2017) classify ML in two major types of algorithms, namely Supervised Machine Learning (SML) and Unsupervised Machine Learning (UML), others add Reinforcement Learning (RL) as a third (Brei, 2020; Jordan & Mitchell, 2015). We will quickly touch upon UML and RL and then continue with SML because this form is used in this paper.

Unsupervised Machine Learning is concerned with finding interesting patterns in the data.

Since the algorithm is not provided with the correct results or output during training, it is encouraged to find similarities in inputs to categorize inputs in communalities itself, which makes this form of ML perfectly suitable for segmentation (clustering) or dimensionality reduction (factor analysis) (Jordan & Mitchell, 2015).

Reinforcement Learning involves applying different situations to actions to maximize a reward signal. In advance, the learning agent algorithm is not aware of which situation maximizes the reward, which it eventually identifies by trial and error. It consists of four main elements: a policy (what actions are taken), a reward signal (the goal of the problem), a value function (determination of what is the most effective over long-term), and a model that represents the environment. Within these elements, the agent tries different actions and identifies which actions lead to the best outcome, and builds progressively upon its previous decisions (Brei, 2020). Applications of RL appear in robotics, games, or autonomously driving cars.

However, the most commonly used ML algorithm in marketing is Supervised Machine Learning. SML is usually concerned with classification or regression problems. In SML, the algorithm is provided with a training set consisting of both the inputs and outputs. The algorithm trains itself to gain predictions as close as possible to the outputs based on the training set. Based on the algorithm's experience, predictions become progressively more accurate, and it tries to generalize to predict all possible (new) inputs correctly. Since SML is concerned with classification or regression problems, it can handle different outcomes.

Regression problems can deal with real values like the number of sales, while classification problems can handle discrete dependent variables like churn but also images, texts, and videos (Brei, 2020; Jordan & Mitchell, 2015; Leeflang et al., 2017).

Since this study focuses on classification issues, Supervised Machine Learning algorithms are

used to investigate the effects on marketing decisions. We compare the use of four computer

(16)

16

science originated algorithms: Support Vector Machines, Neural Networks, Naïve Bayes, and Random Forests.

Support Vector Machine (SVM)

Support Vector Machines are classification algorithms that try to identify a function according to which observations can be split into separate classes based on several features from a training set. For every observation in the test set, the SVM can determine which class the observation belongs based on the set of features. SVM’s split observations in homogeneous groups based on linear classifiers (Miškovic, 2014).

However, since there are multiple possibilities for a line separating the data, a SVM tries to find the one line that maximizes the margin between the two groups. Therefore, SVM’s are commonly named large margin classifiers. The margin is calculated based on the difference between the linear classifier and the support vectors, which are the borders on the edges of the classifier (Leeflang et al., 2017).

A restriction to this type of SVM is that it can only be used for linearly separable data.

Therefore, to deal with data that cannot be separated linearly, extensions to the SVM exist.

First, a more relaxed model can be used in which the required misclassification error is less strict so that it allows for some observations to be misclassified, often referred to as soft margin machines. Simpler models with some classification errors are often preferred over complex models with no classification error but suffer from overfitting.

Overfitting is when the training data is too well classified, causing the model to be trained perfectly for that data. Models that suffer from overfitting usually perform very bad in generalizations and predicting out of sample data. Hence, the model criteria of Little (1970) apply to these models as well. He argued to choose the simplest model as complete as possible; a concept referred to as parsimony (Leeflang et al., 2017).

However, the above-mentioned simplified model is still based on a linear classifier, which

might not be sufficient for data that does not allow for this classification type. Hence, Boser et

al. (1992) were the first to use Kernel-based support vector machines, which allow for more

dimensional classifiers, like radial, polynomial, or hyperbolic classifiers, as shown in figure

2.1. Although Kernel functions are often very robust to over-fitting and generalize quite well,

they require more training effort.

(17)

17

Figure 2.1 Linear & multi-dimensional classifiers

Classification and Regression Trees (CRT)

Decision trees start with the complete training set and make splits according to which groups are created that with each split become more homogeneous. Every split made is done at a so- called node based on a decision rule for a variable (e.g., age > 25; yes or no). The complete training set is at the root node, where the first split is done. The tree stops splitting according to stopping criteria at terminal nodes, which can be the size of the node in percentages of observations, the maximum number of branches, or the maximum classification error (Leeflang et al., 2017). Figure 2.2 shows a schematic display of a decision tree.

The selection of the variable according to which should be split at each node is most often done based on an increase in entropy for binary outcome variables and on a decrease in impurity or variance for continuous outcome variables (Leeflang et al., 2017). The goal of these decisions to split is that the groups are more homogeneous subsequently to the split.

A useful feature in applying decision trees is that it, by default, accommodates for interaction effects and nonlinearities (Leeflang et al., 2017). Imagine the split referred to earlier, where the root node is split into groups of people aged above 25 and people aged 25 and lower.

Suppose the group of people that are older than 25 is subsequently split based on their gender

since that split increases the homogeneity within those groups. However, that specific split is

never made for the younger people. In that case, it shows that only for older people, gender

plays a role. Trees accommodate for nonlinearities since a variable can be used twice to base a

split on: first a split of age based on >25 and later a split made based on >55, resulting in a

group of people older than 55, a group of people between 25 and 55 and a group of people of

25 and younger.

(18)

18

Figure 2.2 Schematic display of a decision tree

Using the decision rules on which splits are made in decision trees, predictions can easily be made based on sample data. Because tree methods are easy to follow due to the schematic display provided and very robust to various data issues, the models are very popular in practice (e.g., Brei, 2020; Leeflang et al., 2017; Miškovic, 2014).

Also, for classification and regression trees, overfitting is a crucial issue to consider. While it is tempting to create a big tree with many branches and decision rules, these models are not useful for predicting sample observations. However, it remains challenging to determine what to identify as stopping criteria. A way to deal with overfitting for tree methods is pruning (Breiman et al., 1984). Pruning is checking all the branches of the tree and removing the decision criteria that are the least contributive to the generalizability of the tree. Although there are several ways to prune decision trees, Esposito & Malerba (1997) showed no one best pruning approach available.

Another way for tree methods to deal with overfitting is by combining the outcome of different trees and use those to aggregate over decision rules and select the best splits. These methods are called ensemble methods. Commonly used ensemble methods are bagging, in which the average classification of several trees determines the decision of the model;

boosting, in which each sequential tree is trained based on the misclassified observations of the previous one and Random Forest models (Leeflang et al., 2017).

Bagging and Random Forest

Bagging and Random Forest algorithms work similarly. Bagging is named after bootstrapping

and aggregation. Bootstrapping means generating new datasets with an equal amount of

observations as the original dataset, randomly drawn from observations from the original.

(19)

19

This means that bootstrapped datasets can contain an observation multiple times, as long as the complete bootstrapped dataset differs from the others (Leeflang et al., 2017).

With bagging, the datasets all train one decision tree, which is later used to make predictions for out of sample observations. These predictions are aggregated over all the generated trees, resulting in an average prediction. A restriction to these ensemble methods is that they need an unstable classifier as the basis for their model. This insinuates that small changes in data have significant consequences, which is the case for decision trees. So, although the individual trees are prone to overfit, aggregation causes the model to be well generalizable (Miškovic, 2014).

While bagging determines every individual tree's split based on the entire set of predictors, Random Forests use a random subset of predictors. Consequently, every variable gets the chance to be used in a split. Therefore, Random Forests can generate relative variable importance figures, which is very useful in marketing based on Machine Learning. This method can hence be assumed as an explainable Machine Learning method. This causes the method to be used often in practice (Miškovic, 2014). Also, this method is often one of the best performers when it comes to classifications.

The method is preferred over original bagging since bagging does not allow every variable to make a split and thus does not show the relative importance of all the variables. Furthermore, bagging algorithms take longer to train due to the higher number of variables that determine the split, while the performance of Random Forests is at least as good (Leeflang et al., 2017;

Miškovic, 2014). Thus, Random Forest algorithms are preferred over boosting algorithms due to their superior robustness and comparable accuracy using fewer variables at each split (Leeflang et al., 2017).

Neural Network (NN)

Neural Network algorithms are based on how the human brain works. Neural Networks weigh

a set of inputs and calculate the sum of those inputs, which may activate an activation

function, resulting in a binary outcome. The most basic threshold activation function will very

sudden switch from a value of 0 to 1 if a certain threshold is reached. However, this is not the

best way to predict it since different sets of attributes, which both predict Y's occurrence,

might have very different probabilities above the threshold. Therefore, a logarithmic

(20)

20

transformation of the threshold function is preferred with the development of sigmoid neurons, causing the activation function to switch more gradually (Leeflang et al., 2017).

Neural Network algorithms consist of three layers: the input layer with all the attributes; the hidden layers, which process the information from the input layer and transform their input to an output and the output layer, making the final prediction. Since the hidden layers do not display the values of their inputs and outputs, the algorithms are often not preferred for explainable artificial intelligence applications. The more hidden layers a network consists of, the deeper it is. Although deeper networks are often more accurate, they take longer to train (Leeflang et al., 2017).

Naïve Bayes

Another Machine Learning algorithm that is used in classification problems is Naïve Bayes.

This method tries to extract information from the training data and relies on building rules of (co)-occurrence of a set of features used to predict class membership. All Naïve Bayes classifiers rely on the Bayes’ Theorem shown in equation 2.1, which decomposes the probability 𝑝(𝐴|𝐵) of classifying an observation to a class, given the observed behavior, into three parts. This probability is referred to as the posterior in Bayesian language. The first part of the decomposition expresses the probability 𝑝(𝐵|𝐴) that a particular behavior occurs, given that the observation belongs to a particular class. The second part is the general probability 𝑝(𝐴) of the class membership, while the third part is the general probability 𝑝(𝐵) of the occurrence of the observed behavior (Leeflang et al., 2017).

(2.1)

The Bayes’ Theorem applies to every sort of classification problem, where a particular set of features for different items is observed. The Theorem will determine to which class every observation is most likely to belong, based on a set of predetermined attributes and the subsequently calculated posterior probability (Leeflang et al., 2017).

The Naïve Bayes assumes that the effect of the value of an attribute to a certain class is

independent of any value of the other attributes, which is referred to as class conditional

independence. Because of this assumption, this classifier belongs to the most efficient

(21)

21

classifiers and can infer predictions based on the training data easily and fast (Leeflang et al., 2017).

Performance Assessment

We assess the performance of the different algorithms in predicting customer churn since improvements in churners' predictions have the opportunity to yield the highest profits in practice. The performance of classifying out-of-sample observations of the different models in different circumstances is judged according to two metrics: the top decile lift (TDL) and the GINI coefficient.

The TDL uses the calculated probabilities to churn all customers, after which it ranks them from high probability to low probability. All observations are divided into deciles, and therefore the top decile represents the top 10% of customers who are most likely to churn.

Subsequently, the actual churn rate of the top decile is divided by the overall churn rate and multiplied by 100%. A TDL of 2 indicates that the model performs twice as good as a naïve model (Leeflang et al., 2015).

However, as Ascarza (2018) argues, it might not be the most effective way to target the high probability churners. She argues that the targeted customers should be selected based on their responsiveness to a marketing intervention as well. In this study, we also use a performance indicator that shows the model's performance across the whole customer base rather than only the top decile: the GINI coefficient. The GINI coefficient is the cumulative lift plot and graphs the cumulative percentage of customers horizontally against the cumulative percentage of churners vertically. This coefficient is a number between zero and one, and the model performs better the higher the number is (Leeflang et al., 2015).

We do not incorporate a proxy for the customers' responsiveness to marketing interventions in the metrics used in this study. Instead, we focus on the predictability of the sample regarding churning probability since the focus of this study is on the predictability of churn and not on the responsiveness of possible churners.

The division of the data in train and hold-out set is done according to the rules of Leeflang et

al. (2017). They argue that a training set should preferably consist of 75% of the data, while

the predictions should be made on a hold-out set of 25%.

(22)

22

Explainable Artificial intelligence

Many researchers agree that inevitably AI takes over some tasks of practitioners. However, also unanimity exists about the fact that probably not every task is taken over by algorithms.

Not only in the field of marketing but also in the field of, for instance, law or social sciences, researchers argue that attempts should be made to open up the black box of Machine Learning so that inferences can be made based on the outcomes of the algorithms (Deeks, 2019;

Higgins, 2019; Miller, 2019).

For instance, in marketing analytical applications of AI, Bradlow et al. (2017) and Wedel &

Kannan (2016) argue that outcomes or predictions of algorithms should always be backed up with a theory due to the lack of transparency of the models. De Bruyn et al. (2020) extend this view and claim that predictive tasks requiring little explainability can be automated with AI applications. However, they also point out that the transfer of tacit knowledge is challenging and question whether AI will ever succeed in transferring valuable tacit knowledge to humans. This topic will remain interesting in the coming years, while the focus probably remains on AI as addition and back-up to marketing theory.

Differences in Performance

Different applications of the methods in section 2.4 show differences in predictive performance (e.g., Miškovic, 2014; Neslin et al., 2006). Due to the differences in how the selected supervised Machine Learning methods predict the outcome, we expect the predictive performance of the models in predicting churn to differ. Therefore, we hypothesize that the model's predictive performance is affected by the selection of the ML algorithm.

Hypothesis 1: The selection of the Machine Learning algorithm affects the predictive performance of models in predicting churn.

2.5. Data Characteristics

As differences in data lead to different predictive outcomes of different modeling methods

(Fellinghauer et al., 2013), it is interesting to investigate what properties of the data render the

best predictions. First, the size of the dataset, both row- and column-wise, is discussed, after

which different types of effects are discussed in the sense of linearity and interactions, and at

last different ways of scaling variables are discussed.

(23)

23

Size of Dataset

Since Machine Learning methods are trained on data to predict the outcome of the holdout set subsequently, it is expected that varying sample sizes cause the models to have different predictive performances (Verhoef et al., 2016). Furthermore, since those methods select the best explanatory variables according to which the best predictions are made (Leeflang et al., 2017), it is expected that the more variables are available, the better the model performs.

Number of rows

Although some authors suggest that an increase in the size of the dataset with regards to the number of rows does not necessarily cause models to perform better in classification problems (Bradlow et al., 2017; Wedel & Kannan, 2016), others argue that ML methods are better able to deal with large sample sizes, yielding a better performance (Dzyabura & Yoganarasimhan, 2018; Leeflang et al., 2017).

Since computer science-based ML models are used to handle big datasets while focusing on identifying patterns in the data and orienting towards predictive quality rather than testing hypotheses (Leeflang et al., 2017), an increase in the volume of observations is expected to lead to a better predictive performance of ML models. This is due to increased possibilities for the model to identify patterns in the data and select the variables with the highest predictive value (Dzyabura & Yoganarasimhan, 2018; Leeflang et al., 2017).

However, Verhoef et al. (2016) argue that an increase in size for small sample sizes is more valuable than when the sample size is already significant. Verhoef et al. (2016, p.133) come up with rules concerning the value of increased sample sizes. They argue that it is more relevant to look at whether the sample represents the population. Altogether, we expect that an increase in sample size positively affects the predictive performance and that this effect is amplified for small sample sizes.

Hypothesis 2a: Sample size positively affects the predictive performance of selected supervised Machine Learning methods in predicting churn. This effect is amplified when the

sample size is small.

Number of columns

Although Bradlow et al. (2017) and Wedel & Kannan (2016) argue that increased sample size

does not affect the predictive performance of models, they do argue that an increase in

(24)

24

predictors leads to an increase in predictive performance. Also, Hwang et al. (2004) find different outcomes for different selections of variables.

Additionally, since ML techniques select the most effective predictors for classifying observations (Leeflang et al., 2017), an increase in the number of predictors is likely to affect the performance positively.

New data sources may positively affect the predictive quality of the models since it offers new variables that can be used to predict the outcome, carrying new information not yet present in the model (Bradlow et al., 2017). It should, however, be noted that it is expected that an increase in possible predictor variables will only matter when the additional predictors are relevant to the dependent variable since irrelevant predictors provide no additional information to the model (Bradlow et al., 2017).

Hypothesis 2b: The number of relevant possible predictor variables positively affects the predictive performance of selected supervised Machine Learning methods in predicting

churn.

Type of effects in the dataset

Another interesting data characteristic that possibly affects the performance of ML methods in predicting churn is the type of effects within the dataset. Literature states that Machine Learning methods perform better with complex effects since they automatically account for specialties in data (e.g., Leeflang et al., 2015).

While most models rely on linearity and few interactions within the data for practical convenience, research suggests that including nonlinear variables and interactions increases predictability in certain areas (Ryo & Rillig, 2017). Since these findings have never been validated in a marketing context, the type of effects in the dataset is an interesting aspect to look at as well.

Linearity

Models can differ in the functional form concerning linearity for both parameters and

variables. The most basic regression type, which is commonly used in practice and research,

is the model that is linear in both parameters and variables, also referred to as the linear

additive model (Leeflang et al., 2015).

(25)

25

While most statistical models assume linearity, not every predictor variable has a stable linear effect on a dependent variable (Ryo & Rillig, 2017). Some predictors have nonlinear effects, like marginally decreasing/increasing or parabolic effects. These effects are very common in marketing practice, as, for instance, advertising might suffer from supersaturation, when eventually sales drop at a certain level due to excessive marketing efforts (Vakratsas et al., 2004).

Since linearity is assumed in most statistical models and models that assume linearity are very commonly used, users of nonparametric or nonlinear models often try to transform their model to a linear model due to the ease of estimation and abundance of information on the topic (Leeflang et al., 2015). When nonlinear effects are modeled more appropriately, by, for example, including square root or square transformations of the variables in the model, the estimates and predictions become more reliable, and the model performs better (Leeflang et al., 2015; Ryo & Rillig, 2017).

Research by Ryo & Rillig (2017) found that modeling nonlinearity within variables has a beneficial effect on the predictability of ecological systems. These findings trigger the investigation of modeling different levels of linearity in predicting churn and the applicability of their findings in a marketing context. Subsequently, it is assessed to what extent nonlinearity affects the performance of the selected Machine Learning methods in predicting churn.

Hypothesis 3a: The degree of nonlinearity in the data positively affects the predictive performance of selected supervised Machine Learning methods in predicting churn.

Interactions

As mentioned in the section on linearity, models differ in the functional form concerning both parameters and variables. Since in marketing, the effects of variables often depend on each other, interactions should be included in modeling churn (Leeflang et al., 2015). The effect of people being unsatisfied with a service can, for instance, be stronger for older people.

One way to include interactions in a model is by switching from a linear additive model to a multiplicative model. Since in a multiplicative model, the effects are multiplied with each other, the model accommodates for interactions automatically (Leeflang et al., 2015).

However, in a multiplicative model, there will be accounted for interactions between all the

(26)

26

variables. Hence, interactions between single variables can be included in a linear additive model by adding multiplications between single predictors.

By adding interactions in the model, the amplified effects of predictors depending on the level of other predictors are taken into account. As a result, the model better represents reality and makes better predictions, while effects are distinguished better (Leeflang et al., 2015; Ryo &

Rillig, 2017).

Since most ML algorithms automatically account for interactions in the data, it is expected that, when interactions are present in the data, the predictive performance of ML methods is better (Ryo & Rillig, 2017). The appearance of first-order interactions increases model performance, while higher-order interactions are often less significant (Leeflang et al., 2015).

Since these findings have never been related to marketing practice, we investigate the extent to which different levels of interaction in the data impact the predictive performance of the selected Machine Learning methods in predicting churn.

Hypothesis 3b: The number of interactions in the data positively affects the predictive performance of selected supervised Machine Learning methods in predicting churn.

Scaling variables

One last interesting aspect to study is the scaling of variables. Research shows that different types of scaling of variables have an impact on the predictive performance of models since the amount of information varies across types of variables (e.g., numeric > binary) (MacCallum et al., 2002; Cohen et al., 2003). Since empirical data often contains various types of variables (e.g., binary, categorical, numeric), it is of great value for practitioners to see what type of scaling causes ML models to perform the best (Fellinghauer et al., 2013).

Research shows that dichotomization, changing continuous into dichotomous variables (or numeric into categorical data), by, for instance, using a median split, causes a loss of information (MacCallum et al., 2002). By applying a split and categorizing a continuous spectrum, valuable information about variables’ linearity, normality, and dependence over time vanishes (Cohen et al., 2003).

Furthermore, tree-based methods are said to preferable select numeric > categorical > binary

data in that order since splitting opportunities are simply higher for those data (Hothorn et al.,

2006). Hence, it is expected that the predictive performance of ML algorithms differs for

(27)

27

different datasets containing differently scaled variables and that data containing more numeric predictors are better for ML methods to base their predictions on than data with less numeric scaled predictors.

Hypothesis 4: Dichotomization of continuous predictor variables negatively affects the predictive performance of selected supervised Machine Learning methods in predicting

churn.

Following from the hypotheses, the conceptual model shown in figure 2.3 is developed.

Figure 2.3 Conceptual Model

3. Research Design

In this section, we discuss how the hypotheses are tested. We explain what data we use, what methods we use, and how we make inferences for marketers.

For this study, two types of data are used: simulated data and real-life churn data. Due to the

investigation of the effect of data characteristics on the performance of ML methods,

simulated data is used. This allows us to get a better idea of the actual effects of the

characteristics of the data without confounds that might be present in real data. In addition to

these data, we use a dataset containing churn in 20 industries of 107 companies of 613

customers, of which 1375 observations can be used for modeling churn. These data contain

information about customer feedback metrics, demographics, and retention. These data are

used to compare to the performance of the ML methods on simulated data and make

(28)

28

inferences about the generalizability of the hypotheses of data characteristics on simulated data to real churn data of companies.

For the analyses, four different packages are used in the programming language R. The R- package “e1071” is used for the Support Vector Machine and Naïve Bayes. For the Random Forest, the R-package “randomForest” is used, using three randomly selected predictors from which the model determines the best variable to split on. For the Neural Network, the R- packages “nnet” and “caret” are used. For further specifications of the conducted analyses in R, we refer to the Appendix in section 8.

3.1. Simulated Data

For this study, we simulate different datasets in order to test the hypotheses. We adopt a similar simulation method as Ryo & Rillig (2017), in which they create a dataset containing sixteen predictors. Due to the setup of our study, we slightly change the division of the variables and observations.

We simulate a control dataset to use as a reference level to measure the increase or decrease of performance of the ML models by applying different treatments. The control dataset consists of 1375 observations, which we adopt from the real dataset of De Haan et al. (2016).

It contains four binary predictors, generated from random sampling from a binomial distribution with probability 0.5 and standard deviation 0.1, four categorical predictors, randomly sampled from respectively three (“high”, “moderate” and “low”) and four (“A”,

“B”, “C”, “D”) categories and ten numeric predictors, which are randomly sampled from a normal distribution (mean = 10, SD = 3) or from a uniform distribution (ranging from 0 to 10). These variables are labeled x1, x2, …, xn. The dataset also includes a dependent churn variable, of which the simulation method is explained next.

Since the real dataset contains a churn percentage of 35%, the goal in the artificial datasets is to get a 35% churn percentage to make the results comparable. The effects of the predictor variables on the dependent variable are determined at random at first. Suppose it turns out that the churn percentage (P(y = 1)) is far above or below 35% in the simulated dataset. In that case, the effects are slightly adjusted to make the datasets better comparable in churn percentages. This returns a minimum churn percentage of 34.35% and a maximum of 36.75%

across all the datasets. The churn percentages of the datasets are also noted in appendix 8.1.