• No results found

Please do not leave me!

N/A
N/A
Protected

Academic year: 2021

Share "Please do not leave me!"

Copied!
50
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Please do not leave me!

What are the effects of including customer heterogeneity for

estimating customer churn probabilities in a non-contractual

online retail setting?

University of Groningen

Faculty of Economics & Business

Prof. dr. Jaap Wieringa

July 7, 2017

(2)

Please do not leave me!

What are the effects of including customer heterogeneity for

estimating customer churn probabilities in a non-contractual

online retail setting?

Koen Schuurman

University of Groningen


Faculty of Economics and Business


MSc Marketing Intelligence

Master Thesis

July 7, 2017

Ossenmarkt 81

8011 MV Zwolle

Student number: S2697998

Email: [email protected]

Tel: +31 6 51 71 86 11

Supervisors

University of Groningen

First supervisor: prof. dr. J.E. Wieringa ([email protected])

Second supervisor: dr. J.T. Bouma ([email protected])

External supervisor XXX

University of Groningen


(3)

Management summary

Since the introduction of the Internet, the landscape of shopping has changed. Traditional brick and mortar stores encounter increasing competition from online stores, and it is expected that this competition will continue to increase. Additionally, customers are able to gather information more and more easily, leading to increased customer awareness and stronger customer preferences. As a consequence, the old-fashioned way of treating every customer in the same manner no longer applies, but personalization techniques are used to customize offers.

Higher customer expectations lead to an increasing number of customers switching companies. This means that companies have to invest in retaining their current customers, as opposed to acquiring new ones. Therefore, it is crucial for companies to identify customers who might churn. Prediction modeling could give companies the opportunity to find these customers. Customer churn exists in two different setting: the non-contractual or contractual setting. Previous studies have mainly focused on the latter setting, and model predicting non-contractual churn are scarce, as modeling non-contractual churn is more complex. This is due to the fact that there is no clear definition of customer churn in a non-contractual setting. Therefore, arbitrary rules-of-thumb are often used. This study uses the interpurchase time between two orders as the prediction of churn. Furthermore, this study investigates whether including customer heterogeneity improves churn predictions. The aim of this paper is to answer the following research question: ‘What are the effects of including customer heterogeneity for

estimating customer churn probabilities in a non-contractual online retail setting?’

Firstly, suitable predictors for customer churn are analyzed, after which hypotheses are made. In the final model, fifteen variables, divided into three sets of factors (RFM, behavioural and socio demographical factors), are included, that are expected to predict customer churn. To test the hypothesis, an ordinary least squared multiple regression model is created.

In total there are sixteen hypotheses stated. From those sixteen are eight hypotheses supported by the results, which means that the findings are in line with the expectations based on prior studies. Next to that, the results of model showed a inverse effect for four stated hypotheses and for another four hypotheses there was not enough evidence by the results to support them.

The results show that including customer heterogeneity in the model significantly improves the model’s fit. Further, the model showed a positive effect for order value on churn prediction, meaning that when the order value decreases, the churn probability decreased. This gives new insights, as previous research confirmed the opposite effect. A possible explanation for the effect is that loyal customers order more frequently, but with a lower order value, as opposed to making one large order. The results also show that the older a customer is, the higher the probability of churn. This could potentially be explained by an U-shaped effect of age, but additional research is needed. Furthermore, there could be an interaction effect between age and the length of the relationship with the company, meaning the older someone is, the higher the chance that they have built a long-term relationship with the company.

(4)

and fourth segments are niche-segments. Segment one contains mostly families, who have an average interpurchase time of two months (58 days). Segment two contains mainly middle-aged customers. Segment three contains primarily young couples and singles, with an average interpurchase time of 23 days, and segment four contains older customers with an average interpurchase time twice as high as segment three.

This paper contributes to existing literature by showing that including customer heterogeneity to the model increases the model fit by 18.3%. This paper should be used as first building block in developing a non-contractual churn prediction model including customer heterogeneity. Furthermore, managers can used the insights of this paper to identify possible churners. To conclude, several limitations and suggestion for future research are discussion, including replication the study with a new dataset and including customer dynamics in the model. Therefore, it is recommended to test a model such as the GMOK model (Holtrop et al., 2017) in order to incorporate both effects.

Key words: Customer heterogeneity, churn prediction, non-contractual setting, latent class

(5)

Preface

As the finishing part of the master Marketing Intelligence, this master thesis is written. All the stories told by former students about doing research by writing the master thesis are true. These students say that doing research is the most challenging part of the master, but on the other hand also the part where you learn the most about yourself. After almost 5 months of research, I can totally agree with them. Personally it was way heavier than I expected in advance. However, the great support of many people has ensured that I could do my job and so that I can be proud on what I have done.

Therefore, I would like to say some word of thank. In the first place I would thank XXX for offering me the opportunity to combine writing my master thesis with an internship. I’m very grateful for having the chance to complete the master Marketing Intelligence in this way. I want especially thank XXX for all his support, understanding and feedback during the whole period. I also want to thank my two supervisors, Jaap Wieringa and Jelle Bouma, for their input, (mental) support and feedback. And last but not least, I would thank my friends and family for their unconditional support. Not only during this period of doing research, but during my whole study period and especially for the last three years.

(6)

Table of Contents

Preface ... 5

1. Introduction ... 8

2. Literature review ... 11

2.1 Customer churn ... 11

2.2 Determinants of customer churn ... 12

2.2.1 Recency, Frequency & Monetary Value (RFM) ... 12

2.2.1.2 Frequency of purchases (Frequency) ... 13

2.2.2 Behavioural factors ... 13

2.2.3 Socio demographical factors ... 16

2.3 Customer heterogeneity in churn prediction ... 17

2.4 Conceptual model ... 19

3. Methodology ... 20

3.2 Set definition for non-contractual churn ... 21

3.3 Measurement of the constructs (Specification variables)... 21

3.3.1 Dependent variable ... 21

3.3.2 Predictors ... 21

3.3.3 Control variables ... 23

3.3.4 Latent segmentation variables ... 24

3.4 Plan of analysis ... 25

3.4.2 Model specification formula ... 27

4. Results ... 29 4.1 Descriptive statistics ... 29 4.2 Data cleaning ... 29 4.3 Statistical Validity ... 30 4.3.1 Autocorrelation ... 30 4.3.2 Heteroscedasticity ... 31 4.3.3 Multicollinearity ... 31 4.3.4 Non-normality... 32 4.4 Model estimation ... 32 4.4.2 Main effects ... 32

4.4.3 Effect control variables ... 34

4.5 Latent Class segmentation ... 34

4.5.1 Model selection ... 35

(7)

4.5.3 Profiling the segments ... 38

4.6 Summary of the results ... 40

5.Discussion ... 41

5.1 Conclusions ... 41

5.3 Managerial contributions ... 44

5.4 Limitations ... 45

5.5 Avenues for future research ... 45

(8)

1. Introduction

Anno 2017, we live in a world with approximately 3.5 billion Internet users worldwide. This means that about 45 percent of the global population can access the Internet (Statista.com, 2017). The Internet is not only an important element in our modern life, but also in our economy. In 2015 Europe’s total gross domestic product was around 17.6 trillion euros, from which the share of the European Internet economy was 2.59%. This percentage is expected to be doubled by 2020. Europe Ecommerce (2016) reported that the turnover of the European e-commerce market increased with more than 21 per cent in 2016. In 2014, the total turnover of the ecommerce market was 402 billion euro, which grew to 510 billion euro in 2016. This growth can be (partly) contributed to the increasing number of Internet users and rise of e-commerce: a revolution called “Big data” has started.

Using the Internet, it became easy to obtain information about products and services. This unique characteristic of the Internet did not only lower search costs for consumers, but also increased the amount of information, which can be gathered. Markus Tuschi, Global director Digital retail at research agency GfK, states “success in today’s retail world means meeting

shoppers’ expectations. As shoppers, in particular the younger ones, increasingly switch between purchase channels, retailers need to meet them where they are: everywhere”. In order

to become or to remain successful, companies are more and more interested in getting a better understanding of who their customers are. This led to a development in the mindset of companies, which also contributed to the fast revolution of Big Data. Nowadays, companies are highly motivated to store (almost) every single piece of possible data in their databases. These huge amounts of data give companies the opportunity to generate insights about their customers on a detailed level. In other words, this instrument enables companies to shift towards becoming a more customer-centric organization.

Being customer-centric means that companies do not only focus on the acquisition of new customers, but also on current customer retention by improving satisfaction (Shah, Rust, Parasuraman, Staelin & Day,2006). An important development in this process is the increased attention for Customer Relationship Management (CRM) by the top management. This increased attention is not surprising since it was found that CRM is an effective instrument with which companies can enhance their competitive advantages and improve customer satisfaction and loyalty (Chan & Li, 2006). Another positive effect was found by Blattberg, Kim & Neslin (2008), where they state that CRM does not only improve customer satisfaction, but also enhances marketing productivity through more effective acquisition, retention, and development of customers.

Tsai & Lu (2009) further confirmed the importance of being customer-centric: they found that retaining existing customers and decreasing churn are both essential to sustain growth and maximize profits. It is certainly not striking that churn is an essential element, given that the cost for the acquisition of new customers exceeds the cost of retaining customers (Kotler, 2001). From prior studies (e.g. Kotler 2001; Chen 2016) it appears that acquiring a new customer is 5 to 25 times more expensive than retaining an existing one. The results of these studies are logical: no resources are needed to attract a new client. Rather, you need to satisfy the customers you already attracted.

(9)

existing literature (e.g. Bucknix & van den Poel; Neslin et al. 2006; Reinartz & Kumar 2004), a customer ending his/her ‘commercial’ relationship with the company is known as customer churn, defection or customer attrition In this research the term customer churn is defined as: the

probability that a customer leaves the firm in a given period (Blattberg, Kim and Neslin, 2008).

Given the costs of attracting new customers or losing existing customers, it would be beneficial for companies to predict customer churn. In addition, Fader & Hardie (2009) suggest that companies may choose for adopting a targeted strategy that enables firms to use customer’s transactional data to predict a customer’s future behaviour. This future behaviour can on their turn indicate patterns that imply a process of a churning customer.

Before generating insights of customer churn, an important distinction must be made between contractual churn and non-contractual churn. Contractual churn, which is mostly investigated by previous studies (e.g. Fader & Hardie 2010, Ascarza, Bruce & Hardie 2013), are customers who end (and not immediately renew) their relationship with the company after the expiration date, e.g. in the telecom industry. Non-contractual churn is a setting (e.g. the online retailing sector) that suffers from the problem that customers have the opportunity to continuously change their purchase behaviour without informing the company about it (Buckinx & Poel, 2005). It is important to make this distinction since it is completely inappropriate to apply a model developed for a contractual setting in a non-contractual setting (Clemente-Císcar et al., 2014). This means that in a non-contractual setting, a churn model must be able to deal with the ever-changing preferences and needs of customers. Models in contractual settings, which can be typified as standard and generally static models, lack the ability to emphasis these ever-changing preferences and needs i.e. customer heterogeneity.

The amount of Internet users and the concomitant number of online shoppers has grown rapidly the last years. The wide range of possibilities that the Internet is offering has ensured that customers have choices enough. As a result, customers are much more willing to switch. Definitely on the moment that a certain company does not meet the specific needs or preferences this customer has. In order to deal with this issue, companies are adjusting their business to a more customized setting. The perception of approaching every customer in an equal manner is beginning to decline. In addition, the approach for customized offers in order to create realistic and relevant offerings increases. To elaborate on this approach, the population first needs to be divided into groups based on heterogeneity. To deal with heterogeneity among customer needs and preferences segmentation is used in marketing. Due to segmentation customers are placed into smaller, more homogenous subgroups. From prior literature (e.g. Schreiber, 2016; Bandeen-Roche, Miglioretti, Zeger & Rathouz, 1997) it is proven that latent class regression is a successful method in order to accommodate for customer heterogeneity. This type of the model is able to identify segments in the sample.

By making use of such segmentation approaches, including customer heterogeneity is becoming even more relevant for modeling churn in a non-contractual setting. As stated by Holtrop (2011) standard churn models predict churn without identifying the high-risk customers and separate them from the other customers, which means that they do not include heterogeneity. Therefore, this paper tries to overcome this gap and determine the extent to which including customer heterogeneity has an effect on improving customer churn prediction estimations in a non-contractual setting. This leads to the following research question:

(10)

This research generates new insights on the added value of improving customer churn predictions by including customer heterogeneity in the model. Furthermore, the existing literature contains several loose ends regarding customer churn prediction estimations in a non-contractual setting, which will be addressed in this paper by covering these aspects of the drivers of non-contractual churning.

This research has several managerial contributions. First, it will enable companies to predict customer churn in a non-contractual setting. The model enables companies to increase their chances of success in identifying potential customers based on heterogeneous characteristics who might leave the organization. With the estimated results, i.e. predicted customer churn probabilities, managers are able to make a better and more data driven policy regarding building sustainable relationships with customers. This will provide managers with an excellent opportunity to maximize profits and provides insights as to whether managers must consider retention programs (customized for the segments in order to keep them motivated to stay with company.

(11)

2. Literature review

This section provides a deeper understanding of the existing literature on customer churn in a non-contractual setting and its determinants. First, an overview of the main topic will be given; customer churn. Second, literature on customer churn in a non-contractual setting is analysed, followed by literature on the determinants of customer churn. Throughout this section, hypotheses are formed to answer the research question. Third, an explanation of the use of customer heterogeneity in predicting churn is given. Finally, a graphical illustration of the research in the form of a conceptual model is presented.

2.1 Customer churn

Companies are investing large amounts of money to become data driven: they are investing in the right personnel, external datasets, data storage, software and such. All these investments are made with the purpose of generating insights from data. One of those insights is to check whether a customer is still actively buying (alive), or so-called customer churn prediction. Kamakura et al. (2005) define customer churn as “the tendency for a customer to defect or cease

with a company”. According to Blattberg, Kim & Neslin (2008), customer churn is defined as

‘the probability the customer leaves the firm in a given period’. The latter definition of customer churn will be used during this research. .

In order to fathom the phenomenon of customer churn, this topic has been extensively studied during the last few decades. Churn can be subdivided into contractual churn and non-contractual churn. The distinction between contractual and non-contractual business settings is fundamental in term of modelling, since it is completely inappropriate to apply a model developed for a contractual setting in a non-contractual setting (Clemente-Císcar et al., 2014). Contractual churn describes a situation in which customers have a ‘contract’ with the company for a certain time period, e.g. within the telecom industry. This type of churn is mostly investigated by previous studies. Non-contractual churn is a setting that suffers from the problem that customers have the opportunity to continuously change their purchase behaviour without informing the company about it. Fader & Hardie (2009) conclude that the nature of non-contractual settings imposes ambiguity when defining churn, since there is no contract between the company and the customers, rendering companies incapable of observing the exact time of customer churn. However, this results in the fact that extra information is needed when customer churn has to be modeled in a contractual setting (e.g. (online) retail). This is necessary because in a non-contractual setting, the predominant challenge for companies is to infer whether a customer is still active or not. Since this setting is less stable and depends more on the environment in which a company is active, studies concerning non-contractual churning use different definitions for customer churn, which are mostly based on simple rules-of-thumb. Bucknix & van den Poel (2005) states in their research that customer churn is a deviation from an established behavioural transaction pattern. But in order to determine whether a customer deviates from its established pattern, companies must be able to identify whether a customer performs active behaviour. To do so, most companies define a customer as active based on those simple rules-of-thumb (Leeflang et al. 2015). Leeflang et al. (2015) state that a major company such as EBay also uses rule-of-thumb to define non-contractual customer churn. In case of EBay, they use a period of 12 months in which a customer bid, bought or listed on the site. If a customer doesn’t fulfill these criteria during that certain period, EBay would consider that customer as churned, i.e. the customer changed their transaction pattern from active to passive.

(12)

important to clarify which variables determine the definition of customer churn in a non-contractual setting. Clemente-Císcar et al. (2014) claim that they have developed a formula that is the most valuable definition of non-contractual customer churn, based on the expected economic benefits in the form of maximizing both profit and return of the retention campaign. In their formula, the researchers make use of the average and the coefficient of variation of the variables interpurchase time, purchase amount, relationship length and number of purchases. In order to gain insights into customer churn probabilities, customer churn modeling can be applied. Customer churn modeling is explained as a process of calculating the probability of future churning behaviour for each customer in the database, using a predictive model (Coussement and De Bock, 2013). Such models are based on recognizing patterns from past information/prior behaviour of a customer. By taking those patterns, the model can estimate a probability of how much a customer deviates from its normal pattern i.e. the probability of customer churn. To understand how prior behaviour can lead to accurate churn predictions, several studies (e.g. Bhattacharya 1998, Mittal and Kamakura 2001) have investigated churn or retention drivers. These drivers can provide companies the opportunity to improve the effectiveness of retention, through an early identification of a specific pattern that could imply a potential churn by a customer.

2.2 Determinants of customer churn

To date, the issue regarding a customer churn prediction model is a well-studied topic. However, a suitable model for a non-contractual setting is yet to be found. Bucknix and van den Poel (2005) claim that they have identified (most) valuable drivers of customer churn. The study by Clemente-Císcar et.al (2014), which is used as inspiration to set the definition of non-contractual customer churn in this paper, contains the same variables. Therefore, 15 variables from the same set of variables are used to investigate the determinants of customer churn in this paper. These 15 determinants of churn can be divided into three sets of factors namely: RFM (Recency, Frequency and Monetary Value), behavioural and socio demographical. The relationship between customer churn and each variable is explained in the following sections, as well as their effect on predicting churn.

2.2.1 Recency, Frequency & Monetary Value (RFM)

Recency, Frequency and Monetary Value (RFM) metrics are often used for analyzing and determining customer value. Coussement and de Bock (2013) state that RFM variables are proven to sort effect in predicting customer churn. In order to represent the RFM metrics, the following variables are used in this paper: Interpuchase time, Frequency of Purchases, Purchase amount.

2.2.1.1 Interpurchase time (recency)

(13)

2.2.1.2 Frequency of purchases (Frequency)

Frequency is a measure of the strength of the customer relationship with the company. More extensive, Khajvand, Zolfaghar, Ashoori & Alizadeh (2011) define frequency as the number of purchases made within a certain period. They state in their study that a higher frequency indicates greater loyalty.According to Bolton, Lemon, & Verhoef (2004) the greater loyalty of a customer, the lower the probability that this particular customer churn. Reinartz and Kumar (2000) state that the purchase frequency of a customer is an appropriate measure to calculate the probability of a customer being active or inactive. This statement is supported by the findings of Tamaddoni, Stakhovych & Ewing (2016). They found that including the frequency of purchases in the model leads to better performance in predicting customer churn. More precisely, the lower the number of purchases of a customer, the higher the churn probability of that customer. A broader perspective has been adopted by Athanassopoulos (2002) who argues the frequency of purchases is a strong indicator of customer loyalty. He claims that customers, who are less loyal, are more willing to switch.

In this paper the frequency of purchases is calculated based on the number of orders made by a customer during the observation period. Therefore the following hypothesis is stated:

H1a: If the number of orders decreases, the probability that this customer churns increases.

2.2.1.3 Order value (monetary value)

Monetary value is a measure of the amount spent by a customer at a certain company. According to Rud (2001), monetary value is the least powerful RFM dimension with regards to predictive ability. Rud (2001) underpinned this by the fact that the total value of the orders is directly correlated with the frequency and therefore an average value is required. However, in combination with the other two dimensions (recency and frequency), it is still valuable to use. Reinartz and Kumar (2002) state that those customers that are spending more are more loyal (i.e. have a longer relationship with the company) and willing to pay higher prices. Two things; trust and consistency drive this particular spending behaviour. First, researchers Rogers, Dale & Tibben-Lemke (2001) state that customers prefer reduced risk. So, when customers perceive reliability they do not have to reconsider on a regular basis the other options. A study by Simonson (1993) reports that this also has to with the urge of customers liking to be consistent in what they are doing. Given this the following hypothesis about monetary value is stipulated:

H1b: If the order value decreases, the probability that this customer churn increases.

2.2.2 Behavioural factors

Behavioural factors strengthen the performance measurement of relationship status next to conventional accounting measures such as the recency, frequency & monetary value factors. The behavioural factors reflect the consumers’ actions, preferences and decision by which the relationship status is influenced. In order to represent the behavioural factors, the following variables are used in this paper: number of product returns, preference of payment, sensitivity to promotions, duration time of a session, number of sessions, length of the relationship and type of device.

2.2.2.1 Number of product returns

(14)

by the proven fact of expectancy disconfirmation. According to Fennis and Stroebe (2015), consumers form expectations about products and their performance before buying the product. Rajamma et al. (2007) state that the disadvantage of online retailing over offline shopping is the of lack of ability for customers to feel, taste or smell the product before purchase and therefore customers have to rely on the information provided by the online retailer. However, often those expectancies are higher than the ’real deal’. As a consequence, consumers are more likely to return their ordered products. Fennis and Stroebe (2015) state that this expectancy disconfirmation will result in decreased customer satisfaction. Thus, the return rate affects customers’ satisfaction. Customers with a higher return rate are less satisfied and therefore tend to be less loyal.

H2a: When a customer has an increasing return rate, the probability of a customer to churn increases.

2.2.2.2 Preference of payment

When finalizing a purchase online, a customer can either pay by using debit or credit. In this paper, debit is defined as immediately paying the purchase, and credit is defined as paying later. Customers paying by credit accept that they have a ‘debt’, meaning they potentially have to pay an interest rate. Kirchler, Hoelzl and Kamleitner (2008) state that customers do not find it an obstacle to buy a product on credit. According to Soman (2001), when the payment mechanism requires the consumer to write down the amount paid (rehearsal) and when the consumer's wealth is depleted immediately rather than with a delay (immediacy), the purchase intention of that consumer will reduce. However, those customers who buy on credit probably do not prefer to have debts. But due to those debts, it offers companies an opportunity to increase the length of the relationship. Conclusively, from a prior study by Mozer, Wolniewicz, Grimes, Johnsen and Kaushansky (2000), credit information is a useful predictor for churn. Therefore, the following hypothesis regarding payment preference is stated.

H2b: If a customer pays immediately, this customer has a higher probability to churn.

2.2.2.3 Sensitivity to promotions

Prior studies confirmed the effect of promotions on shopping behaviour. Fennis and Stroebe (2015) found that (price) promotions are focused on generating an immediate behavioural response from a customer. According to Tsao, Lin, Pitt & Campbell (2009), promotional effect is described as the promotional activities to induce customers to switch. More precisely, customers who are sensitive to promotions are less brand loyal and less store loyal. For them, the lower prices are the explanation of their purchases. These customers typically do not develop a relationship with one specific company (Bawa and Shoemaker, 1987; Bucknix and van den Poel, 2005). In addition, Tsao et al. (2009) state that the greater the promotional effect, the greater the retention. For this paper, this statement can be interpreted from the opposite perspective, namely that promotional activities have a positive effect on churn. Therefore, the following hypothesis is stated:

H2c: If the orders via promotional (marketing) channels decrease, the probability of a customer to churn increases.

2.2.2.4 Duration time of a session and the number sessions

(15)

the other hand it has a negative effect on visit duration on a website. As stated by Bowen & Chen (2001), loyalty is positively related to customer satisfaction. The findings of Gustafsson, Johnson & Roos (2005) are in line with prior studies on satisfaction and loyalty intentions, where it has been established that customer satisfaction has a negative effect on churn. According to Bucklin and Sismeiro (2003), a longer duration of a session helps to maintain users interest in a site or a company. Based on this finding, the following hypothesis is stated:

H2d: As the duration time of a session decreases, the probability of a customer to churn increases.

However, duration time does not only have an effect on maintaining an interest in a website or company. Moe and Fader (2004) built upon this finding and found that enhancing user interest has a positive effect on revisit intentions and therefor long-term sales. This led to the following hypothesis:

H2e: As the numbers of sessions decreases, the probability of a customer to churn increases.

2.2.2.5 Length of the relationship

The length of the relationship might be a strong indicator of status of the relationship between a customer and the company. Bolton (1988) found a positive relationship between customer satisfaction and the length of a customer relationship. Next to that, customer satisfaction is found to be an important driver of customer retention (Chen, 2016). The findings of Bolton (1988) are supported by the study of Wei, Lin, Weng & Wu (2012). They suggest that the longevity of a relationship with a customer affects customer loyalty. The longer the relationship, the more loyal the customer is. In addition, this also been proven by prior research by Verhoef (2003), who found that customers with longer relationships are less likely to churn. These findings are later endorsed by the study of Risselada et al. (2010). In line with the findings above, the following hypothesis is stated:

H2f: The shorter the relationship between a customer and the company, the higher the probability of customer churn.

2.2.2.6 Type of device

Consumer behaviour has fundamentally changed since the prevalence of smartphones and tablets in daily activities and habits. Due to this, customers have easy access to user reviews, expert opinions, and price comparisons and so on in order to gain the information they are looking for. According Papadopoulou (2017), consumers increasingly prefer shopping on a tablet over shopping on a mobile phone. This is mainly because consumers consider shopping on a mobile phone a risk due to the lack of security and privacy. In addition, Dai, Forsythe and Kwon (2014) found that perceptions of risks negatively influence purchase intentions. According to Gustafsson,Johnson & Roos (2006) a low level of satisfaction, by means of lower purchase intention, has negative effect on customer churn. Given all the information the following hypothesis is stipulated:

(16)

Next to the findings written above, Cozzarin & Dimitrov (2016) state that consumers have more negative risk perception for shopping online on a mobile device over a standard desktop/laptop. Conclusively, the probability of customers buying on a desktop/laptop is greater than on a mobile device. Therefore, the following hypothesis is stated:

H2h: If a customer uses a tablet over a desktop/laptop as his/her device to shop, the probability of a customer to churn increases.

By taking a closer look at the different devices, both studies of Cozzarin & Dimitrov (2016) and Papadopoulou (2017) claim that customers who use more than one device, will shop more online. Cozzarin & Dimitrov (2016) state that the perceived risk is preventing consumers less when they use mobile and desktop/laptop for online shopping. Due to shopping on more devices, would imply that a customer has, relatively spoken, more chances to get in contact with the company. So based on this, the following hypothesis is stated:

H2i: If a customer shops only on one device, the probability of a customer to churn increases.

2.2.2.7 Cross category-buying

Cross category-buying leads (in a causal sense) to, or is an antecedent of, behavioural loyalty, measured as relationship duration, buying frequency and share of wallet. Reinartz et al. (2008) claim that customers who buy across categories are inclined to be more loyal. A Data Science and prediction-modeling expert from an online retail company enhances this statement with findings from a customer base analysis (Vos, 2017). These findings show a higher future value for customers who buy not only for themselves in several categories but also for others, for example a women who order products for herself, her husband and her child. So due to the increasing loyalty, the earlier mentioned findings of Gustafsson, Johnson & Roos (2005) are applicable; they found that customer satisfaction has a negative effect on churn. Therefore following hypothesis regarding cross category buying is stated:

H2j: If a customer only shop in one category, the probability of customer churn increases.

2.2.3 Socio demographical factors

Next to behavioural and RFM factors, social demographical factors are extensively used in predicting churn. Buckinx and van den Poel (2005) summarized the extensive used customer socio demographics in other studies of customer defection. Based on this selection and in line with the possibilities of the data, the following three variables are used to represent the socio demographical factors in this model: age, gender and urbanization level.

2.2.3.1 Age

(17)

shopping. On top of that, older people have less intention to switch and prefer to remain in the relationship they already have. According to Monschis (2003), this is due to the fact that older people have a stronger need for convenience and therefore prefer offline shopping above online shopping. Based on all the above the following hypothesis is stated:

H3a: The probability of churning increases when a customer gets older.

2.2.3.2 Gender

The existing body of literature on gender as a predictor for customer churn is inconclusive. For example, the study of Borle, Siddharth and Dipak (2008) stated that gender doesn’t have a significant effect in the determination whether or not a customer churns. Additionally, Garbarino & Strahilevitz (2004) state that men shop online just as much as women do. According to Naseri and Elliot (2011) an exception to this general pattern, are the online purchases of clothing, where it is more likely to be women. In addition, Eckel & Grossman (2008) state that women are found to be more risk-averse. In line with findings as mentioned above, women are thus more concerned with the risks that come with shopping online. Following Cant, Hefer & Machado (2013), when the perceived risk of purchasing decreases, the willingness to buy increase. Conclusively the following hypothesis about gender is stated:

H3b: Females have a higher probability of churn than males.

2.2.3.3 Urbanization level

According to Farag, Krizek & Dijst (2005) customers who live close to shops are more likely to buy online than customers who live far away from those shops. This is supported by the study of Farag, Weltevreden, van Rietbergen, Dijst & van Oort (2006) where the findings indicate that in urban areas, online shopping is more popular in comparison to weakly urbanized and rural areas. Based on these findings the following hypothesis is stated:

H3c: Customers who live in suburban environments have a higher probability to churn than customers who live in cities.

2.3 Customer heterogeneity in churn prediction

The essence of marketing, or the existence of marketing, is devoted to understand consumer preferences and needs (Allenby & Rossi, 1998). A major challenge in marketing is to understand the diversity of customers’ preferences and sensitivities that exist in the market. According to Leeflang et al. (2009) this has not led to a change in modeling, as they state that models often assume consumer homogeneity, i.e. that consumers have the same characteristics. However, goods and services can no longer be offered without considering customers needs and recognizing the heterogeneity of those needs. As written in the book ‘Market segmentation’ by Wedel & Kamarua (2000) it was Smith (1956) who recognized the existence of heterogeneity in the demand of goods and services. Smith concludes that market segmentation involves viewing a heterogeneous market as a number of smaller homogenous markets. These smaller homogenous markets are a result of different preferences attributable to the desire of consumers for more precise satisfaction of their varying wants in responses to differing preferences attributable to desires of consumers for more precise satisfaction of their varying wants.

(18)

Nowadays, segmentation becomes even more important. The widely available information online and the easy access to this information have led to customers being able to easily gather information. This in turn can affect their preferences and needs (Kannan, 2001). The consequence for companies is according to Lohse, Bellmand and Johnson (2000), expected to meet the needs and preferences of the customer, by doing real-time customization on their website, dynamically adjusting the content presented to users as they interact with the Web.

Due to all these changes, a growing need arises for customer churn models that also account for customer heterogeneity. It is almost unthinkable to treat all customers in an equal manner because the variation among customers can be exposed more easily and clearly. However, these changes needs and preferences cannot be included in standard churn models. Because, standard churn models predicting churn without identifying the high-risk customers and separate them from the other customers, which is a form of heterogeneity (Holtrop, 2011). Moreover, according to Allenby & Rossi (1998) these standard churn model based on consumer purchase behaviour often do not recognize that preferences and choices are interdependent. For that purpose, churn models including customer heterogeneity should be developed.

Lim, Currim & Andrews (2005) found that customer preferences segmentation greatly enhances the usefulness of outcomes for management.In contrast to this emphasis on individual differences, economists are often more interested in aggregate effects and regard heterogeneity as a statistical nuisance parameter problem, which must be addressed but not emphasized. However, emphasizing heterogeneity would improve predictive performance. As stated by Risselada, Verhoef, and Bijmolt (2010), in many applications managers have access to panel data and each customer can be tracked across multiple time point. Tracking customers across multiple time points improves the predictive performance of customer churn. Besides the variation over time, this also can be realized when accounting for the heterogeneity in customer response. This finding is supported by Holtrop (2011), he stated that incorporation of heterogeneity through clusters, improves customer churn predictions related to the predictions made by a model that only relies on the information contained in the variables. Taking all this information into account, this study stipulates the following hypothesis;

H4: Customer heterogeneity moderates the effects of all previously discussed drivers on churn probability.

In essence, including heterogeneity will give researchers the opportunity to model behaviour of a specific individual and therefore produce outcomes (predictions) that are relevant for that specific individual. It is expected that the changing environment will influence marketing modeling in the years ahead. The changes will transform marketing models from static decision aids into dynamic real-time tools embedded in systems that are more useable by managers (Wierenga 2009).

(19)

2.4 Conceptual model

The conceptual model is presented below in figure 1.1, to get a clear overview and to visualize the expected relationships based on the proposed hypothesis as stated in the Literature review section. Appendix A contains a table in which the hypotheses are summarized.

Figure 1-1 Conceptual model

Research question: ’What are the effects of including customer heterogeneity for estimating

(20)

3. Methodology

This section aims to provide a deeper understanding of how the stated hypotheses will be measured in practice and the research question will be answered. First there will be an overview given of how the data is collected for this study. After this overview, the method that is used to define customer churn during this paper will be discussed, along with the operationalized determinants of customer churn. Next, the plan of analysis, which contains the route that describes the model procedure that have to be followed in order to generate insights with suggested model. Finally the model will be specified and the mathematical formula is given.

3.1 Data collection

A large online retailer has provided access to their database for this study. By making use of the software SAS enterprise guide7.1, a customized dataset is been created. Since this study focuses on the heterogeneity in customer behaviour, the dataset have to include historical data. Therefore, panel data have been used to analyze the behaviour of individual customers over a period of two years.

The size and richness of these databases offer several opportunities for interesting studies. However, in this study is chosen to exclude the personnel of the company. The behaviour of the employees is not realistic compared to the behaviour of “normal” customers, and this is the main reason for excluding them. As employees receive benefits (extra discount and special offers that are tailored for personnel only) and browse the website for work related-purposes, their behaviour deviates from usual customers

Despite the richness of the data, it is important to demarcate. Therefore, it has been decided to take data from fashion buying customers only. Another decisions which are made to reduce the size of the data set, deals with the observation length, which is stored in the set. As stated in the literature review (see section 2), interpurchase time plays a major role in predicting customer churn. The purchase turnover rate of fashion is considered as short cyclical, which means that customers will make a repurchase more quickly (Ecommerce Benchmark & Retail report, 2016). On the other hand, due to the short cyclical characteristic, the chance of exceeding the threshold in terms of interpuchase time will to lead an increased chance of detecting churning customer. Therefore, in this study a shorter period of data needs to be retrieved for mapping churn behaviour.

The data in this study is obtained in a period of two years,1st of May 2015 until the 1st of May 2017. This period of two years of data is chosen based on two grounds, namely: Firstly it provides enough opportunities to detect churning customers and secondly, it controls for effects of specific campaigns held in a period.

After reducing the data by creating a sample tailored for this study, the size of the sample is still computationally infeasible when using all observations. Therefore a subsample is created, containing 5000 randomly selected customers. These 5000 customers have one condition in common; they all made a purchase in the first week of the dataset (1st of May 2015 until and up to 8th of May 2015). The behaviour of these selected customers is followed day by day, by means of 18 explanatory and 7 control variables, in a period of two years (1st of may 2015 up to and including 1st of May 2017). The created dataset is a so-called panel dataset. Panel dataset

(21)

3.2 Set definition for non-contractual churn

In order to predict churn in a non-contractual online retail setting, a definition has to be set for this study. As stated in the literature review, there is no clear definition of non-contractual churn. However, the most appropriate way to set a definition is by making use of the recency, frequency and monetary-value variables. The scope of this study is to test how an Ordinary Least Squares (OLS) regression model, which contains one dependent variable, extended with latent class segmentation, will perform in a non-contractual environment. In Important to say, it is not the purpose to develop a prediction model in a non-contractual setting that is able to deal with a threshold of three dependent variables instead of one. Not in the first place because it is really arbitrary to make weighted system between these three variables, but in the second place it becomes quite complicated to manage this in right manner. Or in other words how does the model have to deal with a customer that exceeds one out of the three variables, has this customer churned? And to what extend are these three variables equal in order to reflect the right impacted related to churn intentions. So conclusively, as argued in the literature review, interpurchase time is chosen as the main dependent variable.

3.3 Measurement of the constructs (Specification variables)

This section contains a description of the measurement approach for each variable used for the model in this study. In order to create a useful dataset from the accessible database, the data first have to be aggregated on one common level. The aggregation level of the orginal data in the database varies a lot. It varies from detailed information about customers’ behaviour within one session until more high-level (general) information such as the demographics of the customer. The difficulty of preparing data onto one similar level of aggregation has to do with the information loss, as consequence for using a particular level of aggregation. In this study, a dataset is created containing data on a customer-order aggregation level. For describing the variables used for the model in this study, the same sequence is maintained as in section 2 ‘Literature review’. In addition, some variables contain log in their label. This has to do with the fact that these variables are already logarithmically transformed. Logarithmic transformation is required due to the use of a multiplicative model. In this type of model, variables cannot have a zero or negative value in it. However, this is not the case for dummy variables, these are placed in the exponent. All the variables, including a short description, are elaborated on in section 3.4’Model specification’.

3.3.1 Dependent variable

Interpurchase time

The time between two orders at time T and T-1 is reflected by the number of days between the two orders. To calculate the interpurchase, the day of order at time T is subtracted by the day of order at time T-1. If only one order is placed, thus T equal T-1, a value of zero appears. All orders placed on the same day are indicated with one in this study. The variable interpurchase time is a continuous variable and is labeled as logIPT in the dataset.

3.3.2 Predictors

Frequency of purchases

(22)

Purchase amount

Each order can contain one or more items. Each item in an order has its own price. Adding up the items gives the total order value in Euro’s, at time T. This variable is labelled in the dataset as logOrder_value.

Number of return products

The ReturnRatio is calculated by dividing the number of returns by the number of products ordered, for the period between T and T-1. The variable is a ratio ranging from 0 to 1, which can be interpreted as a percentage.

Preference of payment

A customer can chose to pay by either debit or credit. While the dataset does not contain detailed information about financial transaction of customers, the dataset does indicate which form of payment a customer prefers. This information is taken from the choice of payment before entering the private financial area. The customer’s preference is indicated by a dummy variable, where one represents a preference for credit payment and zero for debit payment. The variable is labeled in the dataset as PP_credit

Sensitivity to promotions

In order to measure the construct of sensitivity to promotions, this study uses the marketing channel through which a customer entered the website. A distinction is made between a customer entering the website through a paid marketing channel, or through an unpaid marketing channel. The dummy variable SenseToPromo has a value of 1 when the customers entered the website through a paid channel, and zero otherwise.

Number of visits

The total number of visits is measured by the sum of all sessions of a customer in the period between T and T-1 during the observation period. This variable is continuous and is presented in the dataset as logNo_Sessions

Session duration time

Session duration time is measured by the total duration time of all sessions of a customer in the period between T and T-1 during the observation period. This variable is presented in the dataset with logDuration_Sessions in the form of a continuous variable.

Relationship length

The age of account indicates how long the account of the customer has existed, i.e. how long the customer has had a relationship with the company. The starting date of the account is subtracted from the first order data (1st of May 2015). This number is translated into months. The variable is a continuous and is labelled in the dataset as logLoR_Month.

Type of Device

(23)

Cross category buying

Each article in the dataset is linked to the category men, women or kids fashion. For products of which this is unknown or unisex, they are labelled others. In order to see whether a customer buys across categories, or when an order of a customer contains fashion products that belong to multiple categories, an ordinal variable with four levels is created.

This variable is labeled in the dataset as CCB and it contains the following four levels:

Table 3-1 Cross category buying levels

Age

In order to calculate the age of the customer, the date of birth is subtracted from the date of order. This number is translated into years. The variable is continuous and is labeled in the dataset as logAge_years.

Gender

For gender, a dummy variable is created, where one indicates a female customer and zero a male customer. The variable is labeled in the dataset as Gender.

Urbanization level

The variable urbanization level contains information of external data. Namely data that is provided by CBS.nl. Based on the four digits zip code used for each level of urbanization, the data are connected to dataset. This variable ensures that 4766 four digits zip code numbers are divided into five different levels of urbanization, based on a certain number of inhabitants living on a km2.

Table 3-2 Levels of urbanization

In this study, for the levels 1,2,4 & 5 dummies are made indicating with a one that the customer lives in area that belongs to that level of urbanization and a zero indicates that this is not the case. Urbanization level 3 is set as a benchmark and is therefore not included as a dummy variable. The dummy variables are labeled as follows in the dataset: UL_1, UL_2, UL_4 &

UL_5.

3.3.3 Control variables

Control variables are also known as “covariates”. The most common use of covariates is to remove extraneous variation from the dependent variable, because the effects of the factors could be a concern (Malhotra, 2010). Based on previous literature and the possibilities regarding availability of the data, the following control variables are used in this study;

(24)

Promotional days

During the observation period a major promotional activity has taken place several times. This activity concerns the promotional days. The promotional days are a special promotional activity where in five days products are offered with excessive discounts. On average the uplift in placed orders during the promotional days period increases by 120% in comparison to orders placed in a regular period. This implies that this promotional activity would possibly influence the interpurchase time. For that reason the effects of this variable will be controlled in this model. Based on the variable date is a dummy variable called C_Promo created and added to the dataset. In this variable is a date during the promotional days indicated by a one and a zero when this is not the case.

Seasonal influence

In order to investigate whether the different seasons have an effect on the dependent variable, a new variable is created in the dataset. The variable is created based on the date variable from which the week numbers are extracted. The week numbers are used create the different seasons: Spring (week 10 - 22), summer (week 23-35), autumn (week 36-48), winter (week 49-9). Three dummy variables are created in order to reflect the seasons in the dataset. The variables are labelled as C_Spring, C_Summer and C_Autumn.

Weather

In a study by Weather Unlocked (2014), is stated that both seasonal changes, as well as daily and weekly fluctuations in weather, shape demand for consumer goods and services. During periods of sunny weather, online sales decreased. They reasoned that when the weather is nice, consumers are more likely to go outdoors, and thus less likely to shop online. Therefore, regarding this study it would imply that the Interpurchase time in the summer would be higher in comparison to the winter. To control for the weather, data is collected from the site of the national weather institute. The weather data contains three variables namely; average

temperature, duration time of sunshine and duration time of rainfall). The data from these 3

variables are extracted from all 50 weather stations in the Netherlands.The average of all weather stations is taken within the dataset. These averaged variables are added as control variables in the dataset and are labelled as follows; C_Sunshine, C_Rainfall and C_logTemp.

3.3.4 Latent segmentation variables

In order to perform a latent class regression, segmentation variables have to be used to detect unobserved segments in the sample. For this study the variables Income, Family composition and Average interpurchase time are used. Besides their function of segmenting the data, the variables are also useful in helping to characterize the sample.

Income

The variable income is based on external information at the zip code level regarding the level of income of a customer.

Family composition

Alike the variable income, the variable family compositions is also based on external sources. This variable reflects a customer’s family composition based on a certain address.

Average interpurchase time

(25)

how much time a customer needs on average to place their next order. Table 3.4 shows the five levels possible for this variable.

3.4 Plan of analysis

The first step in order to test the hypotheses and to answer to research question is to create a dataset containing relevant variables. Therefore, the tool SAS enterprise guide 7.1 is used. By making use of the SAS tool, the data from the databases of a large European online retailer can be extracted. The SAS offers several possibilities to adjust, aggregate, combine or create (implied) variables.

The next phase of the process will consist of exploring the data. By doing this, first basic statistical insights regarding how the population in the dataset look like, will be provided. RStudio version 0.99.90 will be used to create those insights. RStudio is an open-source and enterprise ready professional software for programming R. RStudio helps by means of the script based programming the language R and is formerly used for statistical computing and graphics, to make sense of data (RStudio, 2017). Next to the use of RStudio for exploring the data, this tool will also be used for data cleaning and conducting the ordinary least squares multiple regression.

The process of data cleaning will be done following the data exploration. In order to do this properly, several test are conducted within RStudio in order to check the data on missing values, outliers and oddities. After completing the data cleaning, a regression model will be created. This model serves for testing the statistical validity by means of four model assumptions. When an assumption is violated, adjustments on the data will have to be made in order to solve the issue. The following sequence of assumptions will be followed.

Firstly, the model will be checked on Autocorrelation. This assumption tests whether the residuals between different time periods will be nonzero. This assumption is violated when the residuals exhibit some systematic pattern over time, some or all of the covariance between residuals between different time periods will be nonzero (Leeflang et al., 2015). When there correlation exists, a transformation has to be applied in order to take this correlation out. The first step is to calculate the rho (correlation coefficient) of the residuals. Hereafter is for each observation per variable the estimate at time T subtracted by the sum of rho times the estimate of T-1. Equation 3.1 shows the transformation.

𝑋′𝑡= 𝑋𝑡− 𝜌 × 𝑋𝑡−1 𝐸𝑞. (3.1)

Where,

X’t = The independent variables after transformation

Xt = The independent variables in the model at time t

𝜌 = Rho (correlation coefficient)

The second assumption is related to heteroskedasticity, where it will be tested whether the error term is homoscedastic, i.e. has the same variance over time. When this is not the case, it will reduce the efficiency of the parameter estimates (Leeflang et al. 2015).

(26)

𝐹 =𝑆𝑆𝑅1 / (𝑛 − 𝑝)

𝑆𝑆𝑅2 / (𝑛 − 𝑝) 𝐸𝑞.(3.2) Where,

SSR = Sum of Squared Residuals 𝑛 = Number of observations 𝑝 = Number of parameters

Moreover, another common used test for checking this assumption is the Levene’s test. This statistic test for equality of variances and with subsequent modification it improves both the robustness of the test as the statistical performance (Gaswirth, Gel &Miao, 2009).

But if the assumption is violated, a GLS transformation has to be applied. The first step is to calculate how big the variance within both groups is. Here for the standard deviation of each group has to be taken Hereafter, all observations per group are divided by the standard deviation of their own group. Equation 4.3 shows the transformation:

𝑋′𝑡 = 𝑋𝑡

𝜎𝑡 𝐸𝑞.(3.3) Where,

X’t = The independent variables after transformation

Xt = The independent variable at time t

𝜎𝑡 = The standard deviation of the residuals at time t

Multicollinearity is the third assumption and checks for correlation between two or more independent variables in the model, which is not desired when the correlation is too strong. As a consequence of multicollinearity in the model, parameter estimates become unreliable (Leeflang et al 2015). The Variance Inflation Factor (VIF) score will be used in order to check for the degree of correlation.

The fourth assumption is called the non-normality assumption. This latter assumption checks whether the disturbances are normally distributed. A violation of this assumption can be due to misspecification of the model. The disturbances have to be normally distributed in order to be useful for the standard test statistics of hypothesis testing. Otherwise the significance of the parameter estimates cannot be trusted (Leeflang et al., 2015).

The Shapiro-Wilk test is a commonly used test for normality. However, this test cannot be used in this situation because of a requirement of the Shapiro-Wilk test. This requirement states that the dataset have to contain at least the 3 and not more than 5.000 observations in order to perform this test (Shapiro & Wilk, 1965). The dataset used in this study exceeds that restriction by containing over 2.2 million observations. Nevertheless, The Kolmogorov-Smirnov and Jarque-Bera tests are also appropriate to test for normality. Therefore will these two tests conducted to further investigate the non-normality assumption.

(27)

To include the customer heterogeneity and to make segments of the data, an extension to the regression analysis by means of a latent class analysis is conducted. By making use of a latent class analysis, the model is able to identify segments in the sample. This approach allows for parameters to differ between the latent classes due to unobserved segments. Respondents within the same class are homogenous on certain variables and respondents between the same classes differ from each other (Vermunt & Magidson, 2005). Because of the categorical nature of the latent classes, the latent class analysis is different from other latent approaches such as factor analysis and structural equation modeling. This ensures that latent class analysis can be used for predictive models.

The decision to adopt a particular model is considered for these other traditional methods as arbitrary or subjective. In contrast, latent class analysis as a statistical model allows the comparison to be statistically tested, so that the decision to adopt a particular model is less subjective, or at least has some grounding for comparison according to Robertson & Kaptein (2016).

The latent class analysis is conducted in the program Latent Gold 5.0. The variables explained in 3.3.4 ‘Latent segmentation variables’ are used to segment the data in this study.

3.4.2 Model specification formula

For the model in this study, a multiplicative function is used (Wittink et al., 1988). An advantage of this type of model is that it by default incorporates interaction effects. However, by making use of a multiplicative model, it is not allowed to have negative or zero values. To accommodate for this, all the variables, except for the dummy variables, are log-transformed. In addition, dummy variables do contain a zero value and to accommodate for this issue, dummies are placed in the exponent and the beta will be used as the base number (Wittink et al., 1988). So in other words, the indicator variables are used as exponents to facilitate the interpretation of the coefficients associated with these variables. An advantage of using elasticity’s, is that they are dimensionless, completely independent of how you measure the influence of parameter x on the dependent variable y. The model specification in the form of a mathematical formula is written below:

(28)
(29)

4. Results

This section has the aim to provide the results of the collected data by conducting several analyses in order to test the hypothesis and to answer the (associated) research question. First the data will be explored by checking for missing values, outliers and oddities. As evidenced by data exploration that possesses unusual features, the data cleaning section will explain how these unusual features are overcome in this paper. In the third part of this chapter, the regression model will be presented. This regression model is extended with latent class segmentation, in order to specify possible segments in the data.

4.1 Descriptive statistics

The dataset contains panel data of ### customers with ### orders measured by 25 variables (18 predictors and 7 control variables), over a period of two years (1st of May 2015 till the 1st of May 2017), which means ### observations.

The customers in this study are on average ## years old (SD = 10.87). The youngest customer was ### years old and the oldest customer in the dataset was ### years old. The data contains ###female customers (90.2%) and ### male customers (9.8%). On average a customer ordered 24 times (SD = 27.31) during the observation period. The range of orders has a minimum of one and a maximum of ### orders. Furthermore, the length of the customers’ relationship with the company at the moment of ordering was ###months on average, with a minimum of less than one month and a maximum of ###months. A distribution analysis of the dependent variable is specified in appendix B .

4.2 Data cleaning

Before analysing the dataset, the data must be cleaned. It is necessary to check the data on missing values, outliers and oddities in order to make accurate predictions and unbiased estimates. After a first check, 48.433 missing values are detected. A major part of the missing values can be assigned to the problem with a group of orders that are not automatically redirected to the order system. Therefore a second line acquisition team has to register these orders by hand. According to XXX (2017) this group orders reflects 3-5% of all orders. As a consequence, the session data (e.g. duration time of a session and payment preference etc.) of these orders cannot be connected to the corresponding order.

Another cause for the missing values can be assigned to procedure of creating the variable interpurchase time. The interpurchase time variable is composed by calculation the difference between the dates at T and T-1. The date at T is subtracted from the date of T-1. For the customers that ordered for the first time within the used time frame, no value of T-1 exists. Consequently 5000 missing values where detected, which means that 5000 observations will be removed from the dataset due to this procedure. These missing values are completely at random. According to Donders, van der Heijdern, Stijnen & Moons (2006), for a missing value to be completely at random, the probability of that value missing has to be unrelated to any other variables in the dataset. Therefore, these missing values are deleted list wise from the dataset.

Referenties

GERELATEERDE DOCUMENTEN

In addition, both the significant difference in mean percent- age of recall for propositions as a function of their activa- tion or not in an emotional reading cycle, and

It should be noted that the use of the Statistics Netherlands (2014) typology of Topsectors (TS) can produce sometimes arbitrary allocations. For example the

•  Viewing a heterogeneous market as a number of smaller homogenous markets (Smith,1956; Wedel & Kamakura, 2001). •  Segmentation, used in marketing to deal with

Then, a start place and initializing transition are added and connected to the input place of all transitions producing leaf elements (STEP 3).. The initializing transition is

The log file is then mined and a Product Data Model (PDM) and eventually a workflow model can be created. The advantage of our approach is that,.. when needed to investigate a

As both operations and data elements are represented by transactions in models generated with algorithm Delta, deleting a data element, will result in removing the

These objectives can be integrated into the development of the Company X Business Process Framework, containing a process design method, enterprise-level processes and a governance

The forecast performance mea- sures show that, overall, the CS-GARCH specification outperforms all other models using a ranking method by assigning Forecast Points according to