• No results found

Remaining with cluster heterogeneity: the “dark” side of clustering?

N/A
N/A
Protected

Academic year: 2021

Share "Remaining with cluster heterogeneity: the “dark” side of clustering?"

Copied!
106
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Remaining with cluster heterogeneity: the “dark” side of clustering?

An analysis of online travel data to segment customers and predict conversion probability

Master thesis | Irma Kootstra

(2)

Remaining with cluster heterogeneity: the “dark” side of clustering?

An analysis of online travel data to segment customers and predict conversion probability

Irma Kootstra

University of Groningen

Faculty of Economics and Business

MSc Marketing Intelligence | MSc Marketing Management

Master Thesis

17-06-2019

Irma Kootstra

Eeltsjemar 3

8939CJ, Leeuwarden

(+31)6 34 95 48 37

kootstrairma@gmail.com

S2707098

Supervisor (First):

dr. P.S. (Peter) van Eck

p.s.van.eck@rug.nl

Supervisor (Second):

M.T. van der Heide

m.t.van.der.heide@rug.nl

(3)

Management Summary

The online consumer world is growing (Ellis, 2017; GfK, 2018) and firms rapidly capture more and more information via the internet (Verhoef, Kooge & Walk, 2016). If firms effectively use the data that is available via the internet, they would be able to develop more effective marketing strategies (Bergemann & Bonatti, 2011). A marketing strategy contains three important processes: segmentation, targeting and positioning (Kotler, 1994; Toften & Hammervoll, 2013). The focus in this research will be on the first two. Customer segmentation is aimed at finding groups of customers that are homogeneous within their group and heterogeneous from the other groups (Xu & Wunsch, 2005). Cluster analysis is the most used and effective tool for segmentation and different implications of clustering exist (Vidden, Vriens & Chen, 2016). Clustering can be done on the basis of different types of variables and also, different cluster algorithms can be utilized.

However, segmentation itself does not yet lead to success. Effective implementation of segmentation in the targeting step has been proven to lead to success (Dibb & Simkin, 2007). Targeting is all about finding an effective way to reach the customers that are most interesting for the business. For profit businesses, these will be the customers that are most likely to purchase a product of the respective business (Personen, 2013). Also, real-life predicting which customers are most likely to purchase, to be able to target those in an effective manner, is a topic that gains more and more attention in marketing. Therefore, this research will investigate when segmentation is most homogeneous and heterogeneous at the same time (which is related to cluster validity) and consequently the influence of different segmentation approaches on conversion prediction models. Hence, this research aims to answer the following research questions:

1) How do different cluster input variables influence the cluster solution validity? 2) How do different cluster algorithms influence the cluster solution validity? 3) How do different cluster solutions influence conversion probability prediction?

(4)

The key findings resulting from this research indicate that:

1. Clustering based on behavioral variables results in the most valid cluster solution followed by geographic, demographic and psychographic variables, respectively.

2. The cluster algorithm that is especially developed to cluster large applications (CLARA) results in most cases (3 out of 4) in the most valid cluster solution. The traditional k-means algorithm still resulted once in the most valid cluster solution.

3. The cluster solution, which scores highest on validity, eventually is least valuable for conversion prediction and vice versa.

(5)

Preface

Dear reader,

Thank you for taking the time to read my master thesis. When I arrived in Groningen I did not have a clue that I would end up here, writing a master Thesis for Marketing Intelligence and Marketing Management. In 2014 I started with Business Administration and after my first year I choose to move to Technology Management. However, Technology Management is a study, which is really driven by internal business processes and supply chain processes while nowadays taking the view of the customer was way more important in my opinion. Therefore, I choose to study a master where the importance of the customer is acknowledged and gets assigned the central role in businesses. This is going to be my final research project at the University of Groningen before I will start a new chapter in life. During these 1.5 years of my masters I really discovered my enthusiasm for data-analysis as well as the managerial insights that follow from it and during my thesis I learned a lot about my ambitions for the future and myself as a person.

This way, I would like to thank my first supervisor Dr. P.S. van Eck for helping me during the whole process of writing my thesis. He always provided me with very helpful feedback, which seems like a key requisite to me. In such manner, I was able to carefully reflect on my own work and keep doing so to get the best out of myself. I would also like to thank my second supervisor M.T. van der Heide for critically reviewing my thesis. Besides, I would like to thank my fellow group members for helping each other during this journey. We occasionally had sessions to discuss the subject and data set together, which was really valuable to me. Finally, I would like to thank my friends and family who also guided me in this journey.

Irma Kootstra

(6)

Table of contents

1. Introduction ... 1

Structure of this research ... 4

2. Theoretical Framework ... 5

2.1. Customer experience and the customer journey ... 5

2.2. Customer segmentation ... 5

2.3. Feature selection or extraction ... 6

2.4. Clustering algorithm design or selection ... 7

2.5. Clusters validation ... 8

2.6. Results interpretation ... 9

3 Research Design ... 11

3.1. Data collection ... 11

3.2. Variables ... 12

3.3. Working of the cluster algorithms ... 13

3.4. Analysis techniques ... 16

3.4.1. Analysis of cluster validity ... 16

3.4.2. Analysis of cluster usability ... 18

3.4.3. Logit model ... 18 3.5. Plan of analysis ... 21 4. Results ... 23 4.1. Preliminary checks ... 23 4.1.1. Missing values ... 23 4.1.2. Outliers ... 24 4.2. Data description ... 24 4.3. Clustering ... 26 4.3.1. Internal validation ... 26

4.3.2. Description of the behavioral clusters ... 28

4.4. Consequences of mean imputation ... 30

4.5. Assumptions for logistic regression ... 31

4.6. Model estimation -selection- ... 32

4.6.1. Estimation results part I ... 34

4.6.2. Estimation results part II ... 34

4.7. Model selection ... 35

5. Discussion and outlook ... 38

5.1. Summarizing and discussing the results ... 38

5.1.1. Different variable types and validity of the cluster solution ... 38

5.1.2. Different algorithms and validity of the cluster solution ... 39

5.1.3. Different cluster solutions and accuracy of conversion prediction ... 39

5.2. Implications ... 40

5.3. Limitations and suggestions for further research ... 41

6. References ... 43

7. Appendix ... 51

(7)

1. Introduction

The number of households that purchases products online, is rapidly increasing. According to Ellis (2017) “online sales are growing at about three times the rate of brick-and-mortar stores” (p. 2). According to GfK (2018), Dutch consumers have spent 22.5 billion euros in the online environment in 2017. In line with this, online travel shopping is also a dynamic and rapidly growing business sector on the internet. Online travel sites have exploded in the past years (Card, Chen & Cole, 2003; Amaro & Duarte, 2013). Due to this, firms also rapidly capture more and more information. To not suffer from information overload, it is important that firms make efficient use of the data that is available to them (Verhoef, Kooge & Walk, 2016). Handling the data in a smart way will lead to a more effective marketing strategy, which is key for firms (Bergemann & Bonatti, 2011).

In theory, a marketing strategy exists of three steps: segmentation, targeting and positioning (STP) (Kotler, 1994; Toften & Hammervoll, 2013). During the segmentation step, a firm groups consumers with similar needs and buying behavior together based on one or more variables. Segmentation typically leads to a detailed customer analysis, which allows firms to better fit its own behavior with customer behavior (Dibb & Simkin, 2001). In the targeting step, firms aim to allocate resources in line with the priorities and preferences of the various segments. Finally, positioning implicates the development of marketing programs that are appropriate for the targeted segments (Venter, Wright & Dibbs, 2015).

(8)

Segmentation will lead to a more effective marketing strategy because it contributes to the synchronization of a marketing strategy (Moutinho & Vargas-Sanchez, 2018). The reason for this is that segmenting a market aims at unfolding the different types of homogeneous groups that are present in a heterogeneous market to eventually alleviate the design stage of the targeting step (Wedel & Kamakura, 2012). This is because it effectively helps to identify the right audiences to focus on in the targeting step (Vidden, Vriens & Chen, 2016; Moutinho, Vargas-Sanchez, 2018). Moreover, by subsequently making resource allocation decisions in the targeting step that are congruent with the preferences of the segments, companies can enlarge their return on investment (Dibb & Simkin, 2001). Moutinho & Vargas-Sanchez (2018) also discuss the effect of segmentation on a marketing strategy. The authors argue that segmentation leads to more specifically directed marketing programs, more effective positioning and a greater opportunity to develop offers that are with the preferences of the segments, which makes that customers are less harassed with marketing efforts and more willing to embrace it.

Although segmentation literature in general is not scarce, various researchers in the marketing field call for more specific research. For example, Lemon and Verhoef (2016) explicitly call for the identification of customer segments by the customer’s utilization of particular touch points in the journey. Besides, Song, Sahoo, Srinivasan & Dellacoras (2017) specify that one could also try to identify paths to purchase by utilizing demographics of consumers to comprehend how demographics influence consumer’s shopping activities. But, surprisingly, no literature in the marketing field explains the importance of including behavioral variables (e.g. touch point usage) relative to demographic variables for segmentation. This, while at the same time more and more researchers point to the advantages of behavioral variables that became available due to the data “outbreak”. Therefore, this research will add to literature by providing insights into the influence of including different kind of variables for segmentation.

(9)

K-means is the oldest and mostly used cluster algorithm (Gan & Ng, 2017). However, over the last years, different cluster algorithms evolved such as partitioning around medoids, clustering for large applications and fuzzy clustering (Brock, Pihur, Datta & Datta, 2008). These algorithms are developed with the purpose of creating better cluster solutions (Goder & Filkov, 2008). Nonetheless, there is no direct comparison available between the use of k-means, partitioning around means, sampling-based clustering and fuzzy clustering to segment customers in the marketing literature. Therefore, this research will also add to the marketing literature by providing insights into the performance of different cluster algorithms for segmentation.

However, segmentation itself does not yet lead to success. Effective implementation of market segmentation and subsequent targeting is the key to success (Dibb & Simkin, 2007). As mentioned before, segmentation can provide important knowledge for targeting decisions (Moutinho, & Vargas-Sanchez, 2018) and targeting decisions are concerned with focusing on the most interesting customers of a company’s complete customer base. Being able to predict the chances of conversion for individuals in the online customer journey is extremely important because for most businesses 80% of their revenues come from only 20% of their customers (Cook & Mindak, 1984), This means that if firms are able to identify the individuals that are most likely to convert, marketing tools can be successfully exploited to stimulate conversion. There is however no marketing literature available in which different cluster solutions are compared on their added value for conversion prediction while prediction models are on the rise in marketing research. Therefore, this research will lastly contribute to literature by providing insights into the concrete usability of different cluster solutions for the targeting step.

By doing so, this research will seek answer to the central research question: “How can customer segments best be identified to be able to develop more effective marketing strategies?” by answering the following three underlying research questions:

(10)

Structure of this research

To find answers to the questions, this research is twofold: firstly, different cluster algorithms will be applied to cluster customers, based on different types of variables. These methods are then compared on their ability to find valid clusters. Second, a model is developed to predict purchase probability. The most optimal cluster solutions are individually included in the model to predict purchase probability to assess the actual usability of the different cluster solutions.

(11)

2. Theoretical Framework

2.1. Customer experience and the customer journey

The Marketing Science Institute (2014, 2016) describes the concept of customer experience as one of the most relevant research areas for the upcoming years. This is mainly because of the accumulating number and distinctiveness in customer touch points. According to Lemon & Verhoef (2016) customer experience is “the customer’s “journey” with a firm over time during the purchase cycle throughout multiple touch points” (p. 74). The customer journey eventually comprises the pre-purchase, purchase and post-purchase phase and is iterative and dynamic (Lemon & Verhoef, 2016). In each of the three stages, customers encounter several touch points. Touch points are the different moments in time during the journey where the customer has individual contact with the firm (Lemon & Verhoef, 2016). Touch points can be divided into firm-initiated contacts (henceforth: “FIC”) and customer-firm-initiated contacts (henceforth: “CIC”). FIC are contacts where firms “push” information towards the customer while CIC are contacts where customers “pull” information towards themselves (Li & Kannan, 2014). The distinguishment of Li & Kannan (2014) will also be the basis distinguishment of touch points in this research. The growing number of marketing channels also derives a larger variety in touch points (Baxendale, Macdonald & Wilson, 2015). According to Halvorsrud, Kvale & Følstad (2016), marketing channels are “the carriers of touch points, and they can be digital (e.g. e-mail), human-served (e.g. a desk in a shop), or a combination of the two” (p. 846).

2.2. Customer segmentation

The prominent manner to segment customers is clustering (Vidden, Vriens & Chen, 2016). According to Mallika & Krishnan (2014) “the ultimate goal of clustering is to split a finite unlabeled data set into a finite and discrete set of “natural”, hidden data structures” (p. 121). In other words, clustering algorithms subdivide the data into a certain number of clusters. Most researchers define clusters as groups with internal homogeneity and external heterogeneity such that patterns in the same cluster should be alike while patterns in different cluster should not be alike (Xu & Wunsch, 2005). The basic procedure of clustering consists of four steps, as described by Xu & Wunsch (2005), which will form the guideline for the upcoming sections (see Fig. 1).

(12)

2.3. Feature selection or extraction

If we follow the procedure of Xu & Wunsch (2005), the first step includes feature selection or extraction. Vidden, Vriens and Chen (2016) point out that this primary step is a very crucial step. The reason is that cluster algorithms are only useful when irrelevant variables are removed. Otherwise, those irrelevant variables will distort the clustering structure, which will lead to useless results as outcome in the end (Liu & Ong, 2006). The effectiveness of the clustering solution thus highly depends on the selection of useful variables in the first step. There are different types of variables that can be included as cluster input: geographical variables (e.g. regions, countries), demographic variables (e.g. age, gender), psychographic variables (e.g. lifestyle, social class) and behavioral variables (e.g. usage status, loyalty status) (Armstrong, Adam, Denize & Kotler, 2014).

While demographic variables had a remarkable role in segmentation for a very long time, customer-oriented strategies are growing and those strategies require more than demographics solely (Cleveland, Papadopoulos & Laroche, 2011). Besides, past studies mention that demographic variables are frequently used to profile the segments rather than to shape the segments (Straughan & Roberts, 1999). Also, Fuat Firat & Schultz (1997) argue that the traditional segmentation variables, such as demographics and psychographics, become less useful these days. Moutinho & Vargas-Sanchez (2018) furthermore argue that demographic, psychographic and sociographic variables are basic characteristics of customers but that they do not provide us with an understanding of why certain customer segments respond to offerings the way they do.

(13)

According to Lynn (2011), there is no single way to go when selecting variables to include in the cluster input. However, the literature discussed above, suggests that behavioral variables might be more useful in a specific product or service domain setting to segment customers than geographic, demographic or psychographic variables. The latter are probably more useful to describe clusters.

There is however no direct comparison available in marketing literature yet, which derives us at the first research question:

How do different cluster input variables influence the cluster solution validity?

2.4. Clustering algorithm design or selection

According to Xu & Wunsch (2005), the next step is selecting (or designing) a clustering algorithm. Different clustering algorithms evolved over the years. A very traditional cluster algorithm is k-means (Jain, Murty & Flynn, 1996). K-means is one of the oldest and most commonly used clustering algorithms (Gan & Ng, 2017). According to Hartigan & Wong (1979), “the k-means algorithm is iterative and minimizes the within-class sum of squares for a given number of clusters” (p.100). K-means starts with an initial inference for the cluster centers and then places each observation into the cluster from which the center is closest based on the Euclidean distance measure. The cluster centers are consequently restored and this process repeats itself until the centers are not adjusted anymore, which explains why it is considered to be iterative. The algorithm is easy to understand, fast and robust (Gupta & Panda, 2018). However, despite the popularity of this algorithm in research, it has several drawbacks; it is for example very sensitive to outliers and noisy data (Gupta & Panda, 2018).

(14)

Thereafter, another algorithm evolved, which is CLustering LARge Applications (Henceforth: “CLARA”). CLARA is a sampling-based algorithm, which completes PAM on a numerous sub-datasets (Kaufman & Rousseeuw, 2008; Gupta & Panda, 2018). This way, CLARA is able to accomplish faster running times than when PAM is completed on the complete dataset. By drawing multiple samples, it searches for the best k-medoids among the selected samples of the original data set and returns the best solution as its output. Since the best sample is selected in a random manner, it should represent the original data set. Therefore, the chosen medoids should be similar to those that would have been extracted by using the original data set (Han, Jian & Michelin, 2006). Lastly, Pham (2001) introduces fuzzy clustering by the use of the fanny algorithm, which involves that all observations are assigned with a partial membership for each cluster (Pham, 2001). So, all observations are assigned with a vector and this vector represents the partial membership of the observation for the different clusters. Conclusively, the observation will be assigned to the cluster for which it has the highest membership (Brock et al, 2008). An advantage of fuzzy clustering is that it can identify important zones between clusters, which cannot be detected by k-means, PAM or CLARA (Heil, Häring, Marschner & Stumpe, 2019).

In conclusion, multiple algorithms exist to perform clustering. Until now, there is no direct comparison available in marketing literature between the four algorithms mentioned above, which derives us at the second research question:

How do different cluster algorithms influence the cluster solution validity?

2.5. Clusters validation

(15)

In addition, the authors state that “with the huge increase in data size and dimensionality, one can hardly claim that a complete knowledge of the ground truth is available or always valid” (p. 172). Acknowledging this line of reasoning, the focus in this research will be on internal validity. Most internal validity measures are concerned with calculating inter-cluster distance, which reflects the homogeneity of the clusters and intra-cluster distance, which reflects the heterogeneity among the clusters (Xu & Wunsch, 2005). A lot of different measures exist to approach internal validity. Well-known measures are for example the root-mean-square standard deviation (henceforth: “RMSSD”), connectivity and the silhouette width. The latter two will be applied in this research to assess validity. These measures will be elaborated on in section 3.

2.6. Results interpretation

According to Xu & Wunsch (2015) the final goal of clustering is to deliver meaningful insights on the dataset in such a way that the researcher is able to effectively tackle the problem at hand. In marketing segmentation it is relevant that the available segmentation solution is associated with the criterion of interest in the research, which is usually an aspect of behavior (Personen, 2013). Given the goal in the main research question to establish more more effective marketing strategies, it would be very relevant if the research its segmentation (S) makes a valuable contribution to the next step, which is targeting (T). The targeting step is involved with focusing on the right audience. For businesses, the “right” audience consists of customers that are likely to convert at the company (Moschis, Lee & Mathur, 1997). By focusing on this well-defined customer group, a company is able to allocate resources more effectively. This will lead to a higher Return On Investment (henceforth: “ROI”) (Dibb & Simkin, 2001). The aspect of behavior that is of interest in this research is conversion. Therefore, it would be interesting to predict which customer journeys are most likely to result in conversion and to identify the contribution of our segmentation solutions to this. Conversion is defined by Xu, Duan & Whinston (2014) as “the probability of a customer making a purchase, given the fact that an individual came across a way of online marketing initiated by a firm” (p. 1392). In this research, the term conversion is defined as “the customer making an actual purchase, as a result of online marketing exposure”.

(16)

Another example is given by Puneet, Dubé, Goh & Chintagunta (2006) who state that “the number of exposures, number of web sites, and number of pages on which a customer is exposed to advertising will significantly influence the customer’s purchase probability” (p. 104). Furthermore, academic research finds that in general CICs are more effective than FICs (de Haan, Wiesels & Pauwels, 2016). This can be backed up by the fact that CICs are a result of the customers’ own interests and therefore perceived as less obtrusive than FICs (Shankar & Malthouse, 2007).

Limited research has been conducted in light of the contribution of marketing segments to purchase forecasting in the marketing field. Morwitz & Schmittlein (1992) find that segments can improve the accuracy of sales forecasts, but only if statistical segmentation methods are used. Bucklin & Gupta (1992) find that households that switch brands because of price promotions do not necessarily purchase more. However, no marketing literature has yet discussed the added value of various segmentation solutions on the customer journey level to predict purchase probability. Also, the literature that researched this topic is outdated by 27 years, in which the marketing practice and customer behavior has developed significantly. This brings us to the final research question of this paper:

(17)

3 Research Design

3.1. Data collection

Quantitative research will be executed in order to answer the research questions. This research is quantitative since it relies on quantitative information (i.e. numbers and figures) (Blumberg, Cooper & Schindler, 2008). It can further be classified as exploratory research since the area of investigation currently lacks clarity in the marketing field (Blumberg, Cooper & Schindler, 2008).

For this research, event-based, online data from a Dutch travel agency will be investigated. The data is provided by GfK, which is a German market research institute. The data is collected via the GfK Crossmedia Link from Dutch panelists. The panel is passively measured and all the information from their purchase journeys from exposure and media consumption, to orientations and eventually purchases is gathered. Passive measurement means that the research institute constantly measures customer behavior via for example browser plug-ins.

The data comprises information from a Dutch travel agency over a period of one year and five months (from May 31th 2015 until October 31th 2016). It contains advertisement data, search data, website visits of the focal company, website visits of competitors and purchases. For each customer, time-series data is collected. So, the data can be classified as longitudinal panel data (Leeflang, Wieringa, Bijmolt & Pauwels, 2015). For each customer (UserID), every customer journey on its devices is tracked and thereby each observation corresponds to a specific event, namely a touch point. There are 20 different types of touch points and these can be classified into customer-initiated contact (CIC) or firm-initiated contact (FIC) (see Appendix A). Besides, the date and time of the event is tracked together with the time spent on that specific touch point. Lastly, for every customer journey GfK recorded whether the customer journey ended in a purchase.

(18)

To ultimately perform cluster analysis, four sub datasets will be established from the complete dataset according to the variable type. Hence: one dataset is constructed to cluster on behavioral variables, one for demographic variables, one for geographic variables and one for psychographic variables. The dataset containing personal information about the customers, which forms the starting point to cluster based on demographic, geographic and psychographic variables lists all the variables per UserID. However, the customer journey dataset can contain several customer journeys (PurchaseIDs) per UserID.

Since this dataset is the starting point for clustering based on behavioral variables, the data here needs to be aggregated on the UserID level to perform cluster analysis because the ultimate goal of segmentation is to group single users (Dibb & Simkin, 2001). After performing clustering, the variables containing the most valid cluster solution per variable type will be merged with the original dataset again based on the unique UserIDs. UserID is also the unique variable on which the segmentation variables subsequently can be linked to the data on PurchaseID level to predict conversion.

3.2. Variables

(19)

Table 1. Behavioral variables on the user level to create clusters

3.3. Working of the cluster algorithms

The second objective of this research is to identify how different cluster algorithms affect the cluster solution validity. Recalling Figure 1 and focusing on the cluster algorithm, Jain, Murty & Flynn (2000) present the following stages:

Fig 2. Stages of clustering algorithms (Jain, Murty & Flynn, 2000)

Variable Explanation Academic relevance

Number of touch points The total number of touch points More touch points indicate more serious buying behavior

(Hansen, Jensen & Solgaard, 2004)

Frequency of each touch point The sum of occurrence for each touch point

Frequency of usage indicates the preference of the user (Sashi, 2012)

Share of CIC The number of CICs relative to the total number of touch points

CICs indicate that the customer is interested and involved with the subject at hand (Sarner & Herschel, 2008; Shankar & Malthouse, 2007; de Haan, Wiesels & Pauwels, 2016)

Share of mobile device The number of touch points reached via mobile relative to the total number of touch points

Mobile devices are associated with customers having a higher income and/or education and often fit the profile of young unmarried office workers or students (Strom, Vendel & Bredican, 2014)

Duration The (average) duration for the customer journey(s)

More enduring customer journey(s) are related to higher purchase probabilities

(20)

(21)

Fig. 3. The cluster algorithms

K-means

PAM

CLARA

(22)

3.4. Analysis techniques

3.4.1. Analysis of cluster validity

As discussed earlier in section 2, cluster validity can be assessed based on internal and external validation. Based on academic literature (Deborah, Baskaran & Kannan, 2010; Hassani & Seidl, 2017), the decision is made to focus on internal validation in this research. More than 30 different measures exist for internal validation but for convenience only two measures will be used to compare the cluster solutions. According to Brock et al. (2008) it is important that a researcher’s internal validity measures reflect on the compactness, connectivity and separation of a cluster solution. Connectivity presents the extent to which observations are placed in the same cluster as their nearest neighbors. Compactness refers to how close objects are within the same cluster. Separation refers to how well separated a cluster is from other clusters. Since compactness and separation are two contrary criteria, it is possible to combine them into one measure. Well-known and widely used measures for compactness and separation are the Dunn index and the silhouette width. The Dunn index reflects the relation between the smallest and largest distance being present between observations in one cluster. A drawback of the Dunn index however is that no average is used in the calculation. Therefore, one inadequate cluster can influence the ratio to a great extent while the other clusters can be excellent (Brock et al., 2008). It is for this reason that the measures of interest for internal validation in this research are connectivity and silhouette width, which together also encompass compactness, separation and connectivity of the solution. These two measures are both available in the R package clValid (Brock et al., 2008) and are elaborated on below.

3.4.1.1. Connectivity

The connectivity measure calculates how strong different clusters of the cluster solution are connected. When 𝑛𝑛!(!) represents the 𝑗th nearest neighbor of observation 𝒾 and 𝑋!,!!!(!)is zero if

𝒾 and 𝑛𝑛!(!) are in the same cluster and 1/𝑗 if they are not. Then, for every clustering partition

(23)

3.4.1.2. Silhouette width

The silhouette width is extracted by taking the average of the silhouette values per observation. The silhouette value per observation represents the degree of certainty with which the observation was assigned to corresponding cluster. Where well-placed observations have a value near 1 and poor-placed observations have a value near -1. For each observation 𝒾, the silhouette value is calculated as:

𝑆(𝑖) = 𝑏!− 𝑎! max (𝑏!− 𝑎!)

where:

𝑆(𝑖) = The silhouette value of observation 𝒾

𝑎! = The mean of the distance between observation 𝒾 and all other observation in the same cluster

𝑏! = The mean of the distance between 𝒾 and all the observations in the “nearest-neighbor” cluster

And:

where:

𝐶(𝒾) = The cluster in which observation 𝒾 is placed

𝑑𝑖𝑠𝑡(𝒾, 𝑗) = The distance between observation 𝒾 and observation 𝑗 𝓃(𝐶) = The cardinality of cluster 𝐶

(24)

3.4.2. Analysis of cluster usability

However, according to Chowdary, Prasanna & Sudhakar (2014) “one drawback of assessing cluster solutions based on internal validation is that high scores on an internal measure do not necessarily result in effective information retrieval applications” (p. 94). Therefore, the information retrieval of the cluster solutions will be investigated by comparing the contribution of the most heterogeneous cluster solution per variable type on their added value for predicting conversion in a given customer journey.

3.4.3. Logit model

For investigating the contribution of the cluster solutions for conversion prediction, logit models will be estimated. The dependent variable to predict is conversion and this variable can be categorized as a binomial response variable. The customer can either make a purchase yes or no. Leeflang et al. (2015) advise to use a logit/probit model for a binomial response variable. In these type of models, the dependent variable can have two outcomes: Y=1 or Y=0. The authors state that the logistic regression model (logit model) has the preference due to mathematical convenience since probabilities are easier to calculate and interpretation of parameters is easier.

First, a basic logit model is established to predict conversion at the customer journey level. The reason that conversion prediction takes place at the customer journey level is that the aggregation level is now restrained to the customer journey level, which reduces the risk of information loss. The basic logit model will, after the cluster validation process, be expanded with an independent variable, which represents the most internal valid cluster segmentation per variable type (i.e. the cluster solutions with the most optimal scores on the two validity measures).

In order to make sure that the predicted variable values are between zero and one, a log-transformed equation of the binary logit model is applied. When aiming at estimating the odds, the dependent variable is required to be non-logistic (Leeflang et al., 2015). Therefore, the independent variables used in the research are exponentiated.

(25)

3.4.3.1. Variables

The variables and control variables that are used to predict conversion at the customer journey level are listed in Table 2 and 3 below.

Table 2. Academic grounds for independent variables

Independent variable Explanation Academic relevance

Number of touch points Expresses the number of touch points that were part of the journey

A larger number of touch points indicate more serious buying behavior (Hansen, Jensen & Solgaard, 2004)

Share of customer initiated touch points (CIC)

Expresses the share of CICs that were part of the journey

CICs are more effective than FICs because they stem from own interests and therefore perceived as less intrusive (Shankar &

Malthouse, 2007; Sarner & Herschel, 2008; Blattberg, Kim & Neslin, 2008; de Haan, Wiesel & Pauwels, 2016)

Device Expresses the type of device on

which the journey took place

Mobile devices are often used in the search stage while fixed devices are often used in the purchase stage (Lemon & Verhoef, 2016)

Duration Expresses the duration of the journey in seconds

Longer customer journeys positively affect trust and thus purchase probability (Kim, Ferrin & Rao, 2008; Luhmann, 2000)

Table 3. Academic grounds for control variables

Control variables Academic relevance

• Gender • Age • Income • Education

Brown, Pope & Voges (2003):

• Gender differences are present concerning the propensity to purchase online

Rodgers & Harris (2003:

• Men shop and purchase more online than women do Li, Kuo & Russel (1999):

• Younger users spend more time online and possess more knowledge about the internet

Bellman, Lohse & Johnson (1999):

• The likelihood of online purchasing increases as a person’s income, education and age goes up

Zhou, Dai & Zhang (2007):

(26)

3.4.3.2. Logit model specification

The models to be estimated are basically the same, except for the fact that the second model includes a segmentation variable. The number of clusters per variable type is not acquainted yet and therefore set to 𝑛 in the second equation. Besides, there are four variable types on which clustering will be performed (demographic, geographic, psychographic and behavioral) and therefore 𝑘 is set to a maximum of 4. Following equation 3.4.2.1 the models can be outlined as follows:

Model 1: Basic model

𝑃𝐴

!

=

!

!!!"# ! (!! ! ! ! !!!"!! !!!"!! !!!"!!! !!!"!!

!!!! !!!"!! !!!!! !!!!))

(Eq. 3.4.3.2.1)

Model 2: Segmentation model

𝑃𝐴

!

=

!

!!!"# ! ! ! (!! ! !!!"!! !!!"!! !!!"!!! !!!"!!

!!!! !!!"!! !!!!! !!!!! !!!!,…,! !!!!,…,!!! !!"#))

(Eq. 3.4.3.2.2)

where

𝑃𝐴! = Probability that the customer converts at any travel agency in journey 𝑖;

𝑇𝑃! = Number of touch points in journey 𝑖; 𝐷𝐸! = Device used in journey 𝑖;

𝐶𝐼𝐶! = Share of customer initiated touch points in journey 𝑖;

𝐷𝑈! = Duration of journey 𝑖 in seconds; 𝐺! = Gender of user in journey 𝑖;

𝐼𝑁! = Income of user in journey 𝑖;

𝐴! = Age of user in journey 𝑖; 𝐸! = Education of user in journey 𝑖;

(27)

3.4.3.3. Quality of the logit models

When estimating the logit models, maximum likelihood estimation methods are used, which seek for the set of betas that maximize the (log) likelihood function (Leeflang et al., 2015). Significant parameter estimations will be interpreted in order to see which model fits best to predict conversion and in what way the different segmentation variables contribute to this. The model will be estimated and tested by stepwise including or excluding control variables and (in)significant variables until the best model is obtained. To compare the models the Akaike Information Criterion (henceforth: “AIC”), hit rate, Top Decile Lift (henceforth: “TDL”) and Pseudo 𝑅! for logistic regressions will be examined. The best performing model would ideally

have the lowest AIC and the highest hit rate TDL and Pseudo 𝑅!. The AIC gives us the relative

quality of statistical models for a given set of data, and calibrates the precision of the parameters as well as parsimony in the model (Leeflang et al., 2015). The hit rate is a measure to assess the predictive power of a model. It determines the amount of correctly predicted outcomes (Peng, Lee & Ingersoll, 2002). The TDL presents us how many more situations can be identified with the model compared to random selection (Cui, Wong, Zhang & Li, 2007). Different Peusdo 𝑅!s exist

for logistic regression, which test for the strength of association between the predictors and the dependent variable. In this research, Nagelkerke 𝑅! is employed. Nagelkerke 𝑅! cannot be

interpreted the same as the traditional 𝑅! but the larger the Nagelkerke 𝑅!, the better the model

(Leeflang et al, 2016). Lastly, when developing models for prediction, the most critical metric concerns how well the model does in predicting the target variable on out of sample observations. The process comprises of using the model estimates to predict values on the train dataset. Subsequently, the predicted target variable versus the observed valued will be compared for each observation in the test dataset.

3.5. Plan of analysis

(28)

To get preliminary insights in the data, descriptive statistics and explanatory graphs will be investigated once the dataset is prepared and cleaned. After that, the data analysis will commence and all the different cluster solutions (4x4) will be formed and investigated on internal validity. After finding the most internal valid cluster solution per variable type, these four cluster divisions will be added as a variable to the dataset and the logit model will be run including the different segmentation variables.

(29)

4. Results

4.1. Preliminary checks

In order to get reliable results, it is necessary to check for missing values and outliers. According to Leeflang et al. (2015), “outliers represent extreme or distant values relative to other observations in the data, which may contribute to biased estimations”. Besides, Schafer & Graham (2002) mention that when dealing with missing values (N.A.: Not Available), one must either with grounded assumptions impute the missing values or delete them from the dataset in order to get reliable results.

4.1.1. Missing values

(30)

In addition, imputation methods were tested to predict duration N.A.’s based on other available variables. However, the prediction models resulted in extremely low 𝑅! scores, which indicates

that the variables used for predicting the missing values are not able to explain the missing variable reasonably. Hence, they were not applied to impute the missing data points.

4.1.2. Outliers

All variables that are included in the analyses were checked on outliers. The majority of the variables did show some extreme values. This can nevertheless be explained by the fact that for many variables, values are closely together or almost equal. As a result, the outliers that are present in the boxplots are not considered to be significantly impacting the analysis. Omitting them could result in a loss of important information and therefore no outliers have been deleted. Yet, there is one customer journey that contains 64,503 touch points while all the other observations have a maximum number of 8,891. The fact that this observation is about 7 times as large, and given that this data point is the only one exceeding the 8,891 touch points, it is assumed that this data point is unrealistic. As it influences other variable values as well, it would bias the results. Therefore, only this observation was deleted from the dataset. This resulted in the final dataset containing 29,011 customer journeys of 9,677 users.

4.2. Data description

(31)

Fig. 4. Descriptive statistics for demographics

Consequently, some key characteristics of the purchase journeys are discussed. The average number of touch points to which a customer is exposed in a journey is 82.45 and the average duration of a journey is 4,509.10 seconds (approximately 75 minutes), for a total of 29,011 customer journeys. The CICs that occur most frequently in the journeys are accommodations websites (1) and tour operator/travel agent website competitor (7). The CICs that occur least frequently are tour operator/ travel agent search focus brand (12), information/comparison search (6) and tour operator/travel agent search competitor (9). The FIC that occurs most frequently in the journeys is retargeting (22). The FIC that occurs least frequently is affiliates (18). The average share of CICs in a customer journey is relatively high with a value of 0.9908 while the average share of FICs in a customer journey is relatively low with a value of 0.0092. Furthermore, 23,243 journeys occurred via a fixed device and 5,768 journeys via a mobile device.

(32)

4.3. Clustering

4.3.1. Internal validation

To perform clustering with different variable types and different cluster methods, four scaled separate datasets were made. The data was normalized to assure all variables have an equal weight in the cluster partition. The partitioning methods (k-means, PAM and CLARA), as well as fuzzy clustering (fanny), require the researcher to fill in 𝜅, the number of clusters, beforehand. To decide upon the number of clusters to feed the algorithms for each variable type, it was strived for using the ClValid package and allow Rstudio a range of 2:10 clusters to decide upon the optimal number of clusters. However, due to excessive running times (more than 60 minutes per variable type per algorithm, which would be more than 16 hours in total), it was determined that another method should be adopted to decide the number of clusters.

(33)

Fig. 6. Scree plot (demographic)

Fig. 7. Scree plot (geographic)

Fig. 8. Scree plot (psychographic)

Fig. 9. Scree plot (behavioral)

Table 4. Connectivity and silhouette width from the cluster solutions

Demographic Connectivity Silhouette width Geographic Connectivity Silhouette width k-means 87.7456 0.3191 k-means 0.0000 0.4443

PAM 51.2155 0.3795 PAM 0.0000 0.4395 CLARA 44.7020 0.3891 CLARA 0.0000 0.4395 fanny 59.5567 0.2867 fanny 0.0000 0.3567

Psychographic Connectivity Silhouette width k-means 768.5659 0.2467

PAM 1136.3167 0.2347 CLARA 912.2365 0.2610

fanny 1599.358 0.1600

Behavioral Connectivity Silhouette width k-means 373.8433 0.8015

PAM 504.1885 0.3950 CLARA 31.9349 0.8758

(34)

After researching the results regarding internal validity in Table 4, one can clearly conclude

for three out of four variable types which algorithm resulted in the most internally valid

cluster solution. For demographic CLARA provides the most internally valid solution

(connectivity = 44.702, silhouette width = 0.3891), for geographic k-means provides the most

internally valid solution (connectivity = 0.0, silhouette width = 0.4443) and for behavioral

also CLARA provides the most internally valid solution (connectivity = 31.9349, silhouette

width = 0.8758). For psychographic, the two internal validity measures are not in harmony.

K-means results in the best score for connectivity (768.5659) while CLARA results in the

most optimal score for silhouette width (0.2610). Since the objective of the cluster solution in

this research is to provide relevant insights for targeting, the conclusion is drawn to prioritize

silhouette width over connectivity.

A high silhouette width indicates that the clusters are heterogeneously from each other and

homogeneously within the clusters (Brock et al., 2008), which permits firms to develop

distinct though appropriate marketing programs per cluster (Dibb & Simkin, 2001). Following

this approach, the preferred algorithm for psychographic is also CLARA. Overall, one can

identify that behavioral clustering yields a very high silhouette width (0.8758), which is

remarkably close to the perfect silhouette width of 1. This indicates that clustering based on

behavioral variables provides the most heterogeneous and homogeneous cluster solution.

Therefore, the overall preference is towards behavioral resulting in a much higher silhouette

width than all the other solutions and a slightly worse connectivity score (31.9349) than

geographic clustering (0.000).

4.3.2. Description of the behavioral clusters

(35)

This appeared to be the amount of clusters where the aforementioned majority of customers were split into two different clusters. On the other hand, this resulted in an extremely low silhouette width (0.0972) indicating that this majority of customers are too similar in their behavior to split into more clusters. Another method, which was applied to divide the majority of customers, was varying the behavioral variables to the cluster dataset. First, the number of customer journeys per panelist was added as an additional behavioral variable. Second, randomly one behavioral variable was omitted and in this manner different combinations were analyzed. These approaches likewise did not result in a partition where the majority group was distributed among more than one cluster. Therefore, one can conclude that there is one colossal group of the customers, which is very similar in behavior.

It was expected to be difficult to make statistical comparisons between cluster 1 (9,656 customers) and 2 (22 customers) due to the considerable difference in sizes, which affects the homogeneity of variance assumption (Tomarken & Serlin, 1986). Both Bartlett’s and Figner-Killeen test indeed confirmed that the homogeneity of variances assumption was violated (p = .000). If these tests are significant it is advised to conduct the non-parametric equivalent of the ANOVA analysis, which is the Kruskal-Wallis test (Vargha & Delaney, 1998). The Kruskal-Wallis test does not assume a normal distribution of the residuals. A significant Kruskal-Wallis test implies that one group stochastically dominates the other group (McKight & Najab, 2010). Inspection of descriptive statistics together with multiple Kruskal-Wallis tests subsequently resulted in the following insights concerning the two behavioral clusters:

§ Cluster 1 consists of the majority of the customers (9,656) and 28,916 customer journeys § Cluster 1 is less likely to purchase than cluster 2 (p = .000)

§ Cluster 1 mostly uses accommodations websites, information/comparison websites, tour operator/travel agent websites, flight tickets websites and retargeting

§ Cluster 1 prefers to use fixed devices more than cluster 2 (p = .000)

§ Cluster 2 comprises the minority of the customers (21) and 95 customer journeys

§ Cluster 2 predominantly uses accommodations websites, accommodations app, tour operator/travel agent websites and flight tickets app

§ Cluster 2 prefers to use mobile devices more than cluster 1 (p= .000) § Cluster 2 consists of younger customers than cluster 1 (p = .000)

§ Cluster 2 consists of higher educated customers than cluster 1 (p = .000) § Cluster 2 consists of customers with a higher income than cluster 1 (p = .000)

(36)

By observing these findings, one could characterize cluster 1 as the traditional customers and cluster 2 as the modern customers. All information concerning the minimum, median, mean and maximum of variables for the two clusters is provided in Appendix D.

4.4. Consequences of mean imputation

(37)

4.5. Assumptions for logistic regression

Also, we want to assess how the four best scoring cluster solutions help in predicting conversion probability to examine their usability. To predict conversion, a logistic regression model is applied. According to Pituch & Stevens (2015), four assumptions should be met when performing logistic regression:

1. The model needs to be correctly specified in terms of the appropriate analysis technique, correct independent variables and interaction terms where necessary. In the methodology section it has been explained why logit regression is appropriate. The academic literature ensures that the correct independent variables are included and using the stepwise modeling approach will contribute to only include independent variables that are statistically appropriate. Moreover, the models as specified in section 3.4.3.2 will be adjusted where necessary and the specified model will be compared to multiple adjustments of this model to find the model with the best fit. Academic literature did not direct us into interaction terms among the independent variables and therefore interaction terms are not included.

2. The second assumption is independence of observations. This refers to observations in the dataset should for example, not containing numerous measurements of one and the same customer. However, this is actually the case in the dataset that is used for logistic regression since one consumer can have several customer journeys. Therefore, this assumption is not completely satisfied. However, as outlined in section 3.4.3.1 multiple demographic variables will be added to the model to control for user characteristics to influence the analysis. This means that actions are carried out to still satisfy this second assumption.

3. The third assumption suggests that the variables are measured with no measurement errors. As a professional market research institute collected the data, it can be guaranteed that the data collection tries to minimize measurement errors to accurately portray the customers.

(38)

4.6. Model estimation -selection-

The cluster analysis returned the connectivity scores of all variable types and all algorithms (see Table 4). The most optimal solutions (i.e. highest connectivity scores) for each variable type were attached as new variables to the dataset aggregated on PurchaseID. The new variables derived from clustering are therefore democlusterclara for demographic clustering, geoclusterkmeans for geographic clustering, psychoclusterclara for psychographic clustering and behaclusterclara for behavioral clustering. The modeling part is twofold; first the basic model for conversion prediction is estimated, which is specified in Eq. 3.4.3.2.1 and after that the segmentation model is specified, which is specified in Eq. 3.4.3.2.2 and will consecutively include democlusterclara, geoclusterkmeans, psychoclusterclara and behaclusterclara. The second part thus expands the first part. Stepwise modeling was used to compare multiple models and find the one that performs best (Lani, 2014).

First, the basic model solely including predictors, and excluding any control variables was estimated. Subsequently, control variables were added stepwise, which resulted in the complete model as specified in section 3.4.3.2 (model 2). The variables income and education are of ordinal nature, with seven and eight levels, respectively. For convenience purposes, and as their nature is purely a control variable, these variables will be treated as interval data in the model. This is considered a very common practice in research (Long & Freese, 2005; Pasta, 2009). All control variables had significant estimates and were therefore kept in the model. Subsequently, the most insignificant variables were deleted to assess whether they negatively influenced model performance. Share of CICs was the least significant variable (p = .820) and was therefore dropped to create model 3. The variables in model 3 were all significant (p < .05).

(39)

Next, the psychographic segmentation variable psychoclusterclara was included (model 7). In model 7, all variables as well as all four levels of psychoclusterclara were significant (p < .01) except for age and education. Age and education were therefore dropped to create model 8. In model 8 all variables were highly significant (p < .01). Lastly, the behavioral segmentation variable behaclusterclara was included (model 9). All variables were significant (p < .05), age was almost significant (p < .10). However, the estimate for cluster 2 of the behavioral cluster solution was insignificant (p = .530).

All models were also estimated with share of CIC included to assess whether the variable became significant under certain conditions but it did not. Besides, all models were inspected on variance inflation factor (VIF) scores to check for multicollinearity. VIF scores above five indicate serious multicollinearity (Leeflang et al., 2016). Fortunately, in no model VIF scores exceeded five. Consequently, one can conclude that in this research multicollinearity was not an issue for logistic regression modeling.

To evaluate the performance of the models we examined the hit rate, AIC and TDL. When developing models for prediction, the most critical metric regards how well the model performs in predicting the target variable on out of sample observations. Therefore, separate train and test datasets were prepared to evaluate the performance of the model on new data (test data). The train data contained 60% of the total observations and the test data 40%. However, only 3,674 of the 29,011 customer journeys resulted in conversion (12.66%). This indicates that the majority of the customer journeys did not result in conversion (87.34%) and hence it is relatively easy for the models to predict a large number of cases correctly. Therefore, it was decided to calculate, next to the overall hit rate, separate hit rates for the percentage of correctly predicted zeros (no conversion) and the ones (conversion) for both the train data and test data. Hit rate, AIC and Nagelkerke 𝑅! are classified as the more statistical measures while TDL is a more practical

measure. A TDL of 1 is similar to a random model.

(40)

4.6.1. Estimation results part I

Table 5. Model overview I

Model Predictor variables Description

1 𝑇𝑃! + 𝐷𝐸! + 𝐶𝐼𝐶!+ 𝐷𝑈! Control variables excluded

2 𝑇𝑃! + 𝐷𝐸! + 𝐶𝐼𝐶!+ 𝐷𝑈!+ 𝐺!+ 𝐼𝑁!+ 𝐴!+ 𝐸! Control variables included

3 𝑇𝑃! + 𝐷𝐸! + 𝐷𝑈!+ 𝐺!+ 𝐼𝑁!+ 𝐴!+ 𝐸! Insignificant variables excluded Table 6. Model performance I

Numbers rounded to three decimal place.

4.6.2. Estimation results part II

Table 7. Model overview II

Model Predictor variables Description

4 𝑇𝑃! + 𝐷𝐸! + 𝐷𝑈!+ 𝐺!+ 𝐼𝑁!+ 𝐴!+ 𝐸!+ 𝑆!"𝟏 Demographic cluster solution included

5 𝑇𝑃! + 𝐷𝐸! + 𝐷𝑈!+ 𝐼𝑁!+ 𝐴!+ 𝐸!+ 𝑆!"𝟏 Demographic cluster solution included;

Insignificant variables excluded 6 𝑇𝑃! + 𝐷𝐸! + 𝐷𝑈!+ 𝐺!+ 𝐼𝑁!+ 𝐴!+ 𝐸!+ 𝑆!"𝟐 Geographic cluster solution included

7 𝑇𝑃! + 𝐷𝐸! + 𝐷𝑈!+ 𝐺!+ 𝐼𝑁!+ 𝐴!+ 𝐸!+ 𝑆!"𝟑 Psychographic cluster solution included

8 𝑇𝑃! + 𝐷𝐸! + 𝐷𝑈!+ 𝐺!+ 𝐼𝑁!+ 𝑆!"𝟑 Psychographic cluster solution

included;

Insignificant variables excluded 9 𝑇𝑃! + 𝐷𝐸! + 𝐷𝑈!+ 𝐺!+ 𝐼𝑁!+ 𝐴!+ 𝐸!+ 𝑆!"𝟒 Behavioral cluster solution included

Table 8. Model performance II

Numbers rounded to three decimal place.

Model Hit rate train Hit rate 0 train Hit rate 1 train Hit rate test Hit rate 0 test Hit rate 1 test AIC Nagelkerke 𝑅! TDL 4 87.312% 87.844% 49.123% 87.331% 99.112% 6.059% 20431.62 0.103 3.155 5 87.315% 87.847% 49.250% 87.348% 99.132% 6.059% 20429.95 0.103 3.155 Model Hit rate

(41)

4.7. Model selection

As can be derived from Table 6 and Table 8, one can compare the models on the basis of two aspects: 1) model fit and 2) model performance. For model fit, one should inspect the Nagelkerke 𝑅! and AIC. Nagelkerke 𝑅! is the highest for model 7 (0.105) with model 4, 5 and 6 as close

runner-ups (0.103). For model 8 AIC is the lowest (20,406.2) with model 7 closely following (20,409.07), which is desirable according to Pituch & Stevens (2016). Moreover, the predictive power of all models seems to be quite good. Model 7 has the highest overall predictive power (87.336%) with model 8 as a close runner up (87.334%). However, model 6 has the highest predictive power for zeros (87.857%) and model 7 has the highest predictive power for ones (50%). When reviewing the TDL, it can be derived from Table 5 that the values are close. Model 1 has the highest TDL and model 7 has the third-highest TDL (with a difference of only 0.139). A TDL of 3.351 (model 1) and 3.212 (model 7) indicates that these models are respectively 3.351 and 3.212 times better than random selection in predicting conversion. To assess the predictive power of the models when the model is given unknown data (test data), the overall hit rate, hit rate 0 and hit rate 1 are also reported for the test set. One can conclude that the overall hit rate and hit rate for zeros remain similar. In contrast, the hit rate 1 declines intensely to about 6%. Model 1-6 predict 6.059% of the cases correctly for ones, while for model 7 and 8 this is 5.922% and for model 9 this is 5.990%.

Overall, one can conclude that the best predictor model is model 7 with the highest overall hit rate, the highest hit rate 1, the third-highest hit rate 0, the second-lowest AIC, the highest Nagelkerke 𝑅!, the highest TDL, the highest overall hit rate for test data, the

third-highest hit rate 0 for test data and the third-third-highest hit rate 1 for test data. Therefore, the results of model 7 are displayed in Table 6. For enhanced interpretability, the odds-ratios (OR), marginal effects (ME) and variable importance (Var. imp.) are presented in the same table. Although understanding the output from logistic regression models is known to be difficult, four means of interpretation are now available: the 𝛽-coefficients, the odds-ratios, marginal effects and the variable importance. Note that all these measures can only be connected to conclusions if they are statistically significant. To ensure correct interpretation of the parameters, all measures will be discussed briefly.

(42)

Signif. Codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1

Where 𝑆!!𝟑 stands for cluster 2 from cluster solution 3 (psychographic)

𝑆!!𝟑 stands for cluster 3 from cluster solution 3 (psychographic)

𝑆!!𝟑 stands for cluster 4 from cluster solution 3 (psychographic) 𝑆!!𝟑 stands for cluster 5 from cluster solution 3 (psychographic)

When decoding the 𝛽-coefficients, positive parameters indicate a positive relationship, whereas negative parameters indicate a negative relationship. A positive relationship indicates that an increase in this variable increases the probability of observing Y = 1 (i.e. a customer converses), while a negative relationship indicates that an increase in that variable decreases the probability of observing Y = 1. Similarly, odds-ratios indicate a positive relationship when their values are larger than 1 and a negative relationship when there values are smaller than 1. An odds-ratio is merely the exponent of the 𝛽-coefficients, both present the same result but in a different format. Another measure to interpret the results is the marginal effects coefficient. In case of a binary independent variable the marginal effects indicate the change in the probability of observing Y = 1 for a discrete change in the independent variable (i.e. from 0 to 1). For continuous independent variables, they indicate the change in probability of observing Y = 1 for instantaneous change. The marginal effects (ME) reported in Table 9 assume that the other covariates are at their average value. The final measure to interpret the results is the variable importance. This measure is included to assess the relative importance of individual predictors in the model and gives the absolute value of the t-statistic for each model parameter.

As can be derived from Table 9, a positive relationship exists between conversion probability and

Var. 𝜷 Std.

Error z value Sig. OR ME Var. imp. (Intercept) -2.94100 0.12900 -22.794 <0.001*** 0.05283 𝑇𝑃! 0.00085 0.00016 5.371 <0.001*** 1.00085 0.0001 5.37121 𝐷𝐸! (mobile) -0.61920 0.05452 -11.357 <0.001*** 0.53839 -0.0550 11.3570 𝐷𝑈! 0.00003 0.00000 9.459 <0.001*** 1.00003 0.0000 9.45908 𝐴! 0.00104 0.00156 0.670 0.50279 1.00104 0.0001 0.67010 𝐺! (F) -0.11040 0.04021 -2.745 <0.01** 0.89552 -0.0114 2.74467 𝐼𝑁! 0.11350 0.01750 6.483 <0.001*** 1.12015 0.0115 6.48276 𝐸! 0.01341 0.01561 0.859 0.39032 1.01350 0.0014 0.85903 𝑆!!𝟑 0.52520 0.09948 5.280 <0.001*** 1.69087 0.0508 5.27989 𝑆!!𝟑 0.35700 0.06725 5.309 <0.001*** 1.42908 0.0324 5.30914 𝑆!!𝟑 0.58660 0.10120 5.799 <0.001*** 1.79782 0.0580 5.79888 𝑆!!𝟑 0.33880 0.08400 4.033 <0.001*** 1.40320 0.0305 4.03265 Table 9. Logistic regression estimation results (Model 7)

.

(43)

Furthermore, being a user that belongs to psychographic cluster 2, 3, 4 or 5 increases conversion probability (relative to cluster 1, which is the reference level). This is also indicated by an odds-ratio above 1 for the positive relationships and below 1 for negative relationships as well as positive and negative marginal effects. The marginal effect of gender (𝐺!), which is a dummy

variable, for example means that for the average observation, the probability of observing a purchase (Y = 1) declines with -0.0114 when ‘going’ from male to female. The marginal effect of the number of touch points (𝑇𝑃!), which is a continuous variable, means that for the average

observation, the probability of observing a purchase (Y = 1) increases with 0.0001 for each additional touch point in the customer journey. When inspecting variable importance, one can derive that device used is most important, followed by duration and income. The output of the other models is presented in Appendix F.

Referenties

GERELATEERDE DOCUMENTEN

The present research represents the fi rst study to explore the psychological out- comes of giving a substantial amount of money as a gift in the context of an ongoing

In this thesis, I hypothesize that an interpersonal regulatory fit between a supervisor and subordinate leads subordinates to perceive a higher leader-member exchange (LMX), which

The loop assured that the new created datasets report information at the level of consumers’ individual purchase journeys and only include the touchpoints related

3 As the share of customer-initiated contact in a customer journey increases, the relationship between the number of touchpoints in the path to purchase and a customer’s

XPLOR is the performance of explorative activities of the marketing function, FOA is the ability of the marketing function to be financial outcome accountable,

As mentioned in the introduction of this paper, the main objective of this research is to understand to what extend people are aware of the issues related to

The results of the independent t-test and linear regression analysis showed that a bash action negatively influences brand image, and that there is a significant difference

Our results on the ERO clustering clearly show that for such a comparison to be reliable, both a wide field survey (resulting in a large number of EROs) and a consistent estimate