Determining the stages of a customer journey using a
Hidden Markov Model
Author: Jeanna van Haren
Department: Faculty of Economics and Business (FEB) Qualification: Master thesis
Completion date: 17-06-2019
Address: Guldenslag 55, 3991WC Houten Phone: +31623186733
e-mail: [email protected]
student number: s3845443 First supervisor: Dr. Frank Beke
Management Summary
Table of Contents
Management Summary ...2 1. Introduction. ...4 2. Theoretical background. ...6 2.1. Literature Review. ... 6 2.1.1. CUSTOMER JOURNEY. ...6 2.1.2. PREDICTIVE MODELS. ...7 2.2. Conceptual Framework. ... 8 3. Research design ...9 3.1. Type of data ... 9 3.2. Data preparation ... 103.2.1. STRUCTURE OF THE DATASET ... 10
3.2.2. VARIABLE DESCRIPTION ... 11
3.2.3. MISSING VALUES & OUTLIERS ... 13
3.3. Model formulation ... 14
3.3.1. HIDDEN MARKOV MODEL ... 15
3.3.2. ANALYSIS IN LATENT GOLD... 16
3.3.3. PERFORMANCE OF THE MODEL ... 17
4. Results ...18
4.1. Descriptives ... 18
4.2. Results HMM ... 21
4.2.1. NUMBER OF STATES ... 21
4.2.2. INTERPRETATION OF THE STATES ... 21
4.2.2.1. STATE 1 ... 25 4.2.2.2. STATE 2 ... 27 4.2.2.3. STATE 3 ... 28 4.2.2.4. STATE 4 ... 29 4.2.2.5. STATE 5 ... 30 4.2.2.6. STATE 6 ... 31 4.2.2.7. STATE 7... 33 4.2.3. ESTIMATION RESULTS ... 34 4.2.4. MODEL FIT... 35 4.3. Classification ability ... 36 5. Discussion ...37
6. Conclusion & Recommendations ...40
6.1. Managerial implications ... 41
7. Limitations & Future Research ...42
8. References. ...43
1. Introduction.
Nowadays online and offline environments converge at an ever increasing pace deeming it necessary for companies to be able to reach customers anytime, anywhere and with the right message (Van Bommel, Edelman, & Ungerman, 2014). To be able to do this, marketers need to understand the journey that customers go through when forming purchase decisions (Court, Elzinga, Mulder, & Vetvik, 2009). Besides the difficulty to understand the whole customer journey through obstacles such as: “multiple data sources, data accuracy, time and money & organizational structure” (Lee, 2010), customers have also gained more control through the various digital technologies available to them, that it is them that decide where they want to focus their attention on (Edelman & Singer, 2015; Van Bommel et al., 2014). According to Van Bommel et al. (2014) this increased control has generated the need for companies to create a compelling customer experience where all interactions are specifically tailored to the journey stage a customer is in (Van Bommel et al., 2014). However, to be able to create such tailored interactions the company should know in which stage the customer is, in his or her decision journey, at any given moment. The application of this knowledge is illustrated in several research papers where they try to determine appropriate (online) advertising strategies. These papers found that the various advertising formats and marketing channels used have different effects depending on the stage of the customer in his or her decision journey (Abhishek, Fader, & Hosanagar, 2012; Anderl, Becker, Von Wangenheim, & Schumann, 2016; Batra & Keller, 2016). However, before being able to make these kind of marketing decisions it is necessary for a company to know the stages of the journey that their customers go through.
website, instead of presenting one general customer journey on the basis of a combination of individual customers’ behaviors and experiences. The proposed model is of importance for companies as they gather more and more data on customer activity that is then used for targeting purposes. However, the motivation of customers, where the targeting actions are often based upon, are most of the time not based on the observed activity but are rather based on the latent states (Ebbes & Netzer, 2018). Therefore, practitioners could implement the proposed model at their company, whereafter they could use the information of the latent state of a customer for their targeting purposes. This could lead to an improvement of the customer experience, especially by successfully tailoring marketing actions to the stage of the customer journey that a customer is in. Moreover, it is also scientifically relevant because according to Kranzbühler et al. (2018) a subsequent step in the research of the customer journey is to go beyond the common theorized framing of customer journeys towards more empirically derived models (Kranzbühler, Kleijnen, Morgan, & Teerling, 2018). This paper provides a first attempt towards creating such a model.
With the proposed model, this paper tries to retrieve the stages of the customer journey from data. It thereby tries to answer an important question for marketers and academics namely: “is it possible to determine the stage customers are in, in their decision journey?”. This question has two underlying sub-questions namely: “what are the stages in a customer journey?” and “what behavior is indicative for which stage?”. To answer these questions this paper analyses site-centric data of a Dutch home furniture company and uses this data as input to estimate a Hidden Markov Model. Such a model classifies customers into latent states on the basis of the observed behavior (Leeflang, Wieringa, Bijmolt, & Pauwels, 2017). Furthermore, by giving an answer to the two sub-questions this research also illustrates what the stages of the customer journey are and what behavior is typical for the stages. Additionally, this research also provides recommendations to marketers how they should act upon these insights.
2. Theoretical background.
This section examines prior literature about the customer journey, the different versions of customer journey models and several predictive models that are related to subjects around the customer journey. Moreover, definitions of the key terms that are used throughout the paper are given. In the end of this section the conceptual framework is set out.
2.1. Literature Review.
2.1.1. CUSTOMER JOURNEY.
As a start it is very important to know what is meant with the customer journey. According to Følstad & Kvale (2018), most papers about the customer journey refer to it as a sequence, process, or path that a customer goes through when using or accessing a service. In this paper, the customer journey is seen in a similar manner and is defined as: the stages that a customer goes through when buying and using a product.
(2010). He proposes a journey that is based on the typical sales funnel: awareness, research, purchase, and ads another step namely: out-of-box-experience (OOBE) (Richardson, 2010). Moreover, Google has also created a framework for the customer journey that encompasses the following four steps: see, think, do, care. Throughout these four steps, the customers that are considered for targeting in each step become more specific, and the last stage considers the existing customers (Eriksson, 2015). Lastly, a simplified customer journey is proposed by Lemon & Verhoef (2016) as they only take into account three stages namely: pre-purchase, purchase and post-purchase (Lemon & Verhoef, 2016).
However, the models that have been discussed are based on a theoretically predetermined process, and therefore they will not be able to predict a stage of a customer in his or her journey at a given point in time. A predictive model is needed for this, and the use of such models in regard to the customer journey is discussed further in the next section.
2.1.2. PREDICTIVE MODELS.
To the fullest extent of my knowledge an HMM or other similar models have thus far not been used for the purpose of predicting the stage a customer is in, in his or her customer journey. However, HMMs or similar models have been used for various other purposes concerning the customer journey.
One of these purposes has been for studying customer relationships, where Netzer & Srinivasan (2008) have used a nonhomogeneous HMM to model the dynamics of the customer-firm relationship which is based upon several interactions between a firm and its customer (Netzer, Lattin, & Srinivasan, 2008).
Furthermore, a study done by Montgomery et al. (2004) creates a model to examine the order of pages visited by a customer on a bookseller website and find that the course a customer takes on the website reflects his or her goals and this could be useful in predicting future actions on the website (Montgomery, Li, Srinivasan, & Liechty, 2004).
approach as the approach taken in this paper by modelling the customers’ route to purchase. However, there are still substantial differences which will be clarified in the next section where the conceptual framework underlying this study is presented.
2.2. Conceptual Framework.
In this article, a process is introduced where the customer will move through the purchase phase. The purchase phase encompasses all stages prior to the purchase and the purchase stage itself.
The proposed model uses customer’s behavior to imply the state that a customer is in during his or her customer journey via the observed behavior on the website. The model is inspired by the model for customer relationship dynamics of Netzer & Srinivasan (2008) and the model predicting a person’s job seeking state of Ebbes & Netzer (2018). The model is conceptualized in figure 1. State i refers to the state that a customer is in based on his or her current and past behavior.
customers use the basket as a substitute of a wish list because then they avoid the hassle of moving products but they can still easily thin out their consideration set (Close & Kukar-Kinney, 2010). Moreover, putting a product in your basket can also indicate an intention to purchase, as it is a necessary task before customers are able to enter transaction details and actually buy the product (Sismeiro & Bucklin, 2004).
The main difference with the models presented in the section above is that the interest of this paper lies not with predicting a certain outcome (e.g., a purchase) but with determining the latent state itself. This approach is similar to the work of Ebbes & Netzer (2018), where they use a partially Hidden Markov Model to infer and predict the job seeking state of a person. Moreover, Abhishek et al. (2012) analyze the behavior of consumers when they are presented certain advertisements, modelling the customer journey on the basis of interaction with these advertisements. In this paper, however, the focus is not only on the interaction with an advertisement, measured by the channel through which the customer visits the site, but it also includes further behavior on a companies’ website. Therefore, this model will create a more holistic picture of the online behavior of a customer and the relation to the stage that they are in with regards to their customer journey.
Figure 1: Conceptual Framework
3. Research design
In this section the type of data, the preparation of the data before analysis and the type of model used to answer the research question are set out as well as which software program will be used for estimation and how the predictive ability of the model is assessed.
3.1. Type of data
User-centric data tracks the user’s activity across sites, while site-centric only tracks the activities on one specific website (Mullarkey, 2004). In this paper, site-centric clickstream data is used. It is important to note that this kind of data has several limitations regarding the purpose of this paper, as it has been found to perform less on prediction tasks than models build with user-centric data (Padmanabhan et al., 2001). However, these findings refer to rather different prediction tasks than that is performed here, and thus the limitations could be less severe for this study. Moreover, the differences in performance are more striking for predictions on session-level than user-level, and the latter is used in this study. Furthermore, as mentioned before, site-centric clickstream data is usually the only data that is available to companies (Padmanabhan et al., 2001). Therefore, the proposed model will be more useful for marketers than models based on user-centric clickstream data.
3.2. Data preparation
Before the data can be used for estimation it has to be in the right structure and several additional variables have to be created. For the creation of these variables and such a structured dataset the software program R has been used.
3.2.1. STRUCTURE OF THE DATASET
The full dataset was split into 5 different datasets that all contained the variable session id through which they are combined. Before merging the datasets together, all sessions without a client id were removed, because the proposed model needs a case id that connects the sessions to a single user. Moreover, following the reasoning of Bucklin and Sismeiro (2003), clients that had sessions that contained only one URL were also removed from the data, as these observations do not contain information about the browsing behavior on the website. Moreover, they could also reflect an accidental click (Bucklin & Sismeiro, 2003; Hofgesang, 2006).
10% of the client id’s is created to make the further analysis less time intensive. This new subset is used to create additional variables needed for further analysis which is described in the next section. However, one created variable is worth mentioning here, namely the time since last purchase. This variable is created with the use of the timestamp variable and is used to determine which sessions are belonging to the post-purchase phase of a journey and which sessions belong to the stages of the journey that occur during the purchase phase. This led to removing all the sessions that occurred within 30 days after a purchase. The threshold is set at 30 days, because the data comes from a furniture company and some products, especially tailor-made products, can take a long time to deliver (Tammela, Canen, & Helo, 2008). Sessions that occur more than 30 days after the last purchase are considered to be the start of a new journey and thus are included in the dataset that is used for analysis.
After all the necessary variables are created the dataset is structured in such a way that one row corresponds to one distinct session. This structure fits the structure necessary for the proposed model, as the proposed model assumes that the dataset is structured in such a way that the individual cases, which are determined by client id, refer to time points that are equal in distance, which are determined by session id. The proposed model allows for a differing total number of time points per case. However, the dataset only includes the clients that have more than one distinct session. Thus, the first session id for a client will start at one and will increase with one for every consequent session.
3.2.2. VARIABLE DESCRIPTION
The variables that are included in the analysis can be classified into two categories and are based upon the before mentioned aspects that define customer behavior in the conceptual framework part.
1. Observed choice – This is the decision to purchase a product yes or no. This is one of the indicators, which is a dummy variable for purchase. This variable gets the value ‘Yes’ if a person has visited the URL “/payments” and otherwise ‘No’.
framework (e.g., Abhishek et al., 2012; Anderl et al., 2016; Lee, 2010; Wolny & Charoensuksai, 2014; Wooff & Anderson, 2015; Zhang & Duan, 2014). The first variable that was created was the type of page that someone visited. As explained before, the type of page somebody visits is expected to have a big impact on the state(s) that a customer is in, as it shows the way a customer engages with the company(Lemon & Verhoef, 2016). This variable was created by using the URL’s that were visited and classifying them into categories. Afterwards, this variable was used to create separate variables of the number of pages visited per type of page. This variable was then changed into a categorical variable instead of continuous, dividing the number of pages visited into different levels. The type of pages used and their corresponding URL’s are summarized in table 1 below. In the case of many different URL’s only a couple are given as an example. These types of pages differ in how deeply they are built into the website and in table 1 they are ordered from least to most profound pages. It is assumed that the type of pages that are more deeply built into the website are visited more often in later stages of the journey. A similar strategy is used to classify the landing page and exit page. However, for these variables the page types ‘Account’, “Advice”, “Brand Page (BP)”, “Contact”, “Search”, “Service”, “StorePage” and “Other” are added, again based on the URL of the page visited.
Table 1: Type of pages used as explanatory variables.
Type URL(s)
Home page “/”
Inspiration page (Insp) “trend…”, “/blog…”, “/woonstijlen…”, “/magazine…”, “/folder…”, “/inspiratie…”, “/tv…”,
Category “/wonen”, “/slapen”
Product Category Page (PCP) For example: “/wonen-bankstellen”, “/wonen-stoelen”, “/slapen-bedden”, “/slapen-boxsprings”
Product Overview Page (POP) For example: “/wonen-bankstellen-…”, “/wonen-stoelen-…”, “/slapen-bedden-“/wonen-stoelen-…”, “/slapen-boxsprings-“/wonen-stoelen-…”, “/slapen-matrassen-…”
Product Detail Page (PDP) “/p/”
Basket “/basket/index/”
variable, again dividing the number of pages visited into different levels. As explained before, the number of pages visited relate positively to the purchase probability (Mallapragada et al., 2016). Therefore, this variable could give insights into the stage that a customer is in, in his or her journey and is thus included in the analysis. Furthermore, the variable average time on page has also been changed from a continuous to a categorical variable, where the time in minutes is divided into different levels. The variable average time on page is included as it could give insights into how deeply a customer processes pages in general. Moreover, the last variable that is changed is the variable session duration. This variable is important for the analysis as it also has a positive relation to the purchase probability (Mallapragada et al., 2016). Moreover, the information of this variable could complement the information from type of page, which together could give insight into how deeply certain pages are processed by a customer (Moe, 2003). The variable session duration contains the total time a person spent on the website during a session. Due to the reason that the time someone leaves the website is not recorded, the time on the last page is not known. This means that the session duration is not fully complete. Therefore, the value for the average time on page is added once to the session duration value which seemed to result in the correct session duration. Afterwards, this new variable for session duration was changed from a continuous variable to a categorical variable as well, by dividing the total time spent on the website into different levels.
3.2.3. MISSING VALUES & OUTLIERS
the time for the last page another inconsistency was discovered, namely several sessions did not have a correct value for session duration, because some values for time on the last page turned out to be negative. As this is impossible the client id’s that had sessions where this occurred were removed from the dataset.
The dataset is checked for outliers as they can disproportionally influence the estimation of the proposed model. Outliers are identified by creating a boxplot of the variable in question, where all values higher than the 75th percentile plus 1.5 times the interquartile range and all values
below the 25th percentile minus 1.5 times the interquartile range are seen as outliers. The first
variable that is checked for outliers is the session duration. For this variable there are a lot of outliers according to the boxplot. However, when looking at the outliers in combination with the total number of pages viewed in a session many do not appear to be strange, as observations with a high value for session duration also had a high value for total pages viewed. Therefore a cut-off of 30 minutes is chosen for total session duration which is in line with the recommended threshold (Das & Turkoglu, 2009; Spiliopoulou, Mobasher, Berendt, & Nakagawa, 2003). However, here the number of pages viewed is also taken into account with determining outliers. As such, the client id’s that had sessions containing a session duration of more than half an hour were removed only when the total number of pages viewed was below 3.
Besides checking for outliers for the total session duration, outliers for the variable average time on page are also identified. Again, there are many outliers according to the boxplot. However, as the time on page is influenced by several factors, such as the content of a page or the time needed to load a page, determining a cut-off based on the boxplot is not desirable (Spiliopoulou et al., 2003). Therefore, the cut-off is based on the ten minute threshold proposed by Spiliopoulou et al. (2003), removing all clients that have a session containing a value for average time on page of more than 10 minutes (removing approximately 2.3% of the full dataset).
Hereafter, the full dataset – consisting of 186.327 observations - is ready for analysis and the next section describes the model that is proposed to be used for this purpose.
3.3. Model formulation
journey is hidden (Abhishek et al., 2012). More specifically, this research tries to determine the state in the customer journey that someone belongs to, during each time period, and how the variables as described in the variable description part impact the likelihood to belong to each of the states. Following this, the model that has been used for this purpose is a basic Hidden Markov Model (HMM). This model as well as the approach taken for the data analysis is described in the following subsections.
3.3.1. HIDDEN MARKOV MODEL
There are several versions of a HMM possible, however the basic model has three main components, that corresponds to three parameters that need to be estimated (Leeflang, Wieringa, Bijmolt, & Pauwels, 2017; Netzer et al., 2008):
1. The initial state distribution – this refers to the probability that customer i belongs to state s at time 1; P(S1i = s) = pis.
2. The transition probability matrix – this refers to the likelihood that the customers’ interactions with the company website in the previous time period were strong enough to move the customer to another state; P(Sit + 1 = s’ | Sit-1 = s) = Qitss’.
3. The state dependent distributions – this represents the observed activities, dependent
on the user’s state Sit ; P(Yit | Sit = s) = Mits (Ebbes & Netzer, 2018). Here, these
activities refer to the behavior on the website that is observed as well as the decision to purchase a product yes or no. The activities are all of a discreate nature, e.g., a consumer bought a product.
The HMM proposed in this paper is a basic HMM, which does not account for covariate effects. As a consequence, the included variables only have an effect on the state-dependent distributions. The proposed model only accounts for variables that have an impact on the state-dependent distributions, because this research is more interested in understanding the impact of the various variables, including the choice variable, on the estimated states instead of proposing a model that shows the impact of the variables on the probability of observing the choice variable (purchase yes/no).
does usually not follow a linear structure (Batra & Keller, 2016; Wolny & Charoensuksai, 2014).
The objective of this paper is to recover the state membership of a customer in each session (time t), for which two approaches have been suggested (Leeflang et al., 2017):
1. Filtering – here, only the information known up to time t is used to retrieve the state of customer i.
2. Smoothing - here, all available information in the data is used to predict, at any time point, the state of a customer during the period for which data has been observed. In this paper, the filtering approach is used, as the interest lies in recovering a customers’ state at time t based on earlier observed behavior on the website, instead of future observed behavior. This will also be of more value for marketers as this could help them discovering the state of a customer when future behavior is not yet known.
The selection of the number of hidden states are estimated from the data. This differs from the usual approach taken in marketing where the customer journey stages are predefined on the basis of theory (Følstad & Kvale, 2018). The approach taken here optimizes the number of hidden states on the basis of the fit of the model to the data. This model fit is determined by comparing several information criteria, such as the BIC, AIC and CAIC, which balances model fit and model parsimony (Leeflang et al., 2017). Thus, after estimating several models with a different number of hidden states the model with the lowest information criteria value is chosen as the appropriate model.
3.3.2. ANALYSIS IN LATENT GOLD
The estimation of the model is done using the software program Latent Gold 5.1 (Vermunt & Magidson, 2016). In this program the observed activities on the website are included as the indicators (Yit), since they will then only impact the state dependent distributions. The activities
are captured in the various variables created as explained in the variable description part. Moreover, it is important that the data contains a case id that connects the multiple sessions to the same person (Vermunt & Magidson, 2016). In the case of the dataset used in this paper the case id is the same as the client id.
use of the EM algorithm is a popular method for estimating a HMM, however it could also suffer from local maxima and/or under flow (Leeflang et al., 2017).
3.3.3. PERFORMANCE OF THE MODEL
The performance of the model is assessed by looking at the predictive ability of the model. For this the dataset is split where approximately the first 70% of the data points are used to estimate the parameters of the model and approximately the last 30% of the data points that are not used for estimation – the so-called holdout cases – are used to test the ability of classifying clients into the right states. These holdout cases are selected by using the select option in Latent Gold. Latent Gold will provide several statistics in the output for the holdout sample including classification statistics. The classification statistics of the proposed model are then compared to the classification statistics of two heterogeneous HMM. These models allow for cross-customer heterogeneity in the initial state distributions as well as in the transition matrix. Heterogeneity is captured by using a latent class approach. This means that the proposed model with the determined number of states is estimated including 2 and 3 latent classes. These models account for differences among individuals as it would be likely that customers could differ in their decision journeys. Additionally, they account for different levels of stickiness of the states. However, it is assumed that, given a state, customers will behave in a similar manner as the models do not account for cross-customer heterogeneity in the state-dependent distribution (Leeflang et al., 2017). It is still expected that the estimated journey states are somewhat similar over the customer base, as is assumed in most customer journey research (E.g., Lemon & Verhoef, 2016). Thus, the proposed model is expected to be somewhat better especially in terms of simplicity.
4. Results
4.1. Descriptives
Before getting to the estimation of the model it is a good start to explore the dataset that is used for analysis.
The dataset that is used consists of 186.327 observations including 55.646 distinct cases. Of these cases, 41.718 are used to estimate the model and 13.928 cases are used as a test set, to test if the model is doing well in classifying the cases into the states. Descriptive statistics for the data (split into train- and test set) are displayed in table 2 below.
As explained before, only clients with multiple sessions are included in the dataset. As can be seen in table 2, the minimum amount of sessions for a client is 2 and the maximum amount of sessions for a client is 86 in the train set and 100 in the test set. The distribution of the number of sessions per client are shown in figure 2a and 2b. For both the train set (figure 2a) and the test set (figure 2b) it is illustrated that many customers have 0 to 5 sessions and only a few have more.
Besides the number of sessions, several other descriptives are illustrated in table 2. The number of purchases in the datasets are 407 for the trains set and 127 for the test set. Importantly, this is only a small proportion of the full dataset. However, this is not surprising as, considering the typical conversion funnel, the group that goes through to the next stage and ultimately through to purchase gets smaller and smaller every step (Patterson, 2007). Moreover, in both sets clients Figure 3a: Distribution of the number of sessions per customer for the train
viewed on average nine pages in total per session. The total amount of pages viewed is split into several type of pages of which some are included as categorical variables in the model. Therefore, it is interesting to have a look at the descriptive statistics for these variables as well. As is shown in table 2, for most of these variables 0 is the most occurring category in both of the datasets. This means that in most sessions these pages were not viewed by a client. Only for POP and PDP this was not the case. For these pages, in most sessions one to five pages were viewed by a client for both train- and test set. Furthermore, the descriptives show that the session duration of clients for the train set is on average 6.09 minutes and for the test set 6.05. Moreover, for both datasets most sessions are started through organic search. This means that most clients came to the website by clicking on the generic search results. Lastly, the first page visited for most sessions is the POP, for both train- and test set. This page is also the most occurring page that is visited last for most session, in both datasets.
Table 2: Descriptive statistics
Train set Test set
# of observations 139.780 46.547
# of cases 41.718 13.928
Min. # of sessions per case 2 2
Max. # of sessions per case 86 100
# of purchases 407 127
Avg. # of page views 9 9
Avg. session duration (in minutes) 6.09 6.05
# of Home pages viewed - 0 - 1 - 2 - 3 - 4 - 5 - 73 94986 41963 2441 390 31605 14015 800 127 # of Category pages viewed
- 0 - 1 – 16 129580 10200 43135 3412 # of times Basket viewed
- 0 - 1 – 2 - 3 - 35 134862 4794 124 45067 1430 50 # of PCP pages viewed - 0 - 1 - 2 – 4 - 5 – 8 - 9 – 44 104154 21921 12264 1201 240 34811 7206 4050 424 56 # of POP pages viewed
- 0 - 1 – 5 - 6 – 10 - 11 – 20 - 21 – 40 - 41 – 193 36428 70665 18811 10193 3110 573 11987 23608 6328 3400 1018 206 # of PDP pages viewed - 0 - 1 – 5 - 6 – 10 - 11 – 20 - 21 - 218 50248 70221 11911 5493 1907 17030 23190 3885 1836 606 # of Inspiration pages viewed
- 0 - 1 – 2 - 3 – 5 - 6 – 10 - 11 – 147 129876 7096 1935 640 233 43142 2411 676 230 88
Most occurring channel Organic Search Organic Search
Most occurring Landing Page POP POP
4.2. Results HMM
This section will run the proposed model on the dataset and set out the results. The variables that are used for estimation are the ones described before, in the section variable description. Moreover, the syntax code used to run the model in Latent Gold is included in appendix A.
4.2.1. NUMBER OF STATES
The first step in estimating an HMM is to determine the amount of states in the model. Here, the number of states is determined by comparing several model selection criteria. These criteria are the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC) and the Consistent Akaike Information Criterion (CAIC). These criteria are all based on the log likelihood and penalizes for the number of parameters, where the penalty is larger for the BIC & CAIC. Thus, choosing the number of states based on these criteria will lead to a preference for the more parsimonious model (Leeflang, Wieringa, Bijmolt, & Pauwels, 2015).
The model with the lowest value for the AIC, BIC and CAIC is considered to be the better model. As is illustrated in table 3 below, the proposed model is estimated with two till seven states. The bold values indicate which model performs best for the different measures. As the model with 7 states outperforms the other models on all the three different measures the model with 7 states is considered to be the best fitted model.
4.2.2. INTERPRETATION OF THE STATES
To give meaning to the two states that are estimated, the posterior probability means are interpreted. They are illustrated per indicator variable in table 4a till table 4n. The bold values indicate the highest value per category and the italic values indicate the lowest value per category. The interpretation is executed per state in the following subsections.
Table 3: Determining the number of states
Number of states Number of parameters AIC BIC CAIC
Table 4a: Posterior probability means Channel State
1 2 3 4 5 6 7
Other 0.0351 0.2831 0.1171 0.0606 0.1136 0.2522 0.1382
Affiliates 0.1030 0.3190 0.0868 0.2826 0.1374 0.0093 0.0619
Branded Paid Search 0.1527 0.0810 0.0828 0.3082 0.1444 0.1071 0.1239
Direct 0.1580 0.1060 0.1948 0.3307 0.1112 0.0322 0.0671
Display 0.0247 0.1923 0.0873 0.0772 0.3019 0.1703 0.1463
Email 0.0200 0.0695 0.1005 0.0414 0.1942 0.4167 0.1576
Generic Paid Search 0.0343 0.2069 0.0299 0.0626 0.4430 0.0690 0.1515
Organic Search 0.0794 0.1042 0.0839 0.1215 0.2965 0.1295 0.1849
Paid Search 0.0276 0.0717 0.0596 0.0865 0.3527 0.2698 0.1321
Paid Social 0.0252 0.1170 0.0709 0.0777 0.3774 0.1454 0.1864
Referral 0.1024 0.1165 0.2972 0.1752 0.1286 0.0834 0.0967
Social 0.0578 0.2962 0.1115 0.1515 0.1399 0.1303 0.1127
Table 4b: Posterior probability means Purchase State
1 2 3 4 5 6 7
No 0.0879 0.1293 0.0928 0.1712 0.2578 0.1194 0.1415
Yes 0.3817 0.2003 0.1740 0.0423 0.0201 0.0029 0.1787 Table 4c: Posterior probability means Landing Page
Table 4d: Posterior probability means Exit Page State 1 2 3 4 5 6 7 Account 0.0258 0.0149 0.9206 0.0170 0.0122 0.0043 0.0052 Advice 0.0709 0.0374 0.4286 0.0718 0.1763 0.0949 0.1200 BP 0.0628 0.1234 0.2190 0.1561 0.1285 0.1962 0.1139 Basket 0.1757 0.2828 0.1303 0.1350 0.0614 0.0365 0.1783 Category 0.0671 0.0008 0.0758 0.0919 0.0000 0.6300 0.1344 Check Out 0.1939 0.3248 0.1172 0.1511 0.0390 0.0324 0.1416 Contact 0.0265 0.0232 0.7966 0.1010 0.0357 0.0000 0.0170 Home Page 0.1797 0.0105 0.3421 0.4657 0.0000 0.0020 0.0000 Insp 0.0729 0.0351 0.4952 0.0472 0.0000 0.2377 0.1118 Other 0.0952 0.0819 0.4353 0.0786 0.1139 0.0987 0.0991 PCP 0.0531 0.0004 0.0000 0.1617 0.0000 0.6718 0.1130 POP 0.0952 0.0000 0.0000 0.1812 0.4320 0.1117 0.1798 PDP 0.0859 0.3508 0.0000 0.1767 0.1700 0.0826 0.1341 Search 0.0546 0.1969 0.2260 0.1850 0.1119 0.0884 0.1371 Service 0.0381 0.0469 0.7958 0.0353 0.0468 0.0136 0.0236 Store Page 0.0498 0.1058 0.3381 0.0796 0.2190 0.0791 0.1286 Table 4e: Posterior probability means Avg. time on page (min)
State 1 2 3 4 5 6 7 0 – 2 0.0919 0.1216 0.0888 0.1731 0.2580 0.1213 0.1453 2 – 4 0.0588 0.2256 0.1334 0.1401 0.2271 0.0912 0.1239 4 – 6 0.0163 0.2600 0.1858 0.1579 0.2771 0.0756 0.0273 6 – 8 0.0031 0.3326 0.1953 0.1119 0.2641 0.0891 0.0039 8 – 10 0.0000 0.3181 0.2512 0.0792 0.2982 0.0533 0.0000 Table 4f: Posterior probability means Total Home Pages
State 1 2 3 4 5 6 7 0 0.0000 0.1871 0.0637 0.0000 0.3765 0.1728 0.2000 1 – 2 0.2623 0.0077 0.1482 0.5538 0.0042 0.0054 0.0185 3 – 4 0.4850 0.0025 0.2425 0.2576 0.0017 0.0021 0.0086 5 – 73 0.5586 0.0000 0.3872 0.0478 0.0044 0.0020 0.0000
Table 4g: Posterior probability means Total Category State
1 2 3 4 5 6 7
0 0.0815 0.1395 0.0977 0.1768 0.2759 0.0979 0.1307
Table 4h: Posterior probability means Total Basket State 1 2 3 4 5 6 7 0 0.0836 0.1258 0.0930 0.1729 0.2644 0.1224 0.1379 1 – 2 0.2115 0.2366 0.0957 0.1176 0.0606 0.0289 0.2489 3 – 35 0.9572 0.0000 0.0659 0.0084 0.0000 0.0000 0.0000 Table 4i: Posterior probability means Total PCP
State 1 2 3 4 5 6 7 0 0.0628 0.1715 0.1238 0.1590 0.3311 0.0322 0.1196 1 0.1317 0.0083 0.0046 0.2535 0.0580 0.3779 0.1660 2 – 4 0.1966 0.0049 0.0007 0.1406 0.0155 0.3949 0.2469 5 – 8 0.3760 0.0000 0.0000 0.0358 0.0000 0.1280 0.4603 9 – 44 0.4740 0.0000 0.0000 0.0000 0.0000 0.0225 0.5035 Table 4j: Posterior probability means Total POP
State 1 2 3 4 5 6 7 0 0.0112 0.4339 0.3511 0.1037 0.0000 0.0887 0.0113 1 – 5 0.0360 0.0323 0.0031 0.2559 0.4441 0.1751 0.0536 6 – 10 0.2218 0.0006 0.0001 0.1075 0.2426 0.0549 0.3725 11 – 20 0.3677 0.0000 0.0000 0.0000 0.0000 0.0003 0.6320 21 – 40 0.4144 0.0000 0.0000 0.0000 0.0000 0.0000 0.5856 41 – 193 0.4343 0.0000 0.0000 0.0000 0.0000 0.0000 0.5657 Table 4k: Posterior probability means Total PDP
State 1 2 3 4 5 6 7 0 0.0174 0.0000 0.2544 0.1975 0.3274 0.1725 0.0307 1 – 5 0.0760 0.2257 0.0030 0.1925 0.2714 0.1110 0.1203 6 – 10 0.2805 0.1639 0.0005 0.0371 0.0365 0.0150 0.4665 11 – 20 0.3667 0.0541 0.0006 0.0000 0.0000 0.0000 0.5786 21 – 218 0.2805 0.1639 0.0005 0.0371 0.0365 0.0150 0.4665 Table 4l: Posterior probability means Total Insp
4.2.2.1. STATE 1
There are several important observations to note from the tables above that can help in interpreting the state. First of all, people that are in state one are least likely to come to the website via the channels other, display, email, organic search, paid search, paid social or social compared to the other states (table 4a). Moreover, people in state one are most likely to purchase a product and are also least likely to not purchase a product compared to the other states (table 4b). Additionally, people that are in state one are least likely to have a search page or a service page as landing page compared to the other states and have a very low probability to have a contact page as a landing page(table 4c). As illustrated in table 4d people that are in state one are least likely to exit the website through a brand page, search page or store page compared to the other states. Furthermore, people in state one are least likely to spend longer
Table 4m: Posterior probability means Total Page Views State 1 2 3 4 5 6 7 2 – 5 0.0000 0.1992 0.1399 0.1764 0.3483 0.1362 0.0000 6 – 10 0.0000 0.0969 0.0669 0.3028 0.3239 0.1835 0.0260 11 – 15 0.2607 0.0419 0.0373 0.0900 0.0318 0.0590 0.4792 16 – 20 0.3854 0.0116 0.0228 0.0000 0.0000 0.0019 0.5783 21 – 25 0.3910 0.0015 0.0142 0.0000 0.0000 0.0000 0.5932 26 – 30 0.4313 0.0000 0.0090 0.0000 0.0000 0.0000 0.5597 31 – 35 0.4108 0.0000 0.0070 0.0000 0.0000 0.0000 0.5822 36 – 40 0.4357 0.0000 0.0025 0.0000 0.0000 0.0000 0.5618 41 – 45 0.4642 0.0000 0.0083 0.0000 0.0000 0.0000 0.5275 46 – 50 0.4741 0.0000 0.0000 0.0000 0.0000 0.0000 0.5259 51 – 281 0.4963 0.0000 0.0007 0.0000 0.0000 0.0000 0.5031 Table 4n: Posterior probability means Session Duration (min)
and have a very low probability to spend more than eight minutes on average on a page (table 4e). Moreover, people in state one are most likely to visit the home page three times or more compared to the other states and they have a very low probability to visit no home pages at all (table 4f). As shown in table 4g people that are in state one are the least likely to visit no category pages at all compared to the other states. Furthermore, people in state one are the least likely to not look at their basket and are the most likely to look at their basket three times or more compared to the other states (table 4h). Additionally, people in state one are almost equally likely as people in state seven to visit the product category pages nine times or more (table 4i). Regarding the total product overview pages visited by people in state one, there is not one category that is distinct for state one compared to the other states. However, people in state one are only slightly less likely to visit 21 product overview pages or more as people in state seven (table 4j). As illustrated in table 4k, there is not one category of the variable PDP that is distinct for state one compared to the other states.Moreover, this is also the case for the categories of the total inspiration pages visited (table 4l). Furthermore, regarding the total amount of pages viewed by people in state one, they have a very low probability to visit less than ten pages in total. Interestingly, they are somewhat equally likely to visit 36 pages or more in total as people in state seven (table 4m). Lastly, people in state one are the least likely to spend five minutes or less on the website compared to the other states and are somewhat equally likely to visit the website for more than 30 minutes as people in state seven (table 4n).
journey. Lastly, in this state people are most likely to make a purchase, thus again indicating that this state is at the end of the customer journey.
Following this, state one is called “action” and is considered to be the last stage in the journey.
4.2.2.2. STATE 2
People that are in state two are least likely to come to the website via the channel branded paid search and are most likely to come to the website via the channels other, affiliates and social compared to the other states (table 4a). Moreover, people in state two are, after people in state one, the most likely to purchase a product (table 4b). Furthermore, people that are in state two have a very low probability to enter the website through a category page, contact page, home page, PCP or POP. They are most likely to start on the basket page, PDP or search page compared to the other states (table 4c). As illustrated in table 4d people that are in state two are least likely to exit the website through an advice page compared to the other states and they have a very low probability to exit the website through a POP. These people are most likely to exit the website through the basket page, check out or PDP compared to the other states. Furthermore, people in state two are most likely to spend longer than six minutes on average on a page compared to the other states (table 4e). Moreover, people in state two are least likely to visit the home page five or more times compared to the other states (table 4f). As shown in table 4g people that are in state two are the least likely to visit any category pages compared to the other states. Furthermore, people in state two have a very low probability to look at their basket three times or more (table 4h). Moreover, people in state two have a very low probability to visit five or more product category pages (table 4i). Regarding the total product overview pages visited by people in state two, they are most likely to visit no product overview pages at all compared to the other states. Moreover, they have a very low probability to visit 11 or more pages (table 4j). As illustrated in table 4k, people that are in state two are the least likely to visit no product detail pages at all compared to the other states. Moreover, people that are in state two have a very low probability to visit 11 or more inspiration pages and the probability to belong to state two decreases when the total inspiration pages visited increase (table 4l). Furthermore, regarding the total amount of pages viewed by people in state two, they have a very low probability to visit 26 or more pages in total (table 4m). Lastly, there is not one category of the total time spend on the website that is distinct for people in state two compared to the other states (table 4n).
pages that people are most likely to enter the site through are a basket page and a PDP. These pages are both considered to be visited when someone is somewhat further into the journey. Interestingly, people in this state are also the most likely to have the checkout page as exit page. This indicates that people do not (always) finish their purchases in this state. Moreover, two of the most likely channels that people in this state enter the website through are social and affiliates. Batra & Keller (2016) found that people seek credible evidence to trust the brand’s statements, which affiliates such as comparison sites could provide, as well as that social has the greatest influence on the stage just before purchase.
Following this, people in this state are considered to be very close to purchasing a product and thus this state is called “evaluation” and is considered to be just before the action stage in the journey.
4.2.2.3. STATE 3
states. Moreover, people that are in state three are the least likely to visit no inspiration pages at all and are most likely to visit three or more of such pages in total compared to the other states (table 4l). Furthermore, regarding the total amount of pages viewed by people in state three, they have a very low probability to visit 46 up to and including 50 pages in total (table 4m). Lastly, people that are in state three are the least likely to spend five to ten minutes on the website compared to the other states (table 4n).
In general, it seems that people in this state are mainly visiting pages that are not product related. For example, people in this state are most likely to enter the site through an account-, advice-, brand-, contact- or service page among others. They are also most likely to exit the website through such, not product related pages. The interest in pages that are not product related among the people in this state is further emphasized by the low probabilities for the total product category pages, product overview pages and product detail pages visited. These observations are showing similar behavior as the knowledge-building cluster indicated by Moe (2003). Therefore, this state is called “information gathering” and is considered to be the fourth stage in the journey.
4.2.2.4. STATE 4
there is not one category of total inspiration pages visited that is distinct for state four compared to the other states (table 4l). Furthermore, regarding the total amount of pages viewed by people in state four, they have very low probabilities to visit 16 or more pages in total (table 4m). Lastly, there is not one category of the total time spent on the website that is distinct for people that are in state four compared to the other states (table 4n).
As explained before, people in this state are most likely to enter the website through branded paid search as well as directly. Branded paid search indicates that the search query contained the brand name of the company. Thus, these channels indicate that the people in this state are aware of the company. Moreover, in this state people generally do not view much pages in total, as the probabilities of belonging to this state decreases as the total page views increases. Moreover, the most likely landing page and exit page are the home page which the people in this state are most likely to visit once or twice in total. The home page is considered to not be a very informative page emphasizing that information gathering and evaluation comes in later states. These observations are showing similar behavior as is considered in the phase initial consideration set of Court et al. (2009). Therefore, this state is called “consideration” and is considered to be the third phase in the journey.
4.2.2.5. STATE 5
five are most likely to visit no product category pages at all and have a very low probability to visit five or more product category pages in total (table 4i). Regarding the total product overview pages visited by people in state five, they have very low probabilities to visit no product overview pages at all as well as 11 or more product overview pages in total. They are most likely to visit one up to and including five product overview pages in total compared to the other states (table 4j). As illustrated in table 4k, people that are in state five have a very low probability to visit 11 up to and including 20 product detail pages in total. They are most likely to either visit no product detail pages at all or one up to and including five of such pages in total compared to the other states. Moreover, people in state five are most likely to visit no inspiration pages at all and least likely to visit one up to and including 10 of such pages compared to the other states. They also have a very low probability to visit 11 or more inspiration pages in total (table 4l). Furthermore, regarding the total amount of pages viewed by people in state five, they are most likely to visit two up to and including ten pages in total and are least likely to visit 11 up to and including 15 pages in total compared to the other states. They also have very low probabilities to visit 16 or more pages in total (table 4m). Lastly, people that are in state five are most likely to spend five minutes or less on the website compared to the other states (table 4n).
As explained before, the people in this state are most likely to come to the website through the channels display, generic paid search, organic search and paid social. Generic paid search indicates that the search query did not contain the brand name, but contained e.g., a relevant product type. These channels indicate that this state is early in the journey and that people in this state know what they want, but they do not know where to buy it yet. Moreover, the people in this state are the most likely to have a POP as landing- and exit page and are most likely to not view any category or product category pages further emphasizing that these people are aware of their needs. Moreover, people in this state are the most likely to view 10 pages or less in total and to have a session duration of five minutes or less. This indicates that people in this state are simply identifying products that they consider to investigate in further states. Lastly, the people in this state are most likely to not purchase a product indicating that these people are only in the beginning of their journey. These observations are showing similar behavior as is considered in the second phase of Batra & Keller. Following this, state five is called “awareness” and is considered to be the second stage in the journey.
4.2.2.6. STATE 6
the other states (table 4a). Moreover, people in state six are the least likely to purchase a product compared to the other states (table 4b). Furthermore, people that are in state six are the most likely to start on a category page or PCP and the least likely to start on the basket page and account page compared to the other states. They also have very low probabilities to have a contact page or home page as a landing page (table 4c). As illustrated in table 4d people that are in state six are most likely to exit the website via a category page or PCP and the least likely to exit the website through an account page, basket page, check out, contact page or service page compared to the other states. Furthermore, there is not one category of the average time on a page that is distinct for state six compared to the other states (table 4e). This is also the case for the total times the home page is visited (table 4f). Regarding the total category pages visited, people in state six are most likely to visit category pages compared to the other states (table 4g). For the total times the basket was viewed, people in state six are least likely to view their basket once or twice in total compared to the other states. Moreover, they have a very low probability to visit the basket 3 times or more in total (table 4h). Moreover, people in state six are most likely to visit in total one up to and including four product category pages and the least likely to visit no product category pages at all compared to the other states (table 4i). Regarding the total product overview pages visited by people in state six, they have very low probabilities to visit 21 or more product overview pages (table 4j). As illustrated in table 4k, people that are in state six have a very low probability to visit 11 up to and including 20 product detail pages in total. Moreover, people in state six are most likely to visit one or two inspiration pages compared to the other states and have a very low probability to visit 11 or more inspiration pages in total (table 4l). Furthermore, regarding the total amount of pages viewed by people in state six, they have very low probabilities to visit 21 or more pages in total (table 4m). Lastly, people that are in state six are least likely to spend more than 10 minutes on the website compared to the other states (table 4n).
Keller. Following this, state six is called “desire” and is considered to be the first stage in the journey.
4.2.2.7. STATE 7
People that are in state seven have somewhat similar probabilities to purchase a product yes or no (table 4a). Furthermore, people that are in state seven have very low probabilities to have a contact page or home page as a landing page (table 4c). Regarding the exit page, people in state seven have a very low probability to exit the website through the home page (table 4d). Furthermore, people in state seven have a very low probability to spend more than eight minutes on average per page (table 4e). Moreover, people in state seven also have a very low probability to visit the home page five times or more in total (table 4f). Furthermore, there is not one category of the total category pages visited that is distinct for state seven compared to the other states (table 4g). For the total times the basket was viewed, people in state seven are most likely to visit their basket once or twice compared to the other states. They also have a very low probability to visit the basket three times or more in total (table 4h). Moreover, people in state seven are most likely to visit in total five or more product category pages compared to the other states (table 4i). Regarding the total product overview pages visited by people in state seven, they are most likely to visit six or more of such pages in total compared to the other states (table 4j). This is also the case for the total product detail pages visited by people in state seven (table 4k).Moreover, people in state seven are most likely to visit six or more inspiration pages in total compared to the other states (table 4l). Furthermore, regarding the total amount of pages viewed by people in state seven, they have a very low probability to visit five or less pages in total. They are most likely to visit in total 11 or more pages compared to the other states (table 4m). Lastly, people that are in state seven are most likely to spend more than five minutes on the website compared to the other states (table 4n).
the most pages in total and for almost all type of pages. Following this, the observed behavior is considered to be very similar to cluster four of Moe (2003). Therefore, state seven is called “search” and is considered to be the fifth stage in the customer journey.
Thus, following from these subsections it is expected that the general customer journey will start at the desire stage, followed by the awareness-, consideration-, information gathering-, search- and evaluation stages. The journey ends with the action stage.
4.2.3. ESTIMATION RESULTS
As it is now clear which state corresponds to which stage in the customer journey, the initial state distribution and the transition probability matrix can be interpreted. The probabilities are illustrated in table 5 below, together with the standard errors. Moreover, the regression parameters are included in appendix B. These show that all indicators, besides the intercept of total page views (p = 0.088) are significant on a 5% level. Thus, all indicators significantly discriminate between the different states.
As illustrated in table 5 below, people have the most chance to start in the awareness stage (27.04%) followed by the consideration stage (16.34%). People are then most likely to start in the search stage (15.08%) followed by the desire stage (12.78%). Hereafter, they are most likely to start in the evaluation stage (11.43%) followed by the action stage (9.17%). Lastly, people are the least likely to start in the information gathering stage (8.17%).
in this stage – to switch to the search stage (14.54%) and the least likely to switch to the information gathering stage (4.65%). For the people that were in the desire stage, they are most likely – after staying in this stage – to switch to the awareness stage (22.43%) and the least likely to switch to the action stage (5.05%). Lastly, people that were in the search stage are most likely to switch to the awareness stage (27.17%) and the least likely to switch to the information gathering stage (4.44%).
4.2.4. MODEL FIT
The measure that is used to determine if the model fits well, are the bivariate residuals of the indicators. Table 6 displays the bivariate residuals of the model. These residuals indicate whether the model is able to capture the time trend, and the first- and second-order autocorrelations, for the dependent variables concerned (Vermunt & Magidson, 2016). In table 6 the value for time corresponds to the time trend, lag1 to the first order autocorrelation and lag2 to the second-order autocorrelation. The model captures these well if the value is lower than the chi-squared critical value a p = 0.05. Therefore, the degrees of freedom on which this value is based, as well as this value itself are also illustrated in table 6. The bold values in the Table 5: Estimation results of the proposed model (probabilities and standard errors)
Initial state distribution State [= 0] Action Evaluation Information
gathering
Consideration Awareness Desire Search
Probability 0.0917
(0.0015) 0.1143 (0.0016) 0.0817 (0.0014) 0.1634 (0.0019) 0.2704 (0.0023) 0.1278 (0.0018) 0.1508 (0.0020) Transition probability matrix
State [-1]
Action Evaluation Information gathering
Table 6: Bivariate Residuals
Indicator Channel Purchase Landing page Exit page Avg. time on page (min)
Total Home page
df 11 1 15 15 4 3
Critical value 19.675 3.841 24.996 24.996 9.488 7.815
Time 4.8077 3.1555 1.6272 1.4215 0.5750 4.6944
Lag1 1102.3341 83.2934 33.9696 31.6304 12.2029 45.3777
Lag2 805.0145 1.4297 37.2561 28.1456 10.3001 208.5880
Indicator Total Category Total Basket Total PCP Total POP Total PDP Total Insp
df 1 2 4 5 4 4
Critical value 3.841 5.991 9.488 11.070 9.488 9.488
Time 0.9460 6.7841 0.7726 1.2207 2.2253 0.6307
Lag1 442.2111 874.6393 53.6031 29.0788 68.7308 84.3310
Lag2 417.5397 664.1613 71.1867 87.2817 103.3287 77.9202
Indicator Total page views Session duration (min)
df 10 10
Critical value 18.307 18.307
Time 1.5423 1.3696
Lag1 6.4406 4.9027
Lag2 12.3857 8.0810
Considering the values of the bivariate residuals, it is shown that for all indicators, besides the total basket variable, the model captures the observed time trend well. Moreover, it also explains the first and second order autocorrelations well for the total page views and session duration variable. However, for all other variables these autocorrelations remain unexplained.
4.3. Classification ability
As mentioned in the model formulation section, the performance of the proposed model is compared to the two benchmark models explained before. These benchmark models are the two heterogeneous Hidden Markov Models with 2 and 3 classes. The syntax code used to run these two models in Latent Gold is included in appendix C.
The comparison is done on three different classification statistics, namely the entropy R2, the
Integrated Classification Likelihood (ICL-BIC) and the Approximate Weight of Evidence (AWE). The entropy R2 is a pseudo R2 which tells how well the model predicts the cases into
the states based on the included variables. As with most R2 measures, the closer this values is
Table 7 below shows the values for these classification statistics for the proposed model and the two benchmark models. The bold values indicate which model performs best for the different measures.
Table 7: Classification statistics
Measure
Model Entropy R2 ICL-BIC AWE
Proposed Model 0.9409 1017155,54 1025570,99
Heterogeneous HMM with 2 classes 0.3661 1027662,74 1036692,74 Heterogeneous HMM with 3 classes 0.3083 1037040,72 1046685,25
As illustrated in table 7 above, the proposed model is performing better than both heterogeneous models with regards to classifying the cases into the latent states, as it outperforms the models on all three classification measures. This means that the assumption that the estimated journey is somewhat similar over the customer is justified.
5. Discussion
In this section the findings of this research are discussed and compared with the theory. Especially, the behaviors that deviate from what is expected in certain stages of the customer journey are explained. Moreover, additional observations are discussed as well.
spending more time on a page on average, giving them the possibility to more deeply process the information on a page on average, the probability that they are in the search stage is lower compared to the other stages. Lastly, here the effect of the number of pages viewed in total and the duration of a session do not seem to relate to purchase probability as was found by Mallapragada et al., (2016). Since, the probability to belong to the search stage is somewhat similar whether or not a product was bought. The evaluation stage is characterized by people that are taking time to process the information on the pages visited, since people in this stage are most likely to spend the most time on a page on average (Moe, 2003). This stage is different from the information gathering stage in that people in this stage are more likely to visit product detail pages compared to the information gathering stage. Thus, indicating that people in this stage are evaluating particular products instead of gathering general information. Moreover, the people in this stage are most likely to enter the website through the channels social and affiliates. This indicates that these people seek credible evidence to trust the claims found (Batra & Keller, 2016). Lastly, the action stage is characterized by people that are considered to be the most likely to buy a product. Interestingly, for this stage it is indeed the case that higher total page views and a longer session duration seems to positively influence the purchase probability as found by Mallapragada et al., (2016). Moreover, in this stage people are most likely to visit their basket many times (3 times or more). This could indicate that people in this stage use their basket as a sort of wish list, where they have likely stored products onto in earlier stages of the journey that they consider to buy and in this final stage they come back to their basket several times to come to their final purchase decision. The use of the basket as a substitute of a wish list was illustrated by Close & Kukar-Kinney (2010).
of other stages to move to the action stage. However, as is illustrated in a typical conversion funnel, only a small group of people moves all the way to conversion (Patterson, 2007). Thus, this small probability could be a consequence of the small group of people going through to conversion in general rather than an indication that evaluation would be earlier in the journey. The main take away from the transition probability matrix is that the customer journey does not seem to be linear, since people do not always have the highest probabilities to switch to the next stage as expected by the order of the stages. Non-linearity of the customer journey has already been recognized by e.g., Batra & Keller (2016), Court et al, (2009) and Edelman & Singer (2015). Moreover, there are several stages that appear rather sticky which illustrates that people are more likely to stop at these stages instead of moving back to earlier stages or to move to later stages in the journey.
6. Conclusion & Recommendations
The Hidden Markov Model proposed in this paper tries to determine the stage customers are in, in their decision journey. This model is estimated using site-centric data. The model has estimated a customer journey that starts with a desire stage. People in the desire stage are found to have a certain unmet need/want but who are not sure which product(s) will satisfy this. The desire stage is followed by an awareness stage. People in the awareness stage are found to know the type of product they would like to buy but do not know where to buy it. The awareness stage is then followed by an consideration stage. People that are found to be in the consideration stage are considering the company in their purchase process. This stage is followed by an information gathering stage. People in this stage are found to mainly visit pages that are not product related, such as brand- or contact pages. The next stage is the search stage. People in this stage are found to put a great effort into the website visit. The search stage is followed by an evaluation stage. In this stage people are found to take the time to process the information on the pages visited, which are mainly product related pages. The last stage is the action stage. In this stage people are found to be the most likely to buy a product. The journey as estimated by the proposed model is found to be non-linear, and it seems that people do not always go through all stages of the customer journey. Moreover, not all customers start their journey at the same stage. Additionally, the evaluation-, information gathering-, consideration-, awareness- and desire stage all appear to be rather sticky.
the heterogeneous benchmark models as they outperform the benchmark models on all classification measures. This indicates that the estimated customer journey is somewhat similar over the customer base.
6.1. Managerial implications