Measuring conversion attribution : a higher-order Markov and Mixture Transition Distribution model approach

(1)

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided up into a number of sections and contains references. An outline can be something like (this is an example for an empirical thesis, for a theoretical thesis have a look at a relevant paper from the literature):

(a) Front page (requirements see below)

(b) Statement of originality (compulsary, separate page) (c) Introduction (d) Theoretical background (e) Model (f) Data (g) Empirical Analysis (h) Conclusions

(i) References (compulsary)

If preferred you can change the number and order of the sections (but the order you use should be logical) and the heading of the sections. You have a free choice how to list your references but be consistent. References in the text should contain the names of the authors and the year of publication. E.g. Heckman and McFadden (2013). In the case of three or more authors: list all names and year of publication in case of the rst reference and use the rst name and et al and year of publication for the other references. Provide page numbers.

2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All that actually matters is that your supervisor agrees with your thesis.

3. The front page should contain:

(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Faculty as in the heading of this document. This combination is provided on Blackboard (in MSc Econometrics Theses & Presentations).

(b) The title of the thesis

(c) Your name and student number (d) Date of submission nal version

(e) MSc in Econometrics

(f) Your track of the MSc in Econometrics 1

Master’s Thesis

Measuring Conversion Attribution

A higher-order Markov and Mixture Transition Distribution

Model approach

Maxime van Leeuwen

Student number: 10354360

Date of final version: January 14, 2018 Master’s programme: Econometrics

Specialisation: Free Track Supervisor: Dr. K. Pak Second reader: E. Aristodemou

(2)

Abstract

Advertisers aim to encourage the customer to move through the marketing funnel and finally into persuading the customer to convert. In such journeys the customer is usually exposed to multiple channels, which gives rise to the attribution problem how to assign conversion credit to those different channels. Anderl et al. (2016a) introduce a novel attribution framework reflecting the sequential nature of customer paths as first- and higher-order Markov walks. This research aims to provide an addition to the research of Anderl et al. by increasing the maximum order possible for the higher-order Markov model using stochastic simulations. In addition, a Mixture Transition Distribution model is introduced to estimate conversion attribution. Both models are applied to a real-life data set, where similar results are found as by Anderl et al. (2016a).

(3)

Statement of Originality

This document is written by Maxime van Leeuwen who declares to take full responsibility for the contents of this document.

I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(4)

Acknowledgments

I would like to thank dr. Kevin Pak for supervising me through this thesis project, challenging me and providing me with his insights whenever needed. I also would like to thank Wolter Tjeenk Willink and Tuncay Oner for giving me the opportunity to write my thesis at Traffic Builders and providing guidance during the thesis project.

Furthermore, I am grateful for the financial and emotional support my parents Philip and Marja and my sister Marlissa gave me throughout my whole life and especially during my education.

(5)

List of Figures

3.1 Example Markov graph . . . 17

3.2 Example Markov graph 2nd Order . . . 18

4.1 Unsuccessful customer journeys per month . . . 28

4.2 Successful customer journeys per month . . . 28

4.3 KPI Conversion rate . . . 29

4.4 Touch-points per channel . . . 30

5.1 CA per channel for different orders MC . . . 36

(8)

List of Tables

3.1 Removal Effects for Markov graph in Figure 3.1 . . . 17

3.2 Removal Effects for Markov graph in Figure 3.2 . . . 19

3.3 Removal Effects with additional journey . . . 19

3.4 Number of parameters to be estimated . . . 22

3.5 MTD Removal Effects . . . 26

4.1 Data set preview . . . 27

4.2 Preview transformation to customer journey . . . 28

4.3 Data overview . . . 29

4.4 Online marketing channels . . . 30

5.1 Predictive Accuracy . . . 32

5.2 Standard Deviation Channel Effects . . . 34

5.3 Calculation Time . . . 35

5.4 Attribution Results . . . 36

5.5 Results Logit Model 1 . . . 37

5.6 Results Logit Model 2 . . . 38

5.7 Removal effects second order Markov model . . . 39

(9)

Chapter 1

Introduction

”Half of the money I spend on advertising is wasted; the trouble is I don’t know which half.” This is a famous sentence by John Wanamaker which perfectly describes the problem firms face in their budget allocation decisions. The past few years the online marketing activities have increased significantly. Advertisers now not only use offline channels like TV ads, newspapers and radio, but also online channels such as display, search and social media advertising to reach potential customers. They can be differentiated by firm-initiated and customer-initiated channels, which means the interaction is either a result of the firms advertising effort or the initiative of the customer, respectively. The channels influence the movement of the customer through the different stages of the marketing funnel, which are generally referred to as the awareness, consideration and engagement stage. For instance, at the first stage the customer knows the kind of product he or she is looking for, but not from which brand. So for advertisers it is key to build and increase brand awareness in order to move the customer to the consideration phase. Hence, advertisers classify customers on account of the stage they are located in and respond accordingly.

Since customers use multiple channels in their decision making process, called the customer journey, it makes it more difficult to determine the contribution of every channel to a purchase, information request or any other form of conversion. In addition, prior visited channels could influence which channel will be used next. If a customer uses the same channel as before we call this carryover effects. Equivalently, if a customer uses a different channel than the previous one this is referred to as spillover effects. The process of assigning credit of conversion to multiple channels is called the attribution problem and remains challenging for firms. This gives rise to the question which method is best in attributing conversion credit to the different online channels and how this method differs from the industry standard.

The default conversion attribution methodology used in the industry is generally referred to as last-touch attribution (LTA), which is a rule-based attribution methodology that assigns full conversion credit to the channel that last presented an advertisement to a converting user (Dalessandro et al., 2012). This method assumes that all previous clicks in the path are irrelevant to the conversion. Another commonly used method is the first-touch attribution (FTA), which

(10)

corresponds to the idea that the customer would have arrived at the conversion irrespective of the channels following the first used channel. There are several other techniques to assign credit to the channels such as an exponential and a linear attribution method. However, all these methods assign credit to the various channels through fixed ratios and analyze channel performance in isolation. A well-known alternative in conversion attribution is the logistic regression. It predicts a binary outcome of a click path based on prior exposures, where the outcome represents a conversion or non-conversion. Although it is a relatively simple model, the regression can be difficult to interpret in terms of the actual channel effects, since the coefficient represents the rate of change in the “log(odds)” as the explanatory variable changes.

Chierichetti et al. (2012) addresses the question whether or not customers surf the internet according to Markovian behavior. This Markovian behavior states that the probability of visit-ing a next page only depends on the current website and not on how the user arrived there, which is also referred to as a memoryless process. They find that there are significant improvements in the predictive accuracy of the maximum likelihood when increasing the order of the Markov Chain. This means the behavior of customers online is in fact not Markovian. Consequently, the decision of a customer to convert usually depends on more than one touch-point, which is any interaction with the firm’s brand or product. Instead, Anderl et al. (2016a) propose using a kth-order Markov Chain, also referred to as a higher-order Markov Model, which allows for more memory and spillover effects to be included, since the probability of visiting a next channel now does not only depend on the last visited channel but on the previous k visited channels. We use this approach throughout this research.

Although including more memory is desirable, it can also lead to computational intractabil-ity of the model. To overcome this limitation, we use stochastic simulation when modeling the higher-order Markov model and on top of that we examine the applicability of a mixture transition distribution (MTD) model in conversion attribution. This model was introduced by Raftery (1985) to approximate higher-order Markov Chains with far fewer parameters than the fully parameterized model (Berchtold and Raftery, 2002). The purpose of this thesis is to de-velop a method that accurately assigns conversion credit to the various advertisement channels and which has the ability to include as much information from the customer journey as desired. We compare our model to the heuristic last-click model, because this is still used as the standard method in the industry. Besides that, we compare it to a logit model, since logit models are well suited for calibrating the predictive capabilities of other frameworks (Anderl et al., 2016a). The remainder of this research is organized as follows. Chapter 2 contains a literature review in which previous researches on Conversion Attribution models and Mixture Transition Distribution models are discussed. In Chapter 3 an extensive description of the higher-order Markov Model, together with the removal effect and the MTD model is given. An outline of the data is given in Chapter 4. In Chapter 5 the predictive performance and robustness of the models are discussed, followed by the attribution results. Finally, Chapter 6 concludes.

(11)

Chapter 2

Literature Review

Online attribution modeling has been gaining ground the past few years, which is mainly due to the increased availability of online behavior data. Even though, it still remains a challenge for attribution analysis that the thruth is a quantity that may never be known (Dalessandro et al., 2012). In this chapter we discuss several previously proposed Conversion Attribution and Mixture Transition Distribution methods and discuss the general results as well as the limitations of the approaches.

2.1 Previous Conversion Attribution models

Until recently the availability of user-level data on customer behavior across websites was scarce, therefore some previous researches, like the one from Naik and Raman (2003), use aggregated data. They present a dynamic sales response model by the Kalman filtering methodology to investigate the impact of synergy in multimedia environments. Their key objective is to understand how to allocate marketing investments across channels. Similarly, also Kireyev et al. (2016) use aggregate rather than user-level data, which prevents them from incorporating consumer behavior in the model. In estimating dynamic interactions they use a vector error correction (VEC) model. However, they only investigate interaction between display and search and do not consider other channels. They find that in an evolving business scenario, advertising becomes more effective over time (Kireyev et al., 2016). Also, the strength of dynamic effects depends on the novelty of the product and possibly carry over from past advertising investments, which means these past investments should be taken into account when allocating budget.

Montgomery et al. (2004) use a new approach of obtaining data by monitoring online user behavior via a previous installed program with permission of the users. Thereby they show how path information can be categorized and how their model can be applied to predict purchase conversion. To estimate the transition probabilities between pages, a first-order Markov Model is used. They extend the research by a dynamic multinomial probit model to incorporate covariates that possibly explain navigation choices, where a vector autoregressive (VAR) component is included to capture the dynamics in these choices. Their framework falls within a hierarchical

(12)

Bayesian model to allow for heterogeneity across customers. Finally, they incorporate a mixture process to reflect the possibility that browsing behavior can suddenly change. Hence, they incorporate two dynamic elements in their model for browsing behavior. By comparing models which include memory, such as the multinomial probit, Markov and VAR models, with models which do not include memory, like independent, only-intercept, and latent-class models, they find that including memory in the model is crucial in accurately predicting a path. Also Xu et al. (2014) account for customer heterogeneity by casting the model in the Bayesian framework. To compute the conversion probability they develop a mutually exciting point process, which considers advertisement clicks and purchases as dependent random events in continuous time (Xu et al., 2014). They find that display advertisements stimulate subsequent visits through other channels rather than resulting in a direct conversion. As a consequence the conversion effect is underestimated by other methods such as last-click.

Not only the effect of advertising is an interesting field for research, but also the influence of the stage of the funnel. As well de Haan et al. (2016) and Wiesel et al. (2011) as Ghose and Trodi (2015) introduce models to capture these effects. Wiesel et al. (2011) and de Haan et al. (2016) combine online and offline advertising forms on an aggregated data level. The work from de Haan et al. (2016) differs from Wiesel et al. (2011) since they include more online channels and control for more offline advertising channels. Also, Wiesel et al. (2011) use a vector-autoregressive (VAR) model, while de Haan et al. (2016) examine the long-term effectiveness of the selected channels using a structural vector autoregressive (SVAR) model and restricted impulse responses. Not only the effectiveness of the advertising forms is an objective, but also how long the effect lasts and where in the funnel the effect occurs. They find that content-integrated advertising is the most effective form, followed by content-separated and firm-initiated advertising (de Haan et al., 2016). Adding a new feature in terms of data availability, Ghose and Trodi (2015) introduce information on the viewability of impressions and the duration of exposure to a display advertisement. By using a difference-in-differences method and corresponding matching measurements, as well as instrumental variable techniques to control for unobservable and observable confounders, they investigate the effectiveness of display advertisements. They find that the exposure to display advertisements can increase the probability of conversion. Also, the earlier in the funnel the consumer is exposed, the higher the effect of the display advertisement.

Similar results are found by Abhishek et al. (2012) who use a dynamic Hidden Markov Model to not only model the consumer path to purchase but also to solve the attribution problem. They present a model that analyzes the effect of advertisement exposures on consumers, as well prior to ad exposure as long-term future impact. Nevertheless, just like Kireyev et al. (2016), they only focus on the interplay of display and search channels, while the actual number of used online marketing channels generally is much greater than that (Anderl et al., 2016b). Rather than using a Hidden Markov Model, Li and Kannan (2014) use a first-order Markov process, where they use a static setting to model the customer visits. The authors propose a conceptual

(13)

framework and a three-level, nested measurement model of the process through the marketing funnel, which accounts for carryover and spillover effects on both the visit and purchase stages (Kannan et al., 2016). One of their findings shows that the stronger the brand, the lower the incremental effect of the paid search channel on conversion, which is in contrast with what last-click attribution would suggest. Also the by Shapley value obtained conversion credits show significantly different estimates.

The Shapley value approach is also used by Dalessandro et al. (2012), who present a causally motivated methodology for conversion attribution with the aim to bring more standardization to the measurement of online advertising campaigns. They introduce three properties an at-tribution model should fulfill: fairness, data driven and interpretability. Their fully causal attribution model results in estimation inconveniences causing them to propose an approxima-tion model with the use of the Shapley Value to estimate the touch point attribuapproxima-tion. However, the approximated attribution methodology favors channels that systematically appear later in the customer journey. Shao and Li (2011) propose to aggregate the paths more than Dalessan-dro et al. (2012) to prevent scarcity of the observations. They introduce a bivariate metric, where one measures the variability of the estimate and the other measures the accuracy of clas-sifying the positive and negative users. They use two models in their research to assign credit to different advertising channels. The first is a bagged logistic regression, where the bagging leads to the relaxation of the order in which the channels occur and it only matters whether or not they are in the click-path. The second one is a simple probabilistic approach, which is somewhat less accurate but more intuitive. Only the first- and second-order conditional proba-bilities are applied, because the number of observations in a third- and higher-order interaction drops significantly. Both models generate consistent general conclusions, which allows for cross-validation. Although their multi-touch attribution model show similar results for search, email and social compared to last-click, as previous researches also have encountered, the last-click model undervalues display advertisements since these impressions occur usually further away from the conversion.

Both Dalessandro et al. (2012) and Shao and Li (2011) only attribute credit to an ad when it directly increases the conversion probability (Abhishek et al., 2012). In contrast, Anderl et al. (2016a) introduce an approach where first- and higher-order Markov models are explored with the aim to incorporate more of the browser history of a user. They include carryover and spillover effects in their Markovian graph-based model to enable insights into the interplay of multiple channels. They use the removal effect to determine the share of conversion probability per channel and apply their model to four different kind of industries. Nevertheless, as the order of the Markov model increases, the number of estimated parameters increases exponentially and thus the model becomes computationally intractable, which leads them to not going further than a fourth-order model. However, customer journeys could be much longer than four touch-points. As mentioned above, there are a lot of previous researches which all have their limitations. Naik and Raman (2003), Kireyev et al. (2016), de Haan et al. (2016) and Wiesel et al. (2011) base

(14)

their approaches on aggregated data and Montgomery et al. (2004) can only predict purchase conversion based on page categorizations rather than on channels. Furthermore, Wiesel et al. (2011), Abhishek et al. (2012), Kireyev et al. (2016), Xu et al. (2014) and Ghose and Trodi (2015) mainly focus on the effect of display and search advertisements. Finally, Shao and Li (2011) and Dalessandro et al. (2012) do not account for carryover and spillover effects and Li and Kannan (2014) do not account for consumers’ heterogeneity. This leads us to focusing on the model of Anderl et al. (2016a). In order to prevent computational intractability, we propose using a MTD model. Previous researches on the MTD model applied to higher-order Markov Models are discussed in the next section.

2.2 Previous Mixture Transition Distribution models

As stated before, a high-order Markov Model exponentially increases the number of estimated parameters. To overcome this, Raftery (1985) introduced a model for Markov chains of order higher than one, which involves only one additional parameter for each extra lag. Rather than considering that each probability of observing an event at time t depends on a combination of the lagged events, they consider the effect of each lag upon the present separately, after which they combine the contributions additively. This method is applicable for a conversion attribution model, since the order in which the channels occur is still taken into consideration. It results in m(m − 1) + (k − 1) parameters to be estimated instead of mk(m − 1) parameters, where k is the order of the model and m the number of different channels. See Chapter 3.6 on how the parameters are reduced specifically. Although the MTD model is limited by only being defined for a finite state space, in conversion attribution modeling the number of marketing channels (the state space) is in fact finite. However, the MTD model has a large number of non-linear constraints.

Proposing a way to reduce this large number of constraints, Raftery and Tavar´e (1994) introduce a computational algorithm for maximum likelihood estimation (MLE) in order to analyze wind directions. In addition, they also use a χ2 estimation to obtain approximates of the parameters in modeling the MTD model, because the χ2_{estimation in some cases results in a}

lower mean squared error (MSE) than the MLE. Although they do propose an effective procedure in reducing the number of constraints, the difficulty of the non-linearity of the objective functions remains. Lebr´e and Bourguignon (2008) introduce a hidden process and derive an Expectation-Maximization (EM) algorithm. The EM algorithm consists of two steps, namely the Estimation and Maximizing step. First the log-likelihood of the complete model conditional on the observed sequence and on the current parameter is computed, after which it is maximized. However, the complexity from the counts of the patterns of sequences is still unsolved for the EM algorithm (Chen and Lio, 2009). Chen and Lio (2009) propose to convert the nonlinear constraints to box-constraints through a transformation to remove the difficulties in estimating the parameters. Although the transformation method simplifies the MLE process for the MTD model, the large number of constraints is still present. Berchtold (2001) introduce an iterative method which

(15)

uses the idea of balancing an increase in one of the parameters with an equal decrease in another using the boundary adjustment in the MLE process that leads to a modification of the Newton’s method (Chen and Lio, 2009). Although it is possible that the method does not converge to the global maximum of the log-likelihood, choosing the right starting values leads to a very good performance. Therefore, we use this method in our research.

The MTD model has been applied to various subjects, such as DNA sequences, bird songs and wind directions. To our knowledge, this research is novel since MTD models have not been applied to customer journeys in order to attribute conversion credit to various online channels.

(16)

Chapter 3

Methodology and Techniques

3.1 Last Touch Attribution

Since the industry standard method is the LTA method, we compare our model to this method. LTA assigns all conversion credit to the last visited channel prior to conversion, according to the following algorithm:

1. For each channel: count the number of customer journeys in which the channel was visited last prior to conversion.

2. Divide the count of each channel by the total number of successful customer journeys, this results in the conversion attribution per channel.

3.2 Logistic regression

As stated before in this research, using a logistic regression is a well-known alternative in modeling conversion attribution. We compare our model to the logistic regression, therefore we give a brief outline of this method. It predicts a binary outcome of a click-path based on prior exposures, where the outcome represents a conversion or non-conversion. We define X as the set of variables representing the journey properties, such as number of clicks per channel or average time spend per page. And we define Y as follows:

Y =    1 if conversion 0 else

Then the probability of conversion is assumed to be:

P (Y = 1|X) = e

α+βx

1 + eα+βx

= 1

1 + e−(α+βx) (3.1) The odds are:

P (Y = 1|X) P (Y = 0|X) = e

(17)

Resulting in the log-odds of:

ln(odds) = α + βx (3.3)

where the β’s represent the relative contribution of the channels.

We include the number of clicks per channel as the set of variables. We could include more variables such as time on the website or total number of visits, but that is beyond the scope of this research. This results in the following logistic regression model:

log(odds) = α + β1x1+ β2x2+ ... + βmxm (3.4)

where xi is the number of clicks in channel i in the journey.

Although the above described logistic regression does not take channel sequences into ac-count, it is possible by adding dummies to include order effects. Therefore, we introduce a second logistic model, where for every lag a new dummy is introduced. Hence, the second logistic regression model is as follows:

log(odds) = α + m X i=1 k X j=1 βi,jdij (3.5)

where dij is equal to 1 if channel i is present at position j, otherwise it is set to 0. This model

is estimated by the maximum likelihood estimation. To find the coefficients that maximize the likelihood function, the Fisher Scoring method is used, which is an iterative process. It starts with a tentative solution and tries to improve the solution until there is no further improvement possible.

3.3 Baseline Markov model

Markov Chains are probabilistic models that can represent dependencies between sequences of observations of a random variable (Anderl et al., 2016a). Every customer journey can be described as a Markov Chain, where the states depict the channels visited by the customers and the edges represent the probabilities to transit from one state to another. We include three base nodes: START, CONVERSION and NULL. Every click path begins at the START node, when the path ends in a conversion it is connected through the CONVERSION node to the NULL node, otherwise it skips the CONVERSION node and ends in the NULL node. There are no incoming edges in the START node and cycles are possible when for instance sequencing channels are the same.

Let {Xt} be a sequence of random variables taking values in the finite set N = {1, ..., mk}.

In the first-order Markov model the current observation solely relies on the previous observa-tion, thus we can write:

(18)

where i0, ..., it∈ N .

Combining all combinations of it−1and itin a m × m transition matrix Q, we get the following:

Xt Xt−1 1 . . . m Q = 1 .. . .. . m        q11 . . . q1m .. . . .. ... .. . . .. ... qm1 . . . qmm        (3.7)

where each row represents a probability distribution, meaning each row sums to one and the elements are nonnegative.

3.4 Removal effect

Anderl et al. (2016a) introduce the Removal Effect to calculate the conversion attribution per channel. The Removal Effect reflects the change in probability by going to the CONVERSION state from the START state when a certain state is removed from the graph. When a state is removed, all incoming edges are redirected to the NULL state. They define the Removal Effect as the multiplication of Visits and Eventual Conversion, where Visits represents the probability of visiting a particular state and Eventual Conversion is the probability of reaching the CON-VERSION state through that particular state. We clarify this with an example.

Example. Suppose we have the following customer journeys: Journey 1: START - C1 - C2 - C3 - CONVERSION - NULL

Journey 2: START - C1 - C4 - NULL

Journey 3: START - C2 - C3 - NULL

Journey 4: START - C1 - C4 - C3 - CONVERSION - NULL

There are three journeys involving channel C1, where two journeys lead to a subsequent

visit through channel C4 and one through channel C2. This results in 66.7% and 33.3%

of the journeys going to C4 and C2, respectively, after visiting C1. Repeating this for all

channels gives Figure 3.1.

Figure 3.1 shows that the probability of reaching the CONVERSION node from C2 is

1 · 0.67 = 0.67, meaning the eventual conversion probability is 0.67. In addition, visiting channel C2 is possible in two ways, either directly from the START state or via C1,

re-sulting in a visit probability of 0.25 + 0.75 · 0.333 = 0.5. Multiplying Visits with Eventual Conversion leads to a removal effect of 0.33. The results for all channels can be found in Table 3.1.

(19)

Figure 3.1: Example Markov graph

Table 3.1: Removal Effects for Markov graph in Figure 3.1

Channel Visits Eventual Conversion Removal Effect Removal Effect in %

C1 0.75 0.44 0.33 25%

C2 0.5 0.67 0.33 25%

C3 0.75 0.67 0.5 37.5%

C4 0.5 0.33 0.167 12.5%

Allowing us to easily compare the findings with other methods, we include the Removal Effect in % by calculating this as the percentage of the sum of all removal effects.

3.5 Higher-order model

As stated previously in this research, the decision process of a customer prior to conversion usually depends on more than one touch-point. As a result, more memory is included by using a higher-order Markov Model, this means the probability of going to the next state now not only depends on the current state but on all k previous states. The higher-order transition probabilities are thus as follows:

P (Xt= it|Xt−1= it−1, ..., X0 = i0) = P (Xt= it|Xt−1= it−1, ..., Xt−k = it−k) for t > k(3.8)

However, higher-order Markov Chains can still be interpreted as first-order Markov Chains.

Example. Suppose we have channels (or states) A,B,C,D and sequence of observations AABCBDDCDAACA?, where the questionmark represents the unknown next visit. We model a second-order chain into a first-order by enlarging the state space such that the state represents the channel visits on two sequences. This gives the following state space

˜

S = {AA, AB, AC, AD, BA, BB, BC, BD, CA, CB, CC, CD, DA, DB, DC, DD}

which has 42 states. Now we define X0 = AA, X1 = AB, X2 = BC, X3 = CB and so

forth, where for instance from AA we can only transit to AB and AC. Thus, we have converted the second-order chain to a first-order chain.

(20)

More generally we get:

P (Xt= it|Xt−1= it−1, ..., Xt−k = it−k)

= P ((Xt= it, Xt−1 = it−1, ..., Xt−k+1= it−k+1)|Xt−1 = it−1, Xt−2= it−2, ..., Xt−k = it−k)

= P (Yt|Yt−1) (3.9)

where Yt= (Xt= it, Xt−1= it−1, ..., Xt−k+1= it−k+1)

Hence, we have represented the kth-order Markov Chain as a first-order Markov Chain of k-tuples, where the number of states increases from m to mk. The next example will clarify how to calculate the removal effect in a higher-order Markov model.

Example. We continue with the Removal Effect example, where we have channels C1,

C2, C3 and C4 and the four previously given journeys. As stated before, we enlarge the

state space such that the states represent channel visits on two sequences, this lead to the following journeys:

Journey 1: START - C1C2 - C2C3 - CONVERSION - NULL

Journey 2: START - C1C4 - NULL

Journey 3: START - C2C3 - NULL

Journey 4: START - C1C4 - C4C3 - CONVERSION - NULL

Using the same approach as the first-order example this results in Figure 3.2.

Figure 3.2: Example Markov graph 2nd Order

In this graph there are three paths which could lead to conversion and their conversion probabilities are:

P[START - C1C2 - C2C3 - CONVERSION - NULL]= 0.25 · 1 · 0.5 = 0.125

P[START - C2C3 - CONVERSION - NULL]= 0.25 · 0.5 = 0.125

P[START - C1C4 - C4C3 - CONVERSION - NULL]= 0.5 · 0.5 · 1 = 0.25

Suppose we remove channel C1, then both state C1C2 and C1C4 would be removed, which

means only the second possible conversion path remains and thus the removal effect of C1 is _{0.125+0.125+0.25}0.125+0.25 = 0.75. Repeating this for all channels we get the following removal

effects.

As we can see from the above example, the journeys are adjusted with the states from the enlarged state space for order k. In this case the order is 2 and all journeys have 2 or more

(21)

Table 3.2: Removal Effects for Markov graph in Figure 3.2 Channel Removal Effect Removal Effect in %

C1 0.75 27.27%

C2 0.5 18.18%

C3 1 36.36%

C4 0.5 18.18%

touch-points (START, CONVERSION and NULL excluded). However, if there are journeys in which the order is larger than the number of touch-points the higher-order Markov model adds extra states with the maximum length possible, which can very quickly lead to many additional elements in the state space. In order to prevent this we complement those journeys with touch-points of the last visited channel up to order k. By doing so, the transition probabilities are not affected by the adjustment.

Example. Suppose we have the same journeys as before but we add the following journey: START - C1 - CONVERSION - NULL

Applying a second-order Markov model leads to the removal effects in the third column of Table 3.3. In this case there are 42 + 1=17 elements in the state space. Using the approach of complementing the journeys, the second-order journey becomes:

START - C1C1 - CONVERSION - NULL

The removal effects of this approach and applying a second-order Markov model are shown in the fifth column of Table 3.3. As can be seen the removal effects of both approaches are exactly the same, however the second approach prevents the state space from enlarging even more.

Table 3.3: Removal Effects with additional journey

Channel Removal Effect Removal Effect in % Removal Effect2 Removal Effect2 in %

C1 0.83 38.4% 0.83 38.4%

C2 0.33 15.4% 0.33 15.4%

C3 0.67 30.8% 0.67 30.8%

C4 0.33 15.4% 0.33 15.4%

3.5.1 Stochastic simulation

In modeling the higher-order Markov models as described above, we apply the method of Al-tomare and Loris (2016), who use stochastic simulation for the approach of Anderl et al. (2016a). A Markov process can be described as a stochastic process, where the state space is a collection of all possible values the random variables can take in the stochastic process. In conversion attribution the state space is discrete, which means the stochastic variables are also discrete

(22)

(Brereton, 2015). First, using the journeys in the data set, the model estimates a Markov graph, which is similar to the method of Anderl et al. (2016a). Then, the stochastic simulation gen-erates random paths from that graph. Once enough random paths have been generated, which means the number of simulations is reached, the removal effects of the channels are estimated. Using stochastic simulation results in approximated removal effects. However, increasing the number of simulations also increases the accuracy of the estimates. In this research we perform 1.000.000 simulations, which is the default number of simulations in the model of Altomare and Loris (2016). In addition, in Chapter 5 we examine the sensitivity of the estimation by repeating the estimation many times to see how the results change.

This approach is very fast in terms of calculation time, which is due to the stochastic simulation. Also, the fact that the code is written in C++ highly improves the calculation time. In contrast to Anderl et al. (2016a), this method does not impose computational intractability as the order is increased. Nevertheless, it is still interesting to show how the MTD model can be applied to estimate conversion attribution, as it remains an alternative when stochastic simulation is not the desired approach.

3.6 Mixture Transition Distribution model

As stated in the previous section, the number of independent parameters to be estimated increases exponentially with the order, which possibly leads to computational intractability. Whatever the order is, there are (m − 1) independent probabilities in each row of the matrix Q, the last one of which is completely determined by the others since the sum of each row is one (Berchtold and Raftery, 2002). This means there are in total mk(m − 1) independent parameters to be estimated in a kth-order Markov Model.

Raftery (1985) introduced the Mixture Transition Distribution (MTD) model in order to reduce the number of parameters to be estimated in time-homogeneous higher-order Markov chains. As in the case of a higher-order Markov chain, the set of explanatory variables is considered as a whole, but if we make the assumption that the effect of each lag upon the present can be considered separately, the MTD principle can be used to approximate the higher-order Markov model (Berchtold and Raftery, 2002). By doing so, an one-step transition matrix is estimated after which the probabilities are converted into lagged probabilities by multiplying it with the lag parameters λ. This is in contrast with the transition matrix of a higher-order Markov model, which estimates multi-step transition probabilities. This is what makes the MTD model more parsimonious compared to the higher-order Markov model. By applying this approach we do not take the dependency between the lagged events into account when calculating the effect on the present. However, it is important to note this does not mean we treat the events as independent by definition, only when they are lagged events of the present event. Also, the positions of the lagged events relative to the present are still taken into account. Therefore the MTD model should be a good estimator to attribute conversion credit to the various online marketing channels since the order in which the channels occur in

(23)

the customer journey is maintained, which is also the aim of the higher-order Markov models. In order to apply the MTD model, the state space should be finite, which is fulfilled because in our case there are only nine channels, see Chapter 4 for an overview of the channels. As stated previously, the difference between MTD and Markov models lies within the calculation of the contributions. The MTD probabilities are calculated as follows:

P (Xt= it|Xt−1= it−1, ..., Xt−k= it−k) = k X g=1 λg· P (Xt= it|Xt−1 = it−g) = k X g=1 λg· qit−git (3.10) where 0 ≤ k X g=1 λg· qit−git ≤ 1 (3.11)

and where it−k, ..., it∈ {1, ..., m}, the probabilities qit−git are elements of an m × m transition

matrix Q and λ = {λ1, ..., λk} is a vector of lag parameters subject to:

k

X

g=1

λg = 1 (3.12)

λg≥ 0 (3.13)

Raftery and Tavar´e (1994) show that condition (3.13) can be removed to reduce the number of constraints of (3.11) from mk_{(m − 1) to m, but in order to make sure (3.11) still holds, it has}

to be replaced by the following constraints:

T q−_i + (1 − T )q+_i ≥ 0 ∀i ∈ N (3.14) where T = m X g=1 λg≥0 λg q−_i = min 1≤j≤mqij q+_i = max 1≤j≤mqij

There are still (m − 1) independent probabilities in each row of the matrix Q and m states, however there are only k additional lag parameters from which (k − 1) are independent. Hence the number of independent parameters to be estimated is equal to m(m − 1) + (k − 1), which is more parsimonious than the mk(m − 1) parameters mentioned before. See Table 3.4 for an example.

(24)

Table 3.4: Number of parameters to be estimated k m Markov Model MTD Model

2 3 18 7 4 48 13 5 100 21 4 3 162 9 4 768 15 5 2500 23

3.7 Estimating the MTD model

3.7.1 Maximum Likelihood Estimation

We use the Maximum Likelihood Estimator (MLE) to estimate the parameters λ and q of the MTD model (3.10). log(L) = m X it−k,...,it=1 nit−k,...,itlog   k X g=1 λgqit−kit   (3.15)

where nit−k,...,it is the number of sequences of the form

Xt−k = it−k, ..., Xt= it

in the data. In order to maintain a higher-order Markov Model, we should maximize (3.15) with respect to constraints (3.12) and either (3.13) or (3.14). However, as Berchtold (2001) states, the nonlinearity of (3.15) and the large number of constraints make it difficult to maximize the likelihood. In addition, there is no algebraic solution to the maximization of the log-likelihood, leading Berchtold (2001) to propose an iterative method. We discuss their approach briefly.

There are m + 1 subsets, namely the m rows of the transition matrix Q and the vector with lag parameters λ. They assume Q to be fixed and reevalute the vector λ. Since the elements of λ sum to one, the idea is to balance an increase in one of these parameters with an equal decrease in another parameter. Berchtold (2001) use the partial derivatives to measure the local impact of each parameter caused by the change of one parameter upon the log-likelihood.

∂ log(L) ∂λl = m X it−k,...,it=1 nit−k,...,it qit−lit Pk g=1λgqit−git l = 1, ..., k (3.16) ∂ log(L) ∂qit−lit = m X it−k,...,it=1 nit−k,...,it λl Pk g=1λgqit−git l = 1, ..., k (3.17)

The best solution is to increase the parameter corresponding to the largest derivative and to decrease the parameter corresponding to the smallest derivative, respectively denoted as λ+

(25)

and λ−, by an amount δ. Then (3.15) is computed with the reestimation of λ, where the reesti-mation is accomplished by replacing λ+with (λ++ δ) and λ− with (λ−− δ). If the constraints (3.13) are activated, it imposes four special cases where λ or δ have to be adjusted in order to improve the log-likelihood through a reestimation of λ. However, using (3.14) rather dan (3.13), these four special cases do not have to be taken into consideration. Nevertheless, a posteriori has to be verified that all the results of the model are probabilities. The new log-likelihood can be computed with the reestimated λ vector. If the new value is larger than the previous one, the new vector λ is accepted and the procedure stops. In the other case, δ is divided by 2 and the procedure iterates. When δ becomes smaller than a fixed threshold, we stop the procedure, even if λ was not reevaluated. The log-likelihood achieved through this procedure is higher than or equal to the previous value. The iterative procedure is as follows:

1. Initialization

• Choose initial values for all parameters

• Choose a value for δ and a criterion to stop the algorithm. 2. Iterations

• Reestimate the vector λ by modifying two of its elements.

• Reestimate the transition matrix Q by modifying two elements of each row. 3. End criterion

• If the increase of the log-likelihood since the last iteration is greater than the stop criterion, go back to Step 2.

• Otherwise, end the procedure.

At the beginning of the process, the δ is usually chosen to be large allowing for significant changes in the lag parameters. However, as we get closer to the optimum, the changes in the lag parameters become smaller and a large δ may result in useless calculations. To prevent this, the δ is modified dynamically. A different δ for each of the m + 1 sets of parameters is used, but all δ are initialized to the same value. Then, after the reestimation of each distribution, the corresponding δ is modified as follows:

• If the distribution was not reevaluated because the parameter to increase was already set to 1, δ is not changed.

• If the distribution was reevaluated with the original value of δ (i.e. the value of δ at the beginning of Step 2), δ is set to 2δ.

• If the distribution was reevaluated with a value of δ smaller than its value at the beginning of Step 2, δ keeps its present value.

• If the algorithm was completed without reestimation, δ is set to twice the value reached at the end of Step 2.

(26)

3.7.2 Choosing the initial values

As with most iterative procedures, reaching a global maximum can be difficult. In order to increase the probability that a global maximum in fact will be reached, choosing the right initial values for the lag parameters λ and transition matrix Q is crucial. Berchtold (2001) propose a measure where the lag parameters are proportional to the strength of the relation between each lag and the present. First, we introduce some new variables.

Let Cg be the contingency table between the lag g (rows) and the present (columns)

Cg =     Cg(1, 1) . . . Cg(1, m) .. . ... Cg(m, 1) . . . Cg(m, m)    

and let Qg be the corresponding row transition matrix, with Cg(i, ·) =Pmj=1Cg(i, j)

Qg = P (Xt= j|Xt−g = it−g) =     Cg(1,1) Cg(1,·) . . . Cg(1,m) Cg(1,·)) .. . ... Cg(m,1) Cg(m,·) . . . Cg(m,m) Cg(m,·)     =     Qg(1, 1) . . . Qg(1, m) .. . ... Qg(m, 1) . . . Qg(m, m)    

For the cross-table Cg the measure ug is defined as

ug= Pm i=1 Pm j=1Cg(i, j) log2 n_C g(i,·)Cg(·,j) Cg(i,j)T Cg o Pm j=1Cg(·, j) log2 n_C g(·,j) T Cg o (3.18) where Cg(i, ·) = m X j=1 Cg(i, j) Cg(·, j) = m X i=1 Cg(i, j) T Cg = m X i=1 m X j=1 Cg(i, j)

Finally, the weight λg associated with the gth lag can be defined as:

λg =

ug

Pk

l=1ul

(3.19)

According to Berchtold (2001) there are three possible initial values for the transition matrix Q, to be specific:

(27)

2. Define Q as a weighted sum of matrices Q1, ..., Qk. Let ˜ Q = k X l=1 λlCl (3.20)

Then the element (i, j) of Q is

qij = ˜ qij ˜ qi (3.21) where ˜ qi= m X j=1 ˜ qij (3.22)

3. Set Q = Ql, where l = argmaxg=1,..,kug

They find that the results achieved with the third method always are equal to or better than those obtained through the first two possibilities. Therefore, we chose to use the third possibility.

3.7.3 Computing Conversion Attribution with MTD

Before we are able to estimate the transition matrix Q and the lag parameters λ, we have to adjust our customer journey data. This is because the MTD model calculates transition probabilities between the elements in the state space, while the START, CONVERSION and NULL state should be included too. Hence, these have to be added to the journeys a priori. The same approach is followed as before by computing the removal effect of each channel to assign conversion credit.

Example. Once again we use the Removal Effect example, where we have channels C1,

C2, C3 and C4 and the four previously given journeys. Applying a second order MTD

results in the following estimates:

Q =               

(null) (start) C1 C2 C3 C4 conversion

(null) 1 0 0 0 0 0 0 (start) 0 0 0.45 0.45 0 0.1 0 C1 0 0 0 0.33 0 0.67 0 C2 0 0 0 0 1 0 0 C3 0.33 0 0 0 0 0 0.67 C4 0.6 0 0 0 0.4 0 0 conversion 1 0 0 0 0 0 0                λ = 1 0 !

Just like in the previous example there are three possible converting journeys, applying (3.10) we get the following conversion probabilities:

P[START - C1 - C2 - C3 - CONVERSION - NULL]

= (λ1· P[(null) | conversion] + λ2· P[(null) | C3])·

(λ1· P[conversion | C3] + λ2· P[conversion | C2])·

(λ1· P[C3 | C2] + λ2· P[C3 | C1])·

(28)

= (1 · 1 + 0 · 0.33) · (1 · 0.67 + 0 · 0) · (1 · 1 + 0 · 0) · (1 · 0.33 + 0 · 0.45) = 0.22

P[START - C2 - C3 - CONVERSION - NULL]= 0.67

P[START - C1 - C4 - C3 - CONVERSION - NULL] = 0.18

Suppose we remove channel C2, then only the last possible conversion path remains and

thus the removal effect of C2 is _{0.22+0.67+0.18}0.22+0.67 = 0.83. Repeating this for all channels we

get the following removal effects.

Table 3.5: MTD Removal Effects

Channel Removal Effect Removal Effect in %

C1 0.375 15.79%

C2 0.833 35.09%

C3 1 42.11%

C4 0.167 7.02 %

Compared to the example of the second-order Markov Model, we see that C3 still receives

the highest removal effect, however C1 and C4 receive less value and C2 is close to the

removal effect of C3. This can be explained by the fact that in this example the first lag

(29)

Chapter 4

Data

In this research we extract the data from Google Analytics of an online web-shop, which pro-vides computer and ICT products. The data set includes customer journeys from the 1st of June 2017 until the 30th of September 2017, where the source of the click, timestamp and total conversions are known. There is more information available such as used keywords, device and browsers. Although we do not include these variables as we are interested in the attribution of each channel, our framework does allow for inclusion of more variables and one could extent the research to for instance a keyword-level. The advertisers that provide the data are mainly online players, since they do not advertise on for instance TV or radio, for this reason we exclude online/offline cross-channel effects. An unique customer is recognized by a VisitorId, which is allocated to the customer the first time he or she visits the website.

Table 4.1: Data set preview

fullVisitorId timestamp channel conversion 1000037927550026421 2017-08-08 09:21:57 Paid Search NA 1000231304154707563 2017-07-12 12:23:13 Paid Search NA 1000231304154707563 2017-07-12 13:06:54 Affiliate 1 1000231304154707563 2017-07-12 20:36:25 Paid Search NA 1000074013863516242 2017-07-19 08:29:05 Direct NA 1000074013863516242 2017-07-19 18:08:08 Direct NA

We see that the second customer in Table 4.1 entered the website twice before conversion by using Paid Search and Affiliate and after conversion he returned again through Paid Search. Another customer entered the website through direct type-in twice and eventually did not con-vert. We have to transform this data into journeys in order to be able to calculate conversion attribution by the Markov Model and MTD model, where also the START, NULL and CON-VERSION states are added. A customer journey is defined as a sequence of visits which ends in either a conversion (purchase of a product) or a non-conversion. The session is ended when the user has been inactive for more than 30 days, or when a conversion has occurred. Applying

(30)

this to Table 4.1 we get Table 4.2

Table 4.2: Preview transformation to customer journey

fullVisitorId path conversion

1000037927550026421 start > Paid Search > null 0 1000231304154707563 start > Paid Search > Affiliate > conversion> null 1 1000231304154707563 start > Paid Search > null 0 1000074013863516242 start > Direct > Direct > null 0

There are approximately 1.1 million customer journeys which have an average journey length of 1.63 touch-points. See Figure 4.1 and 4.2 for the successful and unsuccessful journeys per month.

Figure 4.1: Unsuccessful customer journeys per month

In Figure 4.1 we see an increase in the traffic towards the website. Similar movements can be seen at the conversions per month in Figure 4.2. From the period June until August there is a slight increase in conversions every month, after which there is a huge increase in conversions in September. This can be explained by led marketing campaigns in that period.

Figure 4.2: Successful customer journeys per month

One of the most important performance indicator in online marketing is the conversion rate, which represents the number of conversions divided by the total number of journeys. In Figure 4.3 this KPI is represented. We see that June has the highest conversion rate, followed by August. Thus, although there was more traffic towards the website in September, this does not

(31)

necessarily result in a higher KPI.

Figure 4.3: KPI Conversion rate

We include bounces in our data set, as we are interested in which channels are used by a customer and not whether or not a customer found what he or she was looking for. Google Analytics not only assigns a bounce if someone leaves immediately after the page is loaded but also when a customer viewed one page and then leaves, therefore we do not exclude bounces from our data set. See Table 4.3 for an overview of the whole data set.

Table 4.3: Data overview Number of different channels 9

Number of clicks 1,834,762 Number of journeys 1,124,582 with length ≥ 2 267,157 with length ≥ 4 66,231 Average journey length 1.63 Number of conversions 22,638 Journey conversion rate 2.01 %

As mentioned in Table 4.3 there are nine different channels tracked by the advertiser. In Ta-ble 4.4 we provide a brief description of each online marketing channel used by the advertiser. Each channel is divided in customer-initiated or firm-initiated as they are seen as important differentiators for online marketing channels. However, some channels fall within both scopes.

It is interesting to investigate which channels are used most in the customer journey, as one could expect that these channels will have much influence on conversion. In Figure 4.4 we see that Paid Search has the highest share in touch-points per month, followed by Organic Search. Affiliate and Direct alternate each other, while Display, Email, Referral and Social are the least used channels.

(32)

Table 4.4: Online marketing channels

Channel Description Channel type Affiliate Affiliate marketing is a type of performance-based marketing in

which a business rewards an affiliate for redirecting a customer to its website. Since Affiliate can either be customer-initiated (for instance coupon websites) or firm-initiated, there is no clear differentiation between the two.

Customer-initiated and firm-initiated

Direct When a customer directly types in the advertisers’ website in the browsers’ address bar this is classified as Direct.

Customer-initiated

Display Display advertising usually comes in the form of banner ads, which contains a certain message from the advertiser on a web-site.

Firm-initiated

Email Advertisers may use the email adress of a customer (if known) to send marketing messages.

Firm-initiated

Referral Referral represents all traffic to the website as a result of external content websites, therefore this can be either customer or firm-initiated.

Search When a customer is searching for a keyword in a search engine, there are two types of results: Organic and Paid. Organic is available for free and the order in which the results appear is based on the relevance of the used keywords. The paid search advertisement space is sold by a second-price auction.

Customer-initiated

Social Social media is used as a platform to advertise, for instance on Facebook, Twitter, LinkedIn.

Firm-initiated

Other Other includes all forms of advertising that do not fit in one of the previous mentioned categories.

(33)

Chapter 5

Results and Analysis

5.1 Model selection

5.1.1 Predictive accuracy

Although the aim of conversion attribution is to determine the value of each channel, the predictive accuracy of the selected models is investigated to determine the performance of the models. However, it is important to mention that predictive accuracy is not consistent with attribution accuracy (Wang et al., 2017). Nevertheless, the receiver operating characteristic (ROC) curve gives us an indication to what extent the model is able to correctly predict the conversion event per journey. The ROC curve describes the relationship between the False Positive Rate and the True Positive Rate. The False Positive Rate represents the proportion of journeys which are predicted to convert while the journey actually did not convert. Similarly, the True Positive Rate represents the proportion of journeys which are correctly predicted to convert. Reducing the performance of the ROC to a single value, the area under the ROC curve (AUC), allows us to compare the attribution models.

Every model, except LTA, gives a forecast for the conversion probability of the journeys. We use these forecasts to compute the AUC between the predicted probability and the true conversion behavior. Although LTA is not an estimated model, we still compute the AUC of LTA in order to be able to compare it to the other models. In this case the conversion probability of each journey is based on the conversion attribution of the last channel in the journey.

Although the AUC is a good predictive accuracy measure for classification models, we also look at the Logarithmic Loss function. We include this measure since the Log Loss function uses punishments for models being extremely confident, while being wrong. The AUC does not account for these situations, meaning Log Loss could provide us more insights in the predictive performance of various models. The Log Loss is calculated as follows:

−1 N N X i=1 [yilog(pi) + (1 − yi) log(1 − pi)] (5.1)

(34)

correct classification for journey i and pi is the model probability of assigning conversion to

journey i. The model with the lowest Log Loss is preferred.

To investigate the predictive performance of all models the whole data set is randomly split into an estimation and an evaluation sample, the within- and out-of-sample results respectively. The estimation set consists of 80% of the journeys in the data set, equal to 899,665 journeys. Hence, the other 20% (224,917 journeys) form the evaluation sample. Table 5.1 reports the results of AUC and Log Loss (LL) for both the within- and out-of-sample results. Unfortunately, after a fifth-order MTD model it takes too much time to estimate the parameters. Therefore, we limit our analysis to a fifth-order MTD model. This is not what we expected, as the MTD model was introduced to increase the speed of estimating conversion attribution. Hence, the MTD model allows for one additional order estimate compared to the higher-order Markov model estimates of Anderl et al. (2016a).

Table 5.1: Predictive Accuracy

Model AUC within sample AUC out of sample LL within sample LL out of sample LTA 0.5102 (0.0018) 0.5131 (0.0036) 0.4134 0.4141 LR1 0.8267 (0.0017) 0.8319 (0.0035) 0.0900 0.0897 LR2,1 0.6559 (0.0020) 0.6520 (0.0040) 0.0942 0.0937 LR2,2 0.8140 (0.0015) 0.8170 (0.0031) 0.0837 0.0826 LR2,3 0.8341 (0.0016) 0.8391 (0.0031) 0.0808 0.0794 LR2,4 0.8391 (0.0016) 0.8435 (0.0032) 0.0797 0.0784 LR2,5 0.8406 (0.0016) 0.8449 (0.0032) 0.0791 0.0780 LR2,6 0.8411 (0.0016) 0.8454 (0.0032) 0.0788 0.0777 MC1 0.6417 (0.0022) 0.6415 (0.0045) 0.1975 0.1943 MC2 0.6412 (0.0022) 0.6413 (0.0045) 0.1926 0.1894 MC3 0.6410 (0.0022) 0.6412 (0.0045) 0.1954 0.1947 MC4 0.6414 (0.0022) 0.6420 (0.0045) 0.1959 0.1989 MC5 0.6420 (0.0022) 0.6426 (0.0045) 0.1954 0.2040 MC6 0.6422 (0.0022) 0.6429 (0.0045) 0.1940 0.2072 MC8 0.6426 (0.0022) 0.6434 (0.0045) 0.1907 0.2125 MTD2 0.6801 (0.0027) 0.6860 (0.0053) 0.2161 0.2127 MTD3 0.8021 (0.0017) 0.8068 (0.0036) 1.8181 1.8173 MTD4 0.6949 (0.0017) 0.6949 (0.0033) 0.5251 0.5191 MTD5 0.6964 (0.0019) 0.7040 (0.0038) 0.4529 0.4387

Standard deviations in parentheses LR1is the first Logit Model

MCkand MTDkrepresent the Markov and MTD model of order k, respectively

LR2,krepresents the second Logit Model of order k

(35)

range of 0.64 - 0.84, where only LTA is an outlier with an AUC of 0.51. This means that the LTA model performs like a random classification model. Except for MTD3, MTD4, MTD5

and LTA lies the Log Loss mainly between 0.079 and 0.220. Table 5.1 shows that LTA is outperformed by all models for AUC and if we look at the Log Loss MTD3, MTD4 and MTD5

perform worse than LTA. The Markov model is outperformed in terms of AUC by the MTD model, however the opposite is true for Log Loss. It is interesting to see that the AUC of MTD5

is higher than of MTD2, while if we look at Log Loss the MTD2 model outperforms MTD5.

For the reason mentioned earlier, we prefer the model with the lowest Log Loss rather than the highest AUC, therefore we select the second-order MTD model. Overall both Logistic Regression models achieve the best predictive accuracy compared to the other models. The second logistic regression performs significantly better when more than one lagged dummy variable is included in the model.

The within and out-of-sample AUC and Log Loss vary little for all models, indicating a low risk of overfitting. The AUC measure suggests that including more memory in the model improves the predictive performance for all order-based models. The Log Loss shows that this is not the case for all models, which emphasizes the importance of including this measure.

5.1.2 Robustness

We have included the standard deviations of the AUC values in Table 5.1 to investigate ro-bustness across cross-validation principles. The MTD model shows most variation in standard deviations, but overall it hardly varies, which implies robustness of the predictive accuracy for all models.

As Anderl et al. (2016a) show, there is a trade-off between predictive accuracy and robust-ness, as the stability of the removal effect variable decreases as the order increases. Furthermore, the average journey length is 1.63 touch-points, which means choosing the order as high as pos-sible barely increases the performance of the model. Taking this into account and looking at the AUC and Log Loss values of the models, we recommend the second-order Markov model for attribution modeling. It is important to note that this order choice may vary for different data sets and branches. As the second-order Markov model will be used for the rest of the analysis, we also select the second-order of the second logistic regression model as comparison measure.

Continuing with the variability of the models, the standard deviations of the selected models in Table 5.1 are investigated. The standard deviations are taken from the estimated channel effects, rather than the estimated parameters, to be able to compare the results across the models. This means that for instance for the second logistic regression we use a weighted average of the lagged dummy parameters per channel. Following the approach of Shao and Li (2011), we obtain a random subset of the whole data set, while keeping the journeys in tact. This is repeated 1000 times, where for every iteration the estimated channel effects are stored, after which we calculate the standard deviation per channel of those estimations. This is unfortunately not possible for the MTD model, as the calculations are too time-consuming.

(36)

The results are presented below in Table 5.2.

Table 5.2: Standard Deviation Channel Effects LR1 LR2,2 MC2 LTA Paid Search 0.0041 0.0146 0.0028 0.0831 Organic Search 0.0041 0.0288 0.0020 0.0617 Affiliate 0.0055 0.0446 0.0014 0.0786 Direct 0.0116 0.0262 0.0025 0.1039 Display 0.0425 0.1440 0.0004 0.0097 Email 0.0174 0.0739 0.0009 0.0310 Referral 0.0139 0.0405 0.0017 0.0605 Social 0.0207 0.1101 0.0004 0.0748 Other 0.0027 0.1040 0.0006 0.0627

It can be seen that overall the second-order Markov model achieves the lowest standard deviations. The standard deviations for the higher-order Markov model can increase if the order is increased. However, if a higher-order Markov model shows the highest predictive accuracy, one could obtain more accurate estimations by increasing the number of simulations. MC2 has the highest standard deviations for Paid Search, Organic Search and Direct, which are

all customer-initiated channels, while LR1 performs worst for the firm-initiated channels. The

second logistic regression and LTA both have high standard deviations compared to the others, where LR2,2 performs worse than LTA for the firm-initiated channels and Other.

5.1.3 Calculation times

As an additional selection criteria we include the calculation times of the discussed models in this research. From Table 5.3 can be seen that LTA and MTD have the shortest and longest calculation times, respectively. Running a MTD model of order five takes more than eight hours, preparation time of the data excluded. Unfortunately, Anderl et al. (2016a) do not report their calculation times, which makes it hard to compare it with their approach, since it may be the case that the MTD5 is faster than their fourth-order Markov model. The data set in this

research is similar to the ones from Anderl et al. (2016a). For example their second data set has around 1.1 million customer journeys and eight different channels, while we have also around 1.1 million journeys and nine different channels. Hence, the absence of computational intractability in this research is not a consequence of different data sets, as having more different channels makes the Markov graph more complex, which would mean that computational intractability would certainly have arisen in our research. In fact, the short calculation times are a result of using stochastic simulation and a C++ code to compute the higher-order Markov model. An eight-order Markov model is faster than the first and second Logit models.

(37)

Table 5.3: Calculation Time

Model Time in seconds LTA 0.84 LR1 13.974 LR2,1 12.833 LR2,2 18.936 LR2,3 28.884 LR2,4 41.463 LR2,5 56.095 LR2,6 72.916 MC1 2.536 MC2 3.541 MC3 3.793 MC4 4.100 MC5 4.251 MC6 6.098 MC8 9.982 MTD2 251.923 MTD5 29,236.122

NB: Calculation time is based on the whole data set

5.2 Attribution results

Now the order of the models is established we go deeper into the attribution results of the models described in Chapter 3, where the whole data set is used. First the estimated transi-tion matrix Q and the lag parameters λ of a second-order MTD model are discussed, see the matrix and vector below. Due to the large number of non-converting journeys, we see that the probability to end up in state NULL is for every channel higher than the probability to convert. Furthermore, the probability of cycles, meaning the probability to go from channel i to channel i, are significantly higher compared to the probability to go to a different chan-nel. The vector of lag parameters shows that most weight is allocated to the second lag. Thus, according to the MTD model, the second lag is more important for conversion than the first lag.

                    

(conversion) (null) (start) Af f iliate Direct Display Email Organic Other P aid Ref erral Social (conversion) 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 (null) 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 (start) 0.0000 0.9787 0.0000 0.0044 0.0000 0.0005 0.0000 0.0000 0.0001 0.0086 0.0073 0.0005 Af f iliate 0.0065 0.3804 0.0000 0.5448 0.0022 0.0008 0.0011 0.0139 0.0098 0.031 0.0084 0.0009 Direct 0.0521 0.2641 0.0000 0.004 0.5453 0.0018 0.0091 0.0358 0.0032 0.0714 0.0117 0.0014 Display 0.0019 0.5609 0.0000 0.0016 0.0043 0.3489 0.0023 0.0103 0.0109 0.039 0.0177 0.002 Email 0.0240 0.2803 0.0000 0.0049 0.0216 0.0032 0.5507 0.0237 0.0089 0.0646 0.0162 0.0018 Organic 0.0124 0.3968 0.0000 0.0111 0.0081 0.0013 0.0024 0.3905 0.0022 0.161 0.0125 0.0017 Other 0.0129 0.1853 0.0000 0.0186 0.0073 0.0057 0.0114 0.0184 0.6733 0.054 0.0117 0.0014 P aid 0.0174 0.4071 0.0000 0.009 0.0052 0.0022 0.0022 0.059 0.0019 0.4765 0.0173 0.0024 Ref erral 0.0476 0.3006 0.0000 0.0121 0.0085 0.0044 0.0069 0.0229 0.0035 0.0811 0.505 0.0076 Social 0.0067 0.4793 0.0000 0.0075 0.0043 0.0041 0.0027 0.021 0.0036 0.0481 0.0432 0.3796                      λ = 0.4153 0.5847 !

(38)

Next, we compare the last-click heuristic with the proposed Markov model and MTD model. Table 5.4 returns the results, where can be seen that for all models the channels Paid Search, Organic Search and Direct form the majority of the contributions, which are all customer-initiated channels. This is in line with our expectation, as these channels are mostly used by the customers. Both LTA and MTD assign less credit to firm-initiated channels compared to MC2. The MTD2 model assigns, just like MC2, less credit to Paid Search, and significantly

more to Organic Search. The findings that LTA is biased towards Paid Search and undervalues Organic Search, Direct and Display, is in line with previous findings of Anderl et al. (2016a), Xu et al. (2014), Shao and Li (2011) and Li and Kannan (2014).

Table 5.4: Attribution Results Channel MC2 MTD2 LTA Paid Search 42.41% 44.26% 45.98% Organic Search 15.41% 15.26% 12.87% Affiliate 4.49% 3.39% 4.70% Direct 24.43% 24.34% 23.77% Display 0.49% 0.24% 0.26% Email 2.99% 2.17% 2.25% Referral 8.35% 8.99% 9.10% Social 0.47% 0.46% 0.39% Other 0.95% 0.89% 0.68%

In Figure 5.1 the variation of conversion attribution per channel for different orders is shown. It is clear that increasing the order leads to a decrease of the amount of credit assigned to Paid Search, while Direct and Organic Search become approximately constant for order three and higher. If we take a closer look at the channels which receive less attribution credit in Figure 5.2,

Figure 5.1: CA per channel for different orders MC

we see slightly more variation than would appear from Figure 5.1. Referral, Display and Social receive approximately a constant attribution value after order four, while Email and Other

(39)

receive slightly more conversion attribution as the order increases. The conversion attribution does not vary much after order six, since the average journey length is 1.63 touch-points.

Figure 5.2: CA per channel for different orders MC

The results for the first Logistic Regression model are presented in Table 5.5, where we see that Direct has the strongest predictor for conversion, followed by Referral and Email. This partially deviates from the results of the Markov, MTD and LTA models, where Paid Search, Direct and Organic Search form the most important channels for conversion. This difference can be explained by the fact that the logistic regression focuses on the predictive abilities of a variable rather than the contribution of that variable to conversion (Anderl et al., 2016a). In the first Logit model only Social is insignificant, which may be due to the small number of observations of Social.

Table 5.5: Results Logit Model 1

β Std. Error exp(β) Paid Search 0.163∗∗∗ 0.003 1.177 Organic Search 0.020∗∗∗ 0.004 1.020 Affiliate 0.051∗∗∗ 0.006 1.052 Direct 0.386∗∗∗ 0.005 1.471 Display -0.258∗∗∗ 0.031 0.773 Email 0.272∗∗∗ 0.011 1.313 Referral 0.343∗∗∗ 0.008 1.409 Social 0.010 0.018 1.010 Other 0.028∗∗∗ 0.007 1.028 (Intercept) -4.261∗∗∗ 0.008 0.014 N 1124582 ∗ p < 0.05,∗∗ p < 0.01,∗∗∗p < 0.001

In Table 5.6 the results of the second Logistic Regression model are shown, with the esti-mated parameters for the first- and second-lag order effects. All the estiesti-mated parameters are significant. Referral being present at the second lag has the strongest predictor, followed by

Measuring conversion attribution : a higher-order Markov and Mixture Transition Distribution model approach

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

Master’s Thesis

Measuring Conversion Attribution

A higher-order Markov and Mixture Transition Distribution

Model approach

Maxime van Leeuwen

Contents

List of Figures

List of Tables

Chapter 1

Introduction

Chapter 2

Literature Review

2.1

Previous Conversion Attribution models

2.2

Previous Mixture Transition Distribution models

Chapter 3

Methodology and Techniques

3.1

Last Touch Attribution

3.2

Logistic regression

3.3

Baseline Markov model

3.4

Removal effect

3.5

Higher-order model

3.6

Mixture Transition Distribution model

3.7

Estimating the MTD model

Chapter 4

Data

Chapter 5

Results and Analysis

5.1

Model selection

5.2

Attribution results