Multi touch attribution : searching for the best attribution model

(1)

Multi Touch Attribution

Searching for the Best Attribution Model

Jo¨el van Kesteren

Student number: 10001962

Date of final version: December 18, 2015 Master’s programme: Econometrics

Supervisor: Dr. N. van Giersbergen Second reader: Dr. K.J. van Garderen

Facilitator: MIcompany (D. de Bruin, MSc & F. de Jong, MSc)

(2)

Abstract

The topic of attribution, which is determining the contribution of marketing channels to a purchase or conversion, is hot. In recent years, a plethora of models and methods to assign attribution have been proposed in the academic and business literature. However, the question which model functions best according to objective criteria has largely been ignored. This thesis answers this question by developing a standardized methodology to evaluate attribution models on three dimensions: theoretically, empirically and in a simulation study. In the theoretical discussion, seven desirable criteria are formulated. The models are consequently compared in the light of those theoretical criteria. For the empirical component, website visit data from a large Dutch financial institution is used to produce out-of-sample conversion classifications based on the touched channels. The Area Under the ROC Curve then serves as a measure to compare the attribution models. In the simulation study, data is generated according to a wide range of scenarios in which the true attribution is known. The models are judged on the Mean Absolute Error of their attribution with respect to the true attribution. This thesis finds that the logistic regression model performs best on all three dimensions. Moreover, the performance of this model can be significantly improved by a simple extension that incorporates timing effects. The Markov chain and probabilistic models perform surprisingly bad. As ex-pected, neither do the rule-based methods such as last touch perform well. In conclusion, although far from perfect partly due to endogeneity issues, this thesis recommends com-panies to employ the logistic regression model for the time being. Above all, however, it encourages econometricians and marketeers to develop new models and methods and evaluate them with the methodology that this thesis has laid out. Following this route, this author is positive that the perfect attribution model will be found.

(3)

Chapter 1 Introduction

At this moment, more than 3 billion people around the world use internet1. This number has been increasing at an exploding rate since the introduction of the world wide web. With this vast reach, the web offers tremendous opportunities for marketing purposes. It is not surprising that digital marketing has been growing likewise, making it a $121 billion industry in 2014 with a year-on-year growth of 16%2_{. In addition to the potential reach}

digital marketing offers, it has two other significant advantages over traditional media. Firstly, an online advertisement can be uniquely tailored to each individual providing perfect customization possibilities. Secondly, all visits of internet users can be tracked and stored, enabling perfect tracking of the number of views an advertisement gets and the number of subsequent product purchases or conversions. Theoretically, this gives marketeers the opportunity to accurately evaluate online advertisements or the channels that serve them. Typical online channels are search engine advertising, email, display and social media.

However, in practice this evaluation of channels reveals to be rather challenging. Since potential customers or prospects typically ‘touch’ multiple channels before converting, the contribution of each of those channels should be determined. The problem of determining the contribution of the channels a prospect touches before conversion is referred to in the literature as the attribution problem. Traditionally, the full conversion credit is assigned to the last channel a prospect touches prior to conversion, a method called last touch attribution. However, it can be easily seen that this method is fundamentally flawed, since it completely ignores the contribution of channels in prior touches. Having realized this flaw, both the business world and academia have devoted themselves to a solution to the attribution problem. The result is that a plethora of attribution methods and models have been proposed.

1_{http://www.internetlivestats.com/internet-users/} 2_{Report by ZenithOptimedia, 2014}

(6)

Initial alternatives to the last touch attribution that have been proposed are first touch attribution (assigning all conversion credit to the first touch) and linear attribution (assigning equal credit to each touchpoint). However, all three are still rule-based methods that a priori presuppose a certain weight to all touches without actually accounting for the data. In response, Shao and Li (2011) propose two data-driven attribution techniques: a (bagged) logistic regression model and a simple probabilistic model. Dalessandro et al. (2012) further refine this probabilistic model, and demonstrate that it is fundamentally equal to the well-known Shapley Value from cooperative game theory (Shapley, 1953). Anderl et al. (2014) introduce an entirely different solution to the attribution problem, modelling the prospect paths as Markov chain models and calculating the attribution through a Removal Effect. Other research has tackled the issue of attribution through Survival Theory (Zhang et al., 2014) or Bayesian Models (Li and Kannan, 2014). In short, it is evident that an almost chaotic abundance of attribution methods and models have been put forward, each of the authors advocating its own solution. The extant literature evidently fails in creating some order by addressing the question which of all those methods is the best solution. Addressing and answering this question is the main goal of this thesis.

In order to do so, this thesis develops a methodology to evaluate attribution models on three dimensions: theoretically, empirically and in a simulation study. It considers the rule-based methods (last touch, first touch and linear), the probabilistic model, the logistic regression model and the Markov chain models. In the theoretical analysis, seven criteria are distilled from the extant literature and formulated by examining an abstract conception of the perfect attribution model. The attribution models are then evaluated in the light of these theoretical criteria. Empirically, we examine the performance of the models by its out-of-sample classification accuracy. The data for the empirical study contains all visits on a website of a large Dutch financial institution during ten months, including variables on the timing, the channel and whether a conversion takes place. By producing out-of-sample conversion classifications for this dataset and determining the Area Under the ROC Curve (AUC), the predictive performance of the models can be ob-jectively assessed. Underlying this assessment is the assumption that accurate prediction implies accurate attribution. The third component of our study consists of simulating a wide variety of scenarios in which the true attribution is known, and calculating the Mean Absolute Error of the models with this true attribution. This simulation study is useful in determining which attribution method performs best under which circumstances to be found in the data, and provides another objective criterion to evaluate the models. In addition to providing an answer to the question which of the considered attribution models performs best, the most important contribution of this thesis is a standardized

(7)

methodology for evaluating models.

The structure of this thesis is as follows. In Chapter 2 the most popular attribution methods and models are introduced, discussed and evaluated in the light of our derived theoretical criteria. Chapter 3 explains the methodology for the empirical and simulation analysis in this thesis. The next chapter, 4, gives insight into which data is used for the empirical analysis and how it is processed. The results of our research are presented in Chapter 5. Finally, this thesis concludes with the answer to our main question, a brief discussion and potential directions for further research in Chapter 6.

(8)

Chapter 2 Theory

In this chapter, the theory and literature behind digital marketing and attribution mod-elling are discussed. First, Section 2.1 discusses digital advertising, its advantages and preconditions. Section 2.2 introduces the most common digital channels. Finally, Section 2.3 reviews the literature on multi touch attribution. Theoretical criteria for a good at-tribution model are derived and the most important atat-tribution models are introduced, discussed and evaluated in the light of those criteria.

2.1 Digital advertising

Advertisements are used by a brand to communicate a message to an audience, usually in order to persuade it to undertake some action. Such an action can for instance be pur-chasing a product, signing up for a service or visiting a shop, and will in the remainder of this thesis be referred to as a conversion. Traditionally, advertisements have reached their audience through media such as television, radio, newspapers and outdoor. During the early days of internet, the network was prohibited to be used for commercial purposes. However, this ban was gradually phased out and since the widespread popularity of inter-net in the late 1990s, online advertising has become one of the most popular advertising media. In 2013, online advertising revenue was $42.8 billion in the United States alone, therewith surpassing television broadcast spending1.

We can distinguish three reasons that are explanatory to the popularity of advertising on the world wide web. Firstly, the network has an immense potential reach, having an estimated number of users that exceeds three billion at the moment of writing. Secondly, the internet has the potential to identify customers at an individual level, enabling adver-tisers to customize an advertisement which increases its effectiveness. Finally, the reach of and response to online advertisements can be stored and monitored on an individual

1_{Report by ZenithOptimedia, 2014}

(9)

level, enabling accurate performance evaluations. In the next paragraphs we will further elucidate on the latter two advantages of online advertising.

In the marketing literature, the effect of customization has been studied extensively. Ansari and Mela (2003) for instance argue that customized and targeted advertisements attract customer attention and foster customer loyalty, therewith having a considerably higher probability of persuading customers to a desired end. Customized advertisements are - if targeted appropriately - much more capable of fulfilling a customer’s need than broad and general advertisements. Advertisements can be personalised through its con-tent, message or visual representation. However, customization through traditional mass media such as television or radio is only possible at a collective level.

On the contrary, digital advertising has the major advantage that its advertisements can be tailored to each unique individual. An advertiser can decide to change the ad-vertisement based on the past browsing history or collected preference information of a potential customer. This can be done through models and algorithms, making the ‘e-customization’ quick and easy. Ansari and Mela (2003)’s research is one of the first to develop such a model, with the purpose of optimally customizing content and represen-tation of e-mails. They find that the expected click-through rate of these emails can be increased by 62%.

Another major advantage of online marketing as opposed to traditional marketing is that the performance of digital advertisements can be evaluated much more accurately than ads from traditional media. Every impression, click and conversion per advertise-ment is recorded on an individual level. The digital advertising medium is therefore perfectly suitable to accurately evaluate how many conversions or revenue each adver-tisements brings in. This enables marketeers to calculate each advertisement’s marketing Return On Investment (ROI). Based on this ROI, budget allocation to the different online advertisements and channels can be improved, eventually resulting in a more profitable firm.

In contrast, analysis of the performance of traditional media such as television and radio can only be done through aggregated data or expensive and untrustworthy surveys. Say, for instance, that we want to evaluate the performance of a large television campaign for a hotel chain. Our best option is to compare the number of bookings during our campaign period with the baseline number of bookings. The additional bookings can be attributed to the television advertisement. However, this method is based on a strong assumption, since all other factors explaining the variation in the number of bookings are ignored. Moreover, this method becomes complex when multiple advertisements are displayed on different channels. Alternatively, some of the customers might be asked to fill out a questionnaire asking them which channel predominantly induced them to book.

(10)

However, those surveys are generally unreliable because of reasons such as the difficulty to acquire a representative sample or the ignorance and forgetfulness of participants.

A precondition to both advantages of online advertising - that is the possibility of customization and improved performance evaluation - is the ability to identify unique persons from the data. If this precondition is fulfilled, we can reconstruct full online prospect or customer journeys, containing all visits, touchpoints or touches (all concepts are used interchangeably in this thesis) a person makes prior to converting. This identi-fication of individuals across multiple visits turns out to be non-trivial. In the literature, this is usually done by HTTP cookies. Additionally, this thesis advocates the use of IP-addresses.

Websites send cookies, small pieces of data, to a user’s web browser while the user visits the website for the first time. These cookies are then locally saved on the user’s device. Each subsequent time the user visits the website again with the same device, the browser notifies the website that it concerns the same web browser and device. It is likely that this subsequent visit containing the same cookie pertains to the same person. This is how cookies identify a user across multiple visits.

However, solely using cookies to reconstruct full customer journeys, although common in the literature, has its shortcomings. The predominant reason is that cookies are not able to identify a person across all visits. To illustrate this, suppose a user visits the website multiple times from different browsers or different devices. These visits cannot be related to the same individual by the use of cookies alone. Moreover, cookie tracking can be disabled. Cookies can therefore only relate part of a user’s visits to this same user. A second disadvantage is that different persons can visit the website on the same device, user account and browser, which cookies consider the same individual. However, since people are increasingly using their own devices, we will assume this risk to be small.

The limitation that cookies cannot bundle all visits pertaining to the same user, can be partly overcome by complementing information from cookies with information from the public IP-address of a visit. A public IP-address is a numerical label that is unique to an internet connection. Visits with the same IP-address are therefore likely to be the same household and thus the same person. However, relating visits to a unique person this way should be done with great caution, since multiple persons may form a household. Moreover, public institutions such as offices, libraries or universities generally have a single IP-address to which multiple people connect. Due to these drawbacks, combining visits based on IP-addresses is unusual in the literature. However, with inclusion of some restraining conditions, we argue in Chapter 4 that IP-bundling can be done to further fine-tune our prospect identification across multiple visits. By using cookie and IP-address information intelligently, a full online customer journey can be reconstructed,

(11)

containing all relevant touches a person has prior to (non-)conversion.

Figure 2.1: The AIDA funnel.

Such a customer journey is extensively described in the marketing literature. Typically, it contains multiple engagement phases. One of the most in-fluential models to regard this journey is AIDA (Strong, 1925), which is an acronym for Attention, Interest, Desire and Action. Usually, these different engagement phases are illustrated as a funnel, indi-cating that a certain amount of prospects are lost in each consecutive phase (see Figure 2.1). However, modern-day literature considers the AIDA model

inappropriate and unrealistic. New engagement phases such as Satisfaction and Con-fidence have been proposed (Barry, 1987). Moreover, new theories based on empirical research state that prospects do not necessarily engage with each of these phases or may do so in a different order. A customer journey is nowadays seen as a constant process of information-gathering and decision-making (Patricio et al., 2011). Nevertheless, AIDA still tends to serve as the reference point for research on customer journeys.

This thesis focuses on the online component of this customer journey. Hence, all visits to the website prior to (non-)conversion are considered. A prospect can visit a website in multiple ways: he can browse for the exact web address, click on a link in an email or use a search engine. These methods to reach a website are called digital or online channels. Rather than attributing individual advertisements, this thesis aims to investigate the contribution of each channel to the total number of conversions. It is therefore important to obtain familiarity with the different digital channels, which is the topic of the next subsection.

2.2 Digital advertising channels

A website can be reached through different channels. The online traffic that flows from many of those channels can be influenced by advertising. Evidently, those online ad-vertising channels are of specific interest for marketeers. The main online adad-vertising channels are search engine marketing, affiliate marketing, display, email and social media advertising. We will discuss each of those channels individually in this section, starting with search engine marketing.

Search engine marketing aims to improve the visibility of the advertiser on search engines such as Google, Bing or Yahoo. This can be done through bidding for specific keywords (search engine advertising, SEA) or adjusting the website in a specific way to

(12)

achieve a higher ranking (search engine optimization, SEO ). Both SEA and SEO are important channels for advertisers, since more than 90% of all internet users make use of search engines to acquire information and orientate on the products they need or desire

2_{. Advertised SEA results are displayed above the organic SEO results.}

The position of a specific SEA result is dependent on the bid of the advertiser, a quality score and the expected impact of possible extensions. The bid of advertisers on keywords is expressed in a certain paid amount per click (Costs per Click or CPC). As long as the user does not click on the SEA result of its query, the advertiser therefore has no costs. This explains why search engines additionally base the SEA result positioning on a quality score, which is a function of the expected amount of clicks (clickthrough rate or CTR), the relevance of an ad and a user’s landing page experience. Finally, the impact of extensions, such as features that show extra business information (e.g. a telephone number or address), is taken into account. The amount an advertiser pays is the minimum it should have bid to beat the advertiser one position below, which is a special case of the Vickrey auction (Vickrey, 1961). In practice, the paid amount is usually significantly lower than the bid, especially when one has a good quality score. Skiera and Nabout (2013) develop a model to find out the optimal bidding amount that leads to the highest profit for each keyword. In their model, they presuppose a causal relation between position and the relative amount of clicks (CTR) and estimate this relation statistically. They find that a lower rank gives a higher CTR, which confirms the intuition that users scan their results from top to bottom.

Within the realm of SEA, two sub-channels are distinguished based on the nature of the keywords: branded or non-branded SEA. Branded keywords specifically refer to the advertised brand. When someone searches branded keywords one might assume (s)he prefers that company to purchase a product or acquire information from. Since keywords that include a publisher’s brand are highly relevant to the internet user, its quality score will generally be unbeatable. Therefore, for branded keywords relatively low bids are sufficient to gain a top position. In contrast, the competition for non-branded or generic keywords such as ’car insurance’ or ’laptop’ is much higher.

Search engine optimization (SEO) is the process of optimizing the ranking of unpaid or organic results. According to an eye tracking study3, around 70% of the search engine users skip the advertisement results. SEO is therefore undoubtedly an extremely valuable marketing channel. The position of a result is determined by the relevance of the content of a website to specific keywords. Strategies that are used to improve the ranking may include increasing the number of backlinks (incoming links to a website), editing content

2_{Pew Internet Survey, May 2011} 3_{Performed by GfK, gfk.com}

(13)

in HTML or removing barriers to the indexing activities of search engines.

Another digital advertising channel that causes traffic to flow to a website is called affiliate. An affiliate is a third party that links to the advertiser’s website. Examples of affiliate parties include price comparison websites, web directories or product review sites. The majority of advertisers that engage in affiliate programs reward the affiliate by a certain amount per sale (Pay Per Sale or PPS). Closely related to affiliate is the referral channel. Whereas affiliates are primarily motivated financially, referrers refer prospects to brands they know well and have a good relationship with. The motives and relationship with the advocated brand are therefore a fundamental difference between the affiliate and referral channel.

A very well-known digital advertisement channel is display or banner advertising. Re-search has shown that internet users find (a large number of) banners annoying (Cho, 2003). However, Kireyev et al. (2013) find that banners have a significant indirect effect. Although prospects do not directly click on banners because they find them annoying or untrustworthy, Kireyev et al. (2013) find that these prospects do have a larger probabil-ity of searching for the displayed products through other channels. The most popular compensation scheme for display advertising, which is paying per click on a banner, may therefore inaccurately compensate the true influence of the channel. As will become clear later, this problem can be solved by attribution modelling.

Finally, although playing a minor role in this thesis, two other digital advertising media, e-mail and social media advertising, are also worth mentioning. The reach of e-mail advertising is restricted to prospects that have provided their e-mail addresses on the website, current or past customers. However, the target group of e-mail advertising is more engaged with the firm and is thus more likely to convert. Since the immense popularity of Facebook and Twitter, the social media channel is important for marketeers as well. Social media advertising can be seen as an “effort to create content that attracts attention and encourages readers to share it across their social networks”4_{. Social media}

marketing is effective because one has generally more trust in the word of mouth of friends in one’s social network than in firms.

Having briefly introduced the most important digital advertising channels, we now turn to the question how to evaluate a channel’s performance. Since prospects typically touch multiple channels, this issue is not as straightforward as one would expect. A method should after all be thought of to fairly assign credit over the channels. These methods or attribution models are the topic of next section.

(14)

2.3 Multi Touch Attribution

This section provides an overview on the existing literature on attribution models. We will first dive into the criteria that are formulated for a proper attribution model and subsequently formulate our own seven criteria. Then, mathematical notation around attribution modelling is introduced. Finally, the different attribution models are discussed and evaluated in the light of our criteria.

The topic of attribution modelling has recently gained widespread interest in the marketing literature. The main explanation for this popularity is the growing importance of digital marketing and its potential to track all the online channel touches of internet users. If a company is able to gather and store data concerning the clicks of its visitors, a full online customer journey can be reconstructed. From these journeys, the credit of each visit to a conversion can be attributed. Since a visit always comes from a certain channel, channels can thus be assigned a part of the (total) conversion credit. Attribution can be expressed as the absolute number of conversions or as a percentage of the total amount of conversions driven by a certain channel. In theory, an infinite number of methods to attribute can be thought of, raising the question which method most accurately reflects the ‘true’ contribution of a certain visit. This is a fundamental question in order to evaluate different channels or advertisements based on their true performance.

2.3.1 Criteria

Besides developing a variety of models, the extant literature has been concerned with for-mulating criteria in order to determine what is a ‘good’ attribution model. This search for universally accepted and standardized attribution criteria is important for two rea-sons. First and foremost, the true attribution of a certain channel is unobserved, making the topic inevitably subjective to some extent. Secondly, the actual implementation of attribution models by marketeers requires more practical criteria as well.

Shao and Li (2011) propose a bivariate metric to evaluate an attribution model: a metric that evaluates both accuracy and variability. Accuracy means that a proper model must be able to classify prospects as converters or non-converters. They evaluate accuracy by the out-of-sample misclassification error rate. This is mathematically expressed as (F P + F N )(T P + T N + F P + F N )−1, with the elements explained in the Confusion matrix in Table 2.1. Strangely, Shao and Li (2011) do not report any threshold to classify a probability as a predicted conversion or non-conversion, and it is thus unclear how they produced the exact numbers for their accuracy metric. In addition to predictive power, they state that the variability of the model’s parameter estimates is important. Consequential decisions of marketeers may after all be based on the parameter estimates of

(15)

Actual outcome Predicted outcome Conv0 Non-conv0 Conv True Positive (TP) False Negative (FN) Non-conv False Positive (FP) True Negative (TN)

Table 2.1: Confusion matrix illustrating the four quadrants into which an out-of-sample prediction can fall

the attribution model, such as performance evaluations and subsequent budget allocation of channels. It is therefore desirable to have an attribution model with stable and reliable parameter estimates. They calculate the variability by taking the average standard error of the estimated coefficients of the model or n−1Pn

i=1SE( ˆβi) for a model that has n

estimated ˆβi coefficients.

Dalessandro et al. (2012) extend Shao and Li (2011)’s criteria with interpretability, arguing that a proper attribution model should be “generally accepted by all parties with material interest in the system, on the basis of its statistical merit, as well as on the basis of intuitive understanding of the components of the system”. Finally, Anderl et al. (2014) formulate as much as six evaluation criteria for attribution models. In addition to the mentioned criteria, they argue for the importance of versatility and algorithmic efficiency. Versatility is defined as the ability to incorporate new information and fit company-specific requirements, and algorithmic efficiency simply reflects the speed of computing model outputs. These criteria are derived from the more practical aim of Anderl et al. (2014)’s paper to develop a model that is comprehensible for managers and easily implementable for a wide range of companies.

In order to evaluate and compare the different attribution models on the theoretical dimension, this thesis will formulate seven desirable qualities. Due to the academic nature of this thesis, we are less concerned with the business-relevant criteria postulated by Anderl et al. (2014). We look for the best theoretical attribution model rather than the one that can be most straightforwardly explained to a manager. Further, it should be noted that Shao and Li (2011)’s criteria are not included since they are practical performance evaluation metrics rather than theoretical qualities. Our seven theoretical criteria are as follows:

1. Data-driven: first and foremost, a good attribution model should be data-driven. If a model is not data-driven and attributes conversion credit based on some a

(16)

priori determined distribution, the subsequent attribution is completely biased and unverifiable. Rather, this distribution should be based on information derived from the data.

2. Ability to predict: a good attribution model, that is able to accurately judge the value of each touchpoint, should be able to estimate the probability of a conversion or a non-conversion given some touchpoints. Moreover, a model’s ability to predict gives us an objective standard to evaluate the empirical performance of each of the models. We will see that models have been proposed that aren’t predictive, making the task much harder to determine whether they attribute correctly.

3. Individual level credit attribution: a desirable quality of a model is its ability to attribute credit on an individual customer journey’s level.

4. Channel contribution heterogeneity: given a prospect and a time, a certain channel can be more contributory to a conversion than another channel. This is a quality that is ideally allowed for in a model’s structure.

5. Differences in contribution over time: the contribution of a channel for a given prospect can differ over time, for instance when a channel is touched closer to the time of conversion. This is a dimension that ideally can be incorporated into a model as well. The timing element can either be accounted for explicitly as a timestamp or relatively as the sequential touchpoint within a journey.

6. Prospect heterogeneity: given a touchpoint with a channel at a specific time, its contribution to a conversion may differ across prospects. Although hard to implement in a model, observing attribution models from a conceptual standpoint this is certainly a desirable characteristic.

7. Intuitive restrictions: intuitively, there are two main additional restrictions that attribution models should account for:

• The conversion credit for a channel must be between 0 and the the number of conversions that have touched this channel. The equivalent on an individual level is that a channel’s contribution to a conversion must be between 0 and 1.

• A model should be able to incorporate information from all touchpoints in a journey.

In the remaining sections in this chapter the attribution methods and models will be introduced and examined in the light of these seven criteria. Interestingly, we will see that

(17)

none of the models fulfils all conditions, perhaps indicating that the perfect attribution model is not yet around.

2.3.2 Models

Before presenting the different attribution models, it is convenient to introduce some mathematical notation. Let there be i = {1, 2, ..., N } prospects who each have an online journey with j = {1, 2, ..., Ji} visits. The jth visit of prospect i is notated by vi,j. For

converting prospects, only visits prior the conversion are considered. Each visit is coming from a channel Ck, for k = {1, 2, ..., K} channels. The function that maps a visit to

a channel is C(vi,j). A prospect journey can either turn into a conversion or a

non-conversion: yi =    1, Conversion 0, Otherwise (2.1)

The entire journey of prospect i can then be formally represented by P Ji = {{vi,j}Jj=1i , yi}.

In case of individual-level attribution, each visit vi,j has an attribution ai,j by the

function ai,j = a(vi,j) under the restrictions that 0 ≤ a(vi,j) ≤ 1 and PJ_j=1i a(vi,j) = yi.

The restrictions imply that a non-converting journey gives a credit of 0 to all visits. The attribution Ak of a channel k as a percentage of the total number of conversions can then

be calculated as follows: Ak = PN i=1 P j:{C(vi,j)=Ck}ai,j PN i=1yi (2.2) As we will see, not all attribution methods are able to attribute individually, so sometimes a model produces estimates for Ak directly.

Now that we have formally defined all the elements of the attribution problem, we can turn our attention to the different models to see how they propose to solve the attribution problem (in other words, how they estimate the Ak’s). First, we will discuss the

sim-ple and mainstream non-statistical rule-based heuristics. Thereafter, the more comsim-plex, mathematical or statistical models (probabilistic model/Shapley, logistic regression and Markov chain) are introduced.

Rule-based heuristics

Rule-based heuristics are non-statistical methods to attribute conversion credit. A priori, a distribution of the weights of the touchpoints is established for these methods. We distinguish single touch and multi touch heuristics.

(18)

The most frequently applied single touch heuristic is last touch attribution. Last touch attribution assigns all credit to the last visit a prospect touches before conversion or, mathematically expressed:

ˆ ai,j,LT =    0, j = {1, 2, .., (Ji − 1)} 1, j = Ji (2.3)

The popularity of this method is due to its intuitive and computational simplicity. Only information about the last touch serves as input to the method, making the recon-struction of a full customer journey unnecessary. However, the fact that it completely ignores information about the prior touches makes it a fundamentally flawed heuristic. Suppose a prospect reaches a website of an online vacation retailer through an affiliate party, gathers all of its information, but then needs a night sleep to decide whether he is going to purchase a trip. Waking up the next morning, he decides to buy it, quickly uses a search engine to find the relevant page and instantly converts. Last click attribution will assign the full credit to organic search (SEO) and no credit to the affiliate party, which intuitively does not make sense. If a marketeer uses last touch attribution for attribu-tion purposes, he might unjustly decide to stop allocating its funds to the affiliate party, therewith lessening much more conversions than he is aware of. In practice, this means that channels that typically appear in the beginning of a journey, while a prospect is still in the orientation phase, are highly undervalued. Examples are banner advertisements or affiliate parties. In contrast, channels that appear later in the journey such as direct (typing the URL of the website in the browser) or organic search are overvalued, even though those channels might predominantly be reached by prospects that have already made up their mind to buy the product and are only looking for the easiest way to reach the website.

To counter this bias that favours later touchpoints, another single touch heuristic named first touch attribution is introduced. Mathematically, this heuristic assigns a weight to each individual visit as follows:

ˆ ai,j,F T =    0, j = {2, 3, .., Ji} 1, j = 1 (2.4)

However, as one can image, first touch attribution is far from perfect either, since a new bias is introduced. Channels that are typically touched later in the journey such as organic or sponsored search are now underestimated, since they are given no conversion credit in cases of more than one touch. In addition, channels that usually occur in the beginning of a journey such as affiliate or display are overvalued.

(19)

Both, the last touch and first touch heuristic, fail to take into account the informa-tion of customer journeys with multiple touches. However, an advantage of the single touch heuristics for the purposes of this thesis is their potential to be transformed into a predictive model. Empirical conversion probabilities can be calculated for each channel given a certain position (e.g. first or last) and used as probability predictions for out-of-sample observations. To illustrate, for the last touch heuristic the empirical probability of conversion given the last touched channel is k is as follows:

ˆ

P (yi = 1|C(vi,Ji) = Ck) =

P #{C(vi,Ji) = Ck, yi = 1}

P #{C(vi,Ji) = Ck}

(2.5) This equation takes the number of conversions with last touch k divided by the total number of journeys with last touch k. The predictive potential of single touch heuristics is an advantage since it enables comparing performances both among each other and among other models.

A straightforward solution to the bias of single touch heuristics is to assign equal conversion credit to all touchpoints:

ˆ

ai,j,LIN =

1 Ji

, ∀j (2.6)

This rule-based method is unsurprisingly called linear touch attribution. Although less fundamentally flawed, linear touch attribution still assigns an arbitrary weight to each touchpoint independent of its true contribution. It is ignorant of potential contribution differences between channels: channel X might be generally more effective in persuading prospects to convert than channel Y. Moreover, it completely discards with differences over time, whereas touches in the beginning or end of a journey may be much more effective and influential than touches in the middle. Linear touch attribution, although in expectation closer to the true attribution than first or last touch, is still not the ‘holy grail’ of the attribution problem.

Wooff and Anderson (2013) decide to employ an attribution method that integrates the knowledge of marketing industry experts. They interview marketeers and conclude that marketeers generally regard the last clicks most valuable, followed by the first clicks and the intermediate clicks. Based on this conclusion, they propose to assign conversion credits for each touchpoint on the basis of an asymmetric U-shaped function:

ˆ

ai,j,W A= kta−1(1 − t)b−1 (2.7)

In this expression, 0 < t < 1 is the relative time in the click path and a and b are fitted parameters to the data. An illustration of such a fitted curve is displayed in Figure 2.2. In this example, you can see that the last click value is larger than the value of the first click.

(20)

Figure 2.2: Source: Wooff and Anderson (2013). The relative value of a click over time.

Although accepted by industry experts, this method is still flawed since it presupposes a functional form. Moreover, it only takes into account attribution variability over time but no intrinsic attribution differences between channels. Another disadvantage of both multi touch heuristics is that there is no method to make them predictive, preventing the possibility to compare its performance with other models.

To conclude this subsection about rule-based heuristics, it can be said that the over-arching disadvantage of those heuristics is that no method is truly data-driven: each method presupposes the distribution of the attribution weights. Interestingly though, the rule-based heuristics are most commonly used in practice. For instance web analytics service Google Analytics only offers attribution analysis based on rule-based methods. In the subsequent subsections statistical models are discussed that base their attributions on parameters derived from the data. It is expected that these models perform much better.

Simple probabilistic model

The simple probabilistic model is first proposed by Shao and Li (2011). This non-parametric model determines attribution by calculating empirical conversion probabilities with one and two channel touches. The empirical probability of a path with a single visit with channel k is as follows:

ˆ

P (yi = 1|Ck) =

P #{Ji = 1, C(vi,1) = Ck, yi = 1}

P #{Ji = 1, C(vi,1) = Ck}

(21)

This expressions divides the number of conversion paths with a single channel k touch by the total number of paths with a single channel k touch. Similarly, for paths with two touches the empirical probability is calculated as follows:

ˆ

P (yi = 1|Ck, Cl) =

P #{Ji = 2, C(vi,j) = Ck, C(vi,r) = Cl, yi = 1}

P #{Ji = 2, C(vi,j) = Ck, C(vi,r) = Cl)}

(2.9) For some j ∈ 1, 2 and r = 3 − j. Note that the order of touching Ck and Cl is irrelevant.

The attribution of channel k on an aggregate level is then computed as follows:

ˆ Ak,P ROB = ˆP (yi = 1|Ck) + 1 2(K − 1) X l6=k { ˆP (yi = 1|Ck, Cl) − ˆP (yi = 1|Ck) − ˆP (yi = 1|Cl)} (2.10) The first element of this expression simply measures the conversion probability of prospect journeys that solely contain channel k. The more interesting second element computes the interaction effect of channels k and l, which is the conversion probability of paths with both channels corrected by the one touch conversion probabilities of both individual chan-nels. Note that the probabilistic model attributes at an aggregate rather than individual level. An important assumption underlying this model is that half of this interaction effect is attributed to each of the involved channels. Dalessandro et al. (2012) arrive at the same model, having defined attribution as a “channel’s expected marginal impact on conversion”. Moreover, they prove that it is a second-order approximation of the Shapley Value, a way to distribute collective value in Cooperative Game Theory (Shapley, 1953). Berman (2013) makes use of exactly the same formulation of this Shapley attribution model.

The simple probabilistic model has a a number of disadvantages. Its attribution methodology solely uses conversion probabilities, therewith ignoring information about the number of conversions. This makes the attribution method unintuitive in some cases. Suppose a channel is only touched in a single customer journey (in a large data set), but this journey is successful and leads to a conversion. Although the channel just contributed to a single conversion, its conversion probability is 100%, causing the probabilistic model to attribute it a disproportionally high share. The attributed conversions to channel X are likely to exceed one, which is unintuitive. A second disadvantage of the probabilistic model is the possibility of negative attributions. Furthermore, the model is unable to integrate information of paths that contain more than two touchpoints. It is theoretically possible to extend the model for longer paths, but Shao and Li (2011) justly argue that from a practical standpoint this does not make sense. The estimated conversion probabilities for longer journeys become after all highly inaccurate due to the low number of observations. A final disadvantage is that the model is not predictive, thwarting the

(22)

possibility to evaluate its performance empirically. In conclusion, it is clear that the simple probabilistic model may be intuitive but has many serious drawbacks.

Logistic regression

An alternative attribution model that is also initially proposed by Shao and Li (2011) is a simple logistic regression. This is a specific regression model in which the dependent variable is binary and the functional form characterized by the non-linear logistic function. In such a model, each customer journey makes an observation with the binary conversion indicator yi as the dependent variable. Two major advantages of this model are that

it is predictive and it takes into account all available touchpoint information. In the form Shao and Li (2011) propose, the explanatory variables are the number of touches of a certain channel k in the journey i or N Ci,k =

PJi

j=1#{C(vi,j) = Ck}. The logistic

regression can then be formulated as follows:

P (yi = 1) = Λ(β0+ K

X

k=1

βkN Ci,k), (2.11)

where Λ(x) = (1 + e−x)−1 is the logistic cumulative distribution function. The param-eters βk can be estimated by maximum likelihood, although a closed form solution such

as in the case of linear regression does not exist. These parameters are then interpreted in order to determine each channel’s attribution to the total number of conversions.

However, the extant literature ignores or is particularly vague about the exact method to go from the logistic model parameter estimates to attributing the channels. Theo-retically, the most obvious method to do so would be to evaluate the marginal effects

∂yi

∂N Ck = βkλ(β0 +

PK

k=1(βkN Ci,k)), where λ(x) = (e

x_{)(1 + e}x₎−2 _{is the logistic}

probabil-ity densprobabil-ity. However, since estimated parameters can be negative this would imply the possibility of attribution to be negative, which is not a desirable property.

An alternative, more practical method to attribute is proposed by this thesis. For each visit vi,j, consider the estimated conversion probability ˆpi,j = Λ( ˆβ0 + ˆβk) in case

only the channel Ck = C(vi,j) of that visit is touched. Use this estimated conversion

probabilities as unnormalized attributions, and normalize this subsequently to obtain ˆ

ai,j,LOG for each touchpoint. Note that this method assumes that every touch of channel

k has the same effect on attribution, which is compatible with the specification of the basic logistic regression model. Mathematically, the individual attribution according to the logistic model is expressed in (2.12), in which for simplicity k rather than Ck is the

(23)

ˆ ai,j,LOG = ˆ pi,j PJi r=1pˆi,r = Λ( ˆβ0+ ˆβC(vi,j)) PJi r=1Λ( ˆβ0+ ˆβC(vi,r)) (2.12) This method to determine attribution from a logistic regression estimation is non-existent in the literature. It is important to remark that it is argued for from a practical rather than theoretical econometric standpoint. There is no proof that the estimator ˆ

ai,j,LOG is statistically unbiased. The advantage of this practical method is that it

facil-itates heterogeneity in channel contributions without any undesirable quality such as a negative attribution.

It is important to notice that the attribution method of this logistic regression uses point coefficient estimates ˆβk, ignoring the standard error SE( ˆβk) of these estimates.

A precondition for using the logistic regression attribution method is therefore that the number of observations in the data set (e.g. the number of customer journeys) is very large, such that standard errors are negligible. If estimated coefficients happen to be insignificant, this is solved by plugging in βk = 0 in (2.12). Prediction is easy with

the logistic model. Having estimated the model in Equation (2.11), one can plug the fitted coefficients ˆβk’s and out-of-sample observations N Ci,k in the logistic cumulative

distribution function Λ(x) to produce out-of-sample conversion probability forecasts. The logistic regression model displayed in Equation (2.11) is very general, and can be extended in various ways. In the original formulation, it is assumed that every touch with a channel k has the same effect on the conversion probability, regardless whether it is the first or tenth touch. If one does not accept this assumption, one can for instance make dummies for N Ci,k = 1 and N Ci,k > 1, which assumes that the first touch and

later touches have different effects. One can even go further by dummifying N Ci,k = 1,

N Ci,k = 2 and so on, if the effect should vary across a larger number of touches with a

channel k. However, as the number of prospects that have multiple touches with a single channel rapidly decreases, estimated coefficients are probable to become insignificant.

Alternatively, one can think of an extension to the logistic model that includes infor-mation on the relative timing of touchpoints, which is one of the theoretically derived criteria that the standard formulation does not comply with. One can integrate this by for instance dummifying the last touch for all channels. The result is the following model:

P (yi = 1) = Λ(β0+ K X k=1 β_kLTdLT_i,k + K X k=1 β_kN LTN C_i,kN LT) (2.13) In this equation, dLT

i,k is a dummy that indicates whether channel k is the last touch

for prospect i. N CN LT

i,k counts the number of touches of channel k that are not last

(24)

βN LT

k respectively. Note that one of the dummies dLTi,k should be eliminated from the

formulation to prevent perfect multicollinearity.

A second extension to the logistic regression model can easily be formulated by also including dummies dLT 2

i,k for the one but last touch of a channel k. In this model, N Ci,kN LT 2

counts the number of touches of a channel k that are neither last touch nor one but last touch. Theoretically, one can go further, but it should be checked whether most coeffi-cients are still significant. Another possible direction is creating dummies dF T_i,k for the first touch and including the number of non-first touches N C_i,kN F T. The Akaike Information Criterion can be checked to see which model has the best fit. However, this thesis limits its scope by considering only the last touch and one but last touch extension of the logistic regression model. This decision is based on the research of Wooff and Anderson (2013) and Anderl et al. (2014), who show that the last touch is a more powerful predictor for a conversion than the first touch.

Attribution in case of this extended logistic regression model can be derived in a similar fashion as described in Equation (2.12). For the first extension with only last touch dummies, the individual attributions for each touchpoint ˆai,j,LOGX1 are calculated

in Equation (2.14). Again, note that k = C(vi,j).

ˆ ai,j,LOGX1 =        Λ( ˆβ0+ ˆβ_{C(vi,j )}LT ) PJi−1

r=1 Λ( ˆβ0+ ˆβ_C(vi,r)N LT )+P_r=JiΛ( ˆβ0+ ˆβ_C(vi,r)LT )

, j = Ji Λ( ˆβ0+ ˆβ_{C(vi,j )}N LT )

PJi−1

r=1 Λ( ˆβ0+ ˆβ_C(vi,r)N LT )+P_r=JiΛ( ˆβ0+ ˆβ_C(vi,r)LT )

, j 6= Ji

(2.14)

Although at first glance much more complex, on closer regard the only difference with Equation (2.12) is that for the last touchpoint a different estimated coefficient is evaluated as for the other touchpoints. The only difference between the cases of j = Ji

and j 6= Ji is that a different estimated coefficient is plugged in the logistic cumulative

distribution function of the denominator. For the single ˆβ_kLT that is not estimated due to perfect collinearity problems, we plug in 0.

Markov chain Models

An entirely different approach to the attribution problem is proposed by Anderl et al. (2014). They state that a customer journey can be modelled as a Markov chain, which is a probabilistic model that represents dependencies between sequences of observations of a random variable. The random variable takes the value of one of the p possible states in state space S or {s1, s2, ..., sp} ∈ S. A transition matrix W determines the dependencies

between those states over discrete time, with transition probabilities to go from state i to state j being wi,j, where 0 ≤ wi,j ≤ 1 and

Pp

j=1wi,j = 1, ∀i. The latter condition means

(25)

in a state in period t + 1. The final element required for a Markov chain is an initial state Z0. Journeys can thus be modelled or simulated by multiplying the initial state Z0 with

the transition matrix W , resulting in a sequence of states {Z0, Z1, Z2, ..., Zt−1, Zt, ...} over

discrete time.

Markov chains can be of different order, denoting the amount of previous observations that influence the current state. Let’s first focus on the first-order model. The possible states si ∈ S in the first-order Markov model are all channels Ck, a conversion state

Conv and a non-conversion state N onConv. The transition probabilities are empirically derived from the data, giving first-order transition matrix ˆW1. The estimated first-order

initial phase for each state is also empirically calculated as the proportion of first visits the relevant channel has with respect to all journeys, resulting in the vector ˆZ0,1. Note

that the initial states for si = Conv and si = N onConv in ˆZ0,1 are zero, since a prospect

journey does not start with a conversion or non-conversion.

Once a first-order Markov model is appropriately fitted this way, attribution is de-termined by a so called Removal Effect. This is defined as the change in probability of reaching the conversion state in the normal situation compared to the situation where the pertinent channel si = Ck is removed from the chain. Although unexplained by Anderl

et al. (2014), we assume removing a channel k means setting all row elements wk,j to zero

for all j’s and for j = N onConv to one. This results in a so called reduced matrix W(−k),1.

Although not specifically mentioned by Anderl et al. (2014), we assume that the Removal Effect is considered over the steady state of the Markov chain process. The first-order Removal Effect of a channel k RE1(Ck) then takes a value between 0 and the original

conversion rate. Mathematically, this is expressed in Equation (2.15), where x[Conv] is the si = Conv state from vector x, ˆW1T is the matrix product of T times ˆW1, and ˆZ0,10 is

the transpose of ˆZ0,1. ˆ RE1(Ck) = lim T →∞{ ˆZ 0 0,1Wˆ1T[Conv] − ˆZ 0 0,1Wˆ(−k),1T [Conv]} (2.15)

The aggregate attribution ˆAk,M AR1 for each channel is subsequently calculated by

dividing the Removal Effect by the sum of all Removal Effects:

ˆ Ak,M AR1 = ˆ RE1(Ck) PK l=1REˆ 1(Cl) (2.16) Markov chain models can easily be made predictive. Given a state si = Ck at moment

t, the estimated transition probability to the state si = Conv is the estimated conversion

probability. Note that this probability is only based on the last touchpoint for a first-order Markov chain model, a finding that can be generalized to the last r touchpoints for rth order models.

(26)

Figure 2.3: First-order Markov chain graph illustrating the different states and transitions possibilities.

State Z0 C1 C2 Conv N onConv

C1 N1+N1,2 P N 0 N1,2 N1,2+N1+N2,1 P1+P2,1 N1,2+N1+N2,1 N1−P1+N2,1−P2,1 N1,2+N1+N2,1 C2 N2+N2,1 P N N2,1 N1,2+N2+N2,1 0 P2+P1,2 N1,2+N2+N2,1 N2−P2+N1,2−P1,2 N1,2+N2+N2,1 Conv 0 0 0 1 0 N onConv 0 0 0 0 1

Table 2.2: Markov initial state Z0 and transition matrix W

Now we will introduce a very simple analytical example of the first-order Markov model to clarify more intuitively how the model attributes. Suppose there are two chan-nels, C1 and C2, and the maximum number of touches is two. The possible paths are

C1, C2, C1C2 and C2C1, which occur respectively N1, N2, N1,2 and N2,1 times with the

number of conversions P1, P2, P1,2 and P2,1. For the ease of this example we omit the

paths CiCi, i = {1, 2}. The graph of this Markov model is illustrated in Figure 2.3, and

the Markov transition matrix and initial state are shown in Table 2.2.

Note that the conversion and non-conversion states are absorbing states: once in this state, a transition to another state is impossible. Now define conv∗ = Z_0,10 WT

1 (conv),

where T is the matrix power. The exact number of conv∗ is irrelevant for our purposes. The Removal Effect for channel RE1(C1) is then expressed as follows, where P N is the

total number of journeys:

RE1(C1) = conv∗− N2+ N2,1 P N P2+ P1,2 N1,2+ N2+ N2,1 (2.17) Since this Removal Effect is proportional to the attribution, we can derive that attri-bution for C1 is a function of:

1. The relative amount of journeys that start with the other channel C2 (negative

effect)

2. The number of conversions with last touch C2relative to the total number of touches

(27)

Note that the attribution for C1 has a negative relation with the relative number of C2

last touch conversions weighted by the relative number of C2 first touches. This weight

by the number of first touches makes it intuitively less accurate than simply taking the negative of the total number of last touch conversions of C2, which is basically last touch

attribution. From this stylized two channel example we can therefore conclude that the first-order Markov chain model is not only comparable to last touch, but that it is even likely to attribute worse than this rule-based heuristic due to the unintuitive weighing.

To some extent, the conclusions of our stylized two channel example can be generalized to more channels and touchpoints. However, in case of more than two channels the transition probabilities and interactions between channels influence attribution as well. Most straightforwardly, the transition probabilities to the state whose Removal Effect is estimated have a positive influence on the attribution of this state. High transition probabilities after all imply that the missed conversion of this state will be larger when it is removed. This issue was irrelevant in our analytical example since the system reached stability after a single iteration, meaning that only the direct traffic to conversion (and no transitional traffic) influenced the Removal Effect.

Having explained the intuitive dynamics of first-order Markov chain models, let’s now turn our attention to order models. The most distinctive difference for higher-order models is that multiple previous periods are taken into account: for a rth-order Markov chain the present state not only depends on the previous state, but on the states in the last r periods. It can be shown that a Markov chain of order r is equivalent to a first-order Markov chain with r-tuples representing the states. For instance in case r = 2, a state can be si = (Ck, Cl), meaning that the current channel is Cl and the previous

channel Ck. We can thus express a rthorder Markov chain with a single transition matrix

Wr and initial state Z0,r with r-tuples representing the different states. The transition

probabilities and initial state are again empirically derived from the data, giving ˆWr and

ˆ

Z0,r. For a second-order Markov model this implies (k+1)(k+2)+2 states. The k+1 term

represents the possibilities of the first element of the 2-tuple, which are all k channels plus a ‘none’ element in case there is no channel previous to the second channel represented in the 2-tuple. The k + 2 term represents all channels including a conversion and a non-conversion possibility. The last 2 comes from the only absorbing states si = (Conv, Conv)

and si = (N onConv, N onConv). In the initial state vector Z0,2 empirical first touch

probabilities are estimated for the states of the structure si = (N one, Ck).

Attribution for the higher order Markov model is again calculated by the Removal Effect. Anderl et al. (2014) choose to calculate channel attribution by taking the average Removal Effect of each of the states that include the respective channel. This state-based Removal Effect is illustrated in Equation (2.18) for the case of r = 2, where ˆW(−si),2 is

(28)

the reduced second-order transition matrix with state si set to zero. ˆ RE2(si) = lim T →∞{ ˆZ 0 0,2Wˆ2T[Conv, Conv] − ˆZ 0 0,2Wˆ(−sT i),2[Conv, Conv]} (2.18)

Subsequently, the attribution ˆAk,M AR2M 1 is calculated in Equation (2.19). In this

equation, Sk ∈ S is the set of all states that contain channel Ck, so either in the form

(Ck, α) or (α, Ck) for any α. |Sk| is the number of such states.

ˆ Ak,M AR2M 1 = |Sk| −1P si∈Sk ˆ RE(si) PK l=1|Sl| −1P si∈Sl ˆ RE(si) (2.19) However, taking the mean of the Removal Effects of all states that include channel k seems inconsistent with Anderl et al. (2014)’s own definition of the Removal Effect, being the “change in probability of reaching the conversion state when we remove a channel from the graph”. Therefore, more in line with this definition we propose an alternative method to determine individual channel attribution. Simply stated, this new method determines the Removal Effect of a channel REr(Ck) rather than a state REr(si).

REr(Ck) is calculated by removing all states (setting them to zero) that include channel

k or all states si ∈ Sk. The consequent reduced matrix is named W(−Sk),r. This Removal

Effect for r = 2 is calculated in Equation (2.20).

ˆ RE2(Ck) = lim T →∞{ ˆZ 0 0,2Wˆ T 2 [Conv, Conv] − ˆZ 0 0,2Wˆ T (−Sk),2[Conv, Conv]} (2.20)

Attribution ˆAk,M AR2M 2 is then calculated by normalizing this channel-based Removal

Effect. ˆ Ak,M AR2M 2 = ˆ RE2(Ck) PK l=1REˆ 2(Cl) (2.21) In the remainder of this thesis we will report the higher-order Markov attribution method of Anderl et al. (2014) as method 1, and the method proposed by this thesis as method 2.

The higher the order of Markov chain models, the more accurate it describes data with multiple touch journeys. In contrast to the first-order model, not only the last touch but the last r-touches are taken into account for attribution. Based on this, the Markov model should perform better with increasing r. However, since the number of param-eters to be estimated grows exponentially with r, a higher order Markov chain quickly becomes inefficient to estimate (Berchtold and Raftery, 2002). In this case there are not enough observations to produce accurate estimates for the transition probabilities. For this reason, this thesis follows Anderl et al. (2014) and only estimates the first-, second-and third-order Markov models. Anderl et al. (2014) find that the third-order Markov

(29)

model performs better than the logistic regression, first touch and last touch heuristics as measured by the area under the ROC curve and the top-decile lift. Unfortunately, it is not reported whether any model performance differences are significant.

Other models

Some more attribution models have been developed that are noteworthy, which we will briefly refer to for the interested reader. Li and Kannan (2014) propose a Bayesian model to measure online channel consideration, visits and purchases. They calculate carryover and spillover effects to attribute conversion credit. Zhang et al. (2014) apply the attribution question to a framework borrowed from survival theory, producing a model that appears quite promising in both conversion prediction and attribution. Finally, Xu et al. (2014) employ a mutually exciting point process model to calculate attribution of online advertising channels. These models aren’t discussed in this thesis because either our dataset is not suitable for the respective model, the model is too complex for the purposes of this thesis or the model is expected to be less effective than our models.

2.3.3 Theoretical evaluation

So far, we have discussed the most common attribution model in the literature. Table 2.3 summarizes in a simplified way our theoretical evaluation in the light of the proposed seven criteria. A ‘+’ indicates that the model complies with the quality or criterion, a ‘+/−’ indicates partial compliance and for a ‘−’ the model is unable to integrate the quality.

All models except for the rule-based methods are fully data-driven. The only mod-els that are not able to predict are linear touch attribution and the simple probabilistic model. The logistic regression models and rule-based heuristics are able to attribute at an individual level: the others are not. Channel heterogeneity is allowed for in all models except for the rule-based methods. Contribution differences over time are explicitly mod-elled in the logistic extension and higher-order Markov model. The first-order Markov model, first- and last touch methods only partly allow for differences over time, such as the simple distinction between last touch and non last touch. The other models do not take into account timing differences. Prospect heterogeneity is an ambitious criterion that none of the considered models comply with. All intuitive restrictions are satisfied by the linear method, the logistic regression models and the Markov chain models. The first touch method, last touch method and simple probabilist model do not incorporate all information from all touchpoints. As explained, the simple probabilistic model neither attributes in an intuitive way, since it is based on conversion probabilities rather than

(30)

Criterion FT LT LIN PROB LOG LOGX MAR1 MAR2+

Data-driven - - - + + + + +

Ability to predict + + - - + + + +

Individual level attribution + + + - + + -

-Channel heterogeneity - - - + + + + +

Differences over time +/- +/- - - - + +/- +

Prospect heterogeneity - - -

-Intuitive restrictions +/- +/- + - + + + +

Table 2.3: Summary of theoretical evaluation of attribution models. A ‘+’ indicates a model fully complies with the criterion, a ‘+/−’ implies partial compliance and a ‘−’ no compliance.

absolute numbers. In conclusion, the logistic extension satisfies most theoretical criteria, closely followed by the normal logistic model and the higher-order Markov chain model. Based on the theoretical criteria, the logistic models perform best.

(31)

Chapter 3 Method

This chapter explains the method that this thesis employs to answer the question which model performs best empirically and in a simulation. First, Section 3.1 briefly mentions the different methods and models that will be estimated and evaluated in this thesis. Then, Section 3.2 describes how the models are evaluated based on their classification accuracy in an empirical study. Finally, Section 3.3 works out the method that is used for the simulations. All statistical analyses and simulations are performed in the open source programming language R.

3.1 Models

The models that are tested in the empirical and simulation study are all introduced in Chapter 2. The list below sums them up, where the models that are able to produce predictions are designated with an asterisk (*).

• Last touch attribution (LT*) • First touch attribution (FT*) • Linear attribution (LIN)

• Simple probabilistic model (PROB)

• Logistic regression model: basic formulation (LOG*), extension with last touch dummies (LOGX1*) and extension with dummies for the last two touches (LOGX2*) • Markov chain model: first-order (MAR1*), second-order (MAR2*) and third-order

(MAR3*)

It is chosen to estimate the models on the full data set, without conditioning on the number of touchpoints. The latter would after all quickly bring down the number of observations for a larger amount of touchpoints. This would give insignificant parameter estimates, severely complicating the issue of attribution. In addition, all of the models are

(32)

perfectly able to cope with a data set that contains observations of different touchpoints and it is expected that some of the models even perform better on such a full data set. To see this, first notice that conditioning on the number of touchpoints would not have any implication for the rule-based heuristics. Attribution under conditioning on touchpoints would be exactly the same as attribution under the full data set. Since the probabilistic model already conditions on paths with one or two touchpoints and ignores longer paths, conditioning here neither has an effect. For the logistic regression and Markov models, it is expected that conditioning produces worse results, simply because for each estimate less information is available. Suppose we condition on the number of touchpoints Nt for

Nt = 1 and Nt > 1. It is much harder for the set that is conditioned on Nt > 1 to

determine the contribution of a channel, since information on the single-touch conversion probability of this channel relative to other channels is not taken into account. This makes its estimates for this contribution less accurate compared to the case in which all information is included in the data. There is, in conclusion, no good reason to condition on the number of touchpoints. Having established this, let’s now discuss how to evaluate the empirical performance of the models.

3.2 Classification accuracy

Although channel attribution can obviously not be observed empirically, it seems plausible that a model able to predict conversions well also attributes well. Under this assumption, we can measure and evaluate the classification accuracy of the different models. In order to do so, the dataset is split in a test and a training set. Randomly, two third of the observations are assigned to the training set. This split is arbitrarily and may have been chosen differently. Most importantly, the number of observations is large enough to create both parameter estimates in the training set and a performance statistic in the test set that have a small variance.

All models are estimated on the training set. For the models that have been marked by an asterisk in the list in Section 3.1, conversion probabilities are determined for the observations in the test set. These probabilities are then turned into a conversion or a non-conversion for a certain threshold, resulting in a Confusion matrix as is shown in Table 2.1. Standard measures for classification accuracy, such as Shao and Li (2011)’s misclassification error rate or the percentage correctly classified, can be calculated. How-ever, in our case the class distribution between conversion and non-conversion is highly skewed: the event of non-conversion is far more likely than the event of conversion. In the case of a highly skewed distribution, He et al. (2009) show that the standard measures perform poorly due to limited discriminative power.

Multi touch attribution : searching for the best attribution model