Improving display advertising with probabilistic cross-cookie matching : probabilistic cookie matching using natural language processing and gradient boosting machines

(1)

Master of Business Administration Thesis

Improving Display Advertising with

Probabilistic Cross-Cookie Matching

Probabilistic Cookie Matching using Natural Language

Processing and Gradient Boosting Machines

T.G. Wijnsma MSc

31 August 2017

(2)

(3)

Improving Display Advertising with

Probabilistic Cross-Cookie Matching

Probabilistic Cookie Matching using Natural Language

Processing and Gradient Boosting Machines

Master of Business Administration Thesis

For obtaining the degree of Master of Business Administration

with a specialization in Big Data & Business Analytics at the

Amsterdam Business School

T.G. Wijnsma MSc

31 August 2017

(4)

(5)

Management summary

ORTEC Adscience helps companies to improve their display advertising with their demand side platform (DSP). These advertising companies book campaigns in the DSP specifying budget constraints and uploading their banners. Campaigns are either aimed at finding new consumers (called prospecting) or to get consumers that visited their site back to their site to buy a product or service (called retargeting). The percentage of consumers that click on a banner and then convert on the site of the customer is called the conversion efficiency. For retargeting the conversion efficiency is up to 5 times higher than prospecting. However, since retargeting works by targeting devices on which a cookie is dropped at the moment of the first site visit, the volume of a retargeting campaign is limited. When a consumer orientates itself to buy a new TV in the afternoon on its phone, but makes the decision to buy in the evening on its laptop, the laptop is not recognized to belong to the same person. An advertising opportunity on the laptop is lost and advertising will continue on the mobile phone even though the consumer already made a purchase. To match the cookies on these different devices to know they belong together is called generally call cross-device matching. However, matching can also be done between old and new cookies on the same device in case the cookies were wiped. Since this research handles both cases the term cross-cookie matching (CCM) is coined. In short, this research improves the overall conversion efficiency of display advertising campaigns through probabilistic cross-device cookie matching.

The matching is done based on the browsing behavior of the consumer. Both when the consumer browsed and which urls are visited is used to characterize each user. The model structure consists of three different phases: candidate selection, feature generation and pairwise classification. Candidate selection is performed, because the it is computationally infeasible to compare each cookie to each other cookie. The natural language processing (NLP) techniques Term-Frequency - Inverse Document Frequency (TF-IDF) and Doc2Vec are utilized to convert the url visits to vectors that represent the consumer. One TF-IDF and four different Doc2Vec models are used and each of those is followed by a k nearest neighbor (kNN) model that selects the most similar consumers as candidates for matching. For TF-IDF the 18 most similar

(6)

Management summary vi

consumers are selected and for each Doc2Vec model the 15 most similar consumers. This reduces the number of candidate pairs (the combination of two consumers) to less than 1% of the original number of candidates. Despite the massive reduction in candidates, the remaining selection still contains 90% of the true pairs.

Feature generation produces the input for the classification model. The similarity, expressed in distances, that 5 NLP models produce are not only used for candidate selection but also provide the most features. The ranking that the 5 kNN models produce are also used as features. For instance, for a pair (A, B) cookie B might be the fourth closest to A of all candidates, so the rank is 4. Also, features are created based on when consumers browsed and with what frequency.

The features are used to estimate for each candidate pair what their probability for being a match is. This is called binary pairwise classification and is done with a gradient boosting machines algorithm, which used an ensemble of decision trees for classification. Where most of the recent research used the XGBoost method, this research employs Light GBM; the latest development in gradient boosting machines and between 11 and 15 times faster.

The results show that the model is able to find 52% of all the true matches (recall) and when the model predicts a match, it is a true match in 76% of the cases (precision). The much used F1-score, which is the harmonic mean of precision and recall, is 61.5%.

Especially the kNN of the TF-IDF model is very good at candidate selection and it is also established that the NLP models as a whole have great predictive value. Although NLP is used in a very different domain, the concept of the techniques turn out to be very well suited to the interpretation of browsing behavior.

The initial expected performance gain when moving prospecting budget to a campaign that targets matched cookies was an increase of 20% to 32% in conversion efficiency. However, a high precision in combination with enough recall resulted in an estimated performance gain of 41% to 53%. This creates a lot of room for ORTEC Adscience to monetize the cross-cookie matching functionality.

In conclusion, it can be said the potential business value created by this research has exceeded expectations. Not only directly through the performance gain for retargeting campaigns, but also other areas of application are identified for use of the model. ORTEC Adscience will continue developing the model based on the recommendation of this research and adapt it for operational use in the business.

(7)

1.4 CRISP-DM . . . 4 1.5 Research in brief . . . 4 1.6 Thesis outline . . . 6 2 Business Understanding 7 2.1 Business goals. . . 7 2.1.1 Performance gains . . . 7 2.1.2 Service gains . . . 10 2.2 Research questions . . . 11 2.3 Requirements specification. . . 12 2.3.1 Design requirements . . . 12 2.3.2 Scope . . . 13 2.4 Chapter conclusion . . . 13 3 Literature 15 3.1 Retargeting . . . 15 3.2 Cross-cookie matching . . . 16 3.3 Chapter conclusion . . . 17 vii

(8)

Contents viii

4 Data understanding and preparation 18

4.1 Data collection . . . 18

4.2 Data description . . . 19

4.2.1 Big data Vs . . . 19

4.2.2 Variables for use in further research . . . 21

4.3 Data cleaning and transformation. . . 22

4.3.1 Input data preparation. . . 22

4.3.2 Simulating ground truth data . . . 23

4.4 Chapter conclusion . . . 24

5 Modeling 25 5.1 Model overview . . . 25

5.2 Candidate selection . . . 26

5.3 Feature generation . . . 28

5.3.1 Time based features (971) . . . 28

5.3.2 Rank based features (20) . . . 29

5.3.3 Distance based features (10). . . 29

5.4 Pairwise classification . . . 30

5.4.1 Decision trees . . . 30

5.4.2 Gradient boosting machines . . . 31

5.5 Chapter conclusion . . . 32 6 Results 33 6.1 Metrics . . . 33 6.2 Performance . . . 35 6.3 Performance decomposition . . . 36 6.3.1 Feature importance. . . 36

6.3.2 Candidate source importance . . . 37

6.3.3 Candidate set size importance. . . 38

6.4 Value for business . . . 39

6.5 Chapter conclusion . . . 40

7 Conclusion and recommendations 41 7.1 Conclusion . . . 41

7.2 Recommendations for further research . . . 42

8 Reflection 44

References 48

A CIKM data distributions 49

(9)

Chapter 1 Introduction

This thesis focuses on implementing a model for a demand side platform in online marketing to improve the efficiency of their display advertising system. More specifi-cally, through the use of cross-cookie matching their retargeting campaigns will be made more effective.

The introduction gives a short description of the business project and serves the purpose of providing context to the topic. This chapter starts with explaining what Demand Side Platforms are, followed by and introduction to cross-cookie matching and ORTEC, at which the project is performed. Next, the CRISP-DM process model is described, which provides the methodology of the project. Finally, the chapter conclusion is given and the outline of the thesis is stated.

1.1 Display Advertising

Display advertising is the part of online marketing that comprehends the

presen-Display advertising

tation of banners of advertisers on the websites of publishers. Advertisers pay for each consumer view, an impression, separately. The buying of these impressions is

Impression

increasingly done automatically, through a process called programmatic buying and more specifically real-time bidding. In real-time bidding, the advertiser can bid on inventory -the banner space for offer on the site of a publisher- at the moment a

Inventory

consumer loads that website. Roughly, real-time bidding is done as follows: while a consumer loads a website, that website sends a bidrequest to an auction, a Supply

Bidrequest, SSP

Side Platform or SSP. The bidrequest contains the inventory for offer (e.g. url and supported banner sizes) and specifics about the consumer (e.g. IP address and device type). The auction distributes the bidrequest to all interested bidding parties. These bidding parties are called Demand Side Platforms. DSP’s have advertisers

DSP

as customers, which setup marketing campaigns in the platform of the DSP. These campaigns at the very least contain budget constraints and which banners to use. When a bidrequest is received, the DSP evaluates all the banners of each campaign

(10)

1.2 Cross-cookie matching 2

in the platform and uses the attributes of the bidrequest to estimate the chance that the consumer will click on the banner and/or buy something of the advertiser after clicking the banner. Based on this chance a price is determined by a machine learning algorithm1. The highest internal price is communicated back to the SSP for auction. At the auction the highest bid wins and the url of the winning banner is communicated back to the website and the banner is loaded. This entire bidding process takes approximately 200 milliseconds (ms) of which the DSP has 100 ms to do its job, including latency in communication.

Display advertising campaigns can be roughly divided in two categories: prospecting

Prospecting

and retargeting. In prospecting campaigns the general goal is to attract consumers to

Retargeting

the website of the advertiser. In retargeting campaigns the goal is to target consumers that did visit the advertisers website in order to get them back to convert (i.e. buy something). For the purpose of retargeting the DSP needs to be able to recognize that a consumer has visited the advertisers website. To achieve this, the advertisers puts some code of the DSP on its website, called a pixel, that signals the DSP at

Pixel

each page load. For every new consumer, the DSP will generate a viewer id and drops this id in a DSP-specific tracking cookie, which is stored on the device of the

Cookie

consumer. This id is also synchronized with the SSPs to enable them to add this id to all the bidrequests that are generated by this consumer. By checking the presence of a viewer id in a bidrequest, the DSP knows which websites of its advertisers this viewer has visited and can take specific action.

It is important to distinguish between cookies, consumers and customers. Customers

Customers

are advertisers that do business with ORTEC Adscience and to which added value should be delivered with CCM. Consumers are the people that are targeted by the

Consumers

customers. The consumers own multiple devices which all have a different identifier in a cookie on that device. Hence, a cookie is not the same a consumer, but browsing

Cookies

behavior logged of a cookie is part of the total browsing behavior of the consumer.

1.2 Cross-cookie matching

Display advertising is foremost aimed at tracking users through their desktop device. However, although 97% of online shoppers make purchases via desktop, about 40% does also purchase via tablet and more than 40% via mobile (Bannan, 2014). About 60% of consumers use multiple devices and in general much more browsing is done on mobile than on desktop nowadays. So, device use has become much more diverse which poses difficulties for traditional targeting. The problem with cookie-based retargeting is that the devices are targeted and its consumers. For example, if a consumer orientates on buying a new TV in the afternoon on their phone, but buys it in the evening on his laptop, the DSP cannot retarget the consumer on his laptop, because it does not know they visited the advertisers website on their phone earlier. Hence, a retargeting opportunity is lost. A possible solution is Cross-Device

Cross-Device Matching,

CDM _{Matching (CDM). CDM, which started to be used in marketing since 2014 (}_Ozt¨_¨ _urk_,

2016), involves methods to determine whether different devices belong to the same 1

(11)

1.3 ORTEC Marketing Optimization 3

consumer. As such, the display advertising moves from device retargeting to consumer retargeting.

There are two categories of CDM: deterministic and probabilistic (Ozt¨¨ urk,2016). In deterministic CDM, one is certain that the matches between devices are positive. For

Deterministic

instance, when a consumer has to sign in with a personal account on the website of the advertiser and does so on all their devices, the advertiser knows for sure this is the same consumer. In probabilistic CDM the match between devices is estimated

Probabilistic

based on non-deterministic attributes. For instance, a phone and laptop sharing IP address A in the morning and sharing IP address B in the afternoon are very likely to have traveled with the same person. Though this is not certain, the chance of those devices matching is likely to be high; hence the term probabilistic. This research focuses on applying probabilistic CDM. More specifically, it focus on matching the cookies on different devices. By doing so, not only can the devices of the same consumer be matched, but also the matching of new cookies with old cookies that the consumer wiped can be done. The latter is not cross-device but on the same device (intra-device). To include intra-device matching in the definition this research coins

the term Cross-Cookie matching (CCM).

CCM

1.3 ORTEC Marketing Optimization

This research project is performed at the DSP of ORTEC, which is part of their

ORTEC

Marketing Optimization department. ORTEC was founded on April 1st, 1981 by five innovative thinkers. These Econometrics students at Erasmus University Rotterdam believed the mathematical theories and algorithms they worked on could be practically applied to significantly improve business performance. Since then, ORTEC has become one of the worlds leaders in optimization software and analytics solutions. They make businesses more efficient, more predictable and more effective through both their products and consultancy. ORTEC serves clients in almost every industry with 15 offices strategically located across 4 continents housing over 1000 employees. Everything they do is a combination of operation research, business and IT (adapted fromORTEC(2017a)).

The Marketing Optimization department is a startup environment of ORTEC in the marketing domain. Their portfolio contains three products in different stages of the startup life-cycle:

• First of all, they offer a statistical conversion attribution tool called ORTEC Media Optimizer that helps marketing managers determine the optimal distri-bution of budget over their marketing channels.

• Second, they offer a search bidder called ORTEC AdWords Bid Optimizer that helps companies determine optimal bid prices for search advertising.

• Third, they have a DSP called ORTEC Adscience, which has both advertisers

ORTEC Adscience

and media agencies as customers2 (ORTEC,2017b). 2

The author is both manager of the Marketing Optimization department and product owner of ORTEC Adscience, which enabled the execution of this project.

(12)

1.4 CRISP-DM 4

1.4 CRISP-DM

With the popularity of data mining projects in business increasing, several big enterprises recognized the lack of a good model to guide the data mining process. Together, they created CRISP-DM, the CRoss-Industry Standard Process for Data

CRISP-DM

Mining (Shearer, 2000), which is applied in this research. The CRISP-DM model contains six phases that are ordered sequentially, but are meant to be used iteratively,

Six phases

namely: business/research understanding, data understanding, data preparation, modelling, evaluation, and deployment. The business/research understanding phase focuses on defining the business requirements and using those to formulate a data mining goal. The data understanding phase is used to get a grip on the data available, the quality of the data, and to perform initial exploratory data analysis. In the data preparation phase the data is made ready to be input for the envisioned modeling technique by e.g. inserting missing data, creating flag variables, and data transformation. In the modeling phase the type of model is chosen and the hyperparameters of the model are tuned. Then, the evaluation phase checks the effectiveness of the model in achieving the business goals. Finally, the deployment phase prepares the model for use in operation and then deploys it. This research contains all phases except for deployment, which was not feasible within the time constraint of the project. However, deployment will follow in the business setting as described before. As such, many decisions in the modeling and evaluation phases are made with respect to operational feasibility. As a whole the CRISP-DM process ensures the focus on business value and ensures that the iterative nature of data mining is taken into account.

1.5 Research in brief

At the start of the research the expected performance gain when moving prospecting budget to a campaign that targets matched cookies was an improvement of 20% to 32% in conversion efficiency. The literature confirms the business experience at ORTEC Adscience that retargeting is much more effective than prospecting, which underpins the goal of this research. As little research is done on CCM, mostly papers of two recent cross-device competitions are sourced for model structure and feature engineering.

The model structure consists of three different phases: candidate selection, feature generation and pairwise classification (Figure 1.1). It is computationally infeasible to compare each cookie with each other cookie. For a thousand cookies the number of possible pairs is 500.000, and for two thousand cookies it is already 2 million. The candidate selection phase reduces the number of pairs to be reviewed to less than 1%. This is done based on the visited urls by the cookie, which are transformed into vectors using the Natural Language Processing (NLP) techniques Term Frequency -Inverse Document Frequency (TF-IDF) and Doc2Vec. Of these two, the Doc2Vec technique is applied in four different ways.

The resulting five NLP models produce distances between the cookies to indicate their similarity. After each NLP model is applied, a corresponding kNN model selects the

(13)

1.5 Research in brief 5 Environment Variables of interest Instruments Adscience variables Time of visit Pages visited Optimal Model boundary kNN TF-IDF GBM Doc2Vec (4x) # candidates per kNN

Cut off value

Matched pairs

kNN

(4x)

Figure 1.1: System diagram of the ORTEC Adscience model

100 most similar cookies as the candidates to review. For these candidates interesting variables (features) are created that the classification model uses for prediction. For feature generation browsing behavior in terms of time and frequency of browsing and the urls that are visited are used. Additional information available through the bid requests of ORTEC Adscience is identified to be used as features. The classification model is trained by providing a ground truth on which is calibrates itself to predict as accurate as possible. As there is no ground truth data available within ORTEC Adscience, such data had to be simulated based on distributions of a cross-device matching competition. Since this competition only utilizes browsing behavior, it cannot provide distributions for the ORTEC Adscience specific variables which as a result cannot be used. They are however described to be used in further research. Feature generation is hence performed using only browsing behavior by generating time based features, distance based features and rank based features on all the candidate pairs. Time based features describe when and how frequent the device of the cookie was used for browsing. Distance based features, constructed from the output of the NLP models, describe the similarity of urls visited by both cookies. Rank based features also use the NLP models to show how similar the cookies are in respect to all other candidates. For instance, for a pair (A, B) cookie B might be the fourth closest to A of all candidates, so the rank is 4.

These features are used to estimate for each pairs whether they are a match or not, called binary pairwise classification. As binary pairwise classifier, gradient boosting machines (GBM) are used, which is an ensemble of many different decision trees. Where most of the literature used the XGBoost method of applying boosted decision trees, this research employs Light GBM, the latest development in GBM’s. Light GBM is between 11 and 15 times faster than XGBoost while achieving the same classification performance.

The results show that the model is able to find 52% of all the true matches (recall) and when the model predicts a match, it is a true match in 76% of the cases (precision).

(14)

1.6 Thesis outline 6

The much used F1-score, which is the harmonic mean of precision and recall, is 61.5%.

Especially the kNN of the TF-IDF model is very good at candidate selection and it is also established that the NLP models as a whole have great predictive value. Although NLP is used in a very different domain, the concept of the techniques turn out to be very well suited to the interpretation of browsing behavior.

The business value as estimated at the end of the research has exceeded expectations. The initial expected performance gain when moving prospecting budget to a campaign that targets matched cookies was 20% to 32%. However, the higher precision in combination with enough recall resulted in an updated expected performance gain of 41% to 53%. This creates a lot of room for ORTEC Adscience to monetize the cross-cookie matching functionality.

1.6 Thesis outline

In this chapter an introduction to this research was given. It takes place in a marketing context, more specifically in display advertising. The main challenge is to match cookies on the same or different devices probabilisticly to improve retargeting campaigns of ORTEC Adscience. The research is executed following the CRISP-DM process. In the next chapter the business understanding phase is described. It contains the business goals, research questions, requirements and the scope definition. Then the literature findings related to retargeting and CCM are giving in chapter 3. Next, chapter 4 presents the considerations of the data understanding and data preparation phases. Following,chapter 5 describes the modeling phase by detailing candidate selection, feature generation and pairwise classification. Thenchapter 6

analyzes the effectiveness of the model in achieving the expected performance, it investigates the contribution of different parts of the model to this performance and lays out the value for the business. Finally, the conclusion and recommendations for further research are given inchapter 7 followed by the reflection inchapter 8.

(15)

Chapter 2 Business Understanding

This chapter describes the requirements for this research and focuses on understanding the business context of the project. It starts with stating the goals of the business, followed by outlining the research questions that will guide the project. Next, the requirements and scope of the project are given and finally the chapter conclusion summarizes the findings of this chapter.

2.1 Business goals

For ORTEC Adscience to increase its competitive advantage it mainly focuses on delivering the best performing advertisement campaigns to its customers with the

Best performance

best service. The best performance means that customers get the most sales out of the money spent in the advertising campaigns. Cross-cookie matching (CCM) is expected to help improve ORTEC Adscience in this area, more specifically in retargeting. The first section looks at the performance gain that CCM can bring, the second section describes service related improvements.

2.1.1 Performance gains

A breakdown of the profit for customers is given to reach a mathematical formulation of the potential performance gains of CCM.

P rof it = Revenue − Cost (2.1)

Cost = ( N

1000) ∗ CP M (2.2)

Revenue = Conv ∗ P C (2.3)

(16)

2.1 Business goals 8

The cost is defined as the number of impressions N divided by thousand times the CP M , which stands for cost per mille or the price of buying one thousand impressions.

CPM

These costs are already optimized by a Na¨ıve Bayes machine learning algorithm

Na¨ıve Bayes

in combination with a budget pacing algorithm, so they are assumed fixed in this research. Hence, a gain in revenue is where the expected gain is coming from and with the profit per conversion P C also fixed, it is the number of conversions CV that CCM needs to lift:

Conv = N ∗ CE (2.4)

The number of conversions is the result of the number of impressions times the conversion efficiency (CE). Normally, the marketing domain defines a click through

Conversion efficiency

rate CT R (P (click)) and a conversion rate CR (P (conversion|click)) separately, but since these are both assumed fixed in this research their combined chance is defined as the conversion efficiency CE. Experience at ORTEC Adscience shows that people who have visited a customers’ website have a CE of about 3 to 5 times higher than non-visitors. In other words, retargeting is much more effective than prospecting. So why do prospecting at all? First of all, because the volume (N) of retargeting is too low to fuel business. Second, to reach new customers. In other words, increasing volume is key. Retargeting is done by dropping a cookie on the devices of users

Volume

visiting the website of the customer. Hence, we are actually targeting cookies and not users, which means we can increase volume in two ways:

1. Being able to target users at all their devices, not only the device that has the original cookie.

2. Being able to continue targeting after a user has wiped the cookie.

Both ways can be achieved with CCM. Since, not deterministic but probabilistic CCM

Probabilistic CCM

is applied, it is not entirely known whether matched cookies are truthful matches or false matches. So next to the CE’s of retargeting and prospecting we have a third CE of falsely matched cookies. To define the performance potential the following case is used. Consider two campaigns, one prospecting and one retargeting, with identical budgets. They have different performances and together yield the average of those performances. In the new situation, with CCM, the budget of prospecting is put in a new campaign that retargets the matched cookies. Comparing the new average performance with that of the old yields a percentage of performance increase. Further analyses of the performance potential is based on this case.

Nccm= Nort+ Nm (2.5)

CVccm = Nort∗ CErt+ Nm∗ CEm (2.6)

The total number of impressions in the CCM situation Nccm is defined as the original

amount of impressions of the retargeting campaign Nort plus the impressions done at

the matched cookies Nm. As such, the conversion efficiency of the combined

(17)

efficiency CEort plus the matched impressions times the matched cookies’ conversion

efficiency CEm.

In the worst case, the CEm is that of prospecting campaigns. In the best case it is

very close to CErt. Since the CCM method we will be using selects matches based

on similar behavior, we can expect that false matches are still cookies of people that behave similar (look-a-likes) to the ones we were expecting to retarget and their CE

Look-a-likes

might hence be very similar too.

Nm = Nnrt+ Nlal (2.7)

Nccm = Nort+ Nnrt+ Nlal (2.8)

The matched impressions are the sum of the new retargeting impressions Nnrt

(the correctly matched cookies) and the impressions of look-a-likes Nlal (the falsely

matched cookies). To calculate a conservative and optimistic scenario of the efficiency

Scenario’s

gain of CCM from this, we need two other assumptions. First of all, we assume that we can find as much matches as there are currently retargetable cookies. In other words Nort = Nm1. Second, we assume that the precision of our model, the

percentage of all matches that are true matches, is between 20% and 40%2_.

Gccm=

CEort+ CEm

CEort+ CEpr

(2.9)

Gccm=

CEort+N_Nnrt_pr ∗ CErt+N_Nlal_pr ∗ CElal

CEort+ CEpr

(2.10)

Variable Conservative Optimistic

CEpr 1 1 CEort 3 5 CElal 1.5 3 Nnrt Npr 0.2 0.4 Nlal Npr 0.8 0.6

Table 2.1: Inputs for achievable gain scenario’s

Gccm,c = 3 + 0.2 ∗ 3 + 0.8 ∗ 1.5 3 + 1 = 1.20 (2.11) Gccm,o= 5 + 0.4 ∗ 5 + 0.6 ∗ 5 5 + 1 = 1.32 (2.12)

Table 2.1shows the inputs for the conservative and optimistic scenario, respectively.

Equation 2.11 gives a lower bound of the gain (Gccm) of 1.20, meaning a 20% Gain

1_{Further down in the research this means that recall of the model should be high enough. See} section 6.1for further explanation of recall.

2_{Based on the scores that were achieved in the CIKM Cup. The highest were around 40%. The}

chosen range is a bit cautious since the competition data was very clean and it is expected that the data is more messy in an operational environment, reducing the score.

(18)

increase in efficiency in relation to spending this money on a prospecting campaign.

Equation 2.12shows the upper bound of 1.32, in other words an increase of 32%. This means that when running one retargeting campaign and one prospecting campaign with the same budget and changing the prospecting campaign to a campaign that targets matched cookies, we would assume an increase in efficiency between 20% and 32%.

2.1.2 Service gains

In addition to performance gains, there are other service related gains to be made with CCM. One argument for ORTEC Adscience to offer CCM is the mere fact that customers are asking for it. It helps create the notion at customers that ORTEC

Customer demand

Adscience has a mature platform with advanced functionalities. This, in turn, helps sales especially at bigger customers that are more demanding.

Also, there are commercially available solutions available. These demand vast amounts

Low cost alternative

of data, including ground truth data (which we do not have3) and start ate10.000. Developing an own solution that does suit the data available, gives ORTEC Adscience the opportunity to both offer this functionality to customers and prevent a fee that the current size of the company is unable to bear.

A final important gain that CCM offers to customers is in the area of brand safety. For

Brand safety

example, a user orientates on the customers website on her mobile phone. Later, she does some final orientation on her laptop and makes the purchase on that device. We will start targeting both her mobile and her laptop, since she visited the customer’s site on both devices. Not only visits are logged, but also purchases. Hence, it is known that on the laptop a conversion has taken place and as a result targeting this device is stopped immediately. This is an important requirement for a lot of customers4, because it can be annoying to the user to get advertisement while they already made a purchase. This can be experienced as badgering and as such hurt the brand of the customer. Although the targeting of the laptop is stopped, there is no notion that the mobile phone belongs to the same user. CCM solves this and helps reducing unnecessary impressions to users that already converted.

In conclusion, the following potential gains of CCM are identified:

• Improve campaign performance in terms of efficiency between 20% and 32% • Satisfy customer needs regarding CCM

• Create low cost alternative to existing commercial methods • Offer brand safety to customers

Next the research questions are stated. 3

2.2 Research questions

The research questions help structure and focus the research. They are divided in three categories related to the origin of the questions: environment, design, and evaluation related.

Environment related research questions

This research is done in a business context and should hence be influenced by the business requirements. Also, the academic context needs to be an integral part of the research and is hence reflected in the research questions.

1. How can ORTEC Adscience improve performance through cross-cookie match-ing?

2. What literature is available on CCM techniques?

Design related research questions

Designing the model will be an iterative process between determining and tuning modeling techniques and preparing the right data. To clearly describe the choices that will be made in this process the following research questions are formulated.

3. What data is available at ORTEC Adscience to use for CCM? (a) What are the characteristics of the data?

(b) How does the data need to be prepared for modeling? 4. What does the design of the model look like?

(a) How can the complexity of CCM be managed? (b) What features can be used for classification?

(c) What type of classification model can be used?

Evaluation related research questions

The goal of this research is to implement a big data related technique to create business value. To test whether the developed model achieves these goals and to what extent, the following research questions are formulated.

5. Does the model give reliable outputs? 6. What is the performance of the model?

(20)

2.3 Requirements specification 12

2.3 Requirements specification

The defined gains are the main objective for the model to achieve. However, it can only be a success in case it adheres to certain requirements, like having a manageable model run time. The requirements specification states these boundary conditions in terms of design requirements and scope of the research.

2.3.1 Design requirements

The design requirements originate from three different sources: customers, business and technology. The customer requirements are needed to make sure the functionality is accepted and used by the customers. These requirements are especially important since ORTEC Adscience will require customers to pay for the use of the functionality. The requirements of ORTEC Adscience are sourced in practicality or cost reduction. Finally, the technical requirements are sourced from available servers to perform this research on.

Customer requirements

• The model should produce matches that increase the campaign performance. • The model should help stop advertising to users that converted.

• The model should be able to match cookies as soon as a campaign starts.

ORTEC Adscience requirements

• The model should have an acceptable run time.

• The model should need the lowest5 _{amount of available hardware possible for}

data processing and model running.

• The model should be able to present the confidence of the matches it made. • The model should be generally explainable to customers.

• The model should not need human intervention or interpretation when running operationally.

Technical requirements

• The model should be able to run on a server with 20 CPUs and 48GB of RAM, since this is what is currently available within ORTEC Adscience.

• The model should only use software available for Linux.

This concludes the requirement specification. In the next section the scope is defined. 5

Actually, it should need the cheapest server specifications. However, for simplicity and manage-ability lower specifications is assumed to be cheaper.

(21)

2.4 Chapter conclusion 13

2.3.2 Scope

A model is always abstraction of reality. The goal of the scope definition is to general delineate the part of reality that will be modeled. Categories of scope are campaign types, customer, geography, aggregation, chronological.

CCM can be applied to both retargeting and prospecting. In case of prospecting it

Campaign types

helps targeting the consumer through multiple devices and to control the number of impressions a consumer receives, just as well as with retargeting. However, prospecting is done to unknown or previously unencountered consumers. Hence, matching then needs to be done real-time and the number of cookies is unknown up front. With retargeting the people that have visited a site is known and their data can be collected for the desired time period. Hence, restricting the scope to retargeting helps to control the magnitude of data and simplifies the modeling process.

The retargeting cookies of a certain customer are used to further restrict the magnitude

Customer

of data. With more than 45.000 unique visitors a day it was sure that enough viable cookies could be tracked while providing a good estimate of the number of site visits that had to be tracked.

Only consumers visiting the Dutch site of the customer were tracked. Also, the cookies

Geographical

of these customer had to have at least one page visit while being the Netherlands. This excludes foreigners that visited the site.

The aggregation level of the input data is very low. It is actually the lowest level of

Aggregation

detail possible and as such, the scope is not restricted on this category.

The features express frequency of page visits by a cookie in terms of the hours in a

Chronological

day and hours in a week. However, multiple weeks could be accumulated in these features. Fewer weeks means less data and hence less data preparation time. More data means better specificity, so more details about the browsing behavior. Although also not only the data preparation time goes up, the data collection time does too. As a balance between preparation time and specificity two weeks of data are used.

2.4 Chapter conclusion

This chapter presented expected performance gains of 20% to 32% in conversion efficiency. The identified service gains are it being a low cost option versus commercial alternatives, the fact that bigger customers have a demand for CCM and that it helps brand safety through cross-cookie frequency capping. The research question are environment related (business and literature), design related and evaluation related. The requirements for the CCM model are sourced from customers, ORTEC Adscience and a technical perspective. Mainly, it should enhance campaign performance, help frequency capping, have an acceptable run time using the lowest technical server specifications as possible withing the available servers at ORTEC Adscience. The scope of the research is reduced to retargeting for a single customer, with a geographical scope limited to the Netherlands. This enables the use of the lowest level of aggregation (the most detail) in terms of bid requests as input data. Two weeks of input data is used. With the business understanding worked out, the next

(22)

chapter will present the literature related to retargeting performance and the latest research related to CCM.

(23)

Chapter 3 Literature

This chapter describes the academic foundation of the research regarding two main topics. The first section reviews the effectiveness of retargeting versus prospecting and the second section discusses the current state of research on CCM.

3.1 Retargeting

As explained in the chapter 2, applying CCM is aimed at improving the business by expanding retargeting opportunities. However, the nature of retargeting makes it easy to overestimate the impact. Since retargeting is aimed at a group of consumers who already showed interest in a certain product (e.g. by visiting the website or app); they are already more likely to convert, even without ads. Potentially, retargeting could even reduce their conversion rate if the continuous showing of the ads is regarded as badgering and creates a negative emotion. To investigate the true contribution of retargeting, a literature review is performed.

According to Sahni et al. (2017), retargeting causes almost 15% more consumers to come back to the website within a month’s time. They also oppose the common understanding that retargeting works by informing consumers or reminding them of the advertised product. They propose that it is actually the competition-suppressing effect that adds most to retargeting success. Also, they show that retargeting is more effective right after a consumer visits a site, rather than later. CCM could improve retargeting in line with both of these findings: by enabling advertising among more of the consumer’s devices, which would increase suppression of competitors; and simultaneously, increasing frequency straight after the site visit.

Ghose & Todri (2015) measure an average uplift of 26% in conversions through retargeting. Their measurements state a decrease in site visit, but asJohnson et al.

(2015) show, this could be due to the lack of control group use. Johnson et al. (2015) also confirm the effectiveness of retargeting by measuring a 17% uplift of website return and 11% uplift in purchases. Through the use of a control group and by

(24)

3.2 Cross-cookie matching 16

providing measurements on conversions, they directly prove the business value of retargeting. They further note that their results are reduced by users wiping their cookies, which prevents further retargeting. As stated inchapter 2, matching cookies within a device and hence matching the new cookie to the old cookie is also within scope of this research. Thus, overcoming the problem of cookie wiping will further increase performance.

3.2 Cross-cookie matching

Very little general research has been done in this on CCM. Much of the recent strides were made through competitions. Prominent examples of those are the Drawbridge Cross-Device Connections challenge1 held for the 2015 International Conference on Data Mining2 and the CIKM Cup3 held for the 2016 Conference on Information and Knowledge Management4. The results of these competitions provide most of the papers referenced below.

Koehler et al.(2013) propose a method, extended inKoehler et al. (2016), to correct campaign performance indicators through cross-device measurements. Using ad server logs, publisher provided user data, census data and a representative panel (which is deterministic data), they adjust high-level measurements by correcting for cross-device impact. Though they perform extensive cross-device analysis, their approach is aimed at an aggregated level and does not provide estimates for individual users.

Selsaas et al.(2015) formulated the cross-cookie matching problem as a recommen-dation engine problem. They use field-aware factorization machines which are good at modeling interactions between prediction variables. Since the performance of field-aware factorization machines declines with the increasing size of input matrices, it does not fit the the large scale application at ORTEC Adscience.

Cao et al.(2015) use confidence based ensemble learning, which was introduced by

Buthpitiya et al.(2014). The performance of this technique was as good as the best Gradient Boosting Machines (Friedman,2001), specifically AdaBoost, at the time. However, this technique only outperformed AdaBoost with low data availability. In this research, data is not an issue. Moreover, new and much improved Gradient Boosting Machines have been introduced since then; hence, the confidence based ensemble learning technique is disregarded as well.

The model that does fit this research is described byTay et al.(2016), which follows much of the main model concepts of Kim et al. (2015); though the latter uses very different features as model input. Theirs are heavily dependent on IP addresses, while Tay et al.are fully based on browsing times and visited pages. Their methods work well with large data sets, focus on the individual user, and both models use boosting techniques (as do others likeLandry et al.(2015),Kejela & Rong(2015) and

1 https://www.kaggle.com/c/icdm-2015-drawbridge-cross-device-connections 2_{https://icdm2015.stonybrook.edu/} 3_{https://competitions.codalab.org/competitions/11171} 4 http://cikm2016.org/

(25)

D´ıaz-Morales(2015)) with Natural Language Processing (NLP) techniques to create input features. Kim et al. (2015) use the NLP technique Term-Frequency Inverse Document Frequency and Tay et al.(2016) add Doc2Vec models to this approach. These techniques are the techniques selected to be incorporated and where possible improved in this research. As such, they are further explained inchapter 5.

3.3 Chapter conclusion

This chapter showed that retargeting has the potential to raise conversions between a reported 11% and 26%, proving the business relevance of increasing retargeting performance. Furthermore, the current state of research in CCM was described: most of the recent research was performed through competitions. The techniques applied in the best performing models, like natural language processing techniques and gradient boosting machines for classification, are selected for use in this research.

(26)

Chapter 4 Data understanding and

preparation

What type of techniques can be used for CCM is heavily dependent on the available data and the data characteristics. This chapter first describes how the data collection is performed. Next a description is given of the data characteristics and specifically of variables could add value to the variable identified in the literature research. The third section explains the data transformations used to prepare the data for the model, and follows with a description of the procedure used to simulate ground truth data for the ORTEC Adscience data, based on CIKM data distributions.

4.1 Data collection

The available data within ORTEC Adscience that relates to consumers originates from five different sources: website pixels, impression handlers, bid data, the user interface (dashboard) and bidrequests.

Website pixels are pieces of code that are placed on the site of the customer. These

Pixels

allow a cookie to be dropped at every consumer -or on every device- visiting a customer’s website. It sends signals to ORTEC Adscience servers to log the amount of visits per consumer and which pages they visited. When such a signal is received, the visited page is checked for so-called segments; if it has one or more segments, the consumer is placed in those. Segments are used for retargeting campaigns to group certain customers. Each segment contains the cookie ids of consumers that visited a certain page. In this research, the segments are used to select the cookie ids we want to find matches for.

Impression handlers are the nodes that receive a signal from the SSP when ORTEC

Impression handlers

Adscience wins an auction. This is very important, since in an auction you only pay when you win; so you need to keep track of the amount of money spent. The signal contains the bid id and the winning price. Because of the second price auction, the

(27)

4.2 Data description 19

price depends on bids of others and is thus unknown until the notification. This signal does not contain any additional user information, so is not used as data source in this research.

The dashboard is used by customers to edit the settings of their campaigns, upload

Dashboard

banners, define segments and check on statistics to keep track of campaign progress. Any creation of campaigns, uploading of banners or editing of settings generates data which is used for bidding. Apart from the definition of segments, it has no influence on user data and is hence of no importance to this research.

Bid data is created at the moment a bid is placed: how much was bid, on what bid

Bid data

request and some logging on how the price was determined. Since ORTEC Adscience does about 180 million bids a day, this amounts to a lot of data. The bid data is mostly used to store info on the bids we actually win and is not needed anymore after we received a signal that we won a bid or after 10 minutes when it is quite sure no impression notification will come in anymore. As the bid data contains no additional user info in comparison to bidrequests, it is not used in this research. As explained inchapter 1, bidrequests contain the most information. These requests

Bidrequests

contain a lot of information about the user loading a page; making it the most important data source of this research. It is also the largest data source per document, as well as the most abundant, when compared to the other data types. To give an idea about the size of this data: we receive around 1 billion bidrequests every day, which in compressed form stil generate about 8 gigabytes of data every hour. For debugging purposes, this data is already being logged for 24 hours, so it is available. However, one of the main challenges of this research was finding an efficient way to use all this data. For the scope of this research, a customer segment was chosen that tracks a site with approximately 45.000 unique consumers visiting the homepage every day. For every bidrequest coming into the system, a check is done whether the user belongs to this segment and if it does, the bidrequest is stored in a MySQL database. This leads to the logging of around 5.000.000 bidrequests every day.

4.2 Data description

Laney, D (2001), while working for Gartner, introduced the 3 Vs of Big Data to describe the difference between normal data and big data. Many have since proposed to add other Vs of which most use the emerged 5 V model. As the previous section showed, there is a lot of data going through the ORTEC Adscience system every day; which actually can be defined as big data. The emerged 5 V model is used in the following section to describe this data.

4.2.1 Big data Vs

Volume

Data volume is determined by the amount of storage needed to keep all the incoming and generated data in a system. The processes described above, including the storage

(28)

of statistics and won impressions, amount to an estimated 10 gigabytes per hour for ORTEC Adscience. The model uses two weeks worth of data. Without any selection, processing and filtering this would lead to a total of around 14 billion rows of data. In terms of disk space that would be around 3.3 terabytes. Most modeling techniques require the data to be loaded in random access memory (RAM). Decent sized servers have about 64 gb of RAM which is barely 2% of the total volume required. Hence the input data can be regarded as very high volume.

Velocity

The speed with which new data is generated and how quickly it needs to be available for interpretation, is described as the velocity of the data. The speed of incoming data at DSPs is measured in queries per second (QPS) where 1 QPS equals one bidrequest per second. Currently, ORTEC Adscience receives over 11.500 QPS. All these bidrequests need to be evaluated for all the campaigns in the system and all the corresponding banners of those campaigns. Since the responses to these requests need to be back at the SSP within 100 milliseconds, they are processed at real-time. At ORTEC Adscience, this response is given within 0.5 to 1 millisecond. Within that time, each banner is run through a calculation and the overall strategy for campaigns is taken into account. The calculation is the probability a user will click on it and or buy something afterwards. The campaign strategy means considering the influence of budget pacing on the price that is bid. With 11.500 QPS and the very limited time of action for the system, the data can be regarded to have a very high velocity.

Variety

Instead of having a single structured database as a data source, nowadays sources have a great variety in both their structure (or lack thereof) and type. The bidrequests are unstructured, which means they have a varying number of attributes that are not always known yet. Also, the requests have a nested JSON structure (JavaScript Object Notation1), which further complicates data analysis methods since those need structured data. As described above insection 4.1, there are a number of different sources of data used and although their structures are completely different, their types are mostly JSON objects and do not have the kind of variety seen in other Big Data projects (like video, social media messages, images, voice). So, the data is unstructured but well formatted and is hence regarded to have medium variety.

Veracity

The trustworthiness or veracity of the data describes how accurate the information actually is. In the bidrequests many parts of the data are unreliable; e.g. location data and page category data. Location data is contained in almost all bidrequests and is usually derived from the IP address. First of all, there is already uncertainty in the process of linking IP adresses to locations. On top of that, IP adresses themselves

1

(29)

can be unreliable; e.g. they might be rerouted through a company server or using a VPN proxy. Both create IP addresses that are often unrelated to their current location. Regarding the category of the page the consumer is loading, like sports or finance, uncertainty is also present. As these classifications are often made using topic modelling by automated models, their precision will inherently be less than perfect. Knowing that veracity is an important aspect of the bidrequest, it is advised to use as many bidrequests per consumer as are available. This can help cancel out as much of the noise as possible.

Value

The final ’V’ is about value. There is no use in collecting and storing massive amounts of data without being able to retrieve the value out of them. This research is all about how to extract predictive value from the data sources available with a collection of the latest techniques in machine learning.

4.2.2 Variables for use in further research

Exploratory data analysis is performed on the bidrequests to get familiar with the data, assess the data quality, detect interesting subsets and identify interesting variable for use in the modeling phase. As indicated in section 1.5 (and further explained in ), the lack of deterministic data means that only the timestamps of page visits and the urls of those pages can be used. However, ORTEC Adscience aims to find a customer with deterministic data in the future which would enable to inclusion of extra variables. Hence effort is made to identify variables that have the potential to boost performance in such a situation. The following variables are identified:

• domain category • country

• region • city

• gps location

• operation system (OS) • language

• browser type

Domain category is a classification of the url to a certain category, or topic, like sports or finance. The urls do not necessarily contain the topic of a page and if they do, the currently selected NLP techniques do not interpret the topic. As a consequence it has the prospective of adding much predictive value. Country, region, city and gps location can be used to introduce a geographical component to the model, which is also likely to enhance performance. For instance, by defining the geographical overlap of two cookies. Language contains the language setting on the OS of the user which might be the same across devices. The same accounts for browser type, which refers to the brand of the browser, such as Firefox or Chrome. A further 20 mobile traffic related variables are available, but their data is too sparse to advice use in further research.

(30)

4.3 Data cleaning and transformation 22

4.3 Data cleaning and transformation

The following data preparation steps are performed to ensure the data is ready to be used by the different modeling techniques in the modeling phase. The first section describes how the ORTEC Adscience data is prepared, and the second section describes how ground truth data is simulated to enable supervised learning.

4.3.1 Input data preparation

The ORTEC Adscience data is unstructured, noisy and contains a lot of information that is not needed or is formatted in a way that the models cannot handle. The following processing steps are performed to prepare the data for the modeling phase.

1. Drop all columns that are not relevant.

2. Replace all indicators of missing values (like ’none’) with NaN.

3. Check the type of values and the percentage of missing values in each column. 4. Sequence the data by sorting ascending on the cookie id, datetime and page. 5. Strip the pages of http-like proceedings to the actual domain name.

6. Create url hierarchy columns for input to TF-IDF and Doc2Vec models. 7. Drop all events of page loads that have identical hierarchy 3 urls in a 10-minute

window.

8. Add column to identify browser sessions of each user.

Most of these steps are trivial, but the url hierarchy (step 6), event dropping (step 7) and browser session identification (step 8) are explained further. Url hierarchy refers

Url hierarchy

to the structure of the page urls. For instance, the hierarchy of a.com/b/c consists of three levels: a.com, b and c. For every url 4 new columns are created with the url up to level 0, 1, 2 and 3. Level 0 would be a.com, level 1 a.com/b and in this case both level 2 and level 3 are a.com/b/c; as there is no deeper level 3 specified. A combination of these columns is used in the Doc2Vec models described in chapter 5. The dropping of urls for certain page load events is done for two reasons: increasing

Event dropping

performance and reducing data. As will be explained further in the next chapter, the urls play a major role in identifying users. It is especially important to know the different urls that are visited and in which order. Hence, having page loads of the exact same page in sequence will not add value. However, more memory and run time is needed to process these rows and it also reduces the strength of the TF-IDF and Doc2Vec models. The data contains many duplicates, mainly due to open browser tabs that aren’t actively used by the consumer, but are reloaded by the browser every few minutes. Finally, we set up a column to identify the browser sessions of users.

Browser sessions

This is a processing step that helps create the ground truth data. A browser session is defined as all consecutive events of a user with less than 30 minutes of intervals in between. As such, a new browser session is identified when a user has not loaded a page for 30 minutes or longer.

(31)

4.3 Data cleaning and transformation 23

4.3.2 Simulating ground truth data

Both the boosted gradient tree model and the four Doc2Vec models are supervised

Supervised learning

machine learning techniques. Supervised means that the model learns by knowing whether the outcome is positive or negative. It starts with default parameters that transform the input data to outputs. A loss function then calculates the difference between the calculated outputs and the truth, the so-called ground truth. Then, the

Ground truth

model changes the parameters in a attempt to reduce the difference between the current outputs and the ground truth. In other words, it tries to reduce the loss. One way to have ground truth, is to have a deterministic data set. For instance,

Deterministic data

a customer might require users to always login on any device to make use of their services. Since the login account is strictly personal, one knows for sure that the used devices belong to the same person. Only very few companies both require this login and have a lot of users too. This is mostly restricted to companies like Google or Facebook. To overcome this deficit the ground truth data is simulated using distributions learned from the CIKM cup data, which does contain ground truths. The main concept of the data simulation is to filter the ORTEC Adscience data to cookies with many events and split their page visits into a number of new subcookies.

Subcookies

The subcookies belong together, since they originated from the same user. For example, user 56 will receive cookies ids like 56 A, 56 B, etc. The biggest challenge

Data: events, sessions, extracted CIKM distributions Result: subcookie id for each event

for each cookie id do

nr splits = sample CIKM split(nr events); overlapping = sample CIKM overlap(nr splits); if nr sessions < nr splits then

subcookie id in session = session id else

if not overlapping then

cuts = random sample session ids at which to cut; for i in range(nr sessions) do

for cut in cuts do if i <= cut then

subcookie id in session = cuts.index(cut); end

end end else

\\overlapping

subcookie id in session = random sample range(0, splits); end

end end

(32)

of splitting the page visits is to get the resulting data to reflect the characteristics of the real-life data. Hence, discrete distributions are derived of the CIKM data for two characteristics:

• the number of cookies (splits to perform) per user, which depends on the number of events of the user;

• the chance of these cookies having overlapping time spans, based on the number of cookies per user.

The derived CIKM distributions can be found in Appendix Aandalgorithm 1shows

Derived CIKM

distributions _{the logic applied to the cookie splitting process.}

The use of sample distributions based on real-life data ensures that the resulting simulated ground truth data contains a realistic number of subcookies per cookie. Note that in the CIKM data (and hence in the derived sample distributions) there is a maximum of 9 subcookies per cookie, which seems high when considering only devices. However, on these devices people can wipe their cookies and continue using with a new cookie and hence the data contains both cross-device cases as well as intra-device cross-cookie cases. In cases of intra-device cross-cookie matching, the

Intra-device

cross-cookie matching _{time spans of the cookies can never overlap. In case of cross-device time spans can}

overlap, but they don’t have to. Hence, by sampling from the derived distribution the overlapping is not enforced, but based on chance.

4.4 Chapter conclusion

This chapter has shown that the ORTEC Adscience data is truly Big Data by characterizing it using the 5 Big Data V’s. ORTEC Adscience specific variables for use in further research are identified and the data preparation process is defined. Finally, since the ORTEC Adscience data does not contain a ground truth, an algorithm is defined to simulate ground truth data using distributions from the CIKM cup data to capture characteristics of real-life data.

(33)

Chapter 5 Modeling

With the requirements specified, relevant literature researched and data prepared, the model can be designed. This chapter describes the different components of the model, considerations on parameter setting and all the major improvements made. The first section explains the different components of the model and the second section the calibration to optimize performance.

5.1 Model overview

As stated inchapter 3, the model is adapted and improved fromTay et al.(2016). In general, the model creates profiles of the cookies based on what sites are visited and at what times. Then, pairs of cookies are compared based on similarities between their profiles and based on ground truth data the model learns how it can use the similarities to estimate the probability of two cookies belonging to the same consumer. Their approach consists of 6 components: candidate selection, feature generation, pairwise classification, supervised inference, unsupervised inference and final candidate selection. The final candidate selection is performed, because the CIKM Cup requires the contestants to enter a certain amount of candidates. For ORTEC Adscience, the number of matches is not limited and hence this step is removed from the model. The inference methods are used to push the performance just enough for the contestants to win the competition. These steps are also removed, because they provide relatively low improvement for the run time they need and the modeling complexity they encompass. Implementing these techniques is not deemed worth the effort in an operational business environment.

InFigure 5.1, a system diagram of the model boundary and the inputs and output is shown. The environment variables are inputs that cannot be influenced, they are fixed. For instance, the urls that consumers visit or the times at which they do this are a given. The instrument variables are inputs that can and are changed in the research. They are the parameters of modeling techniques and features generated

(34)

5.2 Candidate selection 26 Environment Variables of interest Instruments Adscience variables Time of visit Pages visited Optimal Model boundary kNN TF-IDF GBM Doc2Vec (4x) # candidates per kNN

Cut off value

Matched pairs

kNN (4x)

Figure 5.1: System diagram of the ORTEC Adscience model

to help prediction in the decision trees. The variables of interest are the outputs that the model is optimized on and determine the success. The outputs are pairs of matched cookies. Within the model boundary the remaining model components are shown, these are described in more detail in the following sections.

5.2 Candidate selection

The most thorough approach to find matches is to consider each possible pair. When comparing 3 cookies, there are 3 pairs. For 10 cookies there are already 45, with 1000 there are almost 50 million. The group of cookies to find matches for in this research is around 45 thousand and would require evaluation over 1 billion of pairs. This complexity is N-squared or O(n2) in the mathematical big O notation (Bachmann,

O(n2₎

1894) and is one of the computationally most challenging kinds. Considering all these pairs is not achievable within an acceptable run time. Hence, candidate selection is performed to establish a much smaller set of candidate pairs that will be considered for matching by the pairwise classification model. Since the model relies on comparing the browser behavior of cookies, this same information is used to select candidates. First, two techniques of creating profiles are explained followed by a description of the selection process.

The model applies Natural Language Processing (NLP) techniques to create the

NLP

profiles. NLP is normally used to interpret natural language by either extracting meaning form text, perform translations or generate texts. For instance, Google applies NLP for its translation service Google Translate (Google,2017). The power of the techniques in this domain is to retrieve definitions of e.g. words or sentences by looking at surrounding words or sentences. In the ORTEC Adscience model the urls are considered words and the click-stream of a cookie is the sentence. Each cookie

(35)

5.2 Candidate selection 27

has a single click-stream so the document of each cookie has a single sentence. The so called corpus is a collection of one document per cookie containing one sentence with all the visited urls in chronological order.

The first of two techniques applied to the corpus is Term Frequency - Inverse Document

TF-IDF

Frequency (TF-IDF) a which is based on original work of Sparck Jones(1972). The idea behind TF-IDF is to capture the characteristics of a document by counting the frequency of each words it contains and relating it to the frequency of that word in the entire corpus. A word that appears often in a topic can be deemed an important identifier for that document. For instance, a document that contains the word ”Nigeria” frequently and the word ”oil” a few times is probably mostly about the country, The other way around the document would probably be about oil and just mentions ”Nigeria” in that relation. So counting frequency of words in a document helps capturing the unique features of the document, although it won’t help much if it is a very common word in general like ”the” or ”not”. In TF-IDF the frequency of each word in a document is therefore decreased in weight by looking at how often that same word occurs in other document. This way, words that are typical for this document in relation to the other documents in the corpus are scored high. In this research it is used to find the urls that are typical for each cookie. Since, TF-IDF is all about identifying unique urls the technique is performed with a corpus containing urls up to the highest hierarchy: hierarchy 3. The advantage of TF-IDF is that it is very good at finding the urls that matter for a certain cookie and is insensitive to urls that are visited a lot by many users, like major news sites. The disadvantage of TF-IDF is that it does not capture information about the sequence of page visits, but only looks at the number of occurrences.

The second technique applied is Doc2Vec (Le & Mikolov,2014), an extension from

Doc2Vec

the commonly used Word2Vec (Mikolov, Chen, et al., 2013;Mikolov, Sutskever, et al.,

2013). In Word2Vec words are represented by a vector based on the words commonly surrounding that word. The vector can roughly be interpreted as a series of number that indicate for each word in the corpus how likely it is to be found closely to the word belonging to the vector. Actually, a neural network is trained to learn to predict

Neural network

the right word based on the surrounding words. The vector is the output of the model indicating the weights given to surrounding words used to predict the word the vector belongs to. In Doc2Vec this procedure is extended to create a vector representation of a whole document by processing the sentences and words in those sentences. In Doc2Vec a neural network is trained to predict a tag corresponding the document based on the document contents. In this research the tags are the cookie ids. Hence, the Doc2Vec models learn to predict a cookie id based on urls visited by that user. The advantage of Doc2Vec is that since a window of surrounding words is used, the sequence of page visits is captured in the final vector. A disadvantage of Doc2Vec is that is does not correct for urls that are frequent in all documents. Hence, users that have e.g. 95% the same page visits and 5% different have quite similar vector representations. Both TF-IDF and Doc2Vec are used because the counter each others advantages and disadvantages. For each url hierarchy a separate Doc2Vec model is trained with varying window sizes.

The actual process of candidate selection is based on the premise that cookies of the same consumer show similar browsing behavior. After each of the TF-IDF and