A semantic approach to conversion rate forecasting

(1)

A Semantic Approach to Conversion Rate Forecasting

Florian van der Peet

10284966

MSc in Econometrics Track: Big Data

Date of first submission December 23, 2016 Supervisor: Dr. N. van Giersbergen

Second reader: Dr. K. Pak

Abstract

The aim of this thesis is to improve forecast of the conversion rate (CR) for individual keywords per week on Google AdWords. The main challenge in improving the CR forecast, is to forecast a CR for sparse data keywords. Three models are introduced that handle this problem. Two of these introduced models each use a different semantic knowledge to cluster sparse data keywords. One uses the knowledge of a search engine to determine the similarity between keywords, whereas the other model uses Wikipedia. The third model uses the structure within Google AdWords to forecast a CR for sparse data keywords. In addition an evaluation method is introduced that is able to evaluate the CR forecast for individual keywords per week. The results of this evaluation show that the CR forecast model that uses the knowledge of a search engine outperforms the other models.

(2)

STATEMENT OF ORIGINALITY

This document is written by Student Florian van der Peet who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

(4)

ACKNOWLEDGEMENTS

I would like to thank my supervisor dr. N.P.A. van Giersbergen for his assistance and his guidance throughout this thesis. Furthermore I would like to thank my colleagues at ORTEC for their support and time. In special I would like to thank Dr. M.J. Soomer, L. Voogd and R. Brokkelkamp.

(5)

(6)

LIST OF FIGURES

2.1 Example of the position of an advertisement on Google . . . 4

2.2 Account structure within Google AdWords. . . 5

3.1 Schematic overview of ABO . . . 8

3.2 CR forecast of the clustering model . . . 11

4.1 An example of a small semantic network graph . . . 14

5.1 Semantic model overview . . . 16

5.2 Example of a short text description on Bing . . . 17

5.3 Model architecture of the CBOW model and Skip-gram model . . . 22

5.4 Example of the word movers distance method between two sentences . . . 25

7.1 Histogram of the group size . . . 34

7.2 Histogram of the campaign size . . . 34

7.3 histogram of the number of clicks . . . 35

7.4 Split between train and test data . . . 36

8.1 Absolute size of the CI as a function of the number of clicks . . . 38

8.2 Density of the true CR estimation . . . 38

8.3 Optimization level of the true CR estimation . . . 39

8.4 Optimization level for the FP model and the true CR estimation model . . . 41

8.5 RWMSE of the SE model for different settings of the similarity parameters . . . 42

8.6 Computing time of the SE model for different similarity parameter values . . . 42

8.7 RWMSE of the SE model for different CR forecast parameter values . . . 43

(9)

8.9 Computing time of the WMD model for different similarity parameter values . . . 45

8.10 RWMSE of the WMD model for different CR forecast parameter values . . . 45

8.11 Individual keyword accuracy results of the different forecast models on the test data . . . 46

8.12 RWMSE results of the different forecast models for the training and the test data . . . 47

8.13 RWMSE results, when using a different number of clicks to forecast the true CR . . . 48

8.14 RWMSE results, when using a different number of weeks to forecast the true CR . . . 48

8.15 Difference in predicted number of conversions of the four forecast models with the realized number of conversions . . . 49

8.16 Average CR for 26 weeks prior and during the test data . . . 50

8.17 Average CR for the whole duration of the data . . . 50

8.18 Account level accuracy results of the different forecast models on the test data . . . 51

(10)

LIST OF TABLES

2.1 The bidding results between three rivaling firms on Google AdWords. . . 4

5.1 Term frequency matrix . . . 19

5.2 TF-IDF vector per document . . . 19

5.3 Normalized and truncated TF-IDF vector per document . . . 19

5.4 Vector per keyword . . . 20

5.5 Three most similar keywords of a keyword . . . 21

5.6 Vector per keyword . . . 24

6.1 CR and error of an example keyword when forecasting the true CR and a CR of 0% . . . 28

7.1 Number of occurrences per level . . . 34

7.2 Descriptive statistics of the database . . . 35

8.1 RWMSE with clicks for different parameters using the FP model . . . 40

8.2 Different similarity parameter values for the SE model . . . 42

8.3 Different similarity parameter settings . . . 43

8.4 Different similarity parameter values for the WMD model . . . 44

8.5 Different similarity parameter settings . . . 45

(11)

(12)

LIST OF ABBREVIATIONS

ABO AdWords bid optimizer

CBOW Continuous bag of words

CI Confidence interval

CPC Cost per click

CTR Click trough rate

CR Conversion rate

FP Florian van der Peet

MAE Mean absolute error

QVI Query variation index

RMSE Root mean squared error

RWMSE Root weighted mean squared error

SE Search engine

SG Skip-Gram

TF-IDF Term frequency - inverse document frequency

(13)

CHAPTER 1 INTRODUCTION

Online advertising is a booming market. The total revenue from online advertising in the Unites States reached a record of $59.6 billion in 2015, which is 20% higher than in 20141_{. The biggest player within online advertising}

is Google, which is good for roughly 50% of the total revenue in the Unites States. Worldwide Google earned a total $75 billion in 2015 alone, 77% of their income is earned by Google AdWords. As a consequence the competition between advertisers is growing on Google AdWords and the demand for more accurate models for Google AdWords is increasing. This thesis aims to improve the accuracy for the forecast of the conversion rate (CR) for an advertiser that uses Google AdWords.

Before the problems regarding conversion rate forecast can be discussed, we first give a short in-troduction to Google AdWords. Google shows the websites which are most relevant for each entered search keyword. These results are split into sponsored and non-sponsored search results. Advertisers are able to bid on specific keywords on Google by using Google AdWords. The more an advertiser bids the better position the advertisement gets. The question hereby is: What is the optimal bid price for an advertiser to maximize its profit? The optimal bid price depends on three different variables.

1 The position of the advertisement

2 The percentage of potential customers that clicks on the advertisement (CTR)

3 The percentage of potential customers that converts after clicking on the advertisement (CR)

The conversion rate is difficult to forecast, as it has both a low quantity and a low probability. This

(14)

problem can be explained by the marketing funnel. An advertiser targets a certain share of potential customers trough Google AdWords. Only a fraction, around 5% of these potential customers actually click on this advertisement. Afterwards a smaller fraction, around 1%, also converts. Often a keyword has more than a 500 clicks and over 5000 impressions before a customer converts. Most keywords simply do not have that much click activity in a week or even a year. Thus a method is needed that is able to accurately forecast a CR even when there is sparse data. The goal of this thesis is to create an accurate conversion rate forecast model, that works for both keywords with a lot of click activity as well as keywords with little click activity, from here an called sparse data keywords. The accuracy of the novel introduced forecast models in this thesis are compared to the CR forecast model that is currently used by ORTEC.

Three different CR forecast models are introduced in this thesis. The three models are based on the same idea: a certain number of clicks is needed in order to accurately forecast a CR. The models differ in how they forecast the CR for sparse data keywords. A keyword is considered sparse when it does not have enough clicks to accurately forecast a CR. In such a case additional data is needed. The first model uses the structure within Google AdWords to obtain additional click data. Whenever a keyword has not enough data on keyword level, then the required click data is used from a higher level (group, campaign or account). The other two models estimate the similarity between keywords by using each a different semantic measure. This similarity is then used to borrow additional click data for sparse data keywords.

The working of Google AdWords is more thoroughly explained in Chapter 2. Then the effects between the bid price and the position, the CTR and the CR are discussed. In addition, the structure within Google AdWords is explained. Readers that are familiar with Google AdWords can skip this chapter. Afterwards the current model that is used by ORTEC is discussed in Chapter 3 and the effect of the CR on the optimal price is explained. Thereafter a literature review is given in Chapter 4. Next two novel CR forecast models are introduced in Chapter 5. Thereafter two different criteria are given in Chapter 6 in order to evaluate the different forecast models: individual keyword criterion and the account level criterion. The used data is described in Chapter 7. The results of comparing the four different CR forecast models are given in Chapter 8. Additional the results are discussed regarding the evaluation model. Finally a conclusion is given in Chapter 9.

(15)

CHAPTER 2 GOOGLE ADWORDS

This chapter explains briefly how Google AdWords works. First is discussed what the different kind of search results on Google are and how Google uses those to earn money. Next an explanation is given about why advertisers bid on keywords. Finally the structure that is used within Google AdWords is introduced.

2.1

2.2 Structure

Google AdWords uses a structure, where an account of an advertiser is build upon three layers. These are campaign, group and keyword. A campaign contains groups and a group contains keywords. How this structure is used differs per company, a schematic example is shown in Figure 2.2. For instance a campaign might be a holiday to Belgium. Then the groups can be a city or a region, for instance Antwerp or the Ardennes. Finally these groups contain relevant keywords for each city or region, like hotel Antwerp.

Figure 2.2: Account structure within Google AdWords.

The structure within Google AdWords provides two main advantages compared to an unstructured group of keywords. First this structure makes it possible for an advertiser to allocate a maximum daily budget to each group and campaign. Therefore it is possible that all keywords within a campaign or group stop bidding during a day. Second is that there needs to be an advertisement text for each advertisement within Google AdWords. The structure allows advertisers to create this advertisement text on the group or campaign level, so that they do not need to create an advertisement text for each individual keyword.

(18)

(19)

CHAPTER 3 ADWORDS BID OPTIMIZER

ORTEC has created a tool that maximizes the profit advertisers obtains via Google Adwords by optimizing the bid price of a keyword. This tool is called the AdWords Bid Optimizer (ABO). The derivation of the optimal bid price and the role of the CR forecast within this optimization are given in Section 3.1. Afterwards the two CR forecast models that are currently used in ABO are discussed.

3.1 The model

The ABO maximizes the profit for an advertiser. A schematic overview of ABO is given in Figure 3.1. There are four forecasts within the ABO, which are numbered in Figure 3.1.

1 The effect of the bid price on the position

2 The effect of the position on the CTR

3 The number of impressions

4 The CR

In ABO the optimal bid price is determined per week for each keyword. Therefore all data is aggregated on a weekly level, instead of the daily level data which is provided by Google AdWords. The data is aggregated on a weekly level for the remainder of this thesis.

(20)

Figure 3.1: Schematic overview of ABO

The profit of an advertiser on Google AdWords is given by:

Pro f it= #Conversions · margin − costAdWords. (3.1)

The margin is defined as the price of a product minus the (non Google Adwords) cost of that product. Advertisers only have to pay whenever an advertisement is clicked upon, which gives

costAdWords= CPC · #Clicks. (3.2)

Substituting gives

Pro f it= Margin · #Conversions −CPC · #Clicks, (3.3)

where the margin is a different constant for each keyword. Therefore the advertiser is only able to directly change the bid price. The maximization function of the profit per keyword is given by

max

Bidt+1

Pro f it= max

Bidt+1

Margin· #Conversionst+1− Bidt+1· #Clickst+1 (3.4)

Replacing #Clicks with Equation 2.1 and #Conversions with Equation 2.2 gives

max

Bidt+1

Pro f it= (Margin ·CRt+1− Bidt+1) · Impressionst+1·CT Rt+1 (3.5)

The number of impressions can be removed, since it acts as a constant

max

Bidt+1

(21)

Calculating the first order condition with respect to the bid price gives ∂ Pro f it ∂ Bidt+1 = (Margin ·CRt+1− Bidt+1) CT Rt+1 ∂ Bidt+1 −CT R_t+1= 0, (3.7)

where the CR is assumed to be independent from the bid price, the CTR depends on the position and the position depends on the bid price. Next are the forecasts for the position and the CTR discussed.

Position and CTR forecast

The forecast for the CTR is based on the position which is based on the bid. First the following two linear regressions are estimated for each keyword

Pos= α1+ β1· Bid + ε1, (3.8)

CT R= α2+ β2· Pos + ε2, (3.9)

(3.10)

which have often not enough data for a reliable linear regression. This is solved by estimating the regression based on the group or campaign level and not on the keyword level. Next the forecasts are made for the position and the CTR. These forecasts are shown respectively as 1 and 2 in Figure 3.1. The forecasts are estimated with

Pos_t+1= Post+ β1· (Bidt+1− Bidt), (3.11)

CT R_t+1= CT Rt+ β2· (Post+1− Post), (3.12)

where β1is determined in Equation (3.8) and β2in Equation (3.9). Thus the CTR of next period is given by:

CT Rt+1= CT Rt+ β2· (Post+ β1· (Bidt+1− Bidt) − Post) (3.13)

The model

Substituting the forecasts for the position Equation 3.11 and the CTR Equation 3.13 into Equation 3.7 results into

0 = (Margin ·CRt+1− Bidt+1) · β1· β2−CT Rt− β1· β2· Bidt+1+ β1· β2· Bidt, (3.14)

rewriting gives

Bidt+1= Margin ·CRt+1+ Bidt−

CT Rt

β1· β2

(22)

where CT Rtand Bidt are obtained using historical data. Furthermore are β1and β2estimated with the regressions

in Equation 3.8 and 3.9. Finally CRt+1is estimated using one of the forecast algorithms defined in Section 3.2.

A higher CR leads to a higher bid price, as margin > 0. A higher CTR also leads to a higher bid price, as β1< 0

and β2> 0.

3.2 CR forecast

Recall Figure 3.1, where the CR forecast is numbered with 4. This section discusses the two different CR forecasts that are currently being used within ABO. The first one is the clustering forecast models, which was used prior to this thesis. The second forecast model, the Florian van der Peet (FP) model, is introduced in this thesis. The model was implemented within the new accounts in ABO after the first tests showed promising results.

3.2.1 Clustering CR forecast

The clustering CR forecast model forecasts the CR in four steps, which are shown in Figure 3.2. First Y weeks of the data are smoothed, by calculating the moving average over an X number of weeks. If there is a conversion in the past Y weeks than the data is considered sufficient. If there is no conversion, the data on group level is used, instead of the data on the keyword level. If there is still no conversion than the data on the campaign level is used. If there is still no conversion then this forecast model cannot forecast a CR for a keyword in a week.

If the data is sufficient the clustering begins. Each level (keyword, group and campaign) is clustered separately into three buckets using a k-means clustering method, resulting into nine different clusters, e.g. the keywords that have data on a group level are clustered without keywords that have sufficient data on keyword or campaign level. Finally the forecast is created by calculating the centroid of a cluster of four weeks into the past. This is set to four weeks since prior to that date it is still possible that Google AdWords awards conversions to a click.

3.2.2 FP model

The previously discussed cluster forecast model clusters all keywords and not just sparse data keywords. The Florian van der Peet (FP) model solves this problem and only uses additional data for sparse data keywords. The FP model was implemented in July 2016 in the ABO model. Algorithm 1 shows the working of the FP model, note that this is a simplification of the real Algorithm. First the minimum number of clicks that is needed to estimate the CR is determined. Second a maximum number of relevant weeks is chosen. Next the number of clicks a keyword has in the week prior to the week we want to forecast is calculated. If this number of clicks is less than the previous determined required number of clicks, then another week of data is added. This process is

(23)

Figure 3.2: CR forecast of the clustering model Smoothing Sufficient data? Go up a level Clustering Yes No

repeated until either there are enough clicks, or the maximum number of relevant weeks is reached. In the last case, the data is also used from the level above the current level, with keyword (L=1), group (L=2), campaign (L=3) or account (L=4). Finally the CR is calculated using a weighted mean, as is explained in Example 3.1.

Example 3.1: Estimation true CR

Assume 500 clicks are required to estimate the conversion rate accurately. Next a keyword has 499 clicks and 0 conversions during the relevant time period, thus 1 click too short. The group level has 10.000 clicks and 25 conversions during the first week. Then these 10.000 clicks will be weighted as if it was 1 click, since we are only 1 click too short to achieve our goal of 500 clicks. Resulting in:

CR= 0 499+ 25 10.000· 1 500 = 0.000005% (3.16)

Without the weighting, the CR would be:

CR= 25

10499 = 0.24% (3.17)

The FP model has two parameters, the minimum number of clicks required and the number of weeks that are taken into account. These parameter values are chosen by minimizing a criterion, which is defined in Chapter 6. Including more clicks leads to a CR that is forecasted more weeks into the past or on a higher level. Including more weeks could add bias, e.g. a new website or a new product. In practice it is useful to require less weeks in order to forecast the CR, so that less data is needed and the CR can be forecasted sooner for new keywords.

(24)

Algorithm 1: Forecast of the CR using the FP model

Data: The number of clicks and conversions for a keyword, group and campaign to which the keyword belongs.

Result: Forecast of the CR

n= 0, which is the number of weeks we are going into the past

L= 1, the maximum level from which we are using data while Total clicks < Min Clicks Required do n= n + 1

if n5 Relevant Weeks then TotalClicks= L ∑ l=1 t ∑ s=t−n

Clickss,l, where t is the current week

TotalConversions= L ∑ l=1 t ∑ s=t−n Conversionss,l else

L= L + 1, the level on which we are forecasting the true CR (keyword, group, campaign or account)

n= 0 end end

(25)

CHAPTER 4 LITERATURE REVIEW

The goal of this thesis is to improve the accuracy of the CR forecast model that is included in the ABO model. First some papers are discussed that estimate the CR. Note that there is little literature about CR forecasting on keyword level, since most studies estimate the CR on account level.

An important study in the CR estimation literature is the study of Ghose et al. (2007). Their study estimates the effects of keyword characteristics on the CR. Keyword characteristics such as the rank, the length and the CTR of the keyword are used. A hierarchical Bayesian frame is used to estimate the effects of these characteristics on the CR, however this framework is not elaborated in their paper. Ghose et al. (2007) conclude that the rank and the CTR have a significant positive effect on the CR and that the length has no significant effect on the CR. However the literature does not uniformly agree on the effect of position on the CR. Agarwal, Hosanagar, and Smith (2011) conclude that the effect of the position on the CR depends on the length of the keyword. Google itself has stated that the position has no effect on the CR (Friedman, 2009).

Agarwal et al. (2011) found that the effect of position on CR depends on the keyword length. They state that longer keywords are more specific. More specific keywords have fewer mismatches between customers and advertisers. However there are more accurate methods to measure how specific a keyword is. Instead of determining the length of the keyword, one could use language knowledge to determine how specific a keyword is. A model that uses natural language knowledge is called a semantic model.

One of the first semantic models to estimate the CR is introduced by Rutz, Bucklin, and Sonnier (2012). Their study examines some semantic properties of keywords. This is based on the assumption that a group of consumers who use a certain keyword have similar objectives and behavior, that is consumers reveal information about themselves through their choice of search terms. It is not possible to use information about customers,

(26)

since Google does not provide them. Therefore the only way to separate customers is trough the keywords they enter. The study of Rutz et al. (2012) uses dummies to check whether a keyword has a predefined characteristic, such as a location or a brand. They find significant relations between these predefined characteristics and the CR. A downside of their model is that all characteristics must be chosen by hand. These characteristics differ per account. Which characteristics are chosen effects the accuracy of the model.

Another semantic measure is introduced by Klapdor, Anderl, von Wangenheim, and Schumann (2014). Web search keywords tend to be short and encompass little context information. As a result, there is a certain share of user keywords matched to an advertisement that is irrelevant to the user’s actual informational need. The less specific an advertisement is the more search keywords can be matched to the advertisement, which increases the likelihood of irrelevant matches. The query variation index (QVI) is created to capture how relevant an advertisement is to a user, see Equation (4.1). Klapdor et al. (2014) conclude that the QVI has a significant effect on the CR. Furthermore they find that the length of a keyword has no significant effect on the CR once the QVI is included.

QV Ik,t =

(Unique successful search queries for keyword k in period t) − 1

(Clicks for keyword k in period t) − 1 (4.1)

There are a lot of semantic models like the QVI mentioned in Chapter 4. The current semantic literature knows two main types of semantic measures, knowledge based and corpus based (Mihalcea, Corley, & Strapparava, 2006). Knowledge based models are based on graph networks, that contains information between terms/concepts. An example of a small semantic graph network in the form of a lexical taxonomy is shown in Figure 4.1. Corpus based models of semantic similarity try to identify the degree of similarity between keywords using information exclusively derived from a large text or corpus. In this thesis, only corpus based measures are included, as they are easier to use with keywords, e.g. multiple words instead of a single word.

(27)

CHAPTER 5 METHODOLOGY AND TECHNIQUES

Recall the two types of semantic models: corpus and knowledge based. In this Chapter two corpus based semantic measures will be discussed. These two models are defined by Sahami and Heilman (2005) and Kusner, Sun, Kolkin, and Weinberger (2015). Both semantic measures are, to our knowledge, not used to estimate or forecast the CR. Therefore an expansion of the models is required in order to forecast the CR. Sahami and Heilman (2005) used their model do discover new keywords, whereas Kusner et al. (2015) determined the similarity between sentences.

Figure 5.1 shows how the two semantic measures are used to forecast a CR and evaluate the forecasted CR.The algorithms of Sahami and Heilman (2005) and Kusner et al. (2015) are first explained in Section 5.1. The model defined by Sahami and Heilman (2005) is marked with a 1 in the figure and the model defined by Kusner et al. (2015) with a 2. Both models are split into three phases: keyword enrichment, vectorization method and semantic measure. The working of the three phases is discussed for both models in Section 5.1. After the similarity is defined, the CR forecast is estimated. The algorithm that uses the similarities between keywords to forecast a CR is introduced in Section 5.2. Finally the CR forecast is evaluated. This evaluation method is discussed in Chapter 6.

5.1 Similarity between keywords

Determining the similarity between keywords works poorly with traditional document similarity measures, since there are often few, if any, terms in common between two keywords (Sahami & Heilman, 2005). Therefore these two models rely on a keyword expansion technique, where additional information about the keywords

(28)

Figure 5.1: Semantic model overview

is used. This step is shown in the keyword enrichment phase of Figure 5.1. Next step is to transform words into vectors, witch is the vectorization method phase in the figure. These vectors are then used to calculate the similarity between different keywords by using a semantic measure. This is as far as both papers went.

The first corpus model is introduced by Sahami and Heilman (2005). They expand their keywords by using the knowledge of a search engine. Therefore their model is called the search engine (SE) model. The second corpus model is called the word movers distance (WMD) model and is introduced by Kusner et al. (2015). This model uses the knowledge of a large text corpus in order to determine relations between words. The main difference between the two models is that the SE model is based on the knowledge of a search engine, whereas the WMD model is based on general knowledge. First the working of the SE model is discussed, followed by the working of the WMD model.

5.1.1 Search engine model

The algorithm that is introduced in Sahami and Heilman (2005) is shown in Algorithm 2. An example of this algorithm is shown in Example 5.1.

First every keyword in the data is entered in a search engine. The search engine determines which websites are most relevant for each keyword. The most relevant website is shown first, etc. Most search engines show a short text description of each website, as is shown in Figure 5.2. The first N short text descriptions are retrieved, these are called documents. Different search engines could be used, however in this thesis only Bing it used. Bing is preferred over Google, since it has a less strict policy with regard to scraping their search results. This benefits both the speed of the algorithm as the certainty of not getting blocked. This step corresponds to the Data phase of Figure 5.1.

Next is the vectorization phase. First the frequency of every word in every document is determined. This results into N ∗ #keywords vectors with a length equal to the total number of words in the database. This

(29)

Algorithm 2: Algorithm to determine the semantic similarity between keywords Data: All keywords

Result: Semantic similarity measure based on search results Let x represent a keyword

1. Issue x as a keyword to a search engine.

2. Let R(x) be the set of (at most) n retrieved documents d1, d2, ..., dn

3. Compute vi, the TF-IDF term vector, for each document ∈ R(x)

4. Truncate each vector vito include its m highest weighted terms

5. Let C(X ) be the centroid of the L2normalized vectors vi:

C(x) =1_n n ∑ i=1 vi k vik₂

6. Let QE(x) be the L2Normalization of the centroid C(x):

QE(x) = C(x)

k C(x) k₂

7. Then the similarity between keyword x and keyword y is given by the following kernel function: K(x, y) = QE(x)T· QE(y)

(30)

is called the term frequency (TF). Some common words like ’the’ do not help us to determine the similarity between keywords. Therefore the frequencies are weighted using a inverse document frequency (IDF) weighting scheme, defined as

IDFi= log(

N∗ #keywords d fi

) (5.1)

where d fiis the total number of documents that contain term ti. This results into the term frequency - inverse

document frequency weighting model

T F-IDF = t f (i, j) ∗ IDF(i) (5.2)

where t fi, j is the frequency of term ti in document dj. These TF-IDF vector of document i is called vi. Then

each vector viis truncated to only include the m highest terms. Afterwards viis normalized. The next step is to

calculate the sum of each of the N v vectors that correspond to a keyword. Resulting in a total of #keywords vectors. Again the vectors are normalized. These normalized vectors are called QE(x) for keyword x.

Finally the similarity between the different keywords can be determined by using the cosine similarity function

sim(x, y) = QET(x) · QE(y) (5.3)

where sim(x, y) is the similarity between keyword x and keyword y. The similarity is in the range [0, 1], where 0 is total dissimilar and 1 is total similar. The three most similar keywords of three keywords are shown in Table 5.5, where a higher similarity means that the keywords are more similar.

Example 5.1: The working of algorithm 2

This is an example of Algorithm 2. All steps of the algorithm are discussed. Assume that in step 1, two keywords are entered into the Bing search engine, called keyword A and keyword B (#Keywords = 2). For each keyword, the top two documents is obtained, thus setting N = 2 in step 2. The term frequency matrix of these four documents is given in row 1-4 of Table 5.1. The fifth row of Table 5.1 shows the IDF weighting vector, which is determined using Equation 5.1. Only the number of documents that contain a certain term effects the IDF vector, the frequency of a word has no effect on the IDF vector. This can be seen with the words bright and sky, both words have the same IDF value, although bright appears once more than sky

(31)

Table 5.1: Term frequency matrix

Keyword Doc blue bright cloud sky sun the

A 1 1 0 0 1 0 1

A 2 0 1 0 0 1 1

B 1 0 1 1 1 0 2

B 2 0 2 0 1 1 2

IDF vector 1.38 0.29 1.38 0.29 0.70 0

The TF-IDF vector can simply be obtained by multiplying row 1-4 with row 5 of Table 5.1. This results in four vectors, which are shown in Table 5.2.

Table 5.2: TF-IDF vector per document

TF-IDF vector blue bright cloud sky sun the

vA,1 1.38 0 0 0.29 0 0

vA,2 0 0.29 0 0 0.69 0

vB,1 0 0.29 1.38 0.29 0 0

vB,2 0 0.58 0 0.29 0.70 0

The vectors are truncated in step 4 to only include the m highest values. Thus all other values are set to zero. In this example we set m = 2. Additional are the vectors normalized, this results into Table 5.3.

Table 5.3: Normalized and truncated TF-IDF vector per document

TF-IDF vector blue bright cloud sky sun the

v₁ 0.98 0 0 0.20 0 0

v2 0 0.39 0 0 0.92 0

v3 0 0.20 0.98 0 0 0

v4 0 0.41 0 0 0.59 0

The sum of the normalized vectors is taken in Step 5. Thus we are left with two vectors, CAand CBfor

respectively keyword A and B. Next CAand CBare normalized, resulting in the vector for this keyword. The

(32)

Table 5.4: Vector per keyword

blue bright cloud sky sun the

QEA 0.69 0.28 0 0.14 0.65 0

QEB 0 0.47 0.76 0 0.46 0

The similarity between keyword A and keyword B is determined by calculating the inner product between the normalized vectors QEAand QEB, resulting in a similarity of 43%.

Algorithm 2 depends on two variable parameters. These are N, the number of documents per keyword and m, the number of highest elements left in v(i). Sahami and Heilman (2005) make no mention about the optimal value of N. Their study sets m = 50, as they have found that it gives a good trade-off between representational robustness and efficiency. Another study that uses the same algorithm sets m = 500, since speed is not of importance to them and their database is relative small. A short text description on Bing is limited to 160 characters. In our case a document has a most 31 words (without stop words). Thus truncating vi to only include the 500 highest

vectors is no truncation at all. We will experiment with different settings of N and m. The effect of n and m on both the forecast accuracy as the computing time will be analyzed.

5.1.2 Word movers distance model

Another corpus based similarity measure is introduced by Mikolov, Chen, Corrado, and Dean (2013a), namely the Word2Vec model. A Word2Vec model uses a large corpus to put words into a vector representation. A difficulty that rises with this method is that a keyword could consist of several words. A solution is introduced with the word movers distance (WMD) (Kusner et al., 2015). The WMD is able to determine the semantic similarity between keywords by calculating the "distance" one keyword has to travel in order to be the same as another keyword. First we will discuss which data is used as corpus. Then different Word2Vec models are introduced. Next the derivation is given of the Word2Vec model that is used in this thesis. Finally the WMD is discussed.

The Word2Vec models require a large corpus data in order to vectorize words. If this large corpus is for any reason not sufficient enough than the model performs poorly (Mikolov et al., 2013a). This thesis uses two different corpus sources for the Word2Vec model. The first one is Wikipedia. The entire database of Wikipedia will be used to train the Word2Vec model. Second the corpus that was obtained in Section 5.1.1 by Bing is used. Two different Word2Vec model are introduces by Mikolov et al. (2013a), these are the continuous Bag-of-words (CBOW) model and the Skip-gram (SG) model. The CBOW model predicts a word based on the context and the Skip-gram model predicts the context based on the word. The model architecture is shown in

(33)

T able 5.5: Three most similar k eyw ords of a k eyw ord K eyw ord SE model WMD model Most similar k eyw ord Similarity Most similar k eyw ord Distance Sant Antoni Portman y Sant Joan Labritja 86% Sant Joan Labritja 0 .76 Sant Josep talaia 85% Sant Josep talaia 0 .97 Huis huren 3 personen Ibiza 74% Gran Canaria 1 .34 b ung alo w k erst 2016 V akantie k erst 2016 86% k erst b ung alo w 1 .23 V akantie k erst b ung alo w 84% v akantie k erst b ung alo w 1 .36 V akantie met k erst 2016 82% huisje k erst 2016 1 .36 lastminute wintersport Oostenrijk Oostenrijk wintersport last minute 100% Oostenrijk wintersport last minute 0 .00 aanbiedingen wintersport Oostenrijk 89% aanbiedingen wintersport Oostenrijk 0 .56 wintersport goedk oop Oostenrijk 87% wintersport goedk oop Oostenrijk 0 .69

(34)

Figure 5.3. Where w stands for a word. In this study we choose to use a Skip-gram model with negative sampling. Mikolov and Dean (2013b) state that the Skip-gram model with negative sampling outperforms other word2vec variants. One important factor was the significant speedup caused by the negative sampling (two to ten times as fast).

Figure 5.3: Model architecture of the CBOW model and Skip-gram model

Mikolov and Dean (2013b) did not explain the SG model with negative sampling very clear. Thankfully other papers are written just for this purpose. We follow the derivation of the SG model with negative sampling by Goldberg and Levy (2014). A Skip-gram model maximizes the average log probability

max

θ _(w,c)∈D

∏

p(c|w; θ ) (5.4)

where the set of all words w and their contexts c is defined as the corpus D. The conditional probability p (c|w; θ ) is modeled using soft-max:

p(c|w; θ ) = exp (vc· vw) ∑

c0∈C

exp (vc0· vw) (5.5)

where vcand vware vector representations for c and w respectively. C is the set of all available contexts. The

set of parameters θ is given by vcand vw. Substituting Equation 5.5 into Equation 5.4 and taking the logarithm

results into

max

θ _(w,c)∈

∑

_D

log p (c|w; θ ) =

_∑

(w,c)∈D

log exp (vc· vw) − log

∑

c0∈C

exp (vc0· vw)

!

(5.6)

It is computationally expensive to compute Equation (5.6) due to the summation ∑c0∈Cexp (vc0∗ vw) over all

contexts c0. This can be solved using negative sampling.

Consider a pair (w, c) of word and context. Then the probability that this pair is in the corpus is given by P(D = 1|w, c). Assume that there are parameters θ controlling the distribution: P (D = 1|w, c, θ ). An objective

(35)

function is build that maximizes the probability of a word and context being in the corpus if it indeed is and maximizes the probability of a word and context not being in the corpus if it is indeed not.

max θ _(w,c)∈

∏

_D P(D = 1|w, c, θ )

_∏

(w,c)∈D0 P(D = 0|w, c, θ ) (5.7) max θ

∏

(w,c)∈D P(D = 1|w, c, θ )

_∏

(w,c)∈D0 1 − P (D = 1|w, c, θ ) (5.8) max θ _(w,c)∈

∑

_D log P (D = 1|w, c, θ )

_∑

(w,c)∈D0 log (1 − P (D = 1|w, c, θ )) (5.9) max θ _(w,c)∈

∑

_D log 1 exp (−vc· vw) +

_∑

(w,c)∈D0 log 1 exp (vc· vw) (5.10)

WhereD0is the set of non existing (w, c) combinations. If we let σ (x) = 1

1 + e−x we get: max θ _(w,c)∈

∑

_D log σ (vc· vw) +

∑

(w,c)∈D0 log σ (−vc· vw) (5.11)

This objective is for the entire corpus. The objective for one example (w, c) ∈ D is given by:

− log σ (vc· vw) + K

∑

k=1

σ (−vc· vw) (5.12)

where K is the number of negative samples used. The false corpus D0is generated on the fly using the Unigram noise distribution Pn(w). The probability of this noise distribution matches the ordering of the frequency of the

corpus. Thus more frequent words appear more often. However it turned out that the model works best when the Unigram noise distribution is raised to the power of 3

4. The Unigram noise distribution is raised by a power that is smaller than one so that less frequent words appear more often. This choice for 3

4 was found by trial and error. Example 5.2 shows the effect of raising the power of the Unigram noise distribution to the power of 3

4. Finally the θ is updated each iteration by using stochastic gradient descend.

Example 5.2: Unigram noise distribution

Assume that there is a corpus that consists of three words with probabilities given in Table 5.6. Raising all probabilities with the power of 3

4 causes less frequent words to appear more often. For example the word brownis five times more likely to be sampled after raising the probability with the power of 3

4. Less frequent words are sampled more often at the expense of more frequent words like is in our example. By raising the probabilities to the power of 3

(36)

Table 5.6: Vector per keyword Word P P34 normalized P 3 4 f ox 0.099 0.176 0.160 is 0.900 0.924 0.835 brown 0.001 0.006 0.005

The Word2Vec model depends on multiple parameters. All these parameters affect the estimation of the similarity. Therefore an optimization is needed in order to obtain the best combination of parameters. However not all parameters need to be changed, as we can simply use the same settings as Mikolov and Dean (2013b). Their study concluded that a context of 5 words gives accurate results. Thus there are two words left and right from the target word. In addition is the Unigram noise distribution left unchanged. Leaving us with the dimension of the word-vectors and the number of negative samples. The computing time increases as one or both of these variables increases. Mikolov and Dean (2013b) state that 2 − 5 negative samples are enough for a large database and 6 − 20 for a small database. The optimal number of negative samples and the optimal dimension of the word-vectors are optimized.

The Word2Vec model determines how similar single words are. However it is unable to compare multiple words. The WMD is used to solve this problem. First all stop-words are removed from a keyword. Next, the WMD is able to determine the similarity between keywords by calculating the ’distance’ keyword A has to travel in order to be the same as keyword B. The cost associated from "traveling" from word i to word j is defined as

c(i, j) = kxi− xjk (5.13)

The optimization becomes,

min T=0 n

∑

i, j=1 T_{i, j}c(i, j) subject to n

∑

j=1 Ti, j= di ∀i ∈ 1, . . . n (5.14) subject to n

∑

i=1 Ti, j= d0j ∀ j ∈ 1, . . . n (5.15)

where d and d0are the vector representations of two keywords and Ti, j denotes how much of word i in d travels

to word j in d0. Finally n is the number of words1. A WMD of 0 means that the keywords are exactly the same. The higher the WMD gets the more dissimilar the keywords are. An example of the WMD is shown in Figure

(37)

5.4, where for convenience the dimension of the word vectors is reduced to two. In practice the number of words in two keywords is often different, causing multiple words of one keyword to be the most similar to a single word of the second keyword. For instance, if in Figure 5.4 the right sentence did not contain the word Italian, then the word Sicilian of the left sentence would be most similar to ice-cream. Thus both gelato and Sicilian are in this case most similar to ice-cream. The three most similar keywords of three keywords are shown in Table 5.5, where a lower distance means that the keywords are more similar.

Figure 5.4: Example of the word movers distance method between two sentences

5.2 CR forecast

The semantic similarity between the keywords is used to forecast a CR as is shown in Algorithm 3. Two different parameters need to be determined in order for this model to forecast the CR. First the minimum number of clicks that are required and second the maximum number of weeks. The combination of these two parameters that leads to the most accurate results will be used. Whenever a keyword has less clicks within a week than the minimum number of clicks required, then it also uses the clicks of the previous weeks. This is repeated until either there are enough clicks or the maximum number of weeks is reached. If the maximum number of weeks is reached then data from the most similar keyword is also used. This is repeated until there is enough click data. Finally the CR forecast is made by calculating the current CR. The CR forecast is a weighted mean like in section 3.2.2. Algorithm 3 is able to forecast a unique CR for each keyword in every week. Also keywords with a lot of click activity are unaffected by the similarity measure. Thus the measure only effects the CR forecast for sparse data keywords.

(38)

Algorithm 3: Algorithm to forecast the CR based on a semantic similarity measure Result: Conversion rate forecast based on a similarity measure

Let x represent a keyword for which we forecast the CR Previously we obtained the similarity between all keywords n= 0, which is the number of weeks we are going into the past l= 1, start with only the keyword itself

Q= x, select the data from keyword x to start the forecast Clicks from previous keyword(s) set to 0

Conversions from previous keyword(s) set to 0 while Total clicks < Min Clicks Required do

n= n + 1

if n5 Relevant Weeks then TotalClicks=

t−n

∑

s=t

Clickss,Q+ Clicks from previous keyword(s), where t is the current week

TotalConversions=

t−n

∑

s=t

Conversionss,Q+ Conversions from previous keyword(s)

else

l= l + 1

Q= Similar[l], Use the data of the keyword that is the lth_{most similar to keyword x}

n= 0 end end

(39)

CHAPTER 6 EVALUATION TECHNIQUE

An evaluation method is needed in order to evaluate the accuracy of the clustering, FP, SE and WMD model. This evaluation method must be able to determine the accuracy of individual keywords per week. The individual weekly keyword accuracy is important since the bid prices are imposed per keyword per week. However a problem occurs when evaluating the accuracy of individual keywords on a weekly level, especially for keywords with little click activity. For each keyword the empirical CR is observed. This empirical CR might be different from the true CR (König, Gamon, & Wu, 2009). Comparing the forecast with the empirical CR could result into wrong conclusions about the accuracy of a forecast model. For instance in Example 6.1 it is four out of five times better to forecast a CR of zero than to forecast the true CR. The difference between the empirical and the true CR is mainly a problem for keywords with little click activity, since their CR is in most weeks zero and then one week above 10%. This chapter discusses two different evaluation method. The first evaluation method is able to determine the accuracy of a CR forecast model per week per keyword. The second evaluation method calculates the accuracy of a CR forecast model per account per week.

Example 6.1: Sparse data keywords

A keyword has 10 clicks a week and a true CR of 1% and the empirical CR for the previous five weeks is given in column 1 of Table 6.1. Columns 2 and 3 of this Table show the error when comparing the empirical CR with the following forecast models: true CR and always a CR of zero. The error is fluctuating over time as a result of the data sparsity of the keyword. The RMSE over five weeks is smaller for the True CR model than the always 0 CR model The bid prices are optimized for each week per individual keyword and therefore it is important that the CR forecast is evaluated each week. As a consequence is the always forecast a CR of 0% 4 out of 5 weeks more accurate than forecasting the true CR.

(40)

Table 6.1: CR and error of an example keyword when forecasting the true CR and a CR of 0%

Empirical CR Error True CR Error 0 CR

1 0% 1% 0%

2 0% 1% 0%

3 0% 1% 0%

4 10% 9% 10%

5 0% 1% 0%

6.1 Individual keyword evaluation method

First the literature is reviewed in order to find an evaluation method that is able to evaluate individual keywords per week. Next an algorithm is introduced that determines how much clicks are needed in order to estimate the true CR. Finally the true CR estimation algorithm is discussed.

6.1.1 Evaluation literature

The studies mentioned in Chapter 4 did not forecast the CR and therefore did not encouter a problem with the evaluation method. Therefore additional literature is needed in order to evaluate the four forecast models.

Richardson, Dominowska, and Ragno (2007) try to estimate the true CTR using data provided by a search engine. Although they do not estimate the true CR, it is the closest study we could find to our problem. Richardson et al. (2007) filtered out keywords with less than 100 impressions. Their choice is a balance between wanting less noise in the training process (required more impressions) and reducing the bias that occurs when only considering keywords that have been shown many times (requires fewer impressions). Next they estimate the effect of the quality of a keyword on the CTR. Third, they determine how similar keywords are based on the same set of words. Finally a prior CTR can be estimated. This prior CTR is used to estimate the CTR for keywords with less than 100 clicks

ˆ

p=α p0+Clicks

α + views , (6.1)

where ˆpis the estimate for the true CTR, p0is the prior CTR predicted by the model for the keyword based on

the ad quality, similar keywords and the average CTR. α sets the strength of the prior and its strength diminishes as the number of impressions increases.

(41)

by seeing a click as a success in a binomial distribution. Thus the CTR is the chance of a success in a binomial distribution. They conclude that the estimation of the true CTR is accurate, especially for keywords with 50 impressions or more. However their data consist out of 10.000 different advertisers, where we have data from only one advertiser. In our case it is not possible to estimate effects such as ad quality. As a consequence their method for the estimation of the true CTR cannot be used to estimate the true CR in our case. Nonetheless there are two useful implications. First, seeing the CR as the chance of a success in a binomial distribution. This is used in to construct a confidence interval (CI) for the true CR estimation and determining a minimum number of clicks that are required to estimate the true CR. Second, the use of a prior CR. This prior CR should only be used for sparse data keywords, e.g. the strength of the prior should diminish as a keyword has more click activity.

6.1.2 The minimum number clicks of required

The CR can be seen as the chance of a success in a binomial distribution. CI can be constructed for the CR using this binomial distribution. No approximation for the CI is needed as the CR is exactly a binomial distribution.

Next step is to determine the number of clicks that is required in order to have a reasonable length CI. But what is a reasonable size for a CI? We state that any CI with a length 0.03 or smaller is reasonable. This 0.03 is arbitrary chosen. Other limits were considered, such as a relative limit. However the length of the CI goes to infinity as the CR goes closer to zero. Thus the length of the CI must be limited in absolute terms. Algorithm 4 shows how the minimum numer of clicks is determined. In order to determine the maximum absolute CI we only have to consider the keyword with the highest CR, since that keyword has the largest CI in absolute terms.

Algorithm 4: Obtain the minimum number of clicks required in order to estimate the true CR within a predefined limit

Data: All keywords for which the CR is estimated in the current model

Result: Obtain the minimum number of clicks required to estimate the CR accurately X = 0

1. Select all CR’s corresponding to keywords with at least X clicks

2. Compute a 95% exact CI for the highest CR rate for sample size from 1 : X 3. Set X as the first sample size for which the length of the CI is smaller than 0.03 4. Repeat steps 1 to 3 until X remains constant

First Algorithm 4 selects the CR of all keywords that have a minimum of X clicks during two years. The filtering is necessary, since if a keyword does not have X clicks, then the empirical CR might differ too much from the true CR. Which might cause a CR that is unrealistic high. For example a keyword could have two clicks during a two year time frame with a CR of 50%. Causing the highest CR to be unrealistic high. Next the exact CI is determined for the highest CR for every number of clicks between 1 and X . Thus there are X different CI’s. Finally the lowest number of clicks that corresponds to a CI of 0.03 or smaller is selected. This

(42)

process is repeated until X does not change. The results of this algorithm are shown in Section 8.1.1

6.1.3 True conversion rate estimation

An estimate of the true CR is needed in order to compare the different forecast models. Algorithm 5 estimates the true CR1. The algorithm uses future data in order to estimate the true CR, where forecast models only use historical data.

Algorithm 5: Estimation of the true CR

Data: The number of clicks and conversions for a keyword and for the group and the campaign to which the keyword belongs

Result: True CR estimation for a keyword in a given week n= 1, which is the number of weeks we are going into the future while Total clicks < Min Clicks Required do

if n5 Relevant Weeks then TotalClicks= L ∑ l=1 t+n ∑ s=t

Clickss,l, where t is the current week

TotalConversions= L ∑ l=1 t+n ∑ s=t Conversionss,l n= n + 1 else

L= L + 1, the maximum level on which we are estimating the true CR (keyword, group, campaign or account)

n= 1 end end

CRt=TotalConversions_TotalClicks

The Min Clicks Required is obtained using Algorithm 4. First Algorithm 5 tries to calculate the conversion rate on keyword level (L = 1), but if there are not enough clicks within the relevant weeks than clicks are added from the group level (L = 2) and if necessary from the campaign level (L = 3) or the account level (L = 4). The minimum number of clicks was previously determined with Algorithm 4, but the number of relevant weeks needs to be decided yet. There is a trade-off between estimating more keywords on keyword level and reducing the bias that occurs when considering weeks too far into the future. Multiple settings for the number of weeks will be used to analyze both the effect on the percentage of words that is optimized within each level and the accuracy of the different forecast models. Finally the estimation of the true CR is made. However this estimate is calculated using the same weighted mean as in Algorithm 1. The reason behind the weighting is that otherwise the CR of the group, campaign or account level could effect the CR of the keyword too much.

(43)

6.1.4 Criteria

The individual keyword accuracy will be determined by calculating the root mean squared error (RMSE) between the estimated true CR and the forecasted CR. However a weighted mean is used instead of an unweigthed mean, resulting in the root weighted mean squared error (RWMSE). The weights are determined by the number of clicks a keyword has in a week. The weighted mean is chosen since keywords with a lot of click activity are more important than keywords with little click activity. The RMSE is preferred over the mean absolute error (MAE), since the RMSE gives bigger weights to larger errors, whereas the MAE gives the same weight to each error. We believe that larger errors are more harmful to the optimal bid price and therefore the RMSE is preferred.

6.2 Account level accuracy

Besides an individual keyword criterion, the forecast models are also evaluated on account level. The RMSE between the number of forecasted conversions and realized conversions is calculated per week. This method will be referred to as the account level criterion. The main purpose of this evaluation method is to check whether the individual level criterion is accurate. If the two evaluation methods show different results for a forecast model than the individual evaluation method does not work as intended. For instance when the individual keyword criterion results in a high RWMSE and the account level criterion results in a low RMSE.

(44)

(45)

CHAPTER 7 DATA

The data is provided by ORTEC and is from a company that is active in the leisure market. The data is obtained via Google Adwords. Google Adwords provides daily data on the bid price, number of impressions, the average position of that day, the click trough rate and the conversion rate per keyword. The data contains information about 2, 347 keywords over 130 weeks, from 24-04-2014 till 13-10-2016. An observation is defined as the data for a keyword for a single week. There are 2, 347 observations each week and 305.110 in total. Additional to the data that is provided by Google AdWords, there is also data that is provided by the advertisers, such as the profit margin on a product. However there is assumed that these variables have no effect on the CR.

First we will discuss how the data is structured, e.g. the number of keywords, groups and campaigns. Next the descriptive statistics are given of the most relevant variables that are provided by Google AdWords. The sparsity problem regarding the number of conversions is then discussed using these statistics. Finally the split between the train and the test data is discussed.

7.1 Google AdWords structure

The number of keywords, groups and campaigns are shown in Table 7.1. There are 2, 347 keywords within this Google AdWords account. There are on average 3.5 keywords per group. In total there are 680 groups and 94 campaigns, thus on average 7.2 groups per campaign. There are on average more groups per campaign than there are keywords per group. However these averages are skewed due to some outliers, as is shown for the groups in Figure 7.1 and for the campaign in Figure 7.2. This figure shows the density of the number of keywords per group and the number of groups per campaign. 44% of the groups contain a single keyword.

(46)

Table 7.1: Number of occurrences per level

Level Occurrences

Keyword 2, 347

Group 680

Campaign 94

Figure 7.1: Histogram of the group size

(47)

7.2 Descriptive statistics

The descriptive statistics of the data that is provided by Google AdWords is shown in Table 7.2. Every variable has a higher standard deviation than its mean. This is caused due to the fact that most keyword have little activity and some keywords have a lot of click activity. This can be seen in row 1: 75% of the observations has 66 impressions or less per week, which is three times as small as the mean of 187.24. The same applies for the number of clicks and conversions. The CTR and CR have a maximum that is above 1, which can only be caused by a person that clicks or converts on the same advertisement multiple times. Such occurrences are rare.

Table 7.2: Descriptive statistics of the database

Variable Mean Std deviation 25% quantile Median 75% quantile Minium Maximum

Impressions 187.24 1, 480.96 4 15 66 1 158, 790 Position 3.32 2.08 1.00 2.00 3.14 4.56 13 Clicks 8.69 34.06 0 1 5 0 1, 809 CTR 0.05 0.15 0.00 0.04 0.12 0.00 2.00 Conversions 0.05 0.35 0 0 0 0 23 CR 0.01 0.01 − 0.00 0.00 0.00 1.50 CPC 0.40 0.48 − 0.31 0.64 0.00 7.84 Costs 6.71 29.48 0.00 0.50 3.00 0 1, 482.48

Every observation has at least one impression. However 38.6% of the observations have no click. As a consequence, the CR and the CPC are undefined for 38.6% of the observations. There are more observations without a conversions: 96.7%. The density of the number of clicks is shown in Figure 7.3. The density graph shows the sparsity of the data, 98.5% of the observations has 100 clicks or less, combining this with an average CR of 0.01 elaborates the challenges regarding CR forecasting, e.g. low quantity and low probability. The density graph for the number of conversions is less interesting, as 96.7% of the observations has no conversion.

(48)

7.3 Train and test data

The data is obtained over a period of 130 weeks, however some weeks are needed in order to calibrate the different CR forecast models to the data. Therefore the data is split into two parts, training and test data. Figure 7.4 shows how the data is divided. The first 77 weeks are used as training data. Note that the first X weeks are needed in order to forecast a CR. The optimal number of weeks is determined in Section 8.2. Leaving us with 52 weeks as training data. The data from week 78 and onward is used as test data. However the last 26 weeks are needed in order to estimate the true CR. Therefore the test data from week 78 till week 103.

(49)

CHAPTER 8 RESULTS

Before the clustering, FP, SE and WMD models are evaluated, we first need to estimate the true CR. The results regarding the true CR estimation are discussed in Section 8.1. Next the parameters of the FP, SE and WMD model are calibrated so that RWMSE is minimized on the training data. Finally the results of comparing the different models on the test set are discussed.

8.1 Evaluation method

8.1.1 Minimum number of clicks required to estimate the conversion rate

We found with Algorithm 4 that the minimum number of clicks to estimate the true CR with a maximum size of the CI of 0.03 is 675. Figure 8.1 shows the absolute size of the CI as a function of the number of clicks. The horizontal red line shows the maximum size of the CI, which is 0.03. Additionally does the vertical green line show the first value for which the absolute size of the CI is lower than the limit (0.03), which is 675. The largest CR that has at least 675 clicks is 3.68%. Thus the minimum number of clicks required to estimate the true CR within our accuracy limits of 0.03 is 675.

8.1.2 Estimation of the true conversion rate

The true CR is estimated using a minimum of 675 clicks and a maximum of 26 weeks. This model is able to estimate the true CR for each keyword in every week. This subsection discusses the true CR estimation results that are obtained with Algorithm 5.

(50)

Figure 8.1: Absolute size of the CI as a function of the number of clicks

The density of the estimation of the true CR is shown in Figure 8.2. Additional is the density shown of both the CR of keywords with at least 675 clicks and of all observations with at least one conversion. The density for the estimation of the true CR is unstable. Every 1

675% a local maximum appears, which is a consequence of using a minimum of 675 clicks. Furthermore the global maximum of the true CR estimation graph is closer to zero than the other two graphs and it has got less outliers. The maximum CR is 2.9%, where the maximum CR of keywords with at least 675 clicks is 3.68%.

Figure 8.2: Density of the true CR estimation

Figure 8.3 shows which percentage of the keywords is optimized on which level (keyword, group, campaign or account). The true CR is estimated for 7% on keyword level, 52% on group level, 36% on campaign level and 5% on account level .

(51)

Figure 8.3: Optimization level of the true CR estimation

8.2 Model calibration

8.2.1 Individual keyword accuracy

The four different CR forecast models depend on different parameters. In this thesis, it is assumed that the clustering forecast model is already optimized on the current database, as it is being used by ORTEC to forecast the CR with the current data. The other three forecast models are introduced in this thesis and therefore their parameters have to be calibrated. The parameter values are selected, which minimized the RWMSE on the individual level criterion. A forecast model from which the accuracy depends heavily on the parameter values is not preferred, since this gives uncertainty about the accuracy of the model, e.g. the optimal parameter value might change over time. Therefore is also the sensitiveness of the parameters on the RWMSE analyzed. The training data is used to calibrate the different parameters. First are the results of the FP model discussed, followed by the SE model and finally the WMD model.

8.2.2 Calibration of the FP model

The FP model depends on two parameters. These are the minimum number of clicks to include and the number of weeks into the past. One parameter is changed at a time to minimize the RWMSE. This is repeated until the RWMSE can no longer be improved. The results of this parameter calibration is shown in Table 8.1.

The calibration started with 675 clicks and 26 weeks, which are the same settings that are used for the estimation of the true CR. These starting values are close to the optimal parameter settings, which are shown in row 2. The optimal parameter settings are 680 clicks and 26 weeks, with a RWMSE of 0.00416. The start value

(52)

Table 8.1: RWMSE with clicks for different parameters using the FP model

Setting Clicks Weeks RWMSE % optimal RWMSE

Start 675 26 0.00418 100.37% Optimal 680 25 0.00416 100.00% Clicks + 10% 748 25 0.00532 127.81% Clicks - 10% 612 25 0.00418 100.45% Weeks + 10% 680 28 0.00420 100.88% Weeks - 10% 680 22 0.00421 101.13% Weeks - 20% 680 16 0.00445 106.97% Both - 10% 612 22 0.00421 101.20% Both - 20% 612 22 0.00441 106.03%

has a RWMSE of 0.37% higher than the RWMSE of the optimal parameter settings. In addition the sensitivity of changing the two parameters on the RWMSE is analyzed. These results are shown in row 3-9. Increasing the number of clicks with 10% causes the RWMSE to increase by 27.8%, while decreasing the number of clicks increases the RWMSE with 0.45%. There is little difference for the RWMSE between increasing or decreasing the number of weeks with 10%. However using less weeks gives additional advantages, since it becomes possible to forecast a CR with less data. Also if there are time effects then using less weeks might help to forecast time effect patterns better. Therefore additional settings are used with less weeks. Reducing the maximum number of weeks with 20% increased the RWMSE with 6.97% compared to the optimal parameter settings. Decreasing both the number of clicks as the maximum number of weeks with 10% and 20% increased the RWMSE with respectively 1.20% and 6.03%. In practice it might be useful to use less weeks, however we choose to use the parameters that minimizes the RWMSE. These were 680 clicks and 26 weeks.

A potential problem occurs when computing the error between the forecasted CR and the estimated true CR. The true CR estimation algorithm is basically the same as the FP algorithm, it uses future data, whereas the FP algorithm uses historical data. If a keyword is forecasted on the same level (keyword, group, campaign or account) as the true CR then we only evaluate whether we are able to forecast the CR on a certain optimization level. Figure 8.4 shows the percentage of keywords that is optimized on each level. The percentages of both model are almost identical. Both optimize on keyword level 7% of the time. The FP-model optimizes 41% of the keywords on group level, 46% on campaign level and 6% on account level. The true CR estimation method optimizes 43% on group level and 49% on campaign level. Finally it optimizes 5% of the time on account level. These percentages are almost identical as could be expected for such similar methods. In total are 71% percent of the observations optimized on the same level. Therefore the RWMSE is expected to give a too positive view. We expect that the FP-model is less accurate on the account level evaluation method.

A semantic approach to conversion rate forecasting