Optimizing Non-Linear Models for Online Learning to Rank

(1)

MSc Artificial Intelligence

Track: Information Retrieval

Master Thesis

Optimizing Non-Linear Models for Online Learning

to Rank

by

Harrie Oosterhuis

10196129

December 1, 2016

42 EC 15 January 2016, 18 November 2016

Supervisor:

Msc. Artem Grotov

Assessors:

Prof. Dr. Maarten de Rijke

Dr. Joris Mooij

Information and Language Processing Systems group

Informatics Institute

(2)

(3)

Optimizing Non-Linear Models for Online Learning to

Rank

Abstract

In recent years the limits of Offline Learning to Rank have become more evident, as a result interest in Online Learning to Rank has substantially increased. However, there is currently still a large performance gap between models in the Offline and

Online setting. A suspected cause for this gap is that the state-of-the-art models

in the Offline setting cannot be applied to the Online setting. Moreover currently the field of Online Learning to Rank has been limited to algorithms that learn lin-ear models. This thesis proposes a novel method of optimizing non-linlin-ear models in the Online Learning to Rank setting. Our approach uses distributions of ranker functions and optimizes linear combinations of a sampled set of these functions. The optimization is done using an adaptation of Multileave Gradient Descent that both optimizes the model while simultaneously learning what set of rankers to use. We show that this is the first approach that can optimize Regression Forests and Kernel Based models in the Online setting. Our empirical experiments show that both these models can perform better than linear models in situations where a lot of click noise is involved. Furthermore, our results indicate that the user experi-ence during the optimization of a a Kernel Based model is always substantially bet-ter than for a linear model. This opens the door for future work to explore many different models that are novel to the field of Online Learning to Rank.

(4)

List of Figures

4.5.1 The number of features used in the linear model by the subspace selection methods for different Bernoulli distributions under the navigational click model on the Yahoo-US dataset. The selection method was only applied after the first 100 impressions, regard-less of increases in the discard probability of the Bernoulli distri-bution the method seemed unable to select the desired number of 25 features. . . 39 4.5.2 The number of features used in the linear model by different

se-lection algorithms over the number of user impressions under the navigational click models on the Yahoo-US dataset. The selection methods were only applied after the first 100 impressions and stopped when no more than 25 features were used. Similar be-haviour was seen under the perfect and informational click model. 40 4.5.3 The offline performance (NDCG) of each of the algorithms for

three different click models on the Yahoo-US dataset. Feature se-lection methods were only applied after the first 100 impressions and stopped when no more than 25 features were used. . . 44 5.1.1 Example of a regression tree T1and two slight variations T2and

(7)

5.1.2 Example of a regression forest consisting of a regression tree T1 and two variations made by branching out T2and T3. Leafs are numbered on the sets of documents they cover, e.g. leaf 1 receives the same documents in all trees thus it is linked to the same func-tion φ₁for T1and T3. In the same vain dotted lines between leafs and functions describe the relation of the forest to an equivalent linear model. . . 50 5.2.1 Example of a regression forest and two possible branchings

rep-resented by a linear model. Multileave Gradient Descent can de-cide to discard any of the φ functions. Because each φ function represents a leaf in a tree, discarding them is analogous to prun-ing the tree. Over time the branches that provide the best model are selected. . . 53 5.4.1 Offline score (NDCG) of regression forests of varying sizes on

the Yahoo-US dataset under a informational click model. All forests are static stumps i.e. only their labels change, each tree consists of two leafs only and their thresholds do not change. . . 56 5.4.2 Results of an experiment ran on the Yahoo-US dataset under the

navigational click model. Top: number of nodes of a regression

forest where leafs are discarded using regularization to produce a forest with 50 nodes, and two baselines one with the starting number of nodes 400 the other with the final number of nodes 50. Bottom: the offline performance of the same models. . . 61 5.4.3 Offline performance of regression forest model compared with

P-MGD and a static forest for three click models on the Yahoo-US dataset. The growth of the forest is displayed in Figure 5.4.4. . . . 63 5.4.4 Average depth of each node in a regression forest model under

the navigational click model on the Yahoo-US dataset, the perfor-mance of this model is displayed in Figure 5.4.3. . . 64

(8)

6.3.1 Offline performance (NDCG) of several Kernel Based models with polynomial kernels of varying c values on the Yahoo-US dataset under the informational click model. Increasing c causes the higher-order terms to be weighted more than the lower-higher-order terms of the polynomial. . . 71 6.3.2 Offline performance (NDCG) of Kernel Based models with RBF

kernels and varying σ values on the Yahoo-US dataset under the informational click model. Increasing σ makes the similarity func-tion less spiked thus considering documents at a greater distance more similar. . . 72 6.3.3 Offline performance (NDCG) on the Yahoo-US dataset under

the informational click model. Kernel Based models with a RBF kernel of varying sizes M = |Φ|, the set of ranker functions is static: λ = 0. . . . 73 6.3.4 Results of the selection experiment for Kernel Based models on

the Yahoo-US dataset under the informational click model. Top: the size of the model|Φ| compared to the two baselines. Bottom: the offline performance of the model and the baselines. . . 74

(9)

Acknowledgments

I would like to thank my daily supervisor Artem Grotov for his help and support over the past year. His supervision kept me from fully investigation every possible research directions and kept the project feasible, without his feedback and encour-agement it would not have been possible.

Many thanks to dr. Anne Schuth who supervised me as a honours-student at ILPS during my Master. Thanks to him I found my passion for Online Learning to

Rank and research in general. He pushed me to publish papers, visit conferences,

do international internships and most of all enjoy doing research.

Thanks as well to the entire ILPS group for creating such a nice research envi-ronment. My gratitude especially goes to prof. Maarten de Rijke for allowing a mere Master student to join under the honours program and for the many times he supported me behind the scenes. I also want to thank Max Welling and Maarten van Someren for inviting me to join this program in the first place.

Furthermore, I want to thank Joris Mooij for agreeing to be a member of my defence committee.

Finally, I want to thank my family, friends and Yasemin for their endless support and patience.

(10)

1

Introduction

Search engines have played a essential role in making the web accessible, and bil-lions of users now rely on them to find the web content they need [4,8,89]. Nat-urally, their biggest role is to choose what content should be displayed. Typically an engine receives a query from a user and has to respond by displaying a rele-vant list of documents. Unsurprisingly, the order in which documents are pre-sented heavily affects the user experience [79]. In almost all cases the most relevant documents ought to be displayed first, as this would satisfy the user the quickest. Unfortunately, the relevance of a document cannot be observed directly, there-fore it is up to the engine to predict which documents will satisfy the user’s need [12,50,98]. This problem of ordering a large set of documents is called ranking. Over the years many methods that rank based on induced relevance have been introduced [33,62,75,78,90,98,106]. For instance, one may expect that a docu-ment containing the search query is probably relevant, thus counting how often

(11)

words from the query occur in a document may give a good indication of rele-vance [64,106]. More sophisticated methods like TF-IDF also consider how rare words are in the document set [64,75,90,98]. Additionally, BM25 takes into ac-count document length [77,78] or alternatively PageRank estimates document quality by how many others link to it [33,62]. The number of methods that pro-vide signals of relevance is enormous and continues to grow [3,7,53,106]. Con-sequently, all modern search engines now use ranking models that combine a hun-dreds of signals in order to rank documents. Learning to rank is an active area of research of Machine Learning that concerns optimizing models that rank instances [10,24,54,68,102]. As retrieval systems have become more advanced the com-plexity of their ranking models have increased dramatically. Correspondingly find-ing the optimal model to interpret these signals is now considered the main task of the Learning to rank field [11,50,55,101].

Traditionally this is done offline where queries and documents are automatically gathered and then individually labeled by human assessors. The resulting data con-sists of pairs of search-queries and documents each with relevance labels received from multiple assessors. The data produced from such hand labelling allows for models to be optimized for ranking documents in line with what humans perceive as relevant for a given query [14,57,66,67]. However the offline approach also has some major shortcomings. Most obvious is the tremendous costs that come with large amounts of manual labelling [8,57]. Furthermore it is hard to create a dataset that covers the wide range of user intents [14]. Static datasets can also not account for future changes in what is considered relevant [66]. Additionally, for certain tasks it may never be possible to create a dataset, for instance creating a dataset for search in private documents or personal emails would invade the pri-vacy of users [58,96,97]. Most importantly, gathered assessments do not always correspond with the preferences of actual users [82].

As these limitations have become more apparent over time, increased attention is being given to Online Learning to Rank [40,45,74,84,104]. In contrast with the offline setting no explicit human assessments are used in the online setting. In-stead the model is optimized on user interactions only. This setting is very

(12)

appeal-ing since most user interactions are readily available to search engines and can be used to determine user preferences e.g. query reformulations, mouse movements, and clicks. Previous research has shown that clicks can reliably be used to find a preference between rankings [74] or documents [45]. Recent work has intro-duced methods for efficiently inferring preferences between rankers from clicks [37,70,86,87]. In turn this has allowed for more efficient Online Learning to Rank methods that minimize the number of interactions required during optimization, resulting in fast and click-efficient learning [60,88,104]. Despite these recent ad-vancements there is still a substantial gap between the performance of online and

offline optimized models, a comparison per dataset is given in Section 2.3.4. This

difference can partially be explained by the characteristic difficulties of the online setting. Firstly, since the online setting has no human verified labels it has to deal with noise, i.e., clicks on non-relevant documents. Secondly, since users do not con-sider every document available the models also have to deal with bias, i.e., clicks will only be measured for the displayed documents. The former means that single clicks cannot be trusted to give a reliable measure of relevance. The latter creates the danger of self-confirmation as clicks will never be observed for not displayed documents, hence exploration in online algorithms is very important [38].

This thesis addresses another difference between previous work in Online and

Offline Learning to Rank, namely that all existing research has only optimized linear

models in the online setting. The only exception is a case study that optimized the parameters of BM25 [85]. Nonetheless BM25 is a single ranking function and not a model that combines multiple signals. Conversely, within Offline Learning

to Rank many different kinds of ranking models have been studied. Among them

Regression Forests [68,99,100], Support Vector Machines [27,35,45] and Neural Networks [10] have been very successful, with LambdaMart, a Gradient Boosted Regression Trees model, being the current state of the art [14,94]. Therefore, one may expect that such models will also be very effective in the Online Learning to Rank task. Potentially part of the gap in performance is due to linear models being much less expressive than their non-linear counterparts. Correspondingly, the aim of this study is to find reliable methods of optimizing non-linear models in

(13)

the online setting and to compare their performance and behaviour with existing linear models.

Firstly, Chapter 2 discusses the fields of Offline and Online Learning to Rank as well as the state-of-the-art algorithms in both fields. Subsequently, Chapter 3 de-tails related work that is not in the Learning to Rank field. Then Chapter 4 proposes an approach that can optimize many non-linear models by describing them as lin-ear combinations of simpler non-linlin-ear ranker functions. This allows existing reli-able Online Learning to Rank algorithms to be applied in the space of these ranker functions, producing well performing non-linear models. Moreover, this chapter also introduces a method that adapts the set of non-linear ranker functions over which the linear model is applied. Therefore also enabling the non-linear part of these models to be learned online and allowing more complex models to be op-timized. Chapter 5 shows that Regression Forests neatly fit in this approach and describes how the novel algorithm can explore both the structural parts of the trees as well as learn the internal parameters of a forest. In addition, Chapter 6 describes how models similar to Support Vector Machines also fit in the framework, drawing an analogy between selecting the set of functions and learning the set of support vectors. Finally, Chapter 7 concludes the thesis by describing the contributions made in the previous chapters as well as proposing future directions of research.

Due to the scope of this thesis, the research questions will be answered across multiple chapters, all of the question seek to answer part of the following main research questions which will be answered in Chapter 7:

RQ1 Can a Non-Linear model be optimized in the Online Learning to Rank setting using a listwise method?

RQ2 Is a Non-Linear model able to converge to a better performance than a linear model in the Online Learning to Rank setting?

RQ3 Is the user experience during the training of a Non-Linear model as good as when training a linear model in the Online Learning to Rank setting? In Chapter 4 experiments are conducted that investigate optimizing linear models that do not use all available components. In other words, learning models where

(14)

some signals do not contribute because their weights are zero: wn= 0, a task very similar to feature selection [59,65]. The following question is answered:

RQ4 Can Multileave Gradient Descent be adapted to optimize a linear model that does not use all available components?

Then Chapter 4 seeks to answer the main Research Questions specificly for Regres-sion Forests models, additionally the following question is answered:

RQ5 Can the adapted Multileave Gradient Descent learn to effectively select a branching of nodes in the Online Learning to Rank setting?

Lastly, Chapter 6 also aims to answer the main Research Questions specificly for Kernel Based Models, furthermore it answers:

RQ6 Can the adapted Multileave Gradient Descent learn which documents should be used as support vectors in a Kernel Based Model?

(15)

2

Background: Learning to Rank

This chapter describes the current state of Learning to Rank. Section 2.1 covers the Offline Learning to Rank models that served as an inspiration for the methods introduced in this thesis. Then Section 2.2 explains the latest Online Learning to

Rank algorithms which form the basis for the non-linear optimization introduced

in later chapters. Lastly, Section 2.3 describes the datasets and experimental setup commonly used in the field of Online Learning to Rank. The experiments through-out the rest of the thesis will use this data and setup, additionally this section also shows a comparison of the performance of models in the Online and Offline setting.

2.1 Offline Learning to Rank

Modern search engines use numerous features and signals to base their rankings on and new signals are continued to be introduced [3,7,33,75,78,90,98,106] . Therefore it is very important to have reliable methods that can optimize ranking

(16)

models which take such large sets of signals as input. The field of Offline Learning

to Rank focuses on optimizing models using static annotated datasets [11,24,55, 68,101,102]. The big advantage of this approach is that the relevance of the doc-uments in the dataset are given, in contrast with approaches that have to infer rele-vance from other signals [50]. Since relevance is given the cost of a model and its gradient can be calculated exactly, consequently having the exact gradient allows for the optimization of very complex models [10,27,35,45,99,100]. However the

Offline approach also has some major drawbacks. Firstly, gathering annotations is

time-consuming and costly [8,14,57,66]. Secondly, for certain search contexts creating such a dataset would be unethical, for instance in the context of search within personal emails or documents [58,96,97]. Thirdly, since the datasets are static they cannot account for future changes in what is considered relevant. Lastly, models derived from such datasets are not necessarily aligned with user satisfac-tion [82]. Despite these drawbacks Offline Learning to Rank is still very prevalent and the work in this field has influenced many of the other aspects of Learning to

Rank [1,15,16,26]. Research Question RQ1 asks if it is possible to train a non-linear model in an Online Learning to Rank setting. Before non-non-linear models for the Online setting are introduced, this section will discuss related non-linear mod-els in the Offline setting.

2.1.1 RankSVM

The first relevant model is RankSVM introduced byJoachims[45], in this paper a set of linear ranking functions is formalized as:

(di, dj)∈ f_w(q)↔ w · di > w· dj (2.1) where diand djare documents that are ranked according to query q by function

fw and d is feature representation of a document query pair (d, q). Because this

approach is based on preferences between pairs of documents it is considered to be a pairwise algorithm [10,11,63]. Conversely, pointwise methods attempt to directly predict the cost of a document [56] and listwise methods are based on

(17)

metrics over entire rankings [10,68,101]. The weight vector w is optimized as a Support Vector Machine [47], this means that the learned function fw∗can always

be represented as a linear combination of feature vectors:

(di, dj)∈ f_w(q)↔ w · di > w· dj (2.2)

↔∑α∗ldl· di > ∑

α∗ldl· dj (2.3) where the magnitude of coefficient α∗_l determine the weight of document dl_and its sign whether similar documents should be ranked high or low. Thus rankings can be created by sorting documents sorted according to their value of:

R(d) = w· d =∑α∗ldl· d. (2.4)

AlthoughJoachimsdoes not explore the possibility, he mentions that kernels could be used in this function thus enabling a non-linear model to be learned [21]. For a kernelK this would give the ranking function:

R(d) =∑α∗_ldl· d =∑α∗_lK(dl, d). (2.5) Since RankSVM uses a linear kernels the produced ranking function R can be ex-pressed by a single set of weights w. However when using non-linear kernels the model ought to be optimized so that the Lagrange multipliers α will be zero except for a subset of documents [5, Chapter 7]. This sparsity is very useful as only the documents with non-zero α values are necessary for the computation of the rank-ing function. Without this sparsity the usage of this model would not be feasible as the total number of documents in a dataset easily exceeds the ten thousands [4,8,14,66]. Section 6 describes how no Lagrangian can be applied in the Online setting and thus introduces a different method to approximate this process.

(18)

2.1.2 Ranknet, LambdaRank and LambdaMART

Three other relevant models are Ranknet [10], LambdaRank [68] and LambdaMART [99,100] and in depth overview of how they are related has been written byBurges[9]. Ranknet is just like RankSVM a pairwise method, thus the cost of its model is based on the number of relevance differences in documents pairs it predicts incor-rectly. The method used to update Ranknet can be used for any model whose out-put is a differentiable function of the model parameters, though typically neural networks are used. The cost of the model is computed by taking the cross entropy of the models predictions of the pairwise preferences and the known preferences. In this case, Pijis the probability the model assigns to document dibeing preferred over dj _{and Pij}_{the known probability for this preference (as before d}i_{and d}j _are actually query document pairs). This gives us the cross entropy cost function of:

C =∑

i ∑

j

−Pijlog Pij− (1 − Pij)log(1− Pij) (2.6) The probability of preference Pijis based on the output of the model for document

diand dj: f(di) = siand f(dj) = sj. A sigmoid function, commonly used in neural networks [29], is applied to the output values to create the probability:

Pij=

1

1 + e−σ(si−sj) (2.7)

The derivative of the cost function with respect to one of the weights of the model

wkcan be derived as:

δC δwk = δC δsi δsi δwk + δC δsj δsj δwk = δC(si− sj) δsi ( δsi δwk + δsj δwk ) = λij(δsi δwk + δsj δwk ) (2.8) where lambda is defined as:

λij ≡

δC(si− sj)

δsi

(19)

When updating wkwe only need the set I containing all{i, j} where di_{was labeled} as more relevant than dj_:

δwk =−η ∑ {i,j}∈I (λij δsi δwk − λij δsj δwk ) =−η∑ i λi δsi δwk (2.10)

where η determines the learning speed of the algorithm. The key insight here is that the λijfor document-pairs can be aggregated into λifor each document:

λi = ∑ j:{i,j}∈I λij− ∑ j:{j,i}∈I λji. (2.11)

The sign of λiindicates in what direction the document dishould be moved in the ranking and it’s magnitude by how much. Besides the computational benefits of being able to batch-update the model, the usage of λiallowed for the insight that lead to LambdaRank [9].

Though the pairwise approach of Ranknet allowed for efficient optimization and produces well performing ranking models, the pairwise approach is somewhat limited. To see where the pairwise approach is inadequate, consider a model which gets the top ten documents of a ranking correct and in correct order but has the order of the hundred bottom documents incorrect. The cost of Ranknet would update the model in order to fix the order of the bottom documents, even if this update would compromise the correct top ten documents. This behaviour is ex-tremely undesirable, as in practice users will only consider the top results [20,79, 105]. This is taken into account by most of the metrics used in the Learning to

Rank field, for instance the commonly used NDCG@10 [44] only considers the top ten documents in a ranking (see Section 2.3.1). However since NDCG is not differentiable, formulating a cost function that considers it is useless. Conversely LambdaRank [68] instead multiplies the λijvalues by the change in NDCG that the swapping of di_{and d}j_{would cause:}|ΔNDCG

(20)

new cost function C′that is defined this by the λijit produces: λij ≡ δC′(si− sj) δsi = δC(si− sj) δsi |ΔNDCGij| (2.12)

where C is the cost function used in Ranknet (2.6). This change makes Lamb-daRank a listwise approach as besides the preferences between document pairs it considers their place in the resulting rankings [10,68,101]. Because the magni-tude of λ in LambdaRank depends on NDCG values it prioritizes ordering docu-ment pairs that affect this metric the most. Empirical work has shown that Lamb-daRank directly optimizes NDCG [23,103], consequently in terms of NDCG LambdaRank performs better than Ranknet. Additionally LambdaRank can triv-ially be changed to optimize for other Information Retrieval metrics such as Mean Average Precision or Mean Reciprocal Rank [81], this can be done by simply scal-ing λijaccording to the desired metric instead of NDCG [23].

Lastly we have LambdaMart [99,100] a combination of MART [28] and Lamb-daRank. MART is a boosted tree model that can be seen as performing gradient descent using regression trees. The model is essentially a linear combination of least squares regression trees as the output of MART can be written as:

RN(d) = N ∑

i=1

αifi(d) (2.13)

where each fi ∈ R is the output of a single regression tree and αi is the associ-ated weight for the tree. The number of trees N and the number of leafs per tree

L are meta-parameters of MART. Trees are generated in succession, the algorithm

starts by creating the first regression tree on least squares cost, thus each node is split greedily to minimize the least squares error. The next tree is generated by modeling the derivative of the cost of the current model∑_di δC(Rn

) Rn (d

i_{), this is}

re-peated N− 1 times. Thus each tree added to the ensemble can be seen as a step in gradient descent, where the gradients are the λifrom LambdaRank. The trees are generated using Newton’s approximation since the exact step for each leaf

(21)

can-not be calculated in this instance. The details of the tree generation process are lengthy and not relevant to the rest of the thesis [9]. Finally, when comparing LambdaRank and LambdaMart it is interesting to see that LambdaRank updates all its weights after a query is examined. LambdaMart on the other hand only up-dates some weights at a time, i.e. it splits the current node of the tree it is generating. This decision process only affects documents that fall in the current node, however it follows the gradient based on all the data. This means LambdaMart can explic-itly choose splits that increase the cost for some queries if this also decreases the overall cost.

Though both Ranknet, LambdaRank and LambdaMart as well as the prevalent

Online Learning to Rank algorithms use gradient descent [60,88], the Offline

Learn-ing to Rank algorithms are not designed to deal with the difficulties in the Online

setting. As a result, they perform very poorly due to the bias and noise issues in-herent in Online Learning to Rank [38]. However the boosted regression trees and support vector machine models work extremely well in the Offline setting [13,94], and thus there is a high potential for models inspired by this approach working in the Online setting.

2.2 Online Learning to Rank

This thesis introduces models and algorithms for Online Learning to Rank, despite being very related to Offline Learning to Rank they differ fundamentally in their setup. Namely in the Online setting no annotations are available, instead the al-gorithm learns over time from interacting with its users [38]. Due to this differ-ence, many of the issues with the Offline setting do not appear in the Online set-ting. Firstly the Online setting does not require the costly creation of an anno-tated dataset, making it applicable to all search contexts. Consequently online al-gorithms can be used in settings where experts cannot annotate documents, for instance cases that are privacy sensitive or where the search context is too specific. Secondly, because the interactions come from users directly they are more repre-sentative of their preferences then when inferred by annotators. Additionally this makes the algorithm sensitive to changes in relevance over time, something that

(22)

cannot be accounted for in static datasets.

Besides these benefits the Online setting also has drawbacks of its own. Most of these stem from the fact that user preferences can not directly be observed from their interactions [46,74,104]. Moreover algorithms have to deal with both train-ing their ranktrain-ing model and gathertrain-ing the interactions to base their optimizations on. In the standard Online setting queries are issued by users sequentially and the algorithm decides what rankings to display. The user interactions with those rankings are logged and given to the algorithm which it uses to update its model. Although, the Online Learning to Rank task can be modelled as a Reinforcement Learning problem, it differs from a typical Reinforcement Learning setting because there is no observable reward. Conversely, since the algorithm has to infer prefer-ences from user interactions it has to deal with noise and bias. The most common form of noise occurs when users click on documents despite not considering them relevant to the query. This can happen for numerous reasons, e.g. it may coinciden-tally be interesting while unrelated to the query, the document may be sensational or provocative (also know as clickbait), or because all other displayed documents are even more irrelevant [12,48,49]. The most prevalent form of bias in Online

Learning to Rank is position bias [20,105]. This bias originates from the limited number of documents users are willing to consider and the order they will exam-ine them in. Namely this means that the position of a document heavily affects number of interactions it will receive. Accordingly the lower a document is in the ranking the less likely it is to be considered by the user and thus be interacted with [96]. Consequently user preferences have to be inferred from interactions with care [25,73]. Moreover, simply following the most clicked documents will not work, as even the worst rankings receive clicks and not all documents will have been represented in the displayed rankings.

Similar to Online Learning to Rank, inferring preferences from user interactions in the Online setting is also the aim for the task Online Evaluation in Information Retrieval [69,72]. Here the aim is to identify the best ranker out of a set of rankers based on user interactions. Relevant Online Evaluation methods use interleaving methods [17,39,71], where the rankings outputted by two rankers are combined

(23)

into a single interleaved ranking. The clicks on this resulting ranking can then be used to derive which of the rankers in the set is the most preferred. Corre-spondingly, the most influential algorithm in Online Learning to Rank: Dueling

Bandit Gradient Descent (DBGD) [104] uses interleaving to optimize its model. DBGD estimates the gradient of its model by generating a slight variant of its cur-rent model and comparing them using Team Draft Interleaving [74]. If the clicks of the user indicate a preference for the variant, the model is updated accordingly. Re-cently the introduction of Multileaving [86] allowed for the comparison of more than two rankers from a single user impression, by creating a multileaved ranking instead of an interleaved one. This new evaluation method was applied to DBGD resulting in Multileave Gradient Descent [88]. Correspondingly the usage of mul-tileaving increased the learning speed of the algorithm as more comparisons can be made per impression. Subsequently the introduction of Probabilistic Multileaving [87] allowed virtually unlimited rankers to be compared per impression, this again let to an improvement in learning speed of Multileave Gradient Descent [60].

The main Multileave Gradient Descent algorithm is displayed in Algorithm 1, the algorithm starts with a current best ranker w0

0which is initialized with all weights set to zero. At each issued query, another n rankers wn

t are sampled from the unit sphere around the current best ranker (Line 7). Each of these rankers produces a ranking for the query which is combined into a single multileaved ranking e.g. by using Probabilistic Multileaving (Line 9). The resulting ranking is displayed to the user and their clicks are observed (Line 10), from these clicks the rankers preferred over the current best are be inferred (Line 12). Naturally the inference method is defined by the multileaving method used. If none of the other rankers is preferred the current best is kept as the best ranker, otherwise the model takes a η step in the direction derived by the update method (Line 15). Schuth et al. propose two update methods: Mean-Winner and Winner-Takes-All shown in Al-gorithm 2 and 3 respectively. The first updates towards the average of the set of preferred rankers, the latter uniformly selects a ranker from the set and updates towards it. They empirically conclude that the Mean-Winner is more robust than the Winner-Takes-All method [88], however Section 4.3.2 describes a situation

(24)

Algorithm 1 Multileave Gradient Descent (MGD).

1: Input: n, δ, w0

0, update(w, η,{b}, {u})

2: for t← 1..∞ do

3: qt ← receive_query(t) // obtain a query from a user

4: l0 ← generate_list(w0t, qt) // ranking of current best

5: for i← 1...n do

6: ui

t ← sample_unit_vector()

7: wi

t ← w0t + δuit // create a candidate ranker

8: li ← generate_list(wit, qt) // exploratory ranking

9: mt, tt ← multileave(l) // multileaving and teams

10: ct ← receive_clicks(mt) // show multileaving to the user

11: bt ← infer_preferences(tt, ct) // set of winning candidates

12: if bt =∅ then

13: w0

t+1 ← w0t // if current best among winners, no update

14: else

15: w0_t+1 ← w0_t + η(update(bt, ut)) // Algorithm 2 or 3 Algorithm 2 MGD update function: mean winner (MGD-M).

1: Input: bt, ut

2: return _|b1_t_|∑_j∈b_tujt

where averaging candidates should be avoided thus making this update method useful. Note that with either update method setting the number of candidates to

n = 1 reduces MGD to DBGD. After the model has been updated the algorithm

waits for the next query to repeat the process, yielding to a continuously adapting system.

Another extension to DBGD is Candidate Pre-Selection (CPS) which was pro-posed byHofmann et al.[41] for reusing historical interactions during gradient descent. This method was made possible by the introduction of Probabilistic In-terleaving [37] which allows for the comparison of two rankers on an interleaving that was created using a different ranker pair. CPS changes the candidate genera-tion step in DBGD: Algorithm 1 on Line 7. Instead of sampling a candidate

(25)

uni-Algorithm 3 MGD update function: winner takes all (MGD-W).

1: Input: bt, ut

2: j ← pick_random_uniformly(bt)

3: return ujt

formly, a set of uniformly generated rankers is created and based on historical inter-actions the single most promising candidate is selected. Thus vastly increasing the likeliness that the candidate is in the direction of the gradient. The effect of CPS is that the learning speed is tremendously increased, however after an initial peak the learned model’s performance degrades over time [60]. This seems to be caused by the preselection becoming to biased by its historical interactions and ‘overfitting’ on them. Due to the unreliableness of CPS this thesis will only introduce methods that extend on MGD.

2.3 Experimental Setup in Online Learning to Rank

In order to answer the research questions posed in Chapter 1 various experiments were performed. Due to the large number of experiments, their specific setup and results are discussed over several chapters: Section 4.3 details the experiments re-garding the overall algorithm and how it learns non-linear parts of models; Sec-tion 5.3 describes the experiments specific for the Regression Forest models; fi-nally Section 6.2 discusses the experiments regarding Kernel Based models. This Section serves to describe the part of the experimental setup that all of them have in common.

This thesis follows the standard setup for experiments in Online Learning to Rank [40,41,60,88]. Every experiment is based around a stream of queries coming from users interacting with the system that is being trained. All of the queries are assumed to be independent. The system responds to every query by presenting a list of documents to the user, this is called an impression. In turn the user may or may not interact with the list by clicking on one or more documents. For this thesis the number of results displayed is limited to k = 10, meaning the user can-not consider or interact with more than ten documents per impression. The main experimental runs all consist of 10,000 queries, this number was chosen as a

(26)

com-promise between allowing the models to fully converge and having experiments finish in a feasible amount of time [60]. The queries and documents come from static datasets which are described in Section 2.3.4, users are simulated using the click models detailed in Section 2.3.3. Both the model that is trained and the user experience during training are evaluated, Section 2.3.1 describes how. Finally, Sec-tion 2.3.2 discusses the baseline that is used to compare to the state-of-the-art al-gorithm and related parameters that all experiments have in common.

2.3.1 Evaluation Metrics

All performance is assessed using the NDCG [44] metric, namely NDCG@10 is computed meaning that only the top κ = 10 documents of a ranking are consid-ered: NDCG = κ ∑ i=1 2rel(r[i])− 1

log2(i + 1)iDCG

−1_. _(2.14)

This metric calculates the Discounted Cumulative Gain (DCG) over the relevance labels rel(r[i]) for each document in the top κ of a ranking. Subsequently, this is normalized by the maximal DCG possible for a query: the ideal DCG (iDCG) which is the DCG for the perfect ranking. This results in Normalized DCG (NDCG) which measures the quality of a single ranked list of documents with the maximum value of NDCG = 1 for a perfect ranking.

The main task in Online Learning to Rank is to optimize a model, accordingly the performance of the learned model is measured during and after training. This is called the offline performance and is computed by taking the average NDCG score of the current best ranker over a held-out set. In other words, Offline performance is measured over a set of queries that algorithm has not seen during training. To get an idea of learning efficiency this metric can be computed at different moments during training, which reveals the average number of impressions the algorithm needs to reach a certain level of performance.

(27)

de-terring users during training would compromise the task of the system. For this rea-son, the online performance is also assessed by computing the cumulative NDCG of the rankings shown to the users [36,91]. For T queries successive this is the discounted sum: OnlinePerformance = T ∑ t=1 NDCG(mt)· γ(t−1) (2.15)

where mtis the ranking displayed to the user at timestep t. This metric is common in Online learning which can be seen as the expected reward with γ as the proba-bility that another query will be issued. For online performance a discount factor of

γ = 0.9995 was chosen so that queries beyond the horizon of 10,000 queries have

a less than 1% impact [60].

Finally all experimental runs were repeated 125 times, spread evenly over the dataset’s folds, results for each run were averaged and a two tailed Student’s t-test was used to verify whether differences are statistically significant [107].

2.3.2 Baseline and Shared Parameters

The Research Questions RQ2 and RQ3 directly entail comparisons with the state-of-the-art Online Learning to Rank algorithm. As a baseline for these research ques-tions Probabilistic Multileave Gradient Descent (P-MGD) [60] is used (see Sec-tion 2.2). Furthermore this baseline shares all of its parameters with the proposed extension of this algorithm introduced in Chapter 4. Previous work shows that increasing the number of candidates generated per iteration leads to better perfor-mance of P-MGD [60,88], however the effect of adding another candidate lessens as more candidates are used while the computational cost increases linearly. Ac-cordingly, this thesis uses n = 49 candidates for all algorithms in every run, this number is great enough to achieve high learning speed while still keeping the ex-periments within a reasonable running time. Probabilistic Multileaving inferences were computed using a sample based method [87] were the number of document assignments sampled for every inference is 10,000 [60]. Candidates were sampled

(28)

Table 2.3.1: Instantiations of Cascading Click Models [32] as used for simulating user behaviour in experiments. P(click = 1|R) P(stop = 1|R) Relevance grade R 0 1 2 3 4 0 1 2 3 4 perfect 0.0 0.2 0.4 0.8 1.0 0.0 0.0 0.0 0.0 0.0 navigational 0.05 0.3 0.5 0.7 0.95 0.2 0.3 0.5 0.7 0.9 informational 0.4 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5

from the unit sphere with δ = 1, updates were performed with η = 0.01 and zeros were used for the initialization of w00 = 0 according to the standard set for Duel-ing Bandit Gradient Descent and Multileave Gradient Descent [38,60,88,104]. These parameters were the same for every algorithm that was ran for this thesis including the baseline and all novel methods.

2.3.3 Simulation User Behaviour

To simulate users the standard setup for Online Learning to Rank simulations is used [36,60,88]. First, a user issues a query this is simulated by uniformly sam-pling a query from the static dataset, therefore the distribution of queries matches that of the dataset. Subsequently, the algorithm decides the result list of docu-ments to display. The behaviour of the user after it receives this result list are sim-ulated using a cascade click model [32]. This model assumes a user to examine the documents in a result list in their displayed order starting at the top document. For each document that is considered the user decides whether it warrants a click, this is modelled as the conditional probability: P(click = 1|R) where R is the rel-evance label provided by the dataset. Accordingly, a cascade click model instanti-ation should increase the probability of a click with the degree of the relevance label. After the user has clicked on a document their information need may be satisfied, otherwise they will continue considering the remaining documents. The

(29)

probability of the user not examining more documents after clicking is modelled as P(stop = 1|R), again this is conditioned on the relevance label R since it is more likely that the user is satisfied from a very relevant document.

Table 2.3.1 lists the three instantiations of cascade click models that were used for this thesis. The first models a perfect user that considers every document and clicks on all relevant documents and nothing else. Therefore this model only simu-lates selection bias and no position bias as a document’s position within the result list does not affect the probability of it being clicked [20,32,105]. Secondly, the

nav-igational instantiation models a user performing a navnav-igational task and is mostly

looking for a single highly relevant document. Finally, the informational instanti-ation models a user that does not have a very specific informinstanti-ation need and typi-cally clicks on multiple documents with less dependence on their relevance. These three models have increasing levels of noise, as the behaviour of each depends less on the relevance labels of the displayed documents.

2.3.4 Datasets

Several Online Learning to Rank datasets are publicly available which have varying sizes and represent different search tasks. Each of these datasets consists of sev-eral queries and a list of corresponding documents, along with manual relevance assessments for the documents with respect to the queries. The document-query pairs are each represented by a feature vector, there are no representations for the queries or documents independent of each other. Every dataset is divided in train-ing, validation and test partitions. Parameter tuning was done by evaluating per-formance on validation sets, all reported results are evaluated on the test sets with both the training and validation set used to train the model. For this thesis only a selection of public datasets has been used, this section will briefly describe the most important publically available datasets and how this decision was made.

The first publicly available Learning to Rank datasets are distributed as LETOR 3.0 and 4.0 [57], their datasets use representations of 64 or less features encod-ing rankencod-ing models such as TF.IDF, BM25, Language Modellencod-ing, PageRank, and

(30)

HITS on different parts of the documents. The datasets in LETOR are divided by their tasks, most of which come from the TREC Web Tracks between 2003 and 2008 [18,19]: HP2003, HP2004, NP2003, and NP2004 are based on navigational tasks which are homepage finding and named-page finding respectively; both

TD-2003 and TD2004 implement the informational task of topic distillation. HPTD-2003, HP2004, NP2003, NP2004, TD2003 and TD2004 each contain between 50 and

150 queries and 1,000 judged documents per query. The OHSUMED dataset is based on the query log of the search engine on the MedLine abstract database, and contains 106 queries. Lastly the two most recent datasets MQ2007 and MQ2008 were based on the Million Query Track [2] and consist of 1700 and 800 queries respectively but have far fewer assessed documents per query. All of the LETOR distributions have their datasets divided into five folds, for this thesis the folds as given were used.

In 2010 Microsoft released the MSLR-WEB30k and MLSR-WEB10K [66], the former consists of 30,000 queries obtained from a retired labelling set of a commer-cial web search engine (Bing), the latter is a subsampling of 10,000 queries from the former dataset. The datasets uses 136 features to represent its documents, each query has around 125 assessed documents, similar to the LETOR datasets these datasets were distributed in five folds.

Lastly, also in 2010 Yahoo! organised a public Learning to Rank Challenge [14], which consisted of a larger and a smaller dataset. Both datasets consist of docu-ments sampled from the query logs of the Yahoo! search engine, the larger dataset is based of logs originating from the United States whereas the smaller dataset cor-responds with an Asian country. The difference in origin was chosen to enable a transfer learning task, where a system could leverage the larger dataset to enhance performance on the smaller. For the purpose of this thesis both datasets are used separately and referred to as Yahoo-US and Yahoo-AS for the larger and smaller dataset respectively. Yahoo-US consists of 29,921 queries and 709,877 documents, the smaller Yahoo-AS contains 6,330 queries and 172,870 documents. Both en-code 700 features however only 519 have more than one unique value in Yahoo-US,

(31)

dataset LambdaMart TD-MGD

perfect navigational informational

HP2003 0.793 0.804 0.786 0.773 TD2003 0.286 0.334 0.333 0.312 NP2003 0.745 0.737 0.727 0.713 MQ2007 0.425 0.427 0.366 0.350 OHSUMED 0.431 0.462 0.438 0.436 MSLR10k 0.458 0.323 0.311 0.307 Yahoo-US 0.745 0.689 0.679 0.651

Table 2.3.2: The NDCG of LambdaMART and linear models trained using Team-Draft Multileave

Gradient Descent (TD-MGD) on a large collection of datasets. The TD-MGD models were trained using different click-models and evaluated after 10,000 impressions. The difference in perfor-mance between LambdaMART and TD-MGD under the perfect click model is expected to be caused by: the effect of bias in the online setting, the gradient estimation in the TD-MGD algo-rithm and the linear model underfitting the data.

whereas Yahoo-AS only has 596 useful features.

Due to time-constraints and the extensive time it takes to repeat experiments, not all available datasets have been used for this thesis. The chosen datasets were se-lected based on two criteria, firstly larger datasets are preferred because they better capture the variety of queries and documents, also it makes it less likely for a query to appear multiple times during training. Secondly, the difference in performance in the Offline and Online setting was considered. To measure the performance in the Offline setting LambdaMart was applied to all of the datasets, for the Online setting P-MGD was ran with 49 candidates over 10,000 iterations under different click-models. For most datasets P-MGD converges in less than 10,000 impressions and therefore we consider it sufficient for getting a sense of the performance gap. The results are displayed in Table 2.3.2, the largest difference in performance are on the MSLR and Yahoo datasets. Because these datasets also contain the most documents, queries and number of features we have decided to focus on them.

(32)

3

Related Work

In contrast with the previous work described Chapter 2, this chapter discusses re-lated work that is not rere-lated to Learning to Rank.

There are several algorithms related that train non-linear models in an online fashion.Saffari et al.introduced On-line Random Forests [80], a model designed for cases where training data arrives sequentially. Trees are initialized as single nodes, using online bagging [61] incoming datapoints and their labels are distributed over the trees. When a node has received a certain number of datapoints and a substan-tial information gain can be made, then the node is branched accordingly. Thus the forest gradually increases in complexity as more datapoints have been consid-ered and expanding the trees leads to an increased performance. Additionally, the algorithm keeps a set of out-of-bag documents which have not been presented to trees. The error on this out-of-bag set is then used to discard trees. As a result, the On-line Random Forest can be applied to data where the underlying distribution

(33)

is continuously changing. Previous work has shown that the trained forest con-verges to the performance of an Offline Random Forest that uses the entire dataset immediately [22,52,80]. Similarly, online algorithms for optimizing Support Vec-tor Machines (SVM) have been introduced [6,93,95]. The Online Support Vector

Machine can reach the performance of the Offline Support Vector Machine after a

single pass over the data [6]. To allow for efficient learning the SVM update rule is adapted to spend an equal amount of time on datapoints which are support vec-tors and those that are not. This is not an issue for the offline SVM as it can initially determine whether a datapoint should be a support vector or not. In contrast the online SVM can receive a datapoint that can make current support vectors use-less or cause a previously seen vector to become a support vector. The time spend considering either has to be balanced as the number of support vectors are heavily outnumbered by the other vectors.

Despite being very useful for many applications, the settings where On-line

Ran-dom Forests and On-line Support Vector Machines are applicable differ from Online Learning to Rank. Firstly, these algorithms require true labels as input and are

un-able to account for any type of noise. Therefore they are unsuitun-able for optimiz-ing based on user interactions. Secondly, they do not account for position bias i.e. highly ranked documents are more likely to be clicked or selection bias i.e. only dis-played documents receive clicks. Instead both algorithms assume their model to be independent of what datapoints will be received. In contrast, in Online

Learn-ing to Rank the system has to decide what datapoints are gathered i.e. what

doc-uments to display. Therefore, important concepts like exploration [40] are not present in either On-line Random Forests or On-line Support Vector Machines. Con-versely, Chapter 5 and Chapter 6 introduce methods that optimize models similar to Random Forests and Support Vector Machines for Online Learning to Rank re-spectively.

Finally,He et al.introduced a combination of a Gradient Boosted Trees and Lo-gistic Regression to create a model for predicting clicks on advertisements [34]. This combination was made by using decision trees as a feature transformation and applying logistic regression on the resulting binary representation. The

(34)

fea-ture transformation is updated once per day based on a large history of clicks. Data is transformed by taking every leaf as a binary feature. The logistic regression is applied on top of this representation in an online fashion, due to it being compu-tationally cheap it can update much more frequently than decision trees. This is very advantageous as fresh data is considered the most important when predicting clicks [34,76]. Thus the model combines the more complex structure of decision trees with the efficient adaptiveness of logistic regression. Correspondingly it per-forms better than either individual methods.

The approach ofHe et al.is very similar to our approach in Chapter 5. Both use decision trees as a feature transformation and apply a linear model on top of it. However, sinceHe et al.are predicting clicks, they can calculate their cost and create decision trees accordingly. In Online Learning to Rank this is not possible therefore our approach learns tree structures through multileaving instead.

(35)

4

Optimization Techniques for Non-Linear

Models in the Online Learning to Rank

Setting

This chapter proposes a framework for optimizing non-linear models, resembling those described in Section 2.1, in the Online Learning to Rank setting using a list-wise approach, building upon the state-of-the-art described in Section 2.2. The fol-lowing chapters will adapt this framework for specific types of models. First, Sec-tion 4.1 describes the difficulties in non-linear optimizaSec-tion in the Online setting. Subsequently Section 4.2 discusses how many non-linear models can represented as linear combinations of ranker functions, and different non-linear structures of a model can be considered by sampling these ranker functions. Then Section 4.3 proposes different ways of limiting the number of ranker functions the linear

(36)

com-bination uses during the Online optimization. Furthermore, the experiments to empirically validate these approaches and their results are discussed in Section 4.4 and Section 4.5 respectively. Finally, Section 4.6 concludes the chapter by describ-ing an algorithm that uses the approach that performed best accorddescrib-ing to our re-sults. This algorithm will be used for the remainder of this thesis.

4.1 Difficulties in Applying Multileave Gradient Descent to

Non-linear models

Section 2.1 discussed the state-of-the-art Offline Learning to Rank models and why they cannot directly be applied to the Online setting. Primarily this is because nei-ther their costs nor their gradients can be calculated as is possible in the Offline set-ting. In contrast the state-of-the-art Online Learning to Rank algorithms described in Section 2.2 estimate their gradients using multileaving methods. Algorithm 1 re-veals three important properties a model should have in order for Multileave Gra-dient Descent to be applicable. Firstly, the graGra-dient estimation process requires the sampling of slight variants of the model that is being optimized. Subsequently, from user interactions multileaving can estimate which variants are improvements (Algorithm 1, Line 7) and infer the direction of the gradient. Thus, it is important that the sampled variants differ only slightly from the current model, otherwise they will not be representative of the local gradient. Secondly, the variants should cover all “directions” of the gradient evenly otherwise the estimation process will be biased towards the more sampled regions (Algorithm 1, Line 6). Thirdly, the model must allow for small updates, as this is crucial to deal with the unreliability of multileaving. Since noise is inherent in user interactions the algorithm has to avoid learning the noise, using small steps the model will oscillate in the direction of the true gradient over time. Algorithm 1, Line 15 displays how an η step is taken towards the winning variants.

Let’s consider training a single regression tree with Multileave Gradient Descent to see how it does not meet the previously mentioned properties. First off sampling a variant of a tree involves two aspects of the tree: changing the labels YN of it’s nodes and changing the structure of the tree. It is unclear what constitutes a slight

(37)

alteration to the structure of the tree since there is nothing similar to a unit sphere that branchings can be sampled from. Secondly, it is also unclear whether it is pos-sible to evenly sample from all structural changes of a tree. Lastly and most impor-tantly, in contrast with a change in labels, a small update step towards variants with a different structure is not always possible. For instance, take a single node from a given tree xnand some slight variants of the tree. Let’s say multileaving estimates that two variants are improvements over this tree: the first branches the node into

xn+1and xn+2the other into xn+3and xn+4. In this case a small update is possible by choosing one of the branchings and setting their labels accordingly, e.g. choose to branch into xn+1and xn+2and set their labels accordingly yn+1= yn+ η(y′_n+1− yn) etc. However, in subsequent iterations the branching into xn+3 and xn+4 will no longer be possible, thus this small change has committed to a large limitation in the exploration of trees. Consequently, these issues indicate that the reliable opti-mization of a single tree seems to be infeasible using standard Multileave Gradient Descent.

A big part of the issues seem to stem from the noise in the multileaving estima-tion, thus if the algorithm is more certain when committing to a branching the op-timization may be successful. However, this would mean that the algorithm has to observe more clicks before updating, dramatically slowing it down. Furthermore the algorithm can no longer be continuously adapting to the user unless there are no bounds on the growth of the tree. In turn this would be both computationally impractical and support overfitting behaviour.

This thesis proposes a different approach. Instead of trying to optimize a single model, Multileave Gradient Descent can optimize a linear combination of models. In the case of optimizing a regression tree, the candidate rankers are not simply variants of the tree but interpolations between the current model and a variant. Making a slight variant is thus no longer an issue since the interpolations explicitly parameterize how similar they are to the original tree. Moreover updating a linear combination a slight step towards the candidate is trivial as it is simply the linear

η update step. However, over time this would mean the learned combination will

(38)

define a framework for optimizing non-linear models using this approach, as well as proposing methods to overcome the unbounded growth of the linear combina-tion.

4.2 Non-Linearity through Generated Ranker Functions

The previous section has made the case for optimizing a linear combination of non-linear models instead of optimizing a single model in the Online setting. One of the foreseeable issues with this approach is the large number of models that the algorithm has to keep track of. Later chapters show that some non-linear models are equivalent to a linear combination of multiple simpler ranker functions (see Chapter 5 and 6). In turn this property can be used to efficiently represent a lin-ear combination of non-linlin-ear models, since they can have a large overlap in their simpler sub-models.

Naturally, optimizing a linear combination of models can be done using Multi-leave Gradient Descent: let Φ be a collection of ranker functions Φ ≡ {φ₁, ..., φ_N}

and the corresponding weights to be learned are w≡ {w0, ..., wN} then the result-ing scorresult-ing model R will be:

R(d) =

N ∑

i=1

wiφ_i(d) (4.1)

where d is the feature representation of a document-query pair. Documents are ranked according to the value of R(d) which is optimized over time according to Algorithm 1. Obviously this is exactly the situation the algorithm was designed for and thus w will be optimized in a fast an reliable manner [60,87,104]. More-over, note that a large portion of features used in Learning to Rank datasets are outputs of established ranker functions i.e. BM25, PageRank, TF-IDF etc. (see Section 2.3.4). Therefore learning a linear combination of Φ or d can both be seen as a linear combination of ranker functions.

The obvious flaw in this approach is that despite being a non-linear model, the non-linear part of R is not being optimized. In other words a model (the weights

(39)

w) in the space defined by Φ is optimized, but the set of ranker functions Φ which encompass the non-linear aspect of the model remains unaltered. Correspond-ingly this thesis introduces an algorithm that updates the set of ranker functions over time, thus producing a Φtfor every timestep t. Depending on what model is optimized a distribution of ranker functions is introduced:

Π(Φt)∼ φ. _(4.2)

Note that Φ has to be established before any interactions have been recorded, thus Π(∅) should be defined to generate the initial set of ranker functions. By condi-tioning Π on the existing set of functions it is possible to increase the complexity of the generated functions over time. For instance, this behaviour allows the algo-rithm to mimic the growth of regression trees which will be further discussed in Chapter 5.

In addition, to prevent the algorithm from accumulating an unmanageable num-ber of models, this thesis also proposes an alteration of the Multileave Gradient Descent method that causes the algorithm to optimize towards a linear model that does not use all available ranker functions. In other words, the algorithm is changed to not only find the optimal weights w but is also incentivized to discard ranker functions φ. As a result, the algorithm aims to make an optimal selection of ranker functions: Φt+1 ⊂ Φt, while simultaneously optimizing wt+1 for that subset. Several such alterations are proposed and evaluated in Section 4.3 and Sec-tion 4.5.

By discarding suboptimal ranker functions from Φt, the problem of the growing number of models in the linear combination is resolved. Additionally this means the algorithm performs a search in Φt and therefore it is also adapting the non-linear part of the model. Furthermore, Π can be used to add functions to Φ thus creating a continuously adapting system that optimizes w while simultaneously searching for the optimal set of non-linear functions Φ in Π.

The next section evaluates different methods for selecting a subset of Φ, then Chapter 5 and Chapter 6 show how regression forests and kernel based methods

(40)

fit in this framework respectively.

4.3 Online Subspace Selection

This section proposes several alterations to the Multileave Gradient Descent algo-rithm, these changes aim to optimize a linear model that is limited in the number of elements it uses. This means that the algorithm is guided to reduce the number of non-zero weights in w. Subsequently, the ranker functions in Φ that correspond to zero weights in w no longer affect the ranking model and are discarded. The advantages of this behaviour is three fold, firstly it has obvious computational ben-efits since less rankers have to be kept in memory and less φ(d) values have to be calculated. Secondly, the model will be optimized faster as the learning speed of Gradient Descent methods decrease with the number of parameters they optimize [104]. Thirdly, by discarding unimportant ranker functions place is made to con-sider others that are potentially useful. This task is very similar to feature selection, something that has been studied in Offline Learning to Rank before [30]. However to our knowledge this is the first attempt to research it in the Online Learning to

Rank setting.

4.3.1 Discarding the k Smallest Weights

The first method is based on the assumption that the importance of a function φ_n can be derived from the norm of its weight wn. Assuming the output of the ranker functions is normalized i.e. the output of every φ has the same magnitude on aver-age then wndirectly indicates the contribution φ_nmakes to the linear combination. Therefore, removing the k functions from Φ corresponding to the smallest weights in w would affect the model the least.

A big flaw in this approach is that the total norm of the weights discarded may not be negligible. For instance, if a large number of replicas of the same ranker functions (or very highly correlated functions) are in Φ, e.g. it includes a hundred equivalent functions: φ_n ≡ φ_n+1 ≡ ... ≡ φ_n+99. Then their individual weights are expected to be a hundred times smaller than if φ_nhad appeared only once. By

(41)

Algorithm 4 Generating a single candidate for Candidate Subspace Selection. 1: Input: δ, w, η, Φ, p, λ 2: u ← sample_unit_vector(δ) 3: t← 0 4: repeat 5: t← t + 1 6: Φ¯ ← Bernoulli(p · λt) 7: for i∈ ¯Φ do 8: ui← −_η1wi 9: a←∑_φ i∈¯Φ(ui) 2 10: until a > 1 11: c ←√∑ 1−a φi∈Φ∧φi /∈¯Φ(ui)2 12: for j ∈ ¯Φ ∧ i /∈ ¯Φ do 13: ui← c · ui 14: return u, ¯Φ

dropping multiple functions at the same time several correlating highly important features could be discarded. Additionally, the rankers are evaluated on how they rank documents, not directly on their value per document. Therefore, a very small weight may still act as an important tiebreaker in rankings. In Section 4.4 two ways of discarding functions in this manner are evaluated: one minimizing the number of times functions are discarded, thus maximizing k and discarding all k functions in a single iteration; another minimizing the number functions discarded per time, thus k = 1 and the discarding is spread over many iterations.

4.3.2 Sampling Candidates from Unit Subspace Circles

In contrast with the previous method, the second alteration aims to guide Mul-tileave Gradient Descent by changing the candidate sampling method. Sampling is altered to generate candidates which result in some zero weights if the model is updated towards them. The first step when generating a candidate is sampling a set of functions ¯Φ from a Bernoulli distribution over Φ. Subsequently, candidates can be generated so that if the candidate wins, the resulting model will have zero