Learning to Rank Using Simulated Clicks

(1)

Learning to Rank Using Simulated Clicks

SUBMITTED IN PARTIAL FULLFILLMENT FOR THE DEGREE OF MASTER OF SCIENCE

N

OUT VAN

D

EIJCK

11335262

M

ASTER

I

NFORMATION

S

TUDIES

H

UMAN-

C

ENTERED

M

ULTIMEDIA

F

ACULTY OF

S

CIENCE

U

NIVERSITY OF

A

MSTERDAM

September, 2017

1st_Supervisor ₂nd_Supervisor

Dr. Ilya Markov Artem Grotov MSc

(2)

Learning to Rank Using Simulated Clicks

Nout van Deijck

Information Studies, University of Amsterdam

+31(0)6 24567823 nout@noutweb.com

ABSTRACT

User click data is an implicit measure of relevance of documents. Learning to Rank tries to optimize rankings of documents for queries. Previous research showed that click models can be used to simulate user clicks. In this paper, the application of simulating clicks using three click models on the task of LTR, is studied using a click log dataset from a major search engine. Both real-world clicks, as well as random clicks and simulated clicks using the three models, are used to calculate features for a pairwise LTR algorithm, LambdaMART. Results from NDCG and MAP measures are compared, and show that simulated clicks using trained click models perform worse than real-world clicks, but considerably better than random clicks. The relevance, process and conclusion of these experiments, as well as some limitations and suggestions for further research, are described in detail below.

Keywords

Learning to Rank (LTR), Click Models, Click Simulation

1. INTRODUCTION

Ranking can be seen as one of the most important aspects of Information Retrieval (IR) Systems (search engines), as the growing amounts of information continue to make it more difficult for users to find desired information. E.g. the estimated amount of web pages on the world wide web has doubled from 25 billion in October 2008 to almost 50 billion as of August 20171_[22].

Within the discipline of Information Retrieval, a growing field deals with Learning To Rank (LTR). This activity has the objective of developing and improving search engine rankers. Using training and testing data, LTR tries to learn from features, usually assigned to query-document pairs. A model is constructed to rank documents for a particular query, sorted by relevance (or another measure, like preference or importance). As training data, multiple features can be used to train the learning model. A possible distinction is content-based, link-based and user-based features. User-based data in the form of a click log is used in this research.

Since natural user interaction data is noisy [17], and LTR algorithms need lots of training data, often large click-logs

1_Using_{http://www.worldwidewebsize.com/}

2_{https://academy.yandex.ru/events/data_analysis/relpred2011}₍₂₀₁₁₎

are gathered, before performing LTR. To speed up this process, one would need the possibility to use less user data to perform comparable LTR results, in order to similarly improve the eventual search rankers. One way to shorten the process of LTR, is using historical interaction data, which has proven successful especially when the feedback was noisy [5]. If not much data is present, another way to generate more, possibly usable, training data, is through user simulation. Simulating user behaviour and user interactions has already been playing a role within Information Retrieval [24] and can be essential for both academic purposes (when access to large scale logs is limited) and for commercial systems, where experiments with real users cannot be scaled infinitely [24]. Because a search engine click log is considered as primary training data in this experiment, we will simulate clicks.

In order to simulate clicks, click models are used. The performance of different click models for click simulation has been researched recently [24]. The performance of different click models for click simulation, and using the simulations directly as training data in an LTR experiment, is not clearly researched before. The purpose of this paper is to find out and compare the performance of different click models for click simulation in a real world LTR experiment. This can be seen as an application of click simulation.

In order to carry out this research, the following research questions are used:

1. “How does LTR with simulated clicks compare to LTR with real world clicks?”

2. “How do different click models compare, when simulating clicks for an LTR experiment?” The in this paper outlined experiments showed that LTR with simulated clicks generated considerably lower results than with real world clicks, as one would expect, but also performing reasonably above LTR with randomly generated clicks.

To carry out this research, data from the Yandex Relevance Prediction challenge is used2_{. First, the winning papers of}

that challenge [11, 18] are used to replicate an LTR algorithm for this specific task. After that, click simulation

(3)

is applied using the click data from this challenge, and three different trained click models. Lastly, the same LTR algorithm is used to experiment with the simulated click data resulting of the different click models.

2. RELATED WORK

2.1 Learning to rank

Learning To Rank is a growing field within the sciences of Machine Learning and Information Retrieval. Lots of different LTR algorithms have been developed, performing differently on different particular tasks.

Where traditional machine learning (ML) techniques try to solve a prediction problem on single instances, for example classifying items, or predicting values using regression. The goal of Learning To Rank however, is to get an optimal ordering of items. In this case the ranking, the relative ordering of the items, is even more important than single predicted values for items.

LTR can be performed in a variety of ways: • Pointwise LTR

This can be seen as traditional ML, where query-document pairs are feature vectors for which values, in this case relevance labels, are still predicted. The model is learning to predict labels, and the labels are used for ordering by relevance. • Pairwise LTR

This approach corresponds to a binary classification task, where pairs of documents are compared. The model learns which one is more relevant than the other. More of the actual ranking is taken into account, but this still does not optimize for the ranking as a whole.

• Listwise LTR

A listwise LTR approach tries to optimize a given ranking evaluation measure, averaged over all queries. This satisfies the need of taking the whole ranking into account, but also brings a high computational complexity.

As training features, different data can be used. (i) Content-wise a document can be analyzed given a query, to see if terms or semantics match. (ii) Search engines such as Google are famous because of their PageRank [26] algorithm, which deals with the notion of going through the web of links to find out which documents are most popular. (iii) Another way to gather information about a query-document pair, is to measure user behaviour. This user data ranges from (explicit) relevance judgments to (implicit) click data and eye-tracking measurements. When improving search engine rankings, implicit user data is often gathered, because it is low-cost, non-invasive and

seen as a measure of genuine user-perceived document relevance.

An important aspect that measures users’ interest with documents is clicks. This can for example be translated into a Click Through Rate (CTR) of a query-document pair. Clicks are not only used to improve the quality of search ranking [9, 19] but also to evaluate quality of search [4, 16].

Using click data as input for Information Retrieval techniques is considered standard practice by now, as it is used not only for LTR tasks, but also query suggestion [8, 25], evaluation of search engines [21, 28], understanding user behaviour and preference [20, 27].

Specific use of click data for LTR has been carried out by predicting relevance labels [33] and by generating pairwise ranking data [1].

Using search engine click logs, LambdaMART [3] already proved to be the winning algorithm in the Yahoo LTR challenge [30] and also the two best performing papers for the Yandex Relevance Prediction challenge, whose data was also used in this research, marked LambdaMART as the best algorithm in their LTR task [11, 18]. This is why LambdaMART is the LTR algorithm chosen to carry out the experiments in this research.

2.2 Simulation and click models

2.2.1 Click simulation

According to this book [6] and paper [24], there are different ways of using click models for simulating user clicks:

1. Data expansion, by simulating users in the same sessions, when need to extrapolate clicks for the same SERPs as we have in our training set. 2. Simulating users in new sessions, when we want

to test new SERPs, new rankings that we do not want to show to users yet.

3. Simulating users on LETOR datasets [22], applying click models to completely new data sets to simulate user behaviour.

As described in 2.1, clicks are considered of crucial importance in measuring user interactions, so when deciding to simulate user data in LTR experiments, click simulation is a logical goal. Before, click simulation has already been used, successfully, in interleaving experiments [7, 15], and [32] shows that click models are suitable for simulation, using the second approach: Simulating users in new sessions.

The use of simulated click data for LTR has not been researched, but compensating for noisy data with historical user interactions proved successful [17]. Still, this gives no precedent for any outcome in using simulated interaction (click) data for LTR.

(4)

2.2.2 Click models

Using click models, clicks can be simulated but when lots of search sessions have no clicks, or only a click on the first result/document, they performed poorly, as shown in [24]. They suggest in addition to directly evaluating click simulators, to further explore the topic by specifically performing applications of click simulations, such as creating rankers using simulated clicks, thereby researching the usefulness of simulated clicks. This refers to what is carried out in this research paper.

In [7] only one, specifically for that purpose created, click model was used, but performance of different click models on simulation (and the application thereof) is not looked into.

3. METHODS

3.1 Features

As described in section 1, the Yandex challenge was used as an example data and task, and the winning papers were looked into for the winning LTR algorithm. Also for features, the winning paper [18] was taken as an example. Query-URL specific features were calculated, resulting in a shorter list of features per pair as in the example, because of constraints in computational power and time.

The following features were calculated for every query-URL pair q-u:

Table 1: Features used for training LTR Impressions Number of times u is shown in the lists of

results for q (called Search Engine Results Page, SERP)

Clicks Number of times u is clicked on SERPs of q CTR Click Through Rate: Clicks of q-u divided

by impressions of q-u

First CTR CTR on u when it is the first clicked result on a SERP of q

Last CTR CTR on u when it is the last clicked result on a SERP of q

Only CTR CTR on u when it is the only clicked result on a SERP of q

Average dwell time was not used as a feature, although

being query-URL specific, because of the fact of timestamps being used to calculate dwell times on URLs. Timestamps are not generated in the process of simulating clicks and thus it would not be possible to calculate this feature in simulated click-data. When comparing results between real world clicks and simulated clicks, the calculated features should be the same among all sets.

3.2 Click Simulation

For click simulation, code produced by [24] was used, based on the pseudo-code shown in Table 2. The input consisted of a query session, and one of the trained Click Models. The same-session click simulation, as a form of data expansion, is used, as is described in section 2.2.1.

Table 2: Click simulation pseudo-code (source: [24]) Algorithm 1 Simulating user clicks for a query session. Input: click model M, query session s

Output: vector of simulated clicks (c1, ..., cn)

1: for r ← 1 to |s| do

2: Compute p=P(Cr=1 | C1 =c1, ..., Cr-1=cr-1) using previous clicks c1, . . . , cr-1 and the parameters of model M

3: Generate random value cr from Bernoulli(p)

3.3 LTR

3.3.1 LambdaMART

As described in section 2, LambdaMART is the LTR algorithm used in this research. It is a pair-wise LTR algorithm that, at a high level, incorporates gradient-boosting to directly optimize LTR-specific cost functions as NDCG. It is outlined in detail in [3].

An implementation based on the pyltr (Python LTR) open-source framework [23] was used, which is in turn directly based on the original paper [3].

3.3.2 LTR Experiments

Firstly, different distributions among data used for click simulation, training LTR and testing LTR were used in order to see if a “more” trained click model would generate better click simulations. For comparison, LambdaMART was trained with feature-pairs based on real-world clicks, taken from the previously described publicly available data set, and test results calculated using pairs and their relevance labels, from the same dataset. After, LambdaMART was also trained with simulated clicks. Starting with random clicks (0.5 chance for every result on a SERP), to obtain a baseline, and continuing with every of the three click models to simulate new clicks, and thereby generating a new data set for which features could be calculated. This process provided new training sets with which LambdaMART was trained, after which it was measured on the same test sets that were used for the real-world clicks, making comparison possible.

4. EXPERIMENTAL SETUP

4.1 Data set

As described earlier, the Yandex Relevance Prediction Challenge is used as an LTR task. The data compromises user click data for 43,977,859 sessions in the Yandex search engine. For privacy reasons, the data is fully anonymized, and only identifiers of queries and URLs were

(5)

used. Sessions contained query actions with query identifiers, timestamps and a list of returned URLs and click actions, with corresponding time and document id that was clicked. The result is a click log that does not contain any content-based or link-based data.

Since we want to test the performance of a LTR algorithm, only queries with relevance labels for each returned document and thus subsequent query-document pair, were selected. Due to limited computational power, the first 1,000,000 records were considered, resulting in 44,870 query-document pairs with relevance labels. Relevance labels were set at 1 or 0, indicating relevance or non-relevance. These were judged upon by a group of Yandex judges.

Splitting into training and test data required 3 sets: one for training the click model and simulating clicks, one for training LTR and one for testing LTR. Multiple different division-percentages were used. This, and the amount of sessions occurring in each set, is shown in Table 3.

Distributing among the different sets was carried out in such a way, that the same queries showed up in the different sets, making comparison possible. Because of lack in computational power, and because only a small fraction of the large original data set was judged with relevance labels, only 24 unique queries occurred within the data used.

Table 3: Data set distribution with numbers of sessions

4.2 Evaluation metrics

The evaluation metrics used in these experiments are standard IR measures, typically used in LTR experiments [12]: Normalized Discounted Cumulative Gain (NDCG) and Mean Average Precision (MAP).

4.3 Click models

Three different types of click models were used to train upon, and in turn to be passed on to the click simulation. These are widely used click models, and previously tested with click simulation [24]: The User Browsing Model (UBM [10]), Dynamic Bayesian Network Model (DBN [5]) and the Click Chain Model (CCM [13])

5. RESULTS

In this section results are shown for NDCG and MAP performance of LambdaMART on different train and test sets. First, a 40%, 40%, 20% distribution was taken, to have sufficient data to train the click models on, while also having more train data than test data for Learning To Rank.

Table 4: LTR Results of experiment 1

(with 40%-40%-20% distribution of click model, train, test data)

NDCG@10 MAP Real clicks 0.7248 0.5581 Random clicks 0.4857 0.3111 UBM 0.6327 0.4327 DBN 0.6087 0.3207 CCM 0.6857 0.4328

As can be seen in Table 4, the Click Chain Model appears to be the best performing click model when using simulated clicks for this LTR task. It performs better than the randomly simulated clicks, and worse than the real-world clicks, as expected. Interesting and important to note is that all three used click models for click simulation perform better than random, but still worse than real-world clicks, on both NDCG as on MAP.

The second experiment (Table 5) involved another data set distribution, with somewhat less data for training click models and training LTR, but more for testing LTR. Again, of the click models, CCM performs best, considering NDCG, but UBM has the best MAP. Again, all click-models and their simulated clicks perform better than random clicks (for NDCG, not for MAP), but again worse than real-world clicks.

NDCG@10 MAP Real clicks 0.7818 0.3303 Random clicks 0.4471 0.2873 UBM 0.7071 0.3004 DBN 0.6840 0.2655 CCM 0.7251 0.2825

A third experiment was run (Table 6), using half of the data to train the click models, and diving the other half in two quarters to train and test LTR. Evaluation measure values were obtained, but the NDCG metric for real-world clicks yielded a value that can be seen as an error (see Table 6). Therefore, for now, results of this experiment are not discussed further, as comparison is not possible. Rerunning CM / train / test

distribution Number of sessions Click Model Train LTR Train LTR Test Total 40% / 40% / 20% 1601 1601 800 4002 35% / 35% / 30% 1401 1401 1200 4002 50% / 25% / 25% 2001 1001 1000 4002

(6)

the experiment was not possible in time before completing this research.

NDCG@10 MAP

Real clicks 1.1839e+14 0.3919

Random clicks 0.6020 0.3169

UBM 0.7020 0.3292

DBN 0.6695 0.3016

CCM 0.7020 0.4246

6. DISCUSSION

What can be seen as perhaps the most important result of these LTR experiments, is the fact that in the two cases where the experiment succeeded (experiment 1 and 2), all click models that simulated clicks, performed better than randomly generated clicks. They did more so (with a greater difference), than performing below real-world clicks. Even with a handful of query-document features based on the anonymized click log, results from Learning to Rank using real-world clicks are already considerably high, when compared to the original best performers on the Yandex Relevance Prediction Task [11, 18]. This could be the case, partly, because clicks appear to be a very good example of implicitly measuring user satisfaction. Attributed relevance seems to be reflected in clicks, and CTR really well, as was described in section 2.1 [20, 27, 33].

As is seen in [12], CCM is best for ranking documents in that research, and in terms of NDCG, also in the experiments carried out here, it performs really well. Still, as they also point out, their results indicate that different click models can perform better for some tasks while others are better for other tasks. As the click simulation application for LTR is a different task than in the related work described earlier, it is difficult to compare specific click model excellence with any precedent. What we can say furthermore, is that this research again highlights the successful performance of these three click models.

7. CONCLUSION

In this paper, the application of click simulation was investigated by applying the concept to the task of Learning To Rank. Click simulation previously proved to be working successfully in interleaving experiments [7, 15], and [32] showed that click models are suitable for simulation. It was suggested specifically by [24] to investigate the application of click simulation in a task such as LTR, thereby evaluating the usefulness of simulated clicks. This

highlights the scientific relevance of carrying out such suggested research. In this paper, multiple experiments were run, using real-world click data from the Yandex Relevance Prediction challenge click log, and training and testing an LTR algorithm (LambdaMART), following the example of the winning teams of abovementioned challenge, in terms of features used. The same click log was used to simulate new clicks, using different click models (UBM, DBN and CCM). Random clicks were generated as well, and all resulting five different click sets were used for feature calculation and then the LTR experiment. The first two of three experiments yielded results to be interpreted, and considered successful. All simulated clicks based on the click models performed notably better than random clicks, while still performing below LTR results using real clicks, as one would expect. These experiments emphasize the importance of applying the concept of click simulation to real world applications, as it proves to provide interesting results.

In terms of research questions (see Introduction):

1. “How does LTR with simulated clicks compare to LTR with real world clicks?”

LTR with simulated clicks performs worse than LTR with real clicks, as one could expect, but considerably better than random clicks. As mentioned above, the results indicate a relevance of the application of click models and click simulation, worth looking further into in greater detail.

2. “How do different click models compare, when simulating clicks for an LTR experiment?”

When considering the NDCG@10 metric, CCM seems to consistently be the best-performing click model, when applied to click simulation for LTR. In terms of MAP, not one definite winning model can be appointed.

8. LIMITATIONS

Some limitations on this research should be considered. Relevance labels of query document pairs in the data set used, and in other research involving evaluation and training of ranking functions, are almost always assigned by human judges. Previous research looked into agreement of judgments and the variance on evaluation [2, 14, 31]. Possible disagreement, and variance among different judges can influence results using relevance labels, but as Voorhees showed, the disagreement between judges from the same or different organizations on TREC 4,6 topics that albeit remarkable disagreement exists among human judgments, the relative ordering of rankers in terms of performance obtained with the judgment data is actually quite stable [31]. Still, it is important to take the human variance and influence aspect into account.

(7)

Another limitation on this research, is the low amount of data, which could be one of the explanations for the failing of experiment 3, although this is not sure. The amount of query-document pairs (almost 50 thousand) could be seen as not that low, but the later established number of unique queries (24) appearing in the data is really low. Because of the constraint on computational time in this research, a rerun of the experiments using far higher numbers was not possible.

9. FUTURE WORK

Future continuation, reproduction, or variation of the research outlined in this paper, should focus firstly on data. As described in section 8, a far higher amount of unique queries should be used, to yield results that are fit for more grounded conclusions. Another aspect that future research could look into, is the diversity of data. It would be interesting to see, what kind of performance click simulation with different click models would appear, when applying it to different LTR tasks.

Apart from data, it is suggested to also vary the type of click models used when simulating clicks for LTR. As also [24] note, recently proposed neural click models [29] could be used to investigate their application to LTR.

Lastly, different LTR algorithms should be tried when performing the same kind comparison of real-world user data versus simulated data, to get a more complete picture of LTR application.

10. ACKNOWLEDGEMENTS

This research could not be carried out without the support of Ilya Markov, my main supervisor. Special thanks are in place, because when sudden unforeseen changes at the start of the thesis process appeared, Ilya suggested this project, and made it possible to move and start a new challenging thesis subject. Also, I would like to thank Artem Grotov, for being so kind to be the second reader of this thesis. A fresh viewpoint is something that should be valued greatly in research. Thanks to Jerry Ma, for granting permission to use the pyltr framework that is used, and to Yandex, for being able to use the data set.

11. REFERENCES

[1] R. Agrawal, A. Halverson, K. Kenthapadi, and N. Mishra. 2009a. Generating Labels from Clicks. In Proceedings of

the Second ACM International Conference on Web Search and Data Mining. 172–181.

[2] Peter Bailey, Peter Bailey, Nick Craswell, and Paul Thomas. 2008b. Relevance assessment : are judges exchangeable and does it matter Relevance Assessment : Are Judges Exchangeable and Does it Matter ? In

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information

retrieval. ACM.

DOI:http://doi.org/10.1145/1390334.1390447

[3] Christopher J.C. Burges. 2010c. From RankNet to

LambdaRank to LambdaMART : An Overview,

[4] Olivier Chapelle and et al. 2009d. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM

conference on Information and knowledge management.

ACM.

[5] Olivier Chapelle and Y. Zhang. 2009e. A Dynamic Bayesian Network Click Model for Web Search Ranking Categories and Subject Descriptors. In WWW ’09. 1–10. [6] Aleksandr Chuklin, Ilya Markov, and Maarten De Rijke.

2015f. Click Models for Web Search, Morgan & Claypool. [7] Aleksandr Chuklin, Pavel Serdyukov, and Anne Schuth.

2013g. Evaluating Aggregated Search Using Interleaving. In Proceedings of the 22nd ACM international conference

on Information & Knowledge Management. ACM, 669–

678.

[8] Nick Craswell, Martin Szummer, and J.J. Thomson Ave. 2007h. Random Walks on the Click Graph. In Proceedings

of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 239-246). ACM.

[9] G. Dupret and C. Liao. 2010i. A model to estimate intrinsic document relevance from the clickthrough logs of a web search engine. In Proc. ACM Int. Conf. Web Search Data

Mining. ACM, 181–190.

[10] Georges Dupret, Georges Dupret, and Benjamin Piwowarski. 2015j. A user browsing model to predict search engine click data from past observations . A User Browsing Model to Predict Search Engine Click Data from Past Observations . In SIGIR ’08.

DOI:http://doi.org/10.1145/1390334.1390392

[11] Michael Figurnov and Alexander Kirillov. 2012k. Linear combination of random forests for the Relevance Prediction Challenge. In WSCD’12.

[12] Artem Grotov, Aleksandr Chuklin, and Ilya Markov. 2015l. A Comparative Study of Click Models for Web Search. In

International Conference of the Cross-Language Evaluation Forum for European Languages. Springer,

Cham., 78–90.

[13] Fan Guo et al. 2009m. Click Chain Model in Web Search. In WSDM ’09. 11–20.

[14] Stephen P. Harter. 1996n. Variations in Relevance Assessments and the Measurement of Retrieval Effectiveness. JASIS 47, 1 (1996), 37–49.

[15] M. Hofmann, K., Whiteson, S., & De Rijke. 2011o. A probabilistic method for inferring preferences from clicks. In Proceedings of the 20th ACM international conference

on Information and knowledge management. ACM, 249–

258.

[16] Katja Hofmann, Lihong Li, and Radlinski. 2016p. Online evaluation for information retrieval. Found. Trends® Inf.

Retr. (2016).

[17] Katja Hofmann and Anne Schuth. 2013q. Reusing Historical Interaction Data for Faster Online Learning to Rank for IR. In Proceedings of the sixth ACM international

conference on Web search and data mining. ACM.

[18] Botao Hu, Nathan Liu, and Weizhu Chen. 2012r. Learning from Click Model and Latent Factor Model for Relevance

(8)

Prediction Challenge. In WSCD ’12.

[19] T. Joachims. 2002s. Optimizing Search Engines Using Clickthrough Data. In Proceedings of the ACM Conference

on Knowledge Discovery and Data Mining (KDD). ACM.

[20] Thorsten Joachims, Laura Granka, Bing Pan, and Helene Hembrooke. 2005t. Accurately Interpreting Clickthrough Data as Implicit Feedback. In Proceedings of the 28th

annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 154–161.

[21] Marijn Koolen, Andrew Trotman, and Jaap Kamps. 2009u. Comparative analysis of clicks and judgments for IR evaluation Comparative Analysis of Clicks and Judgments for IR Evaluation. In Proceedings of the 2009 workshop on

Web Search Click Data. ACM.

DOI:http://doi.org/10.1145/1507509.1507522

[22] Tie-Yan Liu. 2009v. Learning to Rank for Information

Retrieval, now Publishers Inc.

[23] Jerry Ma. 2015w. pyltr, a Python learning-to-rank toolkit. (2015). Retrieved June 29, 2017 from https://github.com/jma127/pyltr

[24] S. Malkevich, I. Markov, E. Michailova, and M. de Rijke. 2017x. Evaluating and Analyzing Click Simulations in Web Search. ICTIR’17 (2017).

[25] Qiaozhu Mei and Kenneth Church. 2008y. Query Suggestion Using Hitting Time. In Proceedings of the 17th

ACM conference on Information and knowledge management. ACM.

[26] T. Page, L., Brin, S., Motwani, R., & Winograd. 1999z. The

PageRank citation ranking: Bringing order to the web,

[27] Feng Qiu and Junghoo Cho. 2006aa. Automatic Identification of User Interest For Personalized Search. In

Proceedings of the 15th international conference on World Wide Web. ACM.

[28] Filip Radlinski and Thorsten Joachims. 2008ab. How Does Clickthrough Data Reflect Retrieval Quality ? In

Proceedings of the 17th ACM conference on Information and knowledge management. ACM, 43–52.

[29] De Rijke, Alexey Borisov, and Pavel Serdyukov. 2016ac. A Neural Click Model for Web Search Categories and Subject Descriptors. In Proceedings of the 25th International

Conference on World Wide Web.

DOI:http://doi.org/10.1145/2872427.2883033

[30] Krysta M. Svore and Paul N. Bennett. 2011ad. Learning to Rank Using an Ensemble of Lambda-Gradient Models. In

Proceedings of the Learning to Rank Challenge. 25–35.

[31] Ellen M. Voorhees. 2000ae. Variations in relevance judgments and the measurement of retrieval effectiveness.

Inf. Process. Manag. 36 (2000).

[32] Qianli Xing, Yiqun Liu, Jian-yun Nie, Min Zhang, Shaoping Ma, and Kuo Zhang. 2013af. Incorporating User Preferences into Click Models. In Proceedings of the 22nd

ACM international conference on Conference on information & knowledge management. ACM.

[33] E.R.T. Xu, J., Chen, C., Xu, G., Li, H., & Abib. 2010ag. Improving quality of training data for learning to rank using click-through data. In Proceedings of the third ACM

international conference on Web search and data mining.