Generating Diverse Query Suggestions Using Co-citation Graphs in Scientific Literature Search

(1)

Generating Diverse Query Suggestions Using Co-citation

Graphs in Scientific Literature Search

Robbert Kauffman

University of Amsterdam Science Park 904 1098 XH, Amsterdam The Netherlands

robbertkauffman@gmail.com

ABSTRACT

A novel approach for generating diverse query suggestions for use in scientific literature search is proposed. By ex-ploiting the co-citation graph, meta-data present in scientific literature is employed for generating candidate query sug-gestions. A co-citation graph is constructed using pseudo-relevance feedback. This graph is clustered using the Lou-vain method. Labels are generated for the graph by ex-tracting the noun-phrases of the titles of the documents in the graph. These labels are scored using tf-idf and the best scoring labels act as candidates for query suggestions. Two variants of the system are created and evaluated: a clus-tered (CCS) and non-clusclus-tered co-citation suggestor (NCS). Only the NCS, after filtering, scores higher than average on relevance. Both systems score higher than average on di-versity. No significant difference was found between the two systems on diversity. The poor performance of the CCS on relevance is likely to be caused by poorly formed clusters and poor query suggestion candidate selection. In conclu-sion, the NCS is capable of generating relevant and diverse query suggestions. Future study is needed to evaluate the system in an exploratory search context.

Categories and Subject Descriptors

H.3.3 [Information Search and Retrieval]: Query for-mulation; G.2.2 [Graph Theory]: Graph algorithms

General Terms

Experimentation, measurement

Keywords

Query suggestion, graphs, community detection, co-citation

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.

1. INTRODUCTION

Although query suggestions are commonplace for general web search tools like Google or Bing, they are not seen of-ten in literature search environments. A recent study by [17] acknowledges this finding. Only two examples in popu-lar academic search environments are found. PubMed sug-gests queries to its users, but is domain-specific. Microsoft Academic Search is equipped with query expansion/auto-completion, but the thesaurus is limited to broad and pop-ular queries. Besides these examples, few other query sug-gestion tools for general literature search exist. A possible explanation could be that existing query suggestion systems are unsuited for the domain of scientific literature. However, this assumption has not been confirmed.

A plethora of content- or log-based query suggestion sys-tems can be found in the literature. None of these sys-tems make use of the meta-data present in scientific liter-ature, such as citation data. However, outside the realm of query suggestion systems, many tools exist that employ these meta-data. [18] created maps of science for analyz-ing research fields, by clusteranalyz-ing co-citation data and label-ing the clusters. [9] clustered co-citation data as well, but generated cluster labels automatically for visual co-citation analysis. Co-citation was devised by [27], as a measure for similarity between scientific papers. Two papers are said to be co-cited, and thus semantically similar, if they are cited at least once by the same paper. A graph can be constructed from these data, called a co-citation graph.

This study proposes a novel approach for generating di-verse query suggestions for use in scientific literature search, by exploiting a co-citation graph. The contribution of this paper is twofold. First, a content-based approach for gener-ating diverse query suggestions is presented, where formerly only systems using query-logs have optimized for diversifi-cation of suggestions. Next, the meta-data present in scien-tific literature are employed, something which has not been done by any query suggestor up until now. Both a non-clustered version and a non-clustered version of the system are evaluated on relevance and diversity, to discern an answer to the following research question: Can labels generated from the co-citation graph function as relevant and diverse query suggestions? An experiment is conducted to compare query suggestions of both systems. Although both systems score very high on diversity, only the non-clustered system scores high on relevance. Therefore, the non-clustered system is a suitable approach for generating query suggestions.

(2)

2. RELATED WORK

Literature search is a typical example of an exploratory search task. According to [32], exploratory search describes “an information-seeking problem context that is open-ended, persistent, and multifaceted, and information-seeking pro-cesses that are opportunistic, iterative, and multitactical”. Exploration is also one of the six stages in Kulhthau’s model of the information search process [20]. In the exploration stage, the user is investigating a topic or domain on a gen-eral level to increase his or her understanding. According to [20], the user has an inability to express what information is needed, which makes formulating queries and judging rele-vance of the search results more difficult. [29] shows this in a real world setting: users either do not exactly know what they are looking for; do not know how to describe what they are looking for; or try to lessen the cognitive burden of find-ing information by orienteerfind-ing [23], instead of specifyfind-ing queries.

Based on the theories on exploratory search, one focus area for improving literature search is aiding the query for-mulation process. This is supported by the findings of [13], which reports users were more likely to consult a thesaurus for query term suggestion when they are not familiar with the topic. Various tools aimed at supporting the query for-mulation process already exist, where query suggestion is the most common approach [12]. Suggestions can come in the form of correction, refinement, relaxation, expansion or even complete substitution. A complete overview on tools available for query formulation, as well as an outline of the various types of query suggestion tools is given in [12] and [32].

Not all types of query suggestion systems are useful in an exploratory search context. [16] showed that users prefer query suggestions over term suggestion and make more use of suggestions for difficult topics or topics that they have lit-tle understanding on. [31] found that query expansion can lead to query drift in unfamiliar domains, so one could ar-gue that in these settings less specific queries would perform better. Further research is needed to confirm this. Finally, diversifying suggestions can increase the rate of useful sug-gestions, especially if the user intent is not clear or ambigu-ous [1].

Query suggestions can be generated from content or by user-query logs. The majority of recent work focusses on using query logs to generate query suggestions. Either the query session is analyzed to look for query reformulations [2], or clickthrough data are analyzed. For the latter, a graph is constructed using queries and URLs. Typical graphs that are used are the query-URL bipartite graph [22, 21] or the flow graph [5]. A proposal by [28] analyses the query-URL bipartite graph in order to generate explicit intentional queries. Although determining intent can be pertinent to exploratory search, it still stays within a single context and thus will not help users with ill-defined information needs. [21] uses the same graph, but diversifies the suggestions us-ing random walks. Finally, [15] generates query substitu-tions, where candidates replace the entire query, in contrast with query expansion or refinement that is used in the pre-viously cited works.

Query-logs have certain limitations. First of all, query logs are noisy [11] and thus require data-preparation or prun-ing. Secondly, clickthrough-data are based on implicit feed-back, which can be inaccurate for relevance assessment [14].

Thirdly, large query-logs are required in order for the sys-tems to be effective. Logs can be sparse, especially for queries in the long-tail. However, this can be mitigated by using a hybrid approach, as done by [34]. Finally, it remains to be seen whether or not query logs are useful in domain-specific contexts like literature search. None of the query suggestions have been evaluated in this context.

Earlier work in the field of query suggestions uses a content-based approach, by analyzing titles, abstracts, anchor texts or full texts of documents in the corpus. [30] uses the statis-tical co-occurrence of missing terms to generate candidates for query expansion. [33] analyzes the top-ranked results to generate refinements. A similar approach by [31] also uses pseudo-relevance feedback to generate expansion terms from the titles and abstracts, but uses co-occurrence and term frequency. More recent work uses Latent Dirichlet Alloca-tion for suggesting topic-based query terms [10]. Another approach generates phrasal query suggestions by extracting noun-phrases of titles in the corpus [17].

The majority of query suggesion systems perform auto-matic evaluation using information retrieval metrics like pre-cision and recall, by using annotated datasets such as TREC [33, 3] or by using machine-learned classifiers [15]. Manual evaluation is performed on relevance [21, 7, 25] or on similar constructs like meaningfulness [3] and usefulness [6]. Diver-sity is also a common metric for evaluation [21, 7]. A frame-work for user-centric evaluation, called ResQue [24], is spe-cialized in evaluating exploratory search tasks. It assesses perceived system quality, beliefs, attitudes and behavioural intentions. [31] also evaluated their system in an exploratory search context, but did not use the ResQue framework. Ac-cording to [32], measures like precision and recall are poor indicators in the context of exploratory search.

3. SYSTEM DESIGN

The system design is inspired by the studies of [9, 18]. They cluster the co-citation graph and label these clusters, either manually [18] or automatically [9], for analysis of sci-entific literature. This can be translated to a query sug-gestion system by following multiple steps, which will be described in detail below.

The initial query, for which to generate suggestions, acts as the starting point of the system. Through pseudo-relevance feedback, a commonly-used approach in query suggestion systems [33, 31], the top results that are returned by the search engine (e.g. by Google Scholar) are used. In pseudo-relevance feedback, it is assumed that the top X results are relevant to the original query and thus can be used for gen-erating query suggestions. For each result in the top-10, the co-citations are retrieved and a co-citation graph is con-structed. In this graph, each document is represented by a vertex. Edges are created between vertices for papers that are co-cited together. At this stage, the graph G only con-tains the 10 initial documents I(G) and papers that have been co-cited with that set of documents, labeled C. Because at this point the graph is still too small for use, we extend the graph by retrieving the co-citations of the earlier retrieved co-citations C(G). Alternatively, the top-100 results can be used instead of the top-10, to increase the size of the graph to a usable size. However, this would put too much emphasis on pseudo-relevance feedback and thus on the performance of the search engine instead of the co-citation data. So the graph is expanded by obtaining co-citations for the set of

(3)

co-cited papers C(G), which gives us a new set C2(G). Note that C2(G) ∩ C(G) ∩ I(G) = ∅. This process can be repeated n times, until a satisfactory graph is created. An optimal depth of n=2 was obtained through experimation, leading to graphs with approximately 800-1000 vertices. Lowering the depth would results in an unusable small graph, and in-creasing the depth would result in a too large graph (>100k vertices) that would lose too much information of the origi-nal query in the process.

After constructing the co-citation graph, the graph is clus-tered by grouping together similar documents and creating clusters of distinct topics or research fields. This poten-tially reduces noise of documents in related or overlapping research fields. Various algorithms for clustering and specif-ically for clustering graphs exist, called community detec-tion. These vary in terms of performance (quality of found clusters and computational speed) and optimization. Opti-mization determines the features of clusters that are formed by the algorithm. One method of optimization is modu-larity, which measures the strength of division of a graph into components or communities. Highly modular graphs have densely inter-connected nodes within the clusters and limited connections between clusters. Resulting in clusters that have relevant nodes within cluster, but diverse nodes between clusters. Therefore, this method is suited for the purpose of this paper. One of the best methods which opti-mizes modularity is the Louvain method [4]. It is both fast and performs well on modularity.

After obtaining the clustered co-citation graph, labels have to be generated. These labels act as candidate query sug-gestions. The labels can be generated by extracting noun-phrases from titles of documents, as done by [9, 17]. Other data can be used as well, like abstracts, citation contexts or even full text. However, [9, 17] argue that document ab-stracts prove less suited for this task. Citation contexts on the other hand, offer even higher quality labels than titles [26], but these have to be included in the dataset. There-fore, titles are used for generating the labels. Noun-phrases can be extracted using NLP techniques Part-of-Speech tag-ging and noun-phrase chunking. Because the performance of various PoS algorithms does not differ much from one to another1, a readily implemented solution is chosen that is available in the Python library TextBlob2_{. This} implemen-tation is based on the Brill-tagger [8]. The same holds for the noun-phrase chunking techniques3. The CONLL chun-ker is used [19], which also comes with the TextBlob library. Finally, the right noun-phrases have to be selected. A commonly used method from the field of information re-trieval is tf-idf, which is also used by [9]. Tf-idf selects the most important terms by counting the frequency of terms in a document and multiplies it with the inverse frequency of terms in the corpus. Terms that occur less frequently in the corpus are weighted higher than more common terms this way. The terms with the highest scores are then selected. Because tf-idf is normally used with full texts and the titles are very short compared to these, a slight modification is made to the algorithm. The terms are counted in either the cluster or the whole co-citation graph. The inverse docu-1

http://wiki.aclweb.org/index.php?title=POS Tagging (State of the art)

2

http://www.clips.ua.ac.be/pages/pattern-en#parser 3_{http://wiki.aclweb.org/index.php?title=NP Chunking} (State of the art)

Step Description

1. Perform search query 2. Retrieve top-10 results

3. Get co-citations for all top-10 results and build graph

4. Get co-citations for each vertex in the graph until depth = 2

5. Cluster the graph using the Louvain method 6. Extract noun-phrases of titles for each vertex in

the graph. First by PoS-tagging the titles using the pattern-tagger in TextBlob. And finally, by chunking the noun-phrases using the CONLL ex-tractor in TextBlob.

7. Score noun-phrases using tf-idf and select three highest scoring noun-phrases

Table 1: Step-by-step workings of the system.

ment frequency is kept the same, and counts the frequency of terms across the whole corpus.

A summary of all the steps involved in generating query suggestions by the system is listed in table 1.

4. EXPERIMENTAL SETUP

The system is evaluated on two metrics: diversity and relevance. First, the definitions of the metrics are given, after which the design of the main experiment is described in detail.

4.1 Metrics

As explained earlier, traditional metrics like precision and recall are not suitable for evaluating an exploratory search task [32]. Established metrics in the research field of rec-ommendation systems can also be applied to exploratory search, such as: novelty, diversity and serendipity. Nov-elty is obviously relevant for exploratory search, and is also suggested as an appropriate metric by White & Roth in [32]. However, novelty, as well as serendipity, are context-dependent. This means that the experimental setup needs to support the measurement of these constructs. Unfortu-nately, for sake of feasibility of the experiment, a simplified experimental design is chosen which does not include user context. Therefore, novelty and serendipity are not mea-sured. Diversity on the other hand, is also an important metric for exploratory search, but can be measured and is included in the evaluation.

Another traditional information retrieval metric that is used, is relevance. This construct is used throughout the IR and recommender system research fields for assessing the relevance of retrieved or recommended items compared to the original query or item.

4.1.1 Diversity

For diversity, the intra-list diversity construct is used, which measures diversity between suggestions for a single query. This can not be measured directly, since only sugges-tions are returned for a query and just comparing terms of the suggestions would not be sufficient. Therefore, similar to the system design, pseudo-relevance feedback is used. The top-10 results of the suggestions are compared. Through experimentation is found that comparing titles of the top-10 results was not a sensitive enough measure, resulting in

(4)

Precise No change in meaning. Little or no change in scope.

Approximate Modest change in intent. The scope may have expanded or narrowed.

Marginal Shift in user intent to a related, but dis-tinct topic.

Mismatch The original user intent has been lost or obscured.

Table 2: Relevance scale used by human judges to evaluate the suggestions for each query, from [25].

very high scores on diversity. Therefore, the citation lists of the top-10 results are used instead. As a distance measure, the number of matches between citation lists is used. The scoring is normalized by dividing the number of matches by the total number of citations. A higher score means more matches and thus less diversity, and vice versa. Although comparing co-citations would seem even better, this would penalize older and highly-cited documents, which are more co-cited than recent and lower-cited papers.

4.1.2 Relevance

Relevance can be measured either automatically or manu-ally. For automatic measuring of relevance, a pre-annotated dataset is required. Such datasets are available for general information retrieval use, e.g. TREC., but these do not con-tain any citation-data. Therefore, the construct of relevance needs to be assessed manually. For manual assessment of relevance, the 4-point relevance scale by [25] is used, see ta-ble 2. An average score on this scale would amount to 2.5. An additional option N/A is added to this scale, in case the assessors are not able to assess the suggestion with the pro-vided information. Domain experts assess each suggestion on relevance to its original query using this scale. They are given the initial query, the suggestion and a list of snippets of the top-5 documents returned by the search engine. To avert any bias, the assessment is performed blinded, mean-ing that the assessors do not know which system was used to generate the suggestion. This assessment is performed by three domain experts that are associated with the Univer-sity of Amsterdam. All are in possession of at least a MSc diploma in the field of computer science. This ensures that they have the required knowledge of the computer science domain to assess the query suggestions on relevance.

4.2 Experiment

Two variants of the system are created and evaluated: one with clustering and one without clustering. As a base-line, a content-based suggestor which employs statistical co-occurence of missing terms was implemented from the lit-erature [30]. Unfortunately, this system seems unsuited for the domain of scientific literature. Exactly 80% of the query suggestions generated by the system were variations on one of the following four suggestions: system, based, using and data. The suggestions are inaccurate and unmeaningful, so consequently the system is not included in the formal eval-uation. So, this means only the query suggestions of the two variants of the system are compared on relevance and diversity:

• Non-clustered co-citation suggestor (NCS) • Clustered co-citation suggestor (CCS)

Based on the reasoning in the previous paragraphs, the experiment will try to answer the following research ques-tion:

RQ1 Can labels generated from the co-citation graph func-tion as relevant and diverse query suggesfunc-tions? Which breaks down into the following three sub questions: SQ1 Do both systems (CCS & NCS) score higher than

av-erage (µ > 2.5) on relevance?

SQ2 Do both systems (CCS & NCS) score higher than av-erage (µ > 0.5) on diversity?

SQ3 Does clustering the co-citation graph (CCS) lead to more diverse suggestions, compared to the non-clustered variant (NCS)?

Start queries need to be feeded to the algorithms, to base suggestions on. These initial queries will be formulated by a team of three domain experts. Each of them will devise twenty queries that fall within their domain of expertise, to ensure eligibility for manual relevance assessment. The experts are instructed to formulate queries based on existing research fields. A random sample of thirty queries will be selected from the set of drafted queries. Each system will then generate three query suggestions based on this subset of queries.

4.2.1 Data

A database dump of CiteseerX is used as dataset. Cite-seerX is a search engine and repository for computer and information science literature. The database is populated by a spider which automatically crawls and indexes all the scientific literature from various sources. The full dataset contains 2.118.122 documents and dates from 2012. Only 1.152.019 documents (roughly 50%) have inbound citation links and can be used for retrieving co-citations. Logically, CiteseerX is used as the search engine for obtaining the top-10 results for each query. From which the eventual co-citaton graph is constructed through pseudo-relevance feedback.

5. EVALUATION

The two systems are evaluated on relevance and diversity for a total of N =90 query suggestions.

5.1 Statistical analysis

5.1.1 Relevance

All query suggestions have been successfully assessed by the domain experts. The option N/A was not used by the assessors. This results in a total of N =90 scored query sug-gestions.

The non-clustered co-citation suggestor (NCS) (µ = 1.99, SE = 0.0955) is compared on relevance with the clustered co-citation suggestor (CCS) (µ = 1.67, SE = 0.0652), to dis-cover if there is a significant difference in relevance between the systems. A Wilcoxon signed-rank test shows that the NCS scores significantly higher on relevance than the CCS (Z = -2,998, p < 0.05), see also table and ??. Both systems score lower than average on relevance (µ < 2.5), see table 4.

(5)

Intraclass Correlation Coefficient

Intraclass Correlation 95% Confidence Interval F Test with True Value 0 Lower Bound Upper Bound Value df1 df2 Sig

Single Measures .600 .522 .671 5.493 179 358 .000

Average Measures .818 .766 .860 5.493 179 358 .000

Table 3: Intraclass correlation coefficient for inter-annotator agreement.

Descriptive Statistics

N Min Max Mean Std. Dev.

Statistic Std. Error

NCS 90 1 4.33 1.9889 .09546 .90599 CCS 90 1 3.67 1.6667 .06515 .61808 Table 4: Descriptive statistics for relevance for non-clustered co-citation suggestor (NCS) and non-clustered co-citation suggestor (CCS) before filtering.

NCS 30 0.82 1 0.9053 .00664 .03636 CCS 30 0.60 1 0.8943 .01485 .08135 Table 5: Descriptive statistics for diversity for non-clustered co-citation suggestor (NCS) and non-clustered co-citation suggestor (CCS) before filtering.

5.1.2 Inter-annotator agreement

To determine the reliability of the relevance assessment, the intraclass correlation coeffecient is computed to measure the agreement between the domain experts. Unlike Cohen’s Kappa, this test is designed for use with more than two an-notators and makes use of the ordering in the data. Since the average score of the assessors is taken per query, the average measures value instead of the single measures value of the ICC test is used. The ICC test reveals that the agreement between the annotators is almost perfect (ICC = 0.818), see table 3.

5.1.3 Diversity

The diversity score is compared between the NCS (µ = 0.91, SE = 0.0066) and the CCS (µ = 0.89, SE = 0.015). Both score higher than average on diversity (µ > 0.5), see table 5. A Wilcoxon signed-rank test shows no significant difference in diversity between the two systems (Z = -0.097, p < 0.05), see table ??.

5.2 Qualitative analysis

Due to the unexpected poor relevance scores of both sys-tems, the input and output of the systems are analyzed qualitatively step-by-step. First, the top-10 results of the search engine are analyzed, to see if these results are rel-evant to the original query. Second, a random sample of the co-citation graphs is inspected to check for any noise in the dataset. Third, the clusters of the co-citation graph are analyzed from another randome sample, to confirm the

validity of the clusters. Fourth, the extracted noun-phrases are compared with the original titles of the documents in the co-citation graph, to count the number of missing or incorrectly extracted noun-phrases. Finally, the query sug-gestion selection is scrutinized, by checking for better suited candidate suggestions.

The qualitative analysis uncovers faults in both the first and fifth step of the system design: the use of pseudo-relevance feedback and the clustering of the co-citation graph. The detailed findings of these issues are documented in the next paragraph. The other steps did not demonstrate as many complications, as depicted directly below.

The co-citation graphs did not contain any considerable amount of noise, as long as the initial results on which the graph is based, were relevant. Next to that, the noun-phrase extraction scored 79 %, which is sufficient although not as high as compared to scores of 93-95% reported in the litera-ture. Lastly, the query suggestion selection could not yet be analyzed, due to the severe issues with the pseudo-relevance feedback, impacting the quality of generated candidate sug-gestions negatively.

5.2.1 Pseudo-relevance feedback

The top-10 results that are returned by the CiteseerX search engine are analyzed for each query on relevance. The findings are listed below.

• Coverage of the dataset is limited. For example, the search engine returns poor results for queries on user modeling, community detection and scalable graph clus-tering. This reduces the number of relevant documents returned and increases noise in the co-citation graph. • Query combinations (e.g. information retrieval user

model, which can be interpreted as a combination of in-formation retrieval and user model ) return only good results for part of the query. For the earlier example, information retrieval user model returns good results for information retrieval, but mediocre results for user model and the full query.

• Results with few or no inbound citations (< 5) are often returned. This reduces the number of potential co-citations that can be retrieved, and could also be a possible indicator of poor quality / influence of the returned documents.

• Related to the point above, a large variation between number of inbound citations is observed, causing re-sults with a high amount of inbound citations to over-shadow the other results.

• Some documents are returned in the results, but are not included in the dataset. The dataset contains data from up to 2012.

(6)

Query Suggestions NCS Suggestions CCS artificial intelligence learning reinforcement learning

reinforcement learning nonlinear systems neural networks genetic algorithms pattern recognition methods face recognition image retrieval

support vector machines relevance feedback speech recognition face recognition computer vision computer vision face recognition

tracking maximum entropy

application active robot work-cell

Table 6: Sample queries and suggestions.

• Aggressive lemmatization can cause irrelevant matches. E.g. computer vision new algorithms can match on a document containing the terms computation and algo-rithm.

5.2.2 Cluster validity

30 of a total of 412 clusters were checked for validity. Be-cause there is a considerable amount of overlap between the clusters, it is difficult to establish a ground truth for this — a well-known issue with graph-clustering studies. However, the clusters can be checked for obvious errors. Obvious er-rors amounted to 24% per cluster. Still, this does not say much about whether the most optimum clusters are found. The modularity scores of the graphs can give insight into this. If modularity is a good optimization method for the used dataset, then the modularity scores should correlate with the relevance scores for the clustered co-citation sys-tem, provided that the query suggestions were selected op-timally. However, due to the problems with pseuo-relevance feedback, the correlation will be calculated after the results have been filtered. Otherwise a potential correlation could be obscured by the negative impact of the pseudo-relevance feedback.

5.2.3 Filtering results

The qualitative analysis unveils a lot of flaws with the workings of the CiteseerX search engine. Since pseudo-relevance feedback is used for generating the suggestions, the system is dependent on the quality of the results pro-vided by the search engine. All start queries that return poor results should be filtered out, as these effect the results negatively and are caused by a fault outside of the control of the system. To measure the quality of the returned re-sults by the queries, rere-sults are assessed on relevance to the query using the same scale as in the relevance assessment. Start queries where the majority of the returned results are rated on relevance either Mismatch or Marginal, are filtered out. Some results contain few or no inbound citations and therefore can not be used for retrieving co-citations. Any re-sults with less than five inbound citations links are pruned from the results, before determining the majority vote on relevance. These criteria ensure that the suggestions are predominantly based on relevant results.

5.3 Analysis after filtering

Based on the aforementioned criteria, 23 of the 30 origi-nal queries are filtered out. Leaving only 7 queries with a total of N =21 query suggestions. A sample of queries and suggestions is given in table 6. Now the correlation between

NCS 21 1 4.33 2.8889 .15142 .69389 CCS 21 1 3.67 2.0317 .18098 .82936 Table 7: Descriptive statistics for relevance for non-clustered co-citation suggestor (NCS) and non-clustered co-citation suggestor (CCS) after filtering.

modularity and relevance can be determined as well as the analysis of the query suggestion candidate selection process. Also, the query suggestions are re-tested on relevance and diversity.

5.3.1 Correlation between modularity and relevance

A pearson correlation rank finds no correlation (r(5) = -0.124, p > 0.05) between graph modularity and relevance scores for the query suggestions of the clustered co-citation system, see table ??.

No definitive conclusion can be drawn from this analysis, since the candidate selection has certain limitations, which are described below.

• No cluster selection. Highest scoring candidates are selected across all clusters.

• Overlapping clusters can cause important candidates to be distributed across multiple clusters, lowering their end score.

5.3.2 Relevance

Again, a Wilcoxon signed-rank test is used to measure a significant difference between the non-clustered co-citation suggestor (µ = 2.89, SE = 0.151) and the clustered co-citation system (µ = 2.03, SE = 0.181). The Wilcoxon signed-rank test shows, once more, that the NCS scores sig-nificantly higher on relevance (Z = -3,514, p < 0.05) than the CCS, see also table 7. Only the NCS scores higher than average on relevance (µ > 2.5), see table ??.

5.3.3 Diversity

Another Wilcoxon signed-rank test is conducted. Both the NCS (µ = 0.91, SE = 0.0084) and the CCS (µ = 0.88, SE = 0.047) still score higher than average on diversity (µ > 0.5), see table 8. A Wilcoxon signed-rank test shows no significant difference in diversity between the two systems (Z = -0.315, p > 0.05), see table ??.

(7)

NCS 7 .88 1 .9143 .00841 .02225

CCS 7 .6 1 .8814 .04748 .12562

Table 8: Descriptive statistics for diversity for non-clustered co-citation suggestor (NCS) and non-clustered co-citation suggestor (CCS) after filtering.

6. DISCUSSION

The problems with pseudo-relevance feedback were unan-ticipated. Although some preliminary testing was done be-forehand, the magnitude of the issue was not clear. Filtering out results that were negatively influenced by the poor per-forming search engine, results in an increase in the mean of relevance by almost a single point, on a 4-point scale. This increased the mean relevance of both systems, but only the NCS scores higher than average. This shows that the NCS is able to generate relevant query suggestons, although it must be noted that the sample size is very small after fil-tering. The relevance assessment was executed properly, as demonstrated by the high agreement between the annota-tors.

The CCS performed worse than the NCS. This is likely caused by poorly formed clusters and poor query suggestion candidate selection (only for the clustered system). A differ-ent cluster optimization method should be chosen in future work, to examine if clustering is an appropriate method for generating query suggestions.

The diversity scores of both systems were very high. The systems did not only score above average but almost near perfect. This makes the suggestions suited for an exploratory search task.

The extraction of noun-phrases underperforms compared to percentages reported in the literature. However, the re-sults in this paper are achieved in a highly domain-specific dataset. Therefore, any further improvement in this area is likely to only be achieved by domain-specific training data, which is not readily available.

Another potential improvement could come from using the citation context of documents instead of the titles. These seem to be more descriptive [26], but unfortunately, were not available in the used dataset.

Future study is required, to determine to which extent the suggestions are suited for exploratory search. A user centric evaluation [24] is deemed a prudent next step. For example, by letting users carry out search tasks and showing sugges-tions in the search engine. Also, multiple types of queries should be evaluated and compared.

All in all, the NCS is capable of generating diverse and rel-evant query suggestions. There are multiple opportunities for optimizing the system, so further study is warrented. Once the system has been optimized, based on the findings of this study, it would be interesting to compare the query suggestor with the system by [17] or any other system de-signed for exploratory search. Only then can a conclusive answer be surmised about whether co-citation data is suited for generating query suggestions in an exploratory search context.

7. ACKNOWLEDGMENTS

I want to thank my supervisor Artem Grotov and my old supervisor Marc Bron for their helpful feedback and moti-vative words. You helped me through the journey.

8. REFERENCES

[1] R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong. Diversifying search results. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, pages 5–14. ACM, 2009.

[2] R. Baeza-Yates, C. Hurtado, and M. Mendoza. Query recommendation using query logs in search engines. In Current Trends in Database Technology-EDBT 2004 Workshops, pages 588–596. Springer, 2005.

[3] S. Bhatia, D. Majumdar, and P. Mitra. Query suggestions in the absence of query logs. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 795–804. ACM, 2011. [4] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008, 2008.

[5] P. Boldi, F. Bonchi, C. Castillo, D. Donato, A. Gionis, and S. Vigna. The query-flow graph: model and applications. In Proceedings of the 17th ACM conference on Information and knowledge management, pages 609–618. ACM, 2008. [6] P. Boldi, F. Bonchi, C. Castillo, D. Donato, and

S. Vigna. Query suggestions using query-flow graphs. In Proceedings of the 2009 workshop on Web Search Click Data, pages 56–63. ACM, 2009.

[7] I. Bordino, C. Castillo, D. Donato, and A. Gionis. Query similarity by projecting the query-flow graph. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in

information retrieval, pages 515–522. ACM, 2010. [8] E. Brill. A simple rule-based part of speech tagger. In

Proceedings of the workshop on Speech and Natural Language, pages 112–116. Association for

Computational Linguistics, 1992.

[9] C. Chen, F. Ibekwe-SanJuan, and J. Hou. The structure and dynamics of cocitation clusters: A multiple-perspective cocitation analysis. Journal of the American Society for Information Science and Technology, 61(7):1386–1409, 2010.

[10] J. Fan, H. Wu, G. Li, and L. Zhou. Suggesting topic-based query terms as you type. In Web Conference (APWEB), 2010 12th International Asia-Pacific, pages 61–67. IEEE, 2010.

[11] C. Grimes, D. Tang, and D. M. Russell. Query logs alone are not enough. In Workshop on query log analysis at WWW. Citeseer, 2007.

[12] M. Hearst. Search user interfaces. Cambridge University Press, 2009.

[13] I. Hsieh-Yee. Effects of search experience and subject knowledge on the search tactics of novice and experienced searchers. Journal of the American Society for Information Science, 44(3):161–174, 1993. [14] T. Joachims, L. Granka, B. Pan, H. Hembrooke, and

(8)

implicit feedback. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 154–161. ACM, 2005.

[15] R. Jones, B. Rey, O. Madani, and W. Greiner. Generating query substitutions. In Proceedings of the 15th international conference on World Wide Web, pages 387–396. ACM, 2006.

[16] D. Kelly, K. Gyllstrom, and E. W. Bailey. A

comparison of query and term suggestion features for interactive searching. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 371–378. ACM, 2009.

[17] Y. Kim, J. Seo, W. B. Croft, and D. A. Smith. Automatic suggestion of phrasal-concept queries for literature search. Information Processing &

Management, 50(4):568–583, 2014.

[18] R. Klavans and K. W. Boyack. Using global mapping to create more accurate document-level maps of research fields. Journal of the American Society for Information Science and Technology, 62(1):1–18, 2011. [19] T. Kudoh and Y. Matsumoto. Use of support vector

learning for chunk identification. In Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning-Volume 7, pages 142–144. Association for Computational Linguistics, 2000.

[20] C. C. Kuhlthau. Inside the search process: Information seeking from the user’s perspective. JASIS, 42(5):361–371, 1991.

[21] H. Ma, M. R. Lyu, and I. King. Diversifying query suggestion results. In AAAI, volume 10, 2010. [22] Q. Mei, D. Zhou, and K. Church. Query suggestion

using hitting time. In Proceedings of the 17th ACM conference on Information and knowledge

management, pages 469–478. ACM, 2008. [23] V. L. O’Day and R. Jeffries. Orienteering in an

information landscape: how information seekers get from here to there. In Proceedings of the

INTERACT’93 and CHI’93 conference on Human factors in computing systems, pages 438–445. ACM, 1993.

[24] P. Pu, L. Chen, and R. Hu. A user-centric evaluation framework for recommender systems. In Proceedings of the fifth ACM conference on Recommender systems, pages 157–164. ACM, 2011.

[25] F. Radlinski, A. Broder, P. Ciccolo, E. Gabrilovich, V. Josifovski, and L. Riedel. Optimizing relevance and revenue in ad search: a query substitution approach. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 403–410. ACM, 2008. [26] J. W. Schneider. Concept symbols revisited: Naming

clusters by parsing and filtering of noun phrases from citation contexts of concept symbols. Scientometrics, 68(3):573–593, 2006.

[27] H. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for information Science, 24(4):265–269, 1973.

[28] M. Strohmaier, M. Kr¨oll, and C. K¨orner. Intentional

query suggestion: making user goals more explicit during search. In Proceedings of the 2009 workshop on web search click data, pages 68–74. ACM, 2009. [29] J. Teevan, C. Alvarado, M. S. Ackerman, and D. R.

Karger. The perfect search engine is not enough: a study of orienteering behavior in directed search. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 415–422. ACM, 2004.

[30] E. Terra and C. L. Clarke. Scoring missing terms in information retrieval tasks. In Proceedings of the thirteenth ACM international conference on

Information and knowledge management, pages 50–58. ACM, 2004.

[31] R. W. White and G. Marchionini. Examining the effectiveness of real-time query expansion. Information Processing & Management, 43(3):685–704, 2007. [32] R. W. White and R. A. Roth. Exploratory search:

Beyond the query-response paradigm. Synthesis Lectures on Information Concepts, Retrieval, and Services, 1(1):1–98, 2009.

[33] J. Xu and W. B. Croft. Query expansion using local and global document analysis. In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pages 4–11. ACM, 1996.

[34] Z. Zhang and O. Nasraoui. Mining search engine query logs for query recommendation. In Proceedings of the 15th international conference on World Wide Web, pages 1039–1040. ACM, 2006.

(9)

APPENDIX

A. QUERY GENERATION INSTRUCTIONS

Please generate 5 queries for each of the search tasks listed below. These queries should be regular search engine queries, that you would use in a real-life setting to carry out the corresponding search task. All the queries should fit within any of the activities that are regarded as exploratory search according to Marchioni [1]. An image is attached which lists all these activities, as described by Marchionini in [1].

• Which models on information seeking behaviour can be found in the literature?

• What is the best performing clustering algorithm for graphs?

• What are the principles of information visualization? • How can you evaluate an information retrieval system? • What is the state-of-the-art on computer vision? • How does pattern recognition relate to artificial

intelli-gence?

[1] Marchionini, G. (2006a). Exploratory search: From finding to understanding. Communications of the ACM, 49(4), pp. 41—46. doi:10.1145/1121949.1121979

B. RELEVANCE ASSESSMENT

INSTRUC-TIONS

A list of search query-query suggestion pairs will be dis-played on the following pages. Each search query is a regular search engine query, coupled with a query suggestion gener-ated by one of two different query suggestion systems. Each page lists one search query-query suggestion pair, along with the top-ranked results of the queries by the search engine, where the original query is displayed on the left and the query suggestion is displayed on the right.

Please rate each suggestion on relevance, using the scale below.

• Precise: No change in meaning. Little or no change in scope

• Approximate: Modest change in intent. The scope may have expanded or narrowed

• Marginal: Shift in user intent to a related, but distinct topic

• Mismatch: The original user intent has been lost or obscured

• N/A: Unable to assess the suggestion with the provided information

A total of 30 queries will be assessed. With six suggestions for each query, this totals to 180 pages. The suggestions are displayed in a randomized order.

Before the assessment starts, four example pairs will be shown. Please study these examples and the accompanying explanations carefully. After the examples, the assessment will start. This will be indicated clearly.

Please do not continue if the above instructions are un-clear. Otherwise, thanks for taking the time and good luck! Press the Next button below to continue to the example pairs.