A principled methodology for comparing relatedness measures for clustering publications

(1)

A principled methodology for comparing

relatedness measures for clustering publications

Ludo Waltman1 _{, Kevin W. Boyack}2 _{, Giovanni Colavizza}3 _{, and Nees Jan van Eck}1

1_{Centre for Science and Technology Studies, Leiden University, The Netherlands} 2

SciTech Strategies, Inc., Albuquerque, NM, USA

3_{University of Amsterdam, The Netherlands}

Keywords: accuracy, citation relation, clustering, relatedness measure, textual similarity

ABSTRACT

There are many different relatedness measures, based for instance on citation relations or textual similarity, that can be used to cluster scientific publications. We propose a principled methodology for evaluating the accuracy of clustering solutions obtained using these relatedness measures. We formally show that the proposed methodology has an important consistency property. The empirical analyses that we present are based on publications in the fields of cell biology, condensed matter physics, and economics. Using the BM25 text-based relatedness measure as the evaluation criterion, we find that bibliographic coupling relations yield more accurate clustering solutions than direct citation relations and cocitation relations. The so-called extended direct citation approach performs similarly to or slightly better than bibliographic coupling in terms of the accuracy of the resulting clustering solutions. The other way around, using a citation-based relatedness measure as evaluation criterion, BM25 turns out to yield more accurate clustering solutions than other text-based relatedness measures.

1. INTRODUCTION

Clustering of scientific publications is an important problem in the field of bibliometrics. Bibliometricians have employed many different clustering techniques (e.g., Gläser, Scharnhorst, & Glänzel, 2017;Šubelj, Van Eck, & Waltman, 2016). In addition, they have used various different relatedness measures to cluster publications. These relatedness mea-sures are typically based on either citation relations (e.g., direct citation relations, bibliographic coupling relations, or cocitation relations) or textual similarity, or sometimes a combination of the two.

Which relatedness measure yields the most accurate clustering of publications? Two per-spectives can be taken on this question. One perspective is that there is no absolute notion of accuracy (e.g., Gläser et al., 2017). Following this perspective, each relatedness measure yields clustering solutions that are accurate in their own right, and it is not meaningful to ask whether one clustering solution is more accurate than another one. For instance, different citation-based and text-based relatedness measures each emphasize different aspects of the way in which publications relate to each other, and the corresponding clustering solutions each provide a legitimate viewpoint on the organization of the scientific literature. The other perspective is that for some purposes it is useful, and perhaps even necessary, to assume the existence of an absolute notion of accuracy (e.g.,Klavans & Boyack, 2017). When this a n o p e n a c c e s s j o u r n a l

Citation: Waltman, L., Boyack, K. W., Colavizza, G., & van Eck, N. J. (2020). A principled methodology for comparing relatedness measures for clustering publications. Quantitative Science Studies, 1(2), 691–713.https://doi.org/ 10.1162/qss_a_00035 DOI: https://doi.org/10.1162/qss_a_00035 Received: 21 January 2019 Accepted: 28 August 2019 Corresponding Author Ludo Waltman waltmanlr@cwts.leidenuniv.nl Handling Editor: Vincent Larivière

(2)

perspective is taken, it is possible, at least in principle, to say that some relatedness measures yield more accurate clustering solutions than others.

We believe that both perspectives are useful. From a purely conceptual point of view, the first perspective is probably the more satisfactory one. However, from a more applied point of view, the second perspective is highly important. In many practical applications, users expect to be provided with a single clustering of publications. Users typically have some intuitive idea of accuracy and, based on this idea of accuracy, they expect the clustering provided to them to be as accurate as possible. In this paper, we take this applied viewpoint and therefore focus on the second perspective.

Identifying the relatedness measure that yields the most accurate clustering of publications is challenging because of the lack of a ground truth. There is no perfect classification of pub-lications that can be used to evaluate the accuracy of different clustering solutions. For in-stance, suppose we study the degree to which a clustering solution resembles an existing classification of publications (e.g.,Haunschild, Schier, et al., 2018). The difficulty then is that it is not clear how discrepancies between the clustering solution and the existing classification should be interpreted. Such discrepancies could indicate shortcomings of the clustering solu-tion, but they could equally well reflect problems of the existing classification.

As an alternative, the accuracy of clustering solutions can be evaluated by domain experts who assess the quality of different clustering solutions in a specific scientific domain (e.g.,

Šubelj et al., 2016). This approach has the difficulty that it is hard to find a sufficiently large number of experts who are willing to spend a considerable amount of time making a detailed assessment of the quality of different clustering solutions. Moreover, the knowledge of experts will often be restricted to relatively small domains, and it will be unclear to what extent the conclusions drawn by experts generalize to other domains.

In this paper, we take a large-scale data-driven approach to compare different relatedness measures based on which publications can be clustered. The basic idea is to cluster publica-tions based on a number of different relatedness measures and to use another more or less independent relatedness measure as a benchmark for evaluating the accuracy of the clustering solutions. This approach has already been used extensively in a series of papers by Kevin Boyack, Dick Klavans, and colleagues. They compared different citation-based relatedness measures (Boyack & Klavans, 2010;Klavans & Boyack, 2017), including relatedness measures that take advantage of full-text data (Boyack, Small, & Klavans, 2013), as well as different text-based relatedness measures (Boyack, Newman, et al., 2011). To evaluate the accuracy of clus-tering solutions, they used grant data, textual similarity (Boyack & Klavans, 2010;Boyack et al., 2011,2013), and more recently also the reference lists of“authoritative” publications, defined as publications with at least 100 references (Klavans & Boyack, 2017).1

Our aim in this paper is to introduce a principled methodology for performing analyses similar to those mentioned above. We restrict ourselves to the use of one specific clustering technique, namely the technique introduced in the bibliometric literature by Waltman and Van Eck (2012), but we allow the use of any measure of the relatedness of publications. For two relatedness measures A and B, our proposed methodology offers a principled way to eval-uate the accuracy of clustering solutions obtained using the two measures, where a third

(3)

relatedness measure C is used as the evaluation criterion. Unlike approaches taken in earlier papers, our methodology has an important consistency property.

This paper is organized as follows. In section 2, we introduce our methodology for evalu-ating the accuracy of clustering solutions obtained using different relatedness measures. In section 3, we discuss the relatedness measures that we consider in our analyses. We report the results of the analyses in section 4. We present comparisons of different citation-based and text-based relatedness measures that can be used to cluster publications. Our analyses are based on publications in the fields of cell biology, condensed matter physics, and economics. We summarize our conclusions in section 5.

2. METHODOLOGY

To introduce our methodology for evaluating the accuracy of clustering solutions obtained using different relatedness measures, we first discuss the quality function that we use to cluster publications. We then explain how we evaluate the accuracy of a clustering solution and analyze the consistency of our evaluation framework. Finally, we discuss the importance of using an independent evaluation criterion.

2.1. Quality Function for Clustering Publications

Consider a set of N publications. Let rX

ij≥ 0 denote the relatedness of publications i and j (with

i= 1,…, N and j = 1, …, N) based on relatedness measure X, and let cX

i 2 {1, 2, …} denote

the cluster to which publication i is assigned when publications are clustered based on relatedness measure X.

Publications are assigned to clusters by maximizing a quality function. We focus on the quality function ofWaltman and Van Eck (2012). This quality function is given by

Q ¼X_i;jI cX_i ¼ cX_j rX_ij− γ ; (1) where I(cX

i = cXj) equals 1 if cXi = cXj and 0 otherwise, and where γ ≥ 0 denotes a so-called

resolution parameter. The higher the value of this parameter, the larger the number of clusters that will be obtained. Hence, the resolution parameterγ determines the granularity of the clus-tering. An appropriate value for this parameter can be chosen based on the specific purpose for which a clustering of publications is intended to be used. For some purposes it may be desirable to have a highly granular clustering, while for other purposes a less granular cluster-ing may be preferable.Sjögårde and Ahlgren (2018,2020)proposed approaches for choosing the value of the resolution parameterγ that allow clusters to be interpreted as research topics or specialties.

The quality function in Eq. (1) can also be written as Q¼X_i;jI cX i ¼ cXj rX ij− γ X k s X k 2 ; (2) where sX

k denotes the number of publications assigned to cluster k; that is,

sX k ¼ X iI c X i ¼ k : (3) We also refer to sX

k as the size of cluster k.

(4)

Girvan (2004)andNewman (2004). However, as shown byTraag et al. (2011), it has the im-portant advantage that it does not suffer from the so-called resolution limit problem (Fortunato & Barthélemy, 2007).Waltman and Van Eck (2012)introduced the above quality function in the bibliometric literature. In the field of bibliometrics, the quality function has been used by, among others,Boyack and Klavans (2014),Klavans and Boyack (2017),Perianes-Rodriguez and Ruiz-Castillo (2017),Ruiz-Castillo and Waltman (2015),Sjögårde and Ahlgren (2018,

2020),Small, Boyack, and Klavans (2014), andVan Eck and Waltman (2014).

2.2. Evaluating the Accuracy of a Clustering Solution

Suppose that we have three relatedness measures A, B, and C, and suppose also that we have used relatedness measures A and B to cluster a set of publications. Furthermore, suppose that we want to use relatedness measure C to evaluate the accuracy of the clustering solutions obtained using relatedness measures A and B. One way in which this could be done is by using relatedness measure C to obtain a third clustering solution and by comparing the clus-tering solutions obtained using relatedness measures A and B with this third clusclus-tering solu-tion. A large number of methods have been proposed for comparing clustering solutions (e.g.,

Fortunato, 2010). However, we do not take this approach. In order to have a consistent eval-uation framework (see section 2.3), we evaluate the accuracy of the clustering solutions ob-tained using relatedness measures A and B based directly on relatedness measure C, not on a clustering solution obtained using this relatedness measure.

Let AX|Cdenote the accuracy of a clustering solution obtained using relatedness measure X (with X = A or X = B), where the accuracy is evaluated using relatedness measure C. We define AX|C as AXjC¼ 1 N X i;jI c X i ¼ cXj rC_ij: (4)

The clustering solution obtained using relatedness measure A is considered to be more accu-rate than the clustering solution obtained using relatedness measure B if AA|C> AB|C, and the other way around.

The above approach for evaluating the accuracy of a clustering solution favors less granular solutions over more granular ones. Of all possible clustering solutions, the least granular so-lution is the one in which all publications belong to the same cluster. According to Eq. (4), this least granular clustering solution always has the highest possible accuracy. There can be no other clustering solution with higher accuracy. In order to perform meaningful comparisons, Eq. (4) should be used only for comparing clustering solutions that have the same granularity. How do we determine whether two clustering solutions have the same granularity? We could require that both clustering solutions have been obtained using the same value for the resolution parameterγ. Alternatively, we could require that both clustering solutions con-sist of the same number of clusters. We do not take either of these approaches. Instead, we require that the sum of the squared cluster sizes is the same for two clustering solutions. In other words, two clustering solutions obtained using relatedness measures A and B have the same granularity if X k s A k 2 ¼X_l sB_l 2: (5)

(5)

practice, obtaining two clustering solutions that satisfy Eq. (5) typically will not be easy. For both clustering solutions, it may require a significant amount of trial and error with different values of the resolution parameterγ. In the end, it may turn out that Eq. (5) can be satisfied only approximately, not exactly. We will return to this issue in section 4.3.

A conceptual motivation for the evaluation framework introduced in this subsection is pre-sented in Appendix A.1. This motivation is based on an analogy with the evaluation of the accuracy of different indicators that provide estimates of values drawn from a probability distribution.

2.3. Consistency of the Evaluation Framework

The choice of the accuracy measure defined in Eq. (4) and the granularity condition presented in Eq. (5) may seem quite arbitrary. However, provided we use the quality function defined in Eq. (1), this choice has an important justification. Suppose that the accuracy of clustering so-lutions is evaluated using some relatedness measure X. Our choice of the accuracy measure in Eq. (4) and the granularity condition in Eq. (5) then guarantees that of all possible clustering solutions of a certain granularity the solution obtained using relatedness measure X will be the most accurate one. In other words, it is guaranteed that AX|X≥ AY|Xfor any relatedness measure Y. This is a fundamental consistency property that we believe should be satisfied by any sound framework for evaluating the accuracy of clustering solutions obtained using different related-ness measures.

Suppose, for instance, that we have three clustering solutions, all of the same granularity: one obtained based on direct citation relations between publications, another obtained based on bibliographic coupling relations, and a third obtained based on cocitation relations. Suppose also that the accuracy of the clustering solutions is evaluated based on direct citation relations. It would then be a rather odd outcome if the clustering solution obtained based on bibliographic coupling or cocitation relations turned out to be more accurate than the solution obtained based on direct citation relations. In our evaluation framework, it is guaranteed that there can be no such inconsistent outcomes. When the accuracy of clustering solutions is eval-uated based on direct citation relations, the clustering solution obtained based on direct cita-tion relacita-tions will always be the most accurate one. We refer to Appendix B for a formal analysis of this important consistency property. The appendix also provides an example of an inconsistent evaluation framework.

2.4. Independent Evaluation Criterion

(6)

affected by similar types of noise: for instance, noise caused by the fact that the authors of a publication cite a specific reference while some other reference would have been more relevant.

In this paper, we use a text-based relatedness measure to evaluate the accuracy of different clustering solutions obtained using citation-based relatedness measures, and conversely we use a citation-based relatedness measure to evaluate the accuracy of different clustering solu-tions obtained using text-based relatedness measures. Importantly, we are not interested in evaluating citation-based clustering solutions using a citation-based relatedness measure or text-based clustering solutions using a text-based relatedness measure. Such evaluations are of little interest because the relatedness measure used for evaluation is not sufficiently indepen-dent of the relatedness measures being evaluated. For instance, when direct citation relations are used to evaluate the accuracy of different clustering solutions obtained using citation-based relatedness measures, the clustering solution obtained based on direct citation relations will be the most accurate one. The evaluation simply shows that the clustering solution obtained based on direct citation relations is best aligned with an evaluation criterion based on direct citation relations, which of course is not surprising. This illustrates the importance of using an independent evaluation criterion. The more the relatedness measure used for evaluation can be considered to be independent of the relatedness measures being evaluated, the more infor-mative the evaluation will be.

In Appendix A.2, we provide a further demonstration of the importance of using an inde-pendent evaluation criterion.

3. RELATEDNESS MEASURES

We now discuss the relatedness measures that we consider in this paper. We first discuss re-latedness measures based on citation relations, followed by rere-latedness measures based on textual similarity. We also discuss the so-called top M relatedness approach as well as the idea of normalized relatedness measures.

3.1. Citation-Based Relatedness Measures

Below we discuss a number of citation-based approaches for determining the pairwise relat-edness for a set of N publications. We use cijto indicate whether publication i cites publication

j(cij= 1) or not (cij= 0).

The relatedness of publications i and j based on direct citation relations is given by r_ijDC¼ max cij; cji

: (6)

Hence, rDC

ij = 1 if publication i cites publication j or the other way around and rDCij = 0 if neither

publication cites the other.

The relatedness of publications i and j based on bibliographic coupling relations equals the number of common references in the two publications. This can be written as

rBC_ij ¼X_kcikcjk; (7)

where the summation extends over all publications in the database that we use, not only over the N publications for which we aim to determine their pairwise relatedness.

(7)

publications in which publications i and j are both cited. In mathematical terms, rCC

ij ¼

X

kckickj; (8)

where the summation again extends over all publications in the database that we use. The above approaches for determining the relatedness of publications may also be com-bined. This results in

r_ijDC−BC−CC¼ αrDC

ij þ rBCij þ rCCij ; (9)

whereα denotes a parameter that determines the weight of direct citation relations relative to bibliographic coupling and cocitation relations. A direct citation relation may be consid-ered a stronger signal of the relatedness of two publications than a bibliographic coupling or cocitation relation (Waltman & Van Eck, 2012), and therefore one may want to give more weight to a direct citation relation than to the two other types of relations. This can be achieved by settingα to a value above 1. The idea of combining different types of citation-based relations is not new. This idea was also explored bySmall (1997)andPersson (2010). In addition to the above citation-based approaches for determining the relatedness of pub-lications, we also consider a so-called extended direct citation approach. Like the ordinary di-rect citation approach, the extended didi-rect citation approach takes into account only didi-rect citation relations between publications. However, direct citation relations are considered not just within the set of N focal publications but within an extended set of publications. In addition to the N focal publications, the extended set of publications includes all publications in our database that have a direct citation relation with at least two focal publications. (Publications that have a direct citation relation with only one focal publication are not considered because they do not contribute to improving the clustering of the focal publications.) The technical de-tails of the extended direct citation approach are somewhat complex. These dede-tails are dis-cussed in Appendix C. We note that an approach similar to our extended direct citation approach was also used byBoyack and Klavans (2014)andKlavans and Boyack (2017).

3.2. Text-Based Relatedness Measures

We consider two text-based approaches for determining the relatedness of publications. We use oilto denote the number of occurrences of term l in publication i. To count the number of

occurrences of a term in a publication, only the title and abstract of the publication are con-sidered, not the full text. Part-of-speech tagging is applied to the title and abstract of the pub-lication to identify nouns and adjectives. The part-of-speech tagging algorithm provided by the Apache OpenNLP 1.5.2 library is used. A term is defined as a sequence of nouns and adjec-tives, with the last word in the sequence being a noun. No distinction is made between sin-gular and plural nouns, so neural network and neural networks are regarded as the same term. Furthermore, shorter terms embedded in longer terms are not counted. For instance, if a pub-lication contains the term artificial neural network, this is counted as an occurrence of artificial neural networkbut not as an occurrence of neural network or network. Finally, no stop word list is used, so there are no terms that are excluded from being counted.

A straightforward text-based measure of the relatedness of publications i and j is given by rCT_ij ¼X l oilojl P kokl _β: (10)

(8)

determines the extent to which the influence of these terms is reduced. Ifβ = 0, no reduction in the influence of frequently occurring terms takes place. On the other hand, ifβ = 1, the influ-ence of frequently occurring terms is strongly reduced, following a so-called fractional counting approach (Perianes-Rodriguez, Waltman, & Van Eck, 2016).

Boyack et al. (2011)identified BM25 as one of the most accurate text-based relatedness measures for clustering publications. We therefore also include BM25 in our analysis. BM25 originates from the field of information retrieval, where it is used to determine the rel-evance of a document for a search query (Sparck Jones, Walker, & Robertson, 2000a,2000b). FollowingBoyack et al. (2011), we use BM25 as a text-based measure of the relatedness of publications. The BM25 relatedness measure is defined as

rBM25 ij ¼ X lI oð il> 0ÞIDFl ojlðk1þ 1Þ ojlþ k1 1− b þ b dj d ; (11)

where I(oil> 0) equals 1 if oil> 0 and 0 otherwise and where djand ddenote, respectively, the

length of publication j and the average length of all N publications. We define the length of a publication as the total number of occurrences of terms in the publication. This results in

di ¼ X l oil and d ¼ 1 N X idi: (12)

IDFlin (11) denotes the inverse document frequency of term l, which we define as

IDFl ¼ log

N− nl þ 0:5

nlþ 0:5 ;

(13) where nldenotes the number of publications in which term l occurs, that is,

nl ¼

X

iI oð il> 0Þ: (14)

The BM25 relatedness measure in Eq. (11) depends on the parameters k1and b. Following Boyack et al. (2011), we set these parameters to values of 2 and 0.75, respectively. Unlike all other relatedness measures that we consider in this paper, the BM25 relatedness measure is not symmetrical. In other words, rBM25

ij does not need to be equal to r BM25 ji .

3.3. TopM Relatedness Approach

Our interest focuses on large-scale clustering analyses that may involve hundreds of thousands or even millions of publications. These analyses impose significant challenges in terms of com-puting time and memory requirements. In particular, in these analyses, it may not be feasible to store all nonzero relatedness values in the main memory of the computer that is used.

To deal with this problem, we use the top M relatedness approach. This approach is quite similar to the idea of similarity filtering typically used by Kevin Boyack and Dick Klavans (e.g.,

Boyack & Klavans, 2010;Boyack et al., 2011). In the top M relatedness approach, only the top Mstrongest relations per publication are kept (ties are broken randomly). The remaining rela-tions are discarded. We useerX_ij to denote the relatedness of publications i and j based on re-latedness measure X after discarding relations that are not in the top M per publication. This means thaterX_ij= rX

ijif publication j is among the M publications that are most strongly related to

(9)

In most of the analyses presented in this paper, we use a value of 20 for M, although we also explore alternative values. We apply the top M relatedness approach to all our relatedness measures except for the measures based on (extended) direct citation relations. As pointed out byWaltman and Van Eck (2012), the use of direct citation relations has the advantage of re-quiring only a relatively limited amount of computer memory, and therefore there is no need to use the top M relatedness approach when working with direct citation relations. Applying the top M relatedness approach in the case of direct citation relations would also be problem-atic, because all relations are equally strong, making it difficult to decide which relations to keep and which ones to discard. Hence, in the case of direct citation relations, we simply have erDC

ij = rDCij for all publications i and j.

3.4. Normalization of Relatedness Measures

We also normalize all relatedness measures. The normalized relatedness of publication i with publication j equals the relatedness of publication i with publication j divided by the total re-latedness of publication i with all publications. Hence, the normalized rere-latedness of publica-tion i with publicapublica-tion j based on relatedness measure X is given by

rX_ij ¼ er X ij P ker X ik : (15)

This normalization was also used byWaltman and Van Eck (2012). The idea of the normal-ization is that the relatedness values of publications in different fields of science should be of the same order of magnitude, so that clusters in different fields will be of similar size. Without the normalization, citation-based relatedness values for instance can be expected to be much higher in the life sciences than in the social sciences. In a clustering analysis that involves both publications in the life sciences and publications in the social sciences, this would result in life sciences clusters being systematically larger than social sciences clusters. The normalization in Eq. (15) can be used to correct for such differences between fields. The normalization also has the advantage that, regardless of the choice of relatedness measure, a specific value of the resolution parameterγ will always yield clustering solutions that have approximately the same granularity.

All results presented in the next section are based on normalized relatedness measures. 4. RESULTS

We start the discussion of the results of our analyses by explaining the data collection and the way in which publications were clustered. We then introduce the idea of granularity-accuracy plots. Next, we present a comparison of different citation-based relatedness measures that can be used to cluster publications. This is followed by a comparison of different text-based relat-edness measures.

4.1. Data Collection

Data was collected from the Web of Science (WoS) database. We used the in-house version of the WoS database available at the Centre for Science and Technology Studies at Leiden University. This version of the database includes the Science Citation Index Expanded, the Social Sciences Citation Index, and the Arts & Humanities Citation Index.

(10)

this paper manageable, we restricted ourselves to three specific fields. We selected all publi-cations of the document types article and review that appeared in the period 2007–2016 in journals belonging to the WoS subject categories Cell biology, Physics, condensed matter, and Economics. Our aim was to cover three broad scientific domains, namely the life sciences, the physical sciences, and the social sciences. The subject categories Cell biology, Physics, con-densed matter, and Economics were chosen because they cover these three domains and be-cause they are relatively large in terms of the number of publications they include. The number of publications is 252,954 in cell biology, 272,935 in condensed matter physics, and 172,690 in economics.

The relatedness measures discussed in section 3 were calculated for the selected publica-tions. Two comments need to be made. First, in determining bibliographic coupling relations between publications, only common references to publications indexed in our WoS database were considered. This database includes publications starting from 1980. Common references to nonindexed publications (e.g., books, conference proceedings publications, and PhD theses) were not taken into account. Nonindexed publications were not considered in the extended direct citation approach either. Second, when we collected the data in Spring 2017, our data-base included a limited number of publications from 2017. These publications were not used in determining cocitation relations between publications. They also were not considered in the extended direct citation approach.

Table 1reports for each of the three fields of science that we analyze and for each of the relatedness measures that we consider the average number of relations per publication and the percentage of publications that have no relations at all. The average number of relations per publication was calculated after applying the top M relatedness approach (except for DC and EDC; see section 3.3).Table 1shows that in the case of DC and especially CC a quite high percentage of the publications have no relations. This can be expected to have a negative effect on the accuracy of clustering solutions obtained using these relatedness measures, since publications without relations cannot be properly clustered.

Table 1. The average number of relations per publication (ANR) and the percentage of publications without relations (PWR) for different fields of science and different citation-based and text-based relatedness measures

Cell biology Condensed matter physics Economics

ANR PWR ANR PWR ANR PWR

(11)

4.2. Clustering of Publications

For each of our three fields (cell biology, condensed matter physics, and economics), the se-lected publications were clustered based on each of our relatedness measures. Clustering was performed by maximizing the quality function presented in Eq. (1). To maximize the quality function, we used an iterative variant (Waltman & Van Eck, 2013) of the well-known Louvain algorithm (Blondel, Guillaume, et al., 2008). Five iterations of the algorithm were performed. In addition, to speed up the algorithm, we employed ideas similar to the pruning idea of

Ozaki, Tezuka, and Inaba (2016)and the prioritization idea ofBae, Halperin, et al. (2017). Our algorithm is a predecessor of the recently introduced Leiden algorithm (Traag, Waltman, & Van Eck, 2019), which was not yet available when we carried out our analyses. In general, our algorithm will not be able to find the global maximum of the quality function, but it can be expected to get close to the global maximum.

Different levels of granularity were considered. For each relatedness measure, we obtained 10 clustering solutions, each of them for a different value of the resolution parameterγ. The following values ofγ were used: 0.00001, 0.00002, 0.00005, 0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, and 0.01. Because of the normalization discussed in section 3.4, the same values ofγ could be used for all relatedness measures. Without normalization, different values ofγ would need to be used for each of the relatedness measures.

4.3. Granularity-Accuracy Plots

A difficulty of the evaluation framework presented in section 2.2 is the requirement that the clustering solutions being compared have exactly the same granularity. This requirement, which is formalized in the condition in Eq. (5), is hard to meet in practice. Clustering solutions obtained using different relatedness measures but the same value of the resolution parameterγ will approximately satisfy Eq. (5), but the condition normally will not be satisfied exactly.

To deal with this problem, we propose a graphical approach based on the idea of granularity-accuracy (GA) plots. Using a GA plot, relatedness measures can be compared despite differ-ences in granularity between clustering solutions. The horizontal axis in a GA plot represents the granularity of a clustering solution. We define the granularity of a clustering solution ob-tained using relatedness measure X as

N P

k sXk

2: (16)

Two clustering solutions that have the same granularity according to Eq. (16) indeed satisfy the condition in Eq. (5). The vertical axis in a GA plot represents the accuracy of a clustering solu-tion as defined in Eq. (4). Clustering solusolu-tions are plotted in a GA plot based on their granularity and accuracy. Lines are drawn between clustering solutions obtained using the same related-ness measure but different values of the resolution parameterγ. We use a logarithmic scale for both the horizontal and the vertical axis in a GA plot.

(12)

on such interpolations, the accuracy of different relatedness measures can be compared at a specific level of granularity. These comparisons can be performed at different levels of gran-ularity. Sometimes different levels of granularity will yield inconsistent results, with, for in-stance, relatedness measure A outperforming relatedness measure B at one level of granularity and the opposite outcome at another level of granularity. In other cases, consistent results will be obtained at all levels of granularity. For instance, relatedness measure C may consistently outperform relatedness measure D, regardless of the level of granularity.

In the next two subsections, GA plots will be used to compare different citation-based and text-based relatedness measures.

4.4. Comparison of Citation-Based Relatedness Measures

For each of our three fields (cell biology, condensed matter physics, and economics),Figure 1

(13)

These results provide an upper bound for the results that can be obtained using the citation-based relatedness measures. (Recall from section 2.3 that the highest possible accuracy is ob-tained when publications are clustered based on the same relatedness measure that is also used as the evaluation criterion.) All relatedness measures (except for DC and EDC; see section 3.3) use a value of 20 for the parameter M of the top M relatedness approach.

To interpret the GA plots inFigure 1, it is important to have some understanding of the meaning of the different levels of granularity. For each of our three fields, a clustering solution consists of several hundreds of significant clusters when the granularity is around 0.001, where we define a significant cluster as a cluster that includes at least 10 publications. A granularity around 0.01 corresponds to several thousands of significant clusters.

(14)

relatedness measure is again used as the evaluation criterion. Only the field of condensed matter physics is considered. As can be seen inFigure 2, our results are rather insensitive to the value of M.

(15)

We also tested the sensitivity of our results to the choice of the text-based relatedness mea-sure that is used as the evaluation criterion. The results turned out to be insensitive to this choice. Replacing BM25 by CT (withβ = 0.5) yielded very similar results (not shown).

4.5. Comparison of Text-Based Relatedness Measures

Figure 3presents GA plots for comparing the BM25 and CT text-based relatedness measures discussed in section 3.2. In the case of the CT relatedness measure, three values of the param-eterβ are considered: β = 0.0, β = 0.5, and β = 1.0. The DC-BC-CC citation-based relatedness measure discussed in section 3.1 (withα = 1) is used as the evaluation criterion. Results ob-tained when this relatedness measure is used to cluster publications are also included in the GA plots. These results provide an upper bound for the results that can be obtained using the text-based relatedness measures. All relatedness measures use a value of 20 for the parameter Mof the top M relatedness approach.

The results presented inFigure 3for cell biology, condensed matter physics, and economics are very similar. Using DC-BC-CC as the evaluation criterion, BM25 outperforms CT, regard-less of the value of the parameterβ. The good performance of BM25 is in agreement with the results ofBoyack et al. (2011). By far the worst performance is obtained when CT is used with the parameter valueβ = 0.0. This confirms the importance of reducing the influence of fre-quently occurring terms. However, CT with the parameter valueβ = 0.5 outperforms CT with the parameter valueβ = 1.0. Hence, the influence of frequently occurring terms should not be reduced too strongly.

To test the sensitivity of our results to the value of the parameter M of the top M relatedness approach,Figure 4presents a GA plot in which the BM25 text-based relatedness measure is compared for different values of M, using the DC-BC-CC citation-based relatedness measure (withα = 1) as the evaluation criterion. Only the field of condensed matter physics is consid-ered. Interestingly, and perhaps surprisingly, the highest values of M (i.e., M = 50 and M = 100) are outperformed by lower values of M. Hence, while the highest values of M require most computing time and most computer memory, they yield the lowest accuracy. The highest ac-curacy is obtained for M = 10 or M = 20. In line with the approach taken byBoyack et al.

(16)

(2011), it therefore seems sufficient to keep only the 10 or 20 strongest relations per publication.

We also tested the sensitivity of our results to the choice of the citation-based relatedness measure that is used as the evaluation criterion. The results turned out to be insensitive to this choice. Replacing DC-BC-CC (withα = 1) by CC yielded very similar results (not shown). 5. CONCLUSIONS

The problem of clustering scientific publications involves significant conceptual and method-ological challenges. We have introduced a principled methodology for evaluating the accu-racy of clustering solutions obtained using different relatedness measures. Our methodology can be applied to evaluate the accuracy of clustering solutions obtained using two relatedness measures A and B, where a third relatedness measure C is used as the evaluation criterion. Preferably, relatedness measure C should be as independent as possible from relatedness mea-sures A and B. Relatedness meamea-sures A and B, for instance, may be citation-based relatedness measures, and relatedness measure C may be a text-based relatedness measure (or the other way around).

The empirical results that we have presented are based on a large-scale analysis of publi-cations in the fields of cell biology, condensed matter physics, and economics indexed in the WoS database. We have used our proposed methodology, complemented with a graphical approach based on so-called GA plots, to compare different citation-based relatedness mea-sures that can be used to cluster publications. Using the BM25 text-based relatedness measure as the evaluation criterion, we have found that cocitation relations and direct citation relations yield less accurate clustering solutions than a number of other citation-based relatedness mea-sures. Bibliographic coupling relations, possibly combined with direct citation relations and cocitation relations, can be used to obtain more accurate clustering solutions. The so-called extended direct citation approach yields clustering solutions with an accuracy that is similar to or even somewhat higher than the accuracy of clustering solutions obtained using bibliogra-phic coupling relations. We note that our analyses have been restricted to individual fields of science. In an analysis that covers all fields of science and a long period of time, differences between the ordinary direct citation approach and the extended direct citation approach can be expected to be much smaller. We have also compared different text-based relatedness measures using a citation-based relatedness measure (obtained by combining direct citation relations, bibliographic coupling relations, and cocitation relations) as the evaluation criterion. BM25 has turned out to yield more accurate clustering solutions than the other text-based relatedness measures that we have studied.

We have also analyzed the use of the so-called top M relatedness approach. This approach can be used to reduce the amount of computing time and computer memory needed to cluster publications. We have found that the use of the top M relatedness approach does not decrease the accuracy of clustering solutions. In fact, in the case of text-based relatedness measures, the accuracy of clustering solutions may even increase.

(17)

different relatedness measures each provide a legitimate viewpoint on the organization of the scientific literature. We fully acknowledge the value of this alternative perspective, and we recognize the need to better understand how clustering solutions obtained using different re-latedness measures offer complementary viewpoints. Nevertheless, from an applied point of view focused on practical applications, we believe that there is a need to evaluate the accu-racy of clustering solutions obtained using different relatedness measures and to identify the relatedness measures that yield the most accurate clustering solutions. This motivates our choice to make the assumption of the existence of an absolute notion of accuracy. For those who consider this assumption to be problematic, we would like to suggest that the results pro-vided by our methodology could be given an alternative interpretation that does not depend on this assumption. Instead of interpreting the results in terms of accuracy, they could be inter-preted in terms of the degree to which different relatedness measures yield similar clustering solutions.

The most obvious direction for future research is to apply our methodology to a broader set of relatedness measures. Examples include relatedness measures based on full-text data, grant data, and keyword data (e.g., MeSH terms). Some of this work is already ongoing (Boyack & Klavans, 2018).

ACKNOWLEDGMENTS

The authors would like to thank Dick Klavans, Vincent Traag, and two reviewers for their help-ful comments.

AUTHOR CONTRIBUTIONS

Ludo Waltman: Conceptualization, Formal analysis, Methodology, Software, Writing—original draft. Kevin W. Boyack: Conceptualization, Methodology, Writing—review & editing. Giovanni Colavizza: Conceptualization, Methodology, Writing—review & editing. Nees Jan van Eck: Conceptualization, Methodology, Writing—review & editing.

COMPETING INTERESTS

The authors use clustering approaches similar to those discussed in this paper in commercial applications.

FUNDING INFORMATION

Part of this research was conducted when Giovanni Colavizza was affiliated with the Digital Humanities Laboratory, École Polytechnique Fédérale de Lausanne, Switzerland. Giovanni Colavizza was in part supported by a Swiss National Fund grant (number P1ELP2_168489). DATA AVAILABILITY

The data used in this paper were obtained from the WoS database produced by Clarivate Analytics. Due to license restrictions, the data cannot be made openly available. To obtain WoS, please contact Clarivate Analytics (https://clarivate.com/products/web-of-science).

REFERENCES

Bae, S.-H., Halperin, D., West, J. D., Rosvall, M., & Howe, B. (2017). Scalable and efficient flow-based community detection for large-scale graph analysis. ACM Transactions on Knowledge Discovery from Data, 11(3), 32.

(18)

Boyack, K. W., & Klavans, R. (2010). Co-citation analysis, biblio-graphic coupling, and direct citation: Which citation approach represents the research front most accurately? Journal of the American Society for Information Science and Technology, 61 (12), 2389–2404.

Boyack, K. W., & Klavans, R. (2014). Including cited non-source items in a large-scale map of science: What difference does it make? Journal of Informetrics, 8(3), 569–580.

Boyack, K. W., & Klavans, R. (2018). Accurately identifying topics using text: Mapping PubMed. In R. Costas, T. Franssen, & A. Yegros-Yegros (Eds.), Proceedings of the 23rd International Conference on Science and Technology Indicators, pp. 107–115. Leiden, the Netherlands.

Boyack, K. W., Newman, D., Duhon, R. J., Klavans, R., Patek, M., Biberstine, J. R.,… & Börner, K. (2011). Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches. PLOS ONE, 6(3), e18029. Boyack, K. W., Small, H., & Klavans, R. (2013). Improving the ac-curacy of co-citation clustering using full text. Journal of the American Society for Information Science and Technology, 64(9), 1759–1767.

Fortunato, S. (2010). Community detection in graphs. Physics Reports, 486(3–5), 75–174.

Fortunato, S., & Barthélemy, M. (2007). Resolution limit in commu-nity detection. Proceedings of the National Academy of Sciences of the United States of America, 104(1), 36–41.

Gläser, J., Scharnhorst, A., & Glänzel, W. (2017). Same data—different results? Towards a comparative approach to the identification of thematic structures in science. Scientometrics, 111(2), 981–998. Haunschild, R., Schier, H., Marx, W., & Bornmann, L. (2018).

Algorithmically generated subject categories based on citation relations: An empirical micro study using papers on overall water splitting. Journal of Informetrics, 12(2), 436–447.

Klavans, R., & Boyack, K. W. (2017). Which type of citation anal-ysis generates the most accurate taxonomy of scientific and tech-nical knowledge? Journal of the Association for Information Science and Technology, 68(4), 984–998.

Li, Y., & Ruiz-Castillo, J. (2013). The comparison of normalization procedures based on different classification systems. Journal of Informetrics, 7(4), 945–958.

Newman, M. E. J. (2004). Fast algorithm for detecting community structure in networks. Physical Review E, 69(6), 066133. Newman, M. E. J., & Girvan, M. (2004). Finding and evaluating

com-munity structure in networks. Physical Review E, 69(2), 026113. Ozaki, N., Tezuka, H., & Inaba, M. (2016). A simple acceleration

method for the Louvain algorithm. International Journal of Computer and Electrical Engineering, 8(3), 207–218.

Perianes-Rodriguez, A., & Ruiz-Castillo, J. (2017). A comparison of the Web of Science and publication-level classification systems of science. Journal of Informetrics, 11(1), 32–45.

Perianes-Rodriguez, A., Waltman, L., & Van Eck, N. J. (2016). Constructing bibliometric networks: A comparison between full and fractional counting. Journal of Informetrics, 10(4), 1178–1195.

Persson, O. (2010). Identifying research themes with weighted di-rect citation links. Journal of Informetrics, 4(3), 415–422. Ruiz-Castillo, J., & Waltman, L. (2015). Field-normalized

cita-tion impact indicators using algorithmically constructed clas-sification systems of science. Journal of Informetrics, 9(1), 102–117.

Sjögårde, P., & Ahlgren, P. (2018). Granularity of algorithmically constructed publication-level classifications of research publica-tions: Identification of topics. Journal of Informetrics, 12(1), 133–152.

Sjögårde, P., & Ahlgren, P. (2020). Granularity of algorithmically constructed publication-level classifications of research publica-tions: Identification of specialties. Quantitative Science Studies, 1 (1), 207–238.

Small, H. (1997). Update on science mapping: Creating large doc-ument spaces. Scientometrics, 38(2), 275–293.

Small, H., Boyack, K. W., Klavans, R. (2014). Identifying emerg-ing topics in science and technology. Research Policy, 43(8), 1450–1467.

Sparck Jones, K., Walker, S., & Robertson, S. E. (2000a). A probabilistic model of information retrieval: Development and comparative experiments: Part 1. Information Processing and Management, 36(6), 779–808.

Sparck Jones, K., Walker, S., & Robertson, S. E. (2000b). A probabilistic model of information retrieval: Development and comparative experiments: Part 2. Information Processing and Management, 36(6), 809–840.

Šubelj, L., Van Eck, N. J., & Waltman, L. (2016). Clustering scien-tific publications based on citation relations: A systematic com-parison of different methods. PLOS ONE, 11(4), e0154404. Traag, V. A., Van Dooren, P., & Nesterov, Y. (2011). Narrow scope

for resolution-limit-free community detection. Physical Review E, 84(1), 016114.

Traag, V. A., Waltman, L., & Van Eck, N. J. (2019). From Louvain to Leiden: Guaranteeing well-connected communities. Scientific Reports, 9, 5233.

Van Eck, N. J., & Waltman, L. (2014). CitNetExplorer: A new soft-ware tool for analyzing and visualizing citation networks. Journal of Informetrics, 8(4), 802–823.

Waltman, L., & Van Eck, N. J. (2012). A new methodology for con-structing a publication-level classification system of science. Journal of the American Society for Information Science and Technology, 63(12), 2378–2392.

(19)

APPENDIX A: MOTIVATION FOR THE EVALUATION FRAMEWORK

In this appendix, we present a conceptual motivation for the framework introduced in section 2 for evaluating the accuracy of clustering solutions obtained using different relatedness mea-sures. The motivation is based on an analogy with the evaluation of the accuracy of different indicators that provide estimates of values drawn from a probability distribution. We use this analogy because the evaluation of the accuracy of different indicators can be analyzed in a more precise way than the evaluation of the accuracy of clustering solutions obtained using different relatedness measures.

A.1. Evaluating Two Indicators Using a Third Indicator

Suppose N values v1,…, vNhave been drawn from a standard normal distribution. These values

cannot be observed directly. However, we have available three indicators A, B, and C that pro-vide estimates of the values v1,…, vN. Let the estimates provided by the indicators A, B, and C be

denoted by vA

1,…, vAN, v1B,…, vBN, and vC1,…, vNC, respectively. Suppose we need to choose

between the use of indicator A or indicator B. We therefore want to know which of these two indicators is more accurate. Since the values v1,…, vNcannot be observed directly, we cannot

evaluate the accuracy of indicators A and B by comparing the estimates vA

1,…, vNAand vB1,…, vBN

with the true values v1,…, vN. However, if indicator C can be assumed to be independent of

indicators A and B (see Appendix A.2 for a further discussion of this assumption), it is possible to use indicator C to evaluate the accuracy of indicators A and B. This can be seen as follows.

Suppose the estimates provided by indicators A, B, and C are given by vA_i ¼ ffiffiffiffiffiaA p viþ ffiffiffiffiffiffiffiffiffiffiffiffiffi 1− aA p eA_i; (A1) vB_i ¼ ffiffiffiffiffiaB p viþ ffiffiffiffiffiffiffiffiffiffiffiffi 1− aB p eB_i; (A2) v_iC¼pffiffiffiffiffiaCviþ ffiffiffiffiffiffiffiffiffiffiffiffiffi 1− aC p eC_i ; (A3)

whereαA,αB,αC2 [0, 1] denote the accuracy of indicators A, B, and C and where eAi, eBi, and

eC

i have been independently drawn from a standard normal distribution. Eqs. (A1), (A2), and

(A3) imply that the estimates provided by indicators A, B, and C follow a standard normal distribution. Because eA

i , eBi, and eCi have been independently drawn, we say that indicators

A, B, and C are independent of each other.

We want to know whetherαA>αBorαA<αB. To determine this, we calculate the mean

squared difference between the estimates provided by indicators A and C and between the estimates provided by indicators B and C. This yields

MSDAC¼ 1 N X i v A i − viC 2 ; (A4) MSDBC¼ 1 N X i v B i − vCi 2 : (A5)

If N is infinitely large, standard results from probability theory can be used to show that MSDAC¼ 2 − 2pffiffiffiffiffiffiffiffiffiffiaAaC; (A6) MSDBC¼ 2 − 2 ffiffiffiffiffiffiffiffiffiffi aBaC p : (A7)

(20)

MSDBC, thenαA<αB. This shows that indicator C can be used to evaluate the accuracy of

indicators A and B and to determine which of the two indicators is more accurate. Moreover, this is possible even if indicator C itself has a low (but nonzero) accuracy, perhaps much lower than the accuracy of indicators A and B.

We have now demonstrated how an indicator C can be used to evaluate the accuracy of indicators A and B. The idea of the evaluation framework presented in section 2 is similar, but instead of indicators we consider relatedness measures and clustering solutions obtained using these relatedness measures. We use a relatedness measure C to evaluate the accuracy of clus-tering solutions obtained using relatedness measures A and B. Relatedness measures A and B, for instance, could be two citation-based measures, such as a measure based on direct citation relations and a measure based on bibliographic coupling relations, while relatedness measure Ccould be a text-based measure, such as a measure based on BM25. If relatedness measure C can be assumed to be (approximately) independent of relatedness measures A and B, it can be used to evaluate the accuracy of clustering solutions obtained using relatedness measures A and B. This is possible even if relatedness measure C itself has a lower accuracy than related-ness measures A and B.

A.2. Independence Assumption

In Appendix A.1, we relied on the assumption that indicator C is independent of indicators A and B. We now demonstrate the importance of this assumption. To do so, we drop the assump-tion and we allow for a dependence between indicators A and C. Rather than by Eq. (A3), suppose estimates provided by indicator C are given by

vC_i ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffið1− dACÞaCþ dACaA p viþ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1− dAC ð Þ 1 − að CÞ p eC_i þ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffidACð1− aAÞ p eA_i ; (A8) where dAC2 [0, 1] denotes the dependence between indicators A and C. If dAC= 0, there is no

dependence between indicators A and C and Eq. (A8) reduces to Eq. (A3). On the other hand, if dAC= 1, there is a full dependence between indicators A and C. The indicators then provide

identical estimates, and Eq. (A8) therefore reduces to Eq. (A1). Eq. (A8) implies that the estimates provided by indicator C follow a standard normal distribution and that there is no dependence between indicators B and C.

Based on Eqs. (A1), (A2), and (A8), it can be shown that MSDAC¼ 2 − 2 ffiffiffiffiffiffiffiffiffiffiffiffiffi aAaAC p − 2 ffiffiffiffiffiffiffiffidAC p 1− aA ð Þ; (A9) MSDBC¼ 2 − 2pffiffiffiffiffiffiffiffiffiffiffiffiaBaAC; (A10) where aAC ¼ 1 − dð ACÞaCþ dACaA: (A11)

As expected, if dAC= 0, Eqs. (A9) and (A10) reduce to Eqs. (A6) and (A7). It follows from Eqs.

(A9) and (A10) that MSDBC< MSDACif and only if

aB> aAþ dAC aAC 1− aA ð Þ2_{þ 2} ffiffiffiffiffiffiffiffi dAC aAC s ffiffiffiffiffi aA p 1− aA ð Þ: (A12)

If dAC> 0 andαA< 1, the sum of the second and the third term in the right-hand side of Eq.

(A12) is positive. It is then possible that the inequality in Eq. (A12) is not satisfied even though αB >αA. Hence, it is possible that MSDBC> MSDAC, even thoughαB>αA. Indicator C then

(21)

indicator C is to give the incorrect impression that indicator A is more accurate than indicator B. In the extreme case in which dAC= 1, it is even impossible for indicator B to be considered

more accurate than indicator A.

We have now demonstrated the importance of the independence assumption when an in-dicator C is used to evaluate the accuracy of inin-dicators A and B. In the evaluation framework presented in section 2, the independence assumption has a similar importance. When a relat-edness measure C is used to evaluate the accuracy of clustering solutions obtained using re-latedness measures A and B, it is important that rere-latedness measure C is (approximately) independent of relatedness measures A and B. For instance, if there is a dependence between relatedness measures A and C, evaluations performed using relatedness measure C will be biased in favor of clustering solutions obtained using relatedness measure A.

APPENDIX B: CONSISTENT AND INCONSISTENT EVALUATION FRAMEWORKS

In this appendix, we formally show the consistency of the evaluation framework proposed in section 2. We also present an example of an inconsistent evaluation framework.

B.1. Consistency of the Proposed Evaluation Framework

Consider two relatedness measures X and Y. Suppose that we have obtained a clustering so-lution cX

1,…, cXNfor relatedness measure X by maximizing the quality function in Eq. (2) using

the resolution parameterγX. In addition, we have obtained a clustering solution cY

1,…, cYNfor

relatedness measure Y by maximizing the same quality function using the resolution parameter γY

. Suppose also that the two clustering solutions satisfy the condition in Eq. (5). Hence, the two clustering solutions have the same granularity. When the accuracy of the two clustering solutions is evaluated using relatedness measure X, it is guaranteed that the clustering solution obtained using relatedness measure X will be more accurate than the clustering solution ob-tained using relatedness measure Y. More precisely, it is guaranteed that AX|X≥ AY|X, where AX|X and AY|Xdenote the accuracy of the two clustering solutions according to the accuracy measure in Eq. (4). This result shows the consistency of our evaluation framework.

To prove the above result, suppose that AX|X< AY|X. It then follows from Eq. (4) that X i;jI cXi ¼ cXj rX ij < X i;jI cYi ¼ cYj rX ij: (B1)

The granularity condition in Eq. (5) states that X k s X k 2 ¼X_l sY_l 2: (B2)

Table B.1. Relatedness of publications according to relatedness measure X

(22)

Eqs. (B1) and (B2) imply that X i;jI c X i ¼ cXj rX_ij− γXX k s X k 2 <X_i;jI cY_i ¼ cY_j rX_ij− γXX l s Y l 2 : (B3)

It now follows from Eqs. (2) and (B3) that cY

1,…, cYNoffers a higher quality clustering solution for

relatedness measure X and resolution parameter γX than cX

1, …, cXN. However, this is not

possible, since cX

1, …, cXN is defined as the clustering solution that maximizes Eq. (2) for

relatedness measure X and resolution parameterγX. We therefore have a contradiction. This proves that AX|X≥ AY|X.

A minor qualification needs to be made. In practice, heuristic algorithms are usually used to maximize the quality function in Eq. (2). There is no guarantee that these algorithms are able to find the global maximum of the quality function (see section 4.2). In exceptional cases, this might cause the consistency of our evaluation framework to be violated.

B.2. Example of an Inconsistent Evaluation Framework

Consider an evaluation framework in which clustering solutions are compared using Eq. (4) subject to a granularity condition requiring that clustering solutions consist of the same num-ber of clusters. This granularity condition, which was used by Klavans and Boyack (2017), replaces the granularity condition in Eq. (5). The following example shows that this evaluation framework is inconsistent.

Suppose we have six publications, labeled P1 to P6. Consider two relatedness measures X and Y.Tables B.1andB.2show the relatedness of the six publications according to relatedness measures X and Y, respectively. Suppose that the resolution parameterγ is set to a value of 1.1. For relatedness measure X, maximization of the quality function in Eq. (2) then yields two clus-ters, one consisting of publications P1 to P3 and the other consisting of publications P4 to P6. For relatedness measure Y, we also obtain two clusters, one consisting of publications P1 to P5 and the other consisting only of publication P6. Since the two clustering solutions both consist of two clusters, our granularity condition is satisfied.

Based on Eq. (4), we now compare the two clustering solutions. Using relatedness measure Yto evaluate the accuracy of the clustering solutions, we obtain AX|Y= 10/3 and AY|Y= 20/3. Hence, as we would intuitively expect, according to relatedness measure Y, the clustering so-lution obtained using relatedness measure Y is more accurate than that obtained using relat-edness measure X. Let us now use relatrelat-edness measure X to evaluate the accuracy of the clustering solutions. This yields AX|X= 4 and AY|X= 28/6. In other words, we obtain the

Table B.2. Relatedness of publications according to relatedness measure Y

(23)

counterintuitive result that, according to relatedness measure X, the clustering solution obtained using relatedness measure X is less accurate than the one obtained using relatedness measure Y. This shows the inconsistency of our evaluation framework.

APPENDIX C: EXTENDED DIRECT CITATION APPROACH

In this appendix, we discuss the technical details of the extended direct citation approach introduced in section 3.1.

Our aim is to cluster publications 1,…, N. We refer to these publications as our focal pub-lications. To cluster the focal publications, we also consider publications N + 1,…, NEXT. Each of these nonfocal publications has a direct citation relation with at least two focal publications. For i = 1,…, N and j = 1, …, NEXT, the relatedness of publications i and j in the extended direct citation approach is given by

rij¼ max cij; cji

; (C1)

where cijindicates whether publication i cites publication j (cij= 1) or not (cij= 0).

Following the ideas presented in section 3.4, for i = 1,…, N and j = 1, …, NEXT, the nor-malized relatedness of publication i with publication j in the extended direct citation approach equals rij¼ rij P krik : (C2)

To accommodate the nonfocal publications N + 1, …, NEXT, the quality function in Eq. (1) needs to be adjusted. In the extended direct citation approach, publications 1,…, NEXTare assigned to clusters c1,…, cNEXT by maximizing the quality function

Q¼XN_i¼1 XN_j¼1I ci¼ cj rij− γ þXN_{j¼N þ 1}EXT I ci ¼ cj rij h i : (C3)

The nonfocal publications are treated in a special way in Eq. (C3). The costs and benefits of assigning a publication to a cluster are different for the nonfocal publications than for the focal ones. On the one hand, there is no cost in assigning a nonfocal publication to a cluster. To see this, notice that there is no subtraction ofγ in the second term within the square brackets in Eq. (C3). On the other hand, nonfocal publications do not yield benefits in the same way as focal publications do. To see this, notice that the outer summation in Eq. (C3) extends only over the focal publications. The nonfocal publications are not included in this summation.

After the quality function in Eq. (C3) has been maximized, we discard the cluster assign-ments cN+1,…, cNEXTof the nonfocal publications, since we are interested only in the cluster