The AE-algorithm : author name disambiguation for large Web of Science datasets

(1)

The AE-algorithm: Author name

disambiguation for large Web of Science

datasets

Olmo van den Akker (5843049)

August 21, 2014

Bachelor’s Thesis Psychology

University of Amsterdam

Supervisor: Sacha Epskamp

Number of words: 7,128

(2)

Abstract

The creation of co-authorship networks is a valuable way to depict the social structure of scientific fields. However, these co-authorship networks often get distorted because of the problems of author name synonymy and author name homonymy. The practice of author name disambiguation tries to solve these problems by correctly identifying the authors of scientific articles. A multitude of algorithms have been put forward in the context of author name disambiguation, but none of them are suitable for large datasets of the Web of Science database. Therefore, this thesis proposes a new algorithm, the AE-algorithm, which is specifically designed for this purpose. Features of the algorithm include co-authorship, e-mail address, institutional affiliation, cited references, and article keywords. The AE-algorithm will be evaluated using the DBLP-WoS test set, a test set that is based on test sets used in previous studies.

(3)

Introduction

Ever since people began to systematically study the physical and natural world, people are said to be practicing science (Oxford Dictionaries, 2014). Among the first to do so were the Babylonians, who were using mathematics to map the motions of the sun, the moon, and the planets (Price, 1961). However, it was not until the Ancient Greeks that science itself became an object of study. An important event to kick-start the study of science was the publication of Plato’s Theaetetus around 360 BC, wherein Socrates raised the question: “what is knowledge?” (Plato, trans. 1871). Aristotle took this question further and sought to find the best method for gaining knowledge (Dooremalen, De Regt, & Schouten, 2007). The questions posed by these Greek philosophers were largely philosophical in nature; the study of knowledge and science was itself not yet scientific.

This would change in the twentieth century with Derek de Solla Price’s seminal publication

Little Science, Big Science (Price, 1963). Price emphasized that science itself can be scientifically

studied just like other, more mainstream scientific topics as physics and biology. Moreover, he advocated the use of quantitative methods to do so. Price labeled the systematic study of science as ‘the science of science’. Later on, this discipline has come to be known as scientometrics. Formally, the field of scientometrics encompasses a variety of approaches “sharing the general idea that

quantifiable aspects of sciences should be extracted and used to measure whatever it is that can be measured with them” (Rip & Courtial, 1984). Most notably, scientometrics has to do with evaluating

the quality of scientific output, and with mapping the specific structure of scientific fields. Influential scientometric measures of research quality include college and university rankings, the journal impact factor, and the h-index for individual researchers. The structure of scientific fields can be mapped by creating citation networks or co-authorship networks. The focus of this thesis lies on the creation of co-authorship networks.

Co-authorship networks are a specific type of social network. A social network can be defined as “a set of people or groups, each of which has connections of some kind to some or all of the

others” (Scott, 2000). In social networks, people are represented by so-called nodes, while the

connections between these nodes are represented by so-called links. In the case of co-authorship networks, authors of scientific publications are connected to each other when they have published at least one paper together in a set of references. For example, Figure 1 shows a very simple co-authorship network that is based on three references. In general, scientific co-co-authorship networks are a valuable way to depict the social structure of science. The creation of such networks has taken a great leap forward with the advent of comprehensive online bibliographies because publication information is now much more readily available. Newman (2001) was the first to construct entire co-authorship networks based on bibliographic data. Since then, co-co-authorship networks have been established for a wide range of scientific fields, including sociology (Moody, 2004), management

(4)

(Acedo, Barroso, Casanueva, & Galán, 2006), and nanotechnology (Onel, Zeid, & Kamarthi, 2011). In addition, co-authorship networks have been created for scientific journals (Hou, Kretschmer, & Liu, 2008; Fatt, Ujum, & Ratnavelu, 2010), and countries (Gossart & Özman, 2009; Perc, 2010).

Figure 1

A Co-authorship Network Based on Three References

Smalheiser and Torvik (2009) put forward a couple of reasons why scientific co-authorship networks are valuable to the scientific community. Firstly, co-authorship networks can give information about author’s publication profiles and can pinpoint key authors in different scientific disciplines. This information is useful for authors that are looking to find potential collaborators, for journal editors looking for reviewers, and for conference organizers looking for invitees. Secondly, co-authorship networks are useful for funding agencies that are looking for referees, since identifying the co-authors of these referees can be used to avoid potential conflicts of interest. In addition to

References

Van Vugt, M., & Spisak, B. R. (2008). Sex differences in the emergence of leadership during competitions within and between groups. Psychological Science, 19(9), 854-858. Van Vugt, M., & Park, J. H. (2009). Guns, germs, and sex: how evolution shaped our intergroup

psychology. Social and personality psychology compass, 3(6), 927-938.

Spisak, B. R., Nicholson, N., & van Vugt, M. (2011). Leadership in organizations: An evolutionary perspective. In Evolutionary psychology in the business sciences (pp. 165-190). Springer Berlin Heidelberg.

Van Vugt, M.

Spisak, B. R.

Park, J. H. Nicholson, N.

(5)

these benefits, co-authorship networks have been used to measure the research quality of individual authors (Abbasi, Altmann, & Hossain, 2011; Liao, 2011; McCarty, Jawitz, Hopkins, & Goldman, 2013).

The problem of author name disambiguation

A major difficulty in the formation of co-authorship networks is the problem of correctly identifying the authors of publications. This problem is called the problem of author name disambiguation (AND) and it consists of two independent sub-problems: the problem of synonymy, and the problem of

homonymy (Velden, Haque, and Lagoze, 2011). The synonymy problem concerns the possibility that

the same author is split into two nodes because his name is spelled differently in different publications. This could be due to variations in spellings, typographical errors, translation errors, and name changes over time (Tang & Walsh, 2010). In addition, encoding might cause synonymy because author names in online bibliographic databases may or may not be parsed (i.e. divided into parts that can be made useful for computer programs) if they include letters like á, ë, and ø. The homonymy problem concerns the possibility that different authors are compounded into one node because they share the same name.

Velden et al. (2011) propose four reasons why the problem of AND has been so hard to overcome. First, different bibliographic databases provide different kinds of information about articles and authors, which makes it difficult to find an AND solution that is effective for every database. Second, there is no standardized set of benchmark data available on which AND solutions can be tested. Third, every AND solution has its own specific characteristics, making it difficult to compare the solutions with each other. Fourth, AND solutions often do not scale for large data sets. In addition to these four reasons, the rapid increase in the number of scientific researchers (UNESCO, 2010) has added to the problem. The increase in researchers is specifically prominent in China and Korea, which exacerbates the problem even more since it has been established that AND is especially difficult for names of Chinese and Korean origin (Strotmann & Zhao, 2012). The reason for this is the existence of unique family name patterns in East Asian countries. While Western people generally diversify people’s names through last name, in East Asia name diversification occurs on a first name basis. Strotmann and Zhao illustrate this by pointing out that just two dozen last names account for half of the Chinese population, and that just three last names account for nearly half of the Korean population. This results in an increasingly large amount of ambiguous author names, mainly because bibliographic databases often provide only initials and last names.

The problem of AND has long been overlooked. Tang and Walsh (2010) point out that only two out of the 515 articles published in Scientometrics between 2006 and 2009 explored the issue. Since then, the problem has received more attention but arguably not enough. The lack of awareness is problematic because it has been shown that ignoring the problem of AND can lead to significantly

(6)

flawed co-authorship networks. For example, author name ambiguity led to substantially distorted author mappings in the fields of stem cell research (Strotmann & Zhao, 2012) and physical chemistry (Velden et al., 2011). Velden et al. found that author name ambiguity mainly distorts nodes that most crucially determine the mesoscopic structure of co-authorship networks (i.e. nodes that link network clusters to each other).

Solutions to the problem of AND

Solutions to tackle the problem of AND generally involve the development of an algorithm to disambiguate the raw data. These algorithms are based on information that is available in bibliographic databases. The algorithms come in broadly two forms: supervised algorithms and unsupervised algorithms (Smalheiser & Torvik, 2009). For supervised algorithms, a sample of the data of interest is required that has been manually disambiguated. Such a sample is called a training set and is used to ‘train’ the algorithm that will disambiguate the rest of the data. Training the algorithm means that the specific links are learnt between the bibliographic information and author identities. The training set may, for example, indicate that institutional affiliation and co-authorship are good predictors of author identity, which leads the algorithm to heavily weigh these predictors when disambiguating the author names in the rest of the data. Crucial in this supervised approach is that the training set is representative of the rest of the data. If a sample is not representative, a biased algorithm might result. For example, if the sample mainly involves authors from American

universities—who tend to change affiliation more often than, for example, researchers at European universities (Science Europe, & Elsevier, 2013)—the predictive power of institutional affiliation may be underestimated for more global datasets. In general, the larger and the more varied the targeted dataset, the larger and the more varied the training set should be. Of course this comes at a cost, since the time and effort it takes to manually assemble such a training set can be quite high. Unsupervised approaches do not rely on training sets, but involve algorithms that are constructed without sample data. The development of these algorithms mainly relies on intuition, previous findings, and laborious testing. In recent years, a multitude of studies have been undertaken using either the supervised or unsupervised approach. This has led to a wide variety of disambiguation algorithms, all of which claim a decent success rate (see for an overview Elliot, 2010).

The vast majority of the algorithms are not employed for the entire set of author names, but only for pairs of author names that are sufficiently similar. This practice is called blocking and can dramatically reduce computational cost. There are several blocking methods, the most popular of which is to only consider pairs of names that match on last name and first initial. In this way, J. S. Park will be compared with Ji-Sung Park and J. Park, but not with Yoo-Chun Park, Y. Park, or less similar names. An added advantage of blocking methods is that they go a long way in addressing the

(7)

problem of author name synonymy. One of the major causes of author name synonymy is that different journals sometimes employ different spelling conventions. For example, Cota et al. (2010) noticed that the individual Mohammed Zaki is referred to in the DBPL database as Mohammed Zaki, Mohammed J. Zaki, and Mohammed Javeed Zaki. Because these names are so different, they will usually not be compared in an AND algorithm. However, when only last name and first initial are considered, all of these author names are identical. This means that all of them will be compared in an AND algorithm, which makes it possible to see whether they correspond to the same individual. Even so, different spellings are not the only cause of author name synonymy. Author names of the same individual could also differ due to misspellings or mistranslations (Tang, & Walsh, 2010). In order to account for this, so-called string similarity metrics can be used that measure the similarity between two text strings. Examples of such metrics are Jaccard similarity, Levenshtein distance, TFIDF, and the Jaro-Winkler distance (Cohen, Ravikumar, & Fienberg, 2003). The Levenshtein distance, for example, is defined as the minimum number of edits (insertions, deletions, or substitutions) required to transform one text string into another. So, if a maximum Levenshtein distance of 2 is chosen, D. Borsboom will be compared to G. Borsbom (a Levenshtein distance of 2), but not to D. Barsbam (a Levenshtein distance of 3). In this way, string similarity metrics can account for misspellings and mistranslations.

A crucial element of AND is the selection of bibliographic information that can be used as input for the disambiguation algorithm. Due to the vast increase in the number and size of bibliographic databases, many types of information have become available which can aid in the disambiguation process. For example, Web of Science provides for any given publication the following information that can be useful for AND: co-author names, institutional affiliation of the authors, e-mail address of the authors, article title, journal title, year of publishing, article keywords, abstract, and cited references. All of this information can, in one way or another, be implemented in disambiguation algorithms.

The most commonly included feature in AND algorithms is co-author names. Suppose that one publication is authored by Name A and Name B, and another publication is authored by Name A, Name B, and Name C. In this case, it is highly unlikely that Name A has written an article with two different people that happen to have exactly the same name. Therefore we can conclude that the Name B of the first publication and the Name B of the second publication are the same individual (and that the Name A of the first publication, and the name A of the second publication are the same individual). In general, the assumption is made that two author names correspond to the same individual if these names share one or more co-author. This procedure has been consistently shown to be extremely effective in the process of AND (Torvik, Weeber, Swanson, & Smalheiser, 2005;

(8)

Wooding, Wilcox-Jay, Lewison, & Grant, 2006; Kang, Na, Lee, Jung, Kim, Sung, & Lee, 2009; Velden et al., 2011).

Although including co-author names as a feature in AND algorithms is effective, there are situations where co-authorship fails to aid in the disambiguation process. Suppose that one

publication is authored by Name A and Name B, and another publication is authored by Name A and Name C. In this case we cannot be certain that Name A wrote both articles; there might as well be two different individuals that have Name A. This problem is called the problem of transitivity. This problem may occur for several reasons. For instance, if an author is active in two separate research fields, there may be no overlap between his co-authors in both fields. Because of this, the identity of that author cannot be determined on the basis of co-author names alone. Another situation for which the problem of transitivity occurs is when an author publishes an article with an incidental co-author, an author that has only made a single contribution to science. This co-author cannot be linked to other co-authors, so co-author names are again useless for establishing researcher identity. Torvik and Smalheiser (2009) found that about 46% of the authors in the MEDLINE database have published only one article and are thus incidental co-authors. This situation therefore seems to be quite common. In addition to the problem of transitivity, there is also the problem of single-authored

papers. Co-authorship cannot help in the AND of such papers because no co-authorship information

is available. To solve both the problem of transitivity and the problem of single-authored papers, additional algorithm features are required.

An additional feature that is especially powerful is e-mail address. The reason for this is that e-mail addresses are a type of personally identifiable information (McCallister, Grance, & Scarfone, 2010). That is, there are no two individuals that have the same mail address, which means that e-mail addresses can definitively determine the identify of an individual. Whenever Name A is the single author of two single-authored papers, it is possible to identify him as the same individual when the e-mail addresses of both Name A’s match. The problem with e-mail address as an algorithm feature is that it is not often provided in bibliographic databases.

Another feature that can give information about author identities is institutional affiliation. This feature is implemented in many algorithms based on the assumption that it is unlikely that two individuals that work at the same institute have exactly the same name. Although this makes intuitive sense, there are some downsides of using institutional affiliation. Firstly, scientific

institutions are sometimes victims of misspellings or mistranslations. Tang and Walsh (2010) point out that the Chinese Academy of Sciences is often abbreviated as CAS, and that Peking University is often translated into Beijing University. Because the institution names do not match, they cannot be used to efficiently identify authors. Secondly, researchers are not passive entities, but often move from institution to institution during their career. This means that Name A early in his career may not

(9)

be identified as Name A later in his career because both are linked to different institutes. Thirdly, there is the possibility that family members work at the same institute. These family members often have the same last name, and sometimes even the same initials. In such cases, algorithms that use institutional affiliation may incorrectly determine that both family members are the same individual. All of these disadvantages diminish the effectiveness of institutional affiliation in AND algorithms.

Other commonly used features in disambiguation algorithms pertain to the content of the author’s publications, since it is reasonable to assume that author names that are involved in the same scientific topic correspond to the same individual. The content of an article can be deduced from the title, the abstract, and the keywords of the article, and from the title of the journal. Some of these features are widely used, while others are used infrequently or not at all. Abstracts are very rarely used in AND algorithms, since many bibliographic databases do not provide abstracts. In addition, it is difficult to deduce the article topic from lengthy abstracts. Article titles are more often provided in bibliographic databases, provide efficient summaries of what the article is about, but can still be hard to clean (i.e. remove words that have no bearing on the article topic) and parse. There is also information on article topic that does not need to be cleaned: article keywords. For many articles, authors put forward words that are most critical to the subject of their article. These words can be used for AND because author names that have keywords in common, are probably involved in the same research topic, and are therefore probably the same individual. Finally, journal title can also shed light on the research topic of authors. The rationale here is that it is unlikely that authors that have the same name also publish in the same journal. However, this can be debated as journals publish a lot of articles. Björk, Roos, and Mauri (2009) found that the average number of papers for Web of Science journals was 111.7 in 2006. This large output means that it is actually quite probable that individuals with the same name publish in the same journal.

Cited references can also be used in AND. Authors often publish multiple articles on the same line of research and tend to cite the same references. Thus, whenever Author A has a lot of

references in common with another Author A, it is likely that these author names correspond to the same individual. Another way in which cited references can aid in AND is by looking at cross-citations. Cross-citations occur when an author name A cites an article of another author name A. It is highly likely that these two author names are in fact the same individual, because authors tend to cite themselves much more often than they cite another person with the same name. Finally, when two equal author names cite a reference of that same author name, it is likely that both author names cited a previous article of themselves; therefore, the names probably correspond to the same individual.

AND algorithms vary greatly in their characteristics because most algorithms are created for different purposes. For instance, the algorithm of Torvik and Smalheiser (2009) is developed

(10)

specifically for the MEDLINE database, while the algorithm of Huang, Ertekin, and Giles (2006) is developed for large-scale databases in general. This study will add to the existing literature by presenting an algorithm, the Akker-Epskamp (AE) algorithm, that is specifically designed for large datasets from the Web of Science database. In summary, the AE-algorithm uses co-author names, e-mail address, institutional affiliation, cited references, and article keywords. The next section describes how these features are implemented in the algorithm.

The AE-algorithm

Figure 2 and Figure 3 show flow charts of the two phases of the AE-algorithm, the blocking phase and the disambiguation phase. The AE-algorithm is an example of an unsupervised AND approach. We have chosen this over a supervised approach because the AE-algorithm is meant for the

disambiguation of large datasets, which makes it not tenable to create a training set that is sufficiently large and varied to represent the dataset as a whole.

The first step of the blocking phase is to check whether author names have authored a paper together. Obviously, when author names authored a paper together they do not correspond to the same individual. In the next steps, Levenshtein distance and Damerau-Levenshtein distance are used to determine the level of similarity between the remaining author names. Author names are cleaned in the sense that the spaces in last names are removed, and only initials are considered and not full first names. For the initials, the maximum Levenshtein distance is set to 1. In contrast to many other studies (e.g. Han, Zha, & Giles, 2005; Strotmann, Zhao, & Bubela, 2009; Gurney, Horlings, & Van den Besselaar, 2012), we consider 2 author initials instead of 1 because it reduces computational cost. The use of 2 initials, as well as a maximum Levenshtein distance of 1, means that names that differ on two initials will not be compared in the algorithm. This is acceptable since it is highly unlikely that these names correspond to the same individual. The reason we have not used all available initials is that the maximum Levenshtein distance of 1 implies that entirely feasible comparisons will not be made (e.g. J. Tolkien versus J. R. R. Tolkien). For the last names, the Damerau-Levenshtein distance is used, which differs from the Levenshtein distance in the sense that it counts the reversal of two adjacent letters as 1 edit instead of 2. The maximum Damerau-Levenshtein distance for last names is 1. In this way, the algorithm accounts for the most common typographic errors, which are the omission of 1 letter, the addition of 1 letter, the substitution of 1 letter, and the reversal of 2 adjacent letters (Damerau, 1964). The blocking phase of the AE-algorithm is shown in Figure 2.

(11)

Figure 2

Flow Chart of the Blocking Phase of the AE-algorithm

In the disambiguation phase of the AE-algorithm, co-author names is the most important feature. The assumption is made that two author names correspond to the same individual when they have one co-author in common. E-mail address and institutional affiliation are also included as features, despite the disadvantages mentioned previously. In addition, cited references are used in the form of cross-citations, co-citations, and self-citations. The cited references are compared on the basis of DOIs and cleaned names (the first author’s last name and first initial, and the year of

publishing). The final feature of the algorithm is author keywords. Figure 3 shows the complete disambiguation phase of the AE-algorithm.

Did the author names (ANs) author a paper together?

Yes

Is the Levenshtein distance between the two initials of the ANs 2 or higher?

Split

Split Yes

Split

Is the Damerau-Levenshtein distance between the ALNs 2 or more?

No

Yes

No

Go to the Disambiguation Phase of the AE-algorithm

(12)

Figure 3

Flow Chart of the Disambiguation Phase of the AE-algorithm

The result of the AE-algorithm is a network of author names, which we will call an AND network. In such a network, each author name is represented by a node. Author names are connected if the algorithm claims that the author names correspond to the same individual. An example of an AND network is shown in Figure 4. To get from an AND network to a co-authorship network, a cluster analysis is carried out. In cluster analyses, objects are grouped in such a way that objects in the same cluster are more similar, in one sense or another, to each other than to those in

Do the e-mail addresses of the ANs

match? Merge

Do the institutional affiliations of the ANs match?

Yes

Does one AN cite the other AN’s paper?

Do the cited references of both ANs have at least 3 references in common?

Do the ANs cite a self-citation of the other AN?

Do the ANs have at least 2 article keywords in common? Split No No No No No No Merge Yes Merge Yes Merge Yes Merge Yes Merge Yes

Do the ANs share a co-author?

No

Merge Yes

(13)

other clusters (Wikipedia, 2014). There are different algorithms that carry out cluster analyses. We use the fast and greedy community detection algorithm because it is designed specifically for very large networks (Clauset, Newman, & Moore, 2004). The output of the cluster analysis is a range of clusters with bibliographic references, where references that are assumed to be written by the same individual are grouped in the same cluster. Based on the clusters of references, a co-authorship network can be created.

Figure 4

An Example of an AND Network Based on Three References

Evaluation of the AE-algorithm

The AE-algorithm outlined above is designed on the basis of intuitive assumptions and previous findings in AND studies. However, some assumptions may be flawed and some previous findings may not be relevant due to the specific purpose of the AE-algorithm. This means that the algorithm

References

Van Vugt, M., & Spisak, B. R. (2008). Sex differences in the emergence of leadership during competitions within and between groups. Psychological Science, 19(9), 854-858. Van Vugt, M., & Park, J. H. (2009). Guns, germs, and sex: how evolution shaped our intergroup

psychology. Social and personality psychology compass,3(6), 927-938.

Spisak, B. R., Nicholson, N., & van Vugt, M. (2011). Leadership in organizations: An evolutionary perspective. In Evolutionary psychology in the business sciences (pp. 165-190). Springer Berlin Heidelberg. Van Vugt, M. Spisak, B. R. Park, J. H. Van Vugt, M. Van Vugt, M. Nicholson, N. Spisak, B. R. The AND Network

(14)

shown in Figure 2 and Figure 3 may not yet be optimal for the task of AND. Therefore, several adjustments will be made to the AE-algorithm to see if these adjustments result in an improved performance. The original algorithm and its variations will be evaluated on the basis of two criteria: effectiveness (whether the algorithm disambiguates author names well) and computational cost (the time it takes to run the algorithm). The specifics of the evaluation criteria will be described later.

The original algorithm can be adjusted in infinitely many ways. However, we only evaluate the adjustments that we deem to have a reasonable chance of improving the algorithm. Some of these adjustments pertain to the blocking phase and others pertain to the disambiguation phase of the algorithm. In the blocking phase, adjustments are only made in the third step. That third step currently involves checking the Damerau-Levenshtein distance between author last names. However, according to Damerau (1964), this string similarity metric is not effective for names of 4 characters or less. The reason for this is that the Damerau-Levenshtein distance too often assumes that short words are the same. For example, Hill and Hall are very common surnames in the United States (United States Census Bureau, 2000), while Sato and Kato are very common surnames in Japan (Wikipedia, 2014). When a maximum Damerau-Levenshtein distance of 1 is used, all Hills will be compared to all Halls, and all Satos will be compared to all Katos, while many of them are guaranteed to be different individuals. In this way, a lot of unnecessary comparisons will be made in the disambiguation phase of the algorithm. To reduce computational cost, it may be effective to use the Damerau-Levenshtein distance only for long names (defined here as names longer than 4 characters). This variation of the original algorithm will be named DL01 to illustrate that it employs a maximum Damerau-Levenshtein distance of 0 for short last names and 1 for long last names.

In similar fashion, the second variation of the original algorithm is called DL12. In DL12, the maximum Damerau-Levenshtein distance is 1 for short last names and 2 for long last names. In this variation, the number of comparisons in the disambiguation phase is higher than in the original algorithm, which leads to higher computational cost. However, it is probable that the adjustment also leads to a lower number of author name pairs that are incorrectly judged as different individuals. This benefit may outweigh the higher computational cost, making this variation worthwhile to evaluate. A point that is important to make regarding DL12 is that it is not evident how many characters constitute a short name and how many characters constitute a long name. For this reason, several variants of DL12 will be evaluated, all of which use a different definition of long names. In DL125, long names are defined as names of 5 characters or more (like in DL01). In DL126, long names are defined as names of 6 characters or more. In DL127, long names are defined as names of 7 characters or more. All of these variations will be evaluated and compared to DL01, each other, and the original algorithm (which is named DL11 because the maximum Damerau-Levenshtein distance is 1 for both short and long last names).

(15)

In the disambiguation phase of the AE-algorithm, the adjustments pertain to the choice of algorithm features, and to the way they are ordered. The order is of importance because it partially determines the effectiveness of the algorithm features. To give an extreme example, if feature 1 and feature 2 already disambiguated all of the author names, the rest of the algorithm features do not have any author names left to disambiguate, rendering them completely ineffective. However, when feature 1 and feature 2 were only able to disambiguate a small number of author names, the remaining features can probably disambiguate a significant amount of author names, rendering them at least somewhat effective. Since every algorithm feature adds to the computational cost of the algorithm, only features that are sufficiently effective should be included. A priori, it is difficult to decide which algorithm features are effective and which are not. Ideally, all combinations of algorithm features are evaluated to determine the optimal choice of features. However, this means that 7! combinations need to be tested. This amounts to 5,040 combinations, the testing of which is not feasible. To lower the number of combinations, we employ a hierarchical evaluation of algorithm features. This works in the following way. In the first stage, every feature will be evaluated as if they are the first and only step in the algorithm. The most effective feature will then be definitively included as the first step. After that, the remaining features will be evaluated as if they are the second step of the algorithm. Again, the most effective feature will be definitively included as the second step. This process will continue until the algorithm features do not sufficiently improve algorithm performance any more. Using this method, the maximum number of times the algorithm must be run is 7 + 6 + 5 + 4 + 3 + 2 + 1 = 28, which is perfectly manageable.

However, not only the choice and order of algorithm features may improve the algorithm, there may also be room for improvement in the specifics of the features. For example, the effectiveness of the co-citations feature may improve when not three but two co-citations are sufficient to merge two author names together. Likewise, the effectiveness of the keywords feature may improve when not two but one keyword is sufficient to merge two author names together. We believe that these examples are the most likely candidates to lead to an improvement of the algorithm, so these two variants will also be evaluated. To calculate the number of evaluation runs, we have to take into account that two variants of the same algorithm feature cannot both be implemented in an algorithm. For example, an algorithm cannot employ the heuristic that both 2 and 3 co-citations are sufficient to merge author names. Therefore, the maximum number of times the algorithm must be run to evaluate the disambiguation phase variations becomes 9 + 8 + 7 + 6 + 5 + 4 + 2 = 41, which is still manageable.

We believe that the effectiveness of algorithm features is independent of the choice of the blocking method, so all of the test runs for the disambiguation phase will be carried out with the blocking phase of the original AE-algorithm, DL11. After a choice is made on the algorithm features in

(16)

the disambiguation phase, the different blocking methods will be compared. Since there are 4 different blocking methods, this amounts to a maximum total of 41 + 4 = 45 runs.

To orderly compare the 45 algorithms, they will all be named on the basis of the same naming convention. The disambiguation phase of the original algorithm will be named EAIXC3SK2, where the letters stand for the order of the different algorithm features, the first number stands for the number of co-citations sufficient for author names to be merged, and the second number stands for the number of keywords sufficient for author names to be merged. The name of the disambiguation phase of an algorithm will be combined with the name of the blocking phase of that algorithm to arrive at the complete name of the algorithm. So, an algorithm that employs the blocking phase of the original algorithm, and employs e-mail address, institutional affiliation, 2 co-citations, and self-citations in the disambiguation phase will be named DL11-EIC2S. All of the 45 algorithms will be evaluated using the same test set and the same measures. The test set and the measures used to evaluate the algorithms will be described in the next sections.

The DBLP-WoS test set

As Ferreira, Gonçalves, and Laender (2012) point out, directly comparing the merits of the different AND approaches is difficult because different algorithms tend to be created with different goals in mind. An algorithm that tries to disambiguate the field of computer science is bound to be very different from an algorithm that tries to disambiguate the field of biomedicine. For example, the average number of co-authors per author is 4 in computer science, while that number is 18 for biomedicine (Newman, 2001). An algorithm to disambiguate the field of biomedicine should therefore probably make extensive use of co-authorship relations, while the algorithm for computer science should focus on other features. In addition, the available information for the algorithms might differ for these fields. Biomedicine algorithms would probably use the MEDLINE database to get information about author identities, while computer science algorithms would probably use the DBLP database. This is of importance because not all databases provide the same types of information. While MEDLINE provides abstracts and institutional affiliations directly, DBLP does not. Therefore, algorithms using abstracts and institutional affiliations as features are not suitable for the disambiguation of DBLP datasets. The essential point here is that algorithms are tailor-made with specific goals in mind, which makes them difficult to compare. Moreover, there is no standardized test set available, although efforts have been undertaken to create test sets for specific algorithms and databases (Han et al., 2005; Cota, Ferreira, Nascimento, Gonçalves, & Laender, 2010; Kang, 2011).

Test sets are created by manually disambiguating a sample of authors in a bibliographic database, leading to a so-called ground truth. This ground truth can be used to evaluate and compare

(17)

AND algorithms. A commonly used test set is that of Han et al. (2005), who manually disambiguated 14 ambiguous DBLP author name groups by checking the personal websites of authors and sending e-mails to confirm author identities. Other studies have used this database too, albeit with slight alterations since the initial database was not entirely free of errors (Yang, Peng, Jiang, Lee, & Ho, 2008; Ferreira, Veloso, Gonçalves, & Laender, 2010; Cota et al., 2010; Shin, Kim, Choi, & Kim, 2014). To our knowledge, there have been no test sets created for the Web of Science database. For this reason, we will adjust the most recent DBLP database (Shin et al., 2014) to make it suitable to test a Web of Science algorithm. The adjusted test set includes all entries of the DBLP test set that can also be found in the Web of Science database. The new test set will be named the DBLP-WoS test set. With the help of this test set, it is possible to evaluate the AE-algorithm and its variations. Furthermore, other algorithms designed for Web of Science datasets can also be evaluated using this benchmark test set.

Evaluation metrics

To evaluate the algorithms used for AND, several metrics can be used. Most of these metrics are in some way or another based on the number of false positives and false negatives. In the context of AND, false positives are cases in which two author names are assumed by the algorithm to be the same individual, while in reality the names belong to different individuals. False negatives are cases in which two author names are assumed by the algorithm to be different individuals, while in reality the names belong to the same individual. Besides false positives and false negatives, there are also true positives (cases in which author names are correctly assumed to be the same individual) and true negatives (cases in which author names are correctly assumed to be different individuals). An overview of the possible outcomes (derived from Kang et al., 2009) is provided in Table 1.

Table 1

Overview of the Different Outcomes of Comparing an AND Algorithm to the Ground Truth Ground truth AND Algorithm

Match Nonmatch Match True positive (a) False positive (b) Nonmatch False negative (c) True negative (d) The most commonly used evaluation metrics are precision and recall. These metrics can be calculated on a pairwise basis and on a cluster basis. Pairwise precision (pP) is defined as the total number of distinct pairs of references correctly labeled as matches divided by the total number of

(18)

distinct pairs of references labeled as matches (Torvik et al., 2005). Pairwise recall (pR) is defined as the total number of distinct pairs correctly labeled as matches divided by the total number of distinct pairs of true matches (Torvik et al., 2005). The cluster metrics do not compare pairs of references, but compare the clusters that are generated by the AND algorithm to the clusters that are manually disambiguated, the true clusters. Cluster precision (cP) is defined as the number of correctly generated clusters divided by the total number of generated clusters (Huang, et al., 2006). Cluster recall (cR) is defined as the number of correctly generated clusters divided by the total number of true clusters (Huang, et al., 2006). The harmonic mean of pP (cP) and pR (cR) is called the pF1 (cF1) metric, which is also frequently used. A metric that is used somewhat less frequently is pairwise accuracy (pA), which is defined as the number of distinct pairs correctly labeled (as matches or nonmatches) divided by the total number of distinct pairs (Torvik et al., 2005).

Another important metric is the K metric (Ajmera, Bourlard, Lapidot, & McCowan, 2002). This metric is the geometric mean of two clustering metrics, the average cluster purity (ACP), and the average author purity (AAP). ACP and AAP compare the references in the generated clusters to the references in the true clusters. ACP measures to what extent the references within a generated cluster are also grouped together within a true cluster (i.e. to what extent the references within a generated cluster are written by the same individual). When the references within the generated cluster are indeed written by the same individual, the ACP for that cluster is 1. When the references within the generated cluster are actually written by many different individuals, the ACP for that cluster is low. AAP measures to what extent the references in a true cluster are also grouped together within a generated cluster. When all of the references within the true cluster are also grouped together in a generated cluster, the AAP for that cluster is 1. When the references within the true cluster are spread out over many different generated clusters, the AAP for that cluster is low. The overall ACP (AAP) is derived by the summation of the ACPs (AAPs) of individual clusters. The formal definitions of ACP, AAP, K, and the other evaluation metrics that will be used to evaluate the AE-algorithm and its variations are presented in Table 2. In this table, g is the number of clusters generated by the AND algorithm, t is the number of true clusters, and m is the number of correctly generated clusters. N is the total number of author names, nij is the number of references generated

into cluster i that in reality belong to the true cluster j, ni is the number of references generated into

cluster i, and nj is the number of references in the true cluster j.

Metrics that are only sparsely used in AND studies are the ratio of cluster size (Huang, et al., 2006), and over-/ underclustering error (Kang et al., 2009). These metrics will not be further addressed in this thesis.

(19)

Table 2

Overview of the Commonly Used Evaluation Metrics for AND Algorithms

Metric Abb. Formula Metric Abb. Formula Pairwise precision pP

b

a

+

Cluster precision cP g m Pairwise recall pR

c

a

+

Cluster recall cR

t

m

Pairwise F1 pF1 pR pP pR pP + ⋅ ⋅ 2 Cluster F1 cF1

cR

cP

cR

cP

+

⋅

2

Pairwise accuracy pA

d

c

b

a

d

a

+

Average cluster purity ACP

∑ ∑

= = = g i t j i ij n n N ACP 1 1 2 1 Average author purity AAP

∑ ∑

= =

=

t j g i j ij

n

N

AAP

₁ ₁ 2

1

K metric K _ACP_⋅_AAP

In addition to the statistical evaluation metrics in Table 2, AND algorithms should also be evaluated based on the time it takes to run the algorithm, the so-called computational cost. It is likely that there exists a trade-off between computational cost and algorithm effectiveness. Algorithms that are very fast may forego valuable information that can be used in the disambiguation process, while algorithms that are very effective may take too long to run because they incorporate every tiny bit of available information. Due to the fact that the AE-algorithm is specifically designed for large datasets, we will value computational cost very heavily. However, it is difficult to decide a priori on the exact weights we should attach to algorithm effectiveness and computational cost. Therefore, we will decide on the relative weights after the test results are in. Testing starts in September 2014.

References

Abbasi, A., Altmann, J., & Hossain, L. (2011). Identifying the effects of co-authorship networks on the performance of scholars: A correlation and regression analysis of performance measures and social network analysis measures. Journal of Infometrics, 5, 594-607.

(20)

Acedo, F. J., Barroso, C., Casanueva, C., & Galán, J. L. (2006). Co-Authorship in management and organizational studies: An empirical and network analysis. Journal of Management

Studies, 43(5), 957-983.

Ajmera, J., Bourlard, H., Lapidot, I., & McCowan, I. (2002, September). Unknown-multiple speaker clustering using HMM. In Proceedings of the International Conference of Spoken Language

Processing.

Björk, B. C., Roos, A., & Lauri, M. (2009). Scientific journal publishing: yearly volume and open access availability. Information Research: An International Electronic Journal, 14(1), paper 391. Wikipedia. (2014). Cluster analysis. Retrieved August 17, 2014, from http://en.wikipedia.org/wiki/

Cluster_analysis.

Clauset, A., Newman, M. E., & Moore, C. (2004). Finding community structure in very large networks. Physical review E, 70(6), article 066111.

Cohen, W., Ravikumar, P., & Fienberg, S. (2003). A comparison of string metrics for matching names and records. In Proceedings of the workshop on Data Cleaning and Object

Consolidation at the International Conference on Knowledge Discovery and Data Mining.

Cota, R. G., Ferreira, A. A., Nascimento, C., Gonçalves, M. A., & Laender, A. H. (2010). An

unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 61(9), 1853-1870.

Damerau, F. J. (1964). A technique for computer detection and correction of spelling errors.

Communications of the ACM, 7(3), 171-176.

Dooremalen, H., De Regt, H., & Schouten, M. (2007). Exploring humans: An introduction to the

philosophy of the social sciences (3rd_{ed.). Amsterdam: Boom.}

Elliot, S. (2010). Survey of author name disambiguation: 2004 to 2010. Library Philosophy and Practice, 473. Retrieved from http://digitalcommons.unl.edu/libphilprac/473

Fatt, C. K., Ujum, E. A., & Ratnavelu, K. (2010). The structure of collaboration in the Journal of Finance. Scientometrics, 85(3), 849-860.

Ferreira, A. A., Veloso, A., Gonçalves, M. A., & Laender, A. H. (2010, June). Effective self-training author name disambiguation in scholarly digital libraries. In Proceedings of the 10th annual

Joint Conference on Digital libraries (pp. 39-48). ACM.

Ferreira, A. A., Gonçalves, M. A., & Laender, A. H. F. (2012). A brief survey of automatic methods for author name disambiguation. ACM SIGMOD Record, 41(2), 15-26.

Gossart, C., & Özman, M. (2009). Co-authorship networks in social sciences: The case of Turkey. Scientometrics, 78(2), 323-345.

(21)

Gurney, T., Horlings, E., & Van Den Besselaar, P. (2012). Author disambiguation using multi-aspect similarity indicators. Scientometrics, 91(2), 435-449.

Han, H., Zha, H., & Giles, C. L. (2005). Name disambiguation in author citations using a K-way

spectral clustering method. In Proceedings of the 5th ACM/IEEE-CS joint conference on digital

libraries (pp. 334-343). New York: ACM.

Hou, H., Kretschmer, H., & Liu, Z. (2008). The structure of scientific collaboration networks in Scientometrics. Scientometrics, 75(2), 189-202.

Huang, J., Ertekin, S., & Giles, C. L. (2006). Efficient name disambiguation for large-scale databases. In Knowledge Discovery in Databases: PKDD 2006 (pp. 536-544). Springer Berlin Heidelberg. Kang, I. S., Na, S. H., Lee, S., Jung, H., Kim, P., Sung, W. K., & Lee, J. H. (2009). On co-authorship for

author disambiguation. Information Processing & Management, 45(1), 84-97.

Liao, C. H. (2011). How to improve research quality? Examining the impacts of collaboration intensity and member diversity in collaboration networks. Scientometrics, 86, 747-761.

McCarty, C., Jawitz, J. W., Hopkins, A., & Goldman, A. (2013). Predicting author h-index using characteristics of the co-author network. Scientometrics, 96, 467-483.

Moody, J. (2004). The structure of a social science collaboration network: Disciplinary cohesion from 1963 to 1999. American sociological review, 69(2), 213-238.

McCallister, E., Grance, T., & Scarfone, K. A. (2010). Guide to protecting the confidentiality of

personally identifiable information (PII). Gaithersburg: National Institute of Standards and

Technology.

Newman, M. E. (2001). The structure of scientific collaboration networks. Proceedings of the

National Academy of Sciences of the United States of America, 98(2), 404-409.

Onel, S., Zeid, A., & Kamarthi, S. (2011). The structure and analysis of nanotechnology co-author and citation networks. Scientometrics, 89(1), 119-138.

Oxford Dictionaries. (2014). Science. Retrieved July 7, 2014, from http://www.oxforddictionaries. com/definition/english/science.

Perc, M. (2010). Growth and structure of Slovenia’s scientific collaboration network. Journal of

Informetrics, 4(4), 475-482.

Price, D. J. S. (1961). Science since Babylon. New Haven: Yale University Press Price, D. J. S. (1963). Little science, big science. New York: Columbia University Press. Rip, A., & Courtial. J. P. (1984). Co-word maps of biotechnology: An example of cognitive

(22)

Science Europe, & Elsevier. (2013). Comparative benchmarking of European and US research collaboration and researcher mobility. Retrieved from http://www.elsevier.com/__data/ assets/pdf_file/0010/171793/Comparative-Benchmarking-of-European-and-US-Research-Collaboration-and-Researcher-Mobility_sept2013.pdf

Scott, J. (2000). Social network analysis: A handbook (2nd_{ed.). London: Sage Publications.}

Shin, D., Kim, T., Choi, J., & Kim, J. (2014). Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics, 100, 15-50. Smalheiser, N. R., & Torvik, V. I. (2009). Author name disambiguation. Annual Review of Information

Science and Technology, 43(1), 1-43.

Strotmann, A., & Zhao, D. (2012). Author name disambiguation: What difference does it make in author-based citation analysis? Journal of the American Society for Information Science and

Technology 63(9), 1820-1833.

Strotmann, A., Zhao, D., & Bubela, T. (2009). Author name disambiguation for collaboration network analysis and visualization. Proceedings of the American Society for Information Science and

Technology, 46(1), 1-20.

Tang, L., & Walsh, J. P. (2010). Bibliometric fingerprints: Name disambiguation based on approximate structure equivalence of cognitive maps. Scientometrics, 84(3), 763-784.

Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions

on Knowledge Discovery from Data, 3(3), article 11.

Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American

Society for Information Science and Technology, 56(2), 140-158.

UNESCO. (2010). UNESCO Science Report 2010: The current status of science around the world. Retrieved from http://unesdoc.unesco.org/images/0018/001899/189958e.pdf

United States Census Bureau. (2000). Genealogy data: Frequently occurring surnames from census 2000. Retrieved from http://www.census.gov/genealogy/www/data/2000surnames/ index.html.

Velden, T. A., Haque, A. U., & Lagoze, C. (2011, June). Resolving author name homonymy to improve resolution of structures in co-author networks. In Proceedings of the 11th annual

international ACM/IEEE joint conference on Digital libraries (pp. 241-250). ACM.

Web of Science (2014). Recorded Training on Author Search. Retrieved July 10, 2014, from http://wokinfo.com/training_support/training/web-of-science.

Wikipedia (2014). List of most common surnames in Japan. Retrieved on August 20, 2014 from http://en.wikipedia.org/wiki/List_of_most_common_surnames_in_Asia#Japan

(23)

Wooding, S., Wilcox-Jay, K., Lewison, G., & Grant, J. (2006). Co-author inclusion: A novel recursive algorithmic method for dealing with homonyms in bibliometric nalysis. Scientometrics, 66(1), 11-21.

Yang, K. H., Peng, H. T., Jiang, J. Y., Lee, H. M., & Ho, J. M. (2008). Author name disambiguation for citations using topic and web correlation. In B. Christensen-Dalsgaard, D. Castelli, B. A. Jurik, & J. Lippincott (Eds.), Research and Advanced Technology for Digital Libraries. Paper

presented at the 12th_{European Conference of Digital Libraries, Aarhus (pp. 185-196).}