• No results found

What is the organizing principle for large topics?

N/A
N/A
Protected

Academic year: 2021

Share "What is the organizing principle for large topics?"

Copied!
8
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

STI 2018 Conference Proceedings

Proceedings of the 23rd International Conference on Science and Technology Indicators

All papers published in this conference proceedings have been peer reviewed through a peer review process administered by the proceedings Editors. Reviews were conducted by expert referees to the professional and scientific standards expected of a conference proceedings.

Chair of the Conference Paul Wouters

Scientific Editors Rodrigo Costas Thomas Franssen Alfredo Yegros-Yegros

Layout

Andrea Reyes Elizondo Suze van der Luijt-Jansen

The articles of this collection can be accessed at https://hdl.handle.net/1887/64521 ISBN: 978-90-9031204-0

© of the text: the authors

© 2018 Centre for Science and Technology Studies (CWTS), Leiden University, The Netherlands

This ARTICLE is licensed under a Creative Commons Atribution-NonCommercial-NonDetivates 4.0 International Licensed

(2)

Richard Klavans* and Kevin W. Boyack**

*rklavans@mapofscience.com

SciTech Strategies, Inc., Wayne, PA 19087 (USA)

** kboyack@mapofscience.com

SciTech Strategies, Inc., Albuquerque, NM 87122 (USA)

Introduction

We know a great deal about how to identify the topics that researchers are working on. One can use citations and/or text to identify about one hundred thousand document clusters from either the Scopus database or the WoS database. For purposes of discussion, we refer to these document clusters as topics. In our models there are about a thousand very large topics and tens of thousands of small topics. But why doesn’t topic size follow an expected linear Zipfian distribution? Is it possible that there is a different organizing principle for large vs.

small topics?

In this study, we explore the possibility that the organizing principle for large topics is the continued use of very expensive tools (such as specialized equipment, specialized databases and specialized software). For our initial exploration, we use grant size (NSF or NIH grants in excess of $5 million annually) as proxy for preferential investment in specialized tools. By using links between 52,097 grants and tens of thousands of topics, we will test whether large topics get more (than expected) funding from large grants and, by inference, expensive tools are an organizing principle for large topics.

Background

There has been significant progress in the identification of topics from very large collections of scientific and technical documents. Much of this progress can be attributed to the development of computer memory and software (Klavans & Boyack, 2017b, p. 986).

Standards for measuring the accuracy of topic detection (i.e. document clustering) algorithms are being developed (Sjögårde & Ahlgren, 2018; Waltman, Boyack, Colavizza, & Van Eck, 2017). We are now at the stage where it is possible to predict funding patterns across these topics (Klavans & Boyack, 2017a).

One of the crystalizing motives for this study came from an interesting anomaly in this stream of research. We found that the funding levels per author are an order of magnitude larger for the extremely large topics than for the smallest topics1. Why might this be so? Building on Price (1963), we believe that there is a different organizing principle behind little science (where most researchers could fund their endeavors through small community efforts) and big

1 Figure 4 in Klavans & Boyack (2017a) shows that the funding level per person is due to prominence- an indicator based on citation counts, downloads and field-adjusted impact. A similar pattern emerges if one uses

(3)

STI Conference 2018 · Leiden

science (where very expensive laboratories are built with funding from government agencies and corporations). We are correspondingly exploring the possibility that very expensive tools (sequencing equipment, super-computers, large scale epidemiology studies) are one of the organizing principles behind the very large topics that have been identified using the methods described above. And while this concept has been applied to the choices of problems that a researcher or a laboratory works on over time (Latour & Woolgar, 1986), this is the first time (to our knowledge) that this perspective is used to specifically model an organizing principle for large vs. small topics.

Data and Methods

Data from multiple databases were used to explore these issues. Scopus data (over 60 million indexed and cited non-indexed documents) were used to create a citation-based model (Klavans & Boyack, 2017a). PubMed data from 1975-2017 (23 million documents) were used to create a text-based model (Boyack & Klavans, 2018). The use of two models using different methods for detecting topics allows us to test whether organizing principles are method dependent. Zipfian distributions (log size vs. log rank) for both models (sets of topics) are shown in Figure 1.

Figure 1: Zipf distributions for topics identified from Scopus and PubMed data.

The distributions in Figure 1 are based on the maximum number of documents (per year) over the 2010-2016 time period. While this procedure is somewhat different than the standard approach (i.e. showing topic size in a single year), we used this approach in order to avoid the situation where topics that grow rapidly (or have shrunken rapidly) are miscoded as small. A topic with at least 250 publications in any year is coded as large. A topic with less than 50 papers in any year is considered small. The overall shape of these graphs would not

(4)

significantly differ if we used a single year publication rate. But the proper classification of a topic as ‘large’ or ‘small’ would suffer.

The number of large topics is much larger for the model based on the Scopus database (1681 vs. 694 large topics). This is primarily due to coverage. While our PubMed model contains 41% as many large topics as our Scopus model, PubMed comprises only 36% of Scopus over this time period. PubMed does not have much coverage of the literature in Physics, Engineering, Computer Science or Nanoscience (there are 767 large topics in these four fields according to the Scopus model). PubMed has some (small) coverage of Chemistry (161 large Scopus topics) and the Social Sciences (208 large Scopus topics). Overall, the number of large topics in three Scopus fields (Medical Research, Life Science and Infection Disease) is close to the number reported in the PubMed model. In our opinion, these models are similar in size distribution but differ only in coverage.

Grant data were drawn from the Star Metrics database. Figure 2 shows the Zipfian distribution for 61,699 extramural grants that received funding from NIH or NSF in 2012. The grant amount distribution is more linear than the topic size distributions, with a clear discontinuity at $250,000 below which the grant amounts tail off sharply. Grants have been separated into size groups using thresholds shown in Figure 2.

Figure 2: Zipf distribution of annual grant amounts for NIH and NSF grants (2012).

Grant-article links for NIH grants were obtained from NIH RePORTER, while similar links for NSF grants were obtained by matching grant outputs (from the NSF API) to Scopus articles. These data allow us to fractionally assign grants to topics. Numbers of grants by agency along with total funding and the percentage of that funding that could be assigned

(5)

STI Conference 2018 · Leiden

to topics in both models since most NIH grants active in 2012 have been acknowledged by published papers. Lower fractions of NSF funding are assignable, and particularly for the PubMed model. This is partly due to the fact that NSF only lists grants for the year they are funded (which gives only 4 years for papers to accrue), while NIH lists all grants active in a given year including very old grants. The following analyses only use the data for those grants that are linked to documents in the Scopus or PubMed models.

Table 1. Grant funding (2012) and assignment to topics.

Agency # Grants $ Billion %$ Scopus topics %$ PubMed topics

NIH 49,937 22.25 94.7% 95.2%

NSF 11,762 3.32 58.9% 33.7%

An NIH Example

To this point we have not yet explored the potential link between large grants and expensive tools. A specific example of an NIH grant may help illustrate this possibility. The largest extramural NIH grant in 2012 was U54HG003067 which received $46.3 million in 2012 for large scale sequencing and analysis of genomes. This grant had been acknowledged by 102 documents as of 2016. Topics in which the documents acknowledging this grant are located are described in Table 2.

Table 2. Scopus topics linked to NIH grant U54HG003067.

# Links Topic ID # Doc (2006) # Doc (2016) Representative phrases 25 2538 20 1746 variant calling; sequencing platforms

7 15433 16 348 variant association; causal variants 7 1279 31 2909 microbial composition; gut microbial 7 16440 8 405 cancer genome; genomic alterations

5 4 topics

4 4 topics 2 6 topics

The four topics most associated with this grant are all large. The topic with 25 links (#2538) focuses more on the technology for large scale sequencing. Those with fewer (7) links deal with related topics. One could easily argue that expensive tools play a critical role in the formation and evolution of the large topics associated with next generation gene sequencing.

For example, the Illumina HiSeq 4000, a device that does the type of rapid, high-performance sequencing required by this grant, costs $900K.2 In this example, the grant is associated with an investment in expensive tools.

NSF Examples

There were only 18 NSF grants that were greater than $5 million. Seven had literature links.

The remaining eleven grants are listed in Table 3, and all but one are for expensive tools for conducting research. The grants supporting the NSCL lab at Michigan State University and the Arecibo Observatory did not have literature links in the NSF API. Nor do seven large grants for oceanographic research – these grants are to support the research vessels where data is collected with expensive tools. The grant for cyber infrastructure is for the development of a specific use (Science Across Virtual Organizations). The only large grant that is clearly not

2 http://www.molecularecologist.com/next-gen-table-3a-2016/

(6)

associated with expensive tools is to the New England Aquarium, which is for the explicit purpose of educating people about oceans and climate change.

Table 3. Very large NSF grants without literature links.

Field Grant No. $ Million Purpose

Physics 1102511 21.5 National Superconducting Cyclotron Lab Astronomy 1160876 10.6 Arecibo Observatory

Computer Science 1234983 5.7 Cyberinfrastructure for SAVI Oceanography 1211494 12.0 R/V Langseth

Oceanography 1216590 6.8 R/V Kilo Moana

Oceanography 1214207 5.5 R/V Knorr

Oceanography 1216056 5.1 R/V Thompsom

Oceanography 1214235 5.0 R/V Atlantis

Oceanography 1212770 5.0 R/V Revelle

Oceanography 1212771 5.0 R/V Melville

Oceanography 1214207 5.6 Education - Oceans & Climate Change The NIH example and the ten NSF examples provide concrete evidence that very large grants are reasonable signals of the use of expensive tools. The questions we face, however, are (1) whether funding from thousands of large grants generally represents an investment in expensive tools and (2) is there a pattern of preferential investment when we compare larger vs. smaller topics.

Large Grants and Grant Types

The next piece of evidence that large grants might indicate preferential investment in expensive tools is to look at grant types and grant size. Table 4 lists the six major extramural grant types with at least 1000 grants in 2012 ordered by median grant size. At the top of the list are the two grants types (P & U) associated with major research programs. Included in this group are Center Grants (P01, P30, P50 and U54). Center Grants conform to the definition of

‘big science’ – these are major research programs that require expensive tools. At the bottom of the list in Table 4 are the small grants that support individual researchers. For example, nearly half of all extramural NIH money is spent on R01 grants – these are for individual researchers who may, or may not, be working at a major lab.

Table 4. NIH Extramural grant types and median grant size (grant types with at least 1000 grants in FY 2012).

Type Description # Grants Annual median ($K)

P Research programs 1,750 1,674

U Cooperative agreements 2,708 1,559

R Research projects 35,786 370

T Training programs 2,072 304

K Career grants 3,855 167

F Fellowships 3,019 41

We also asked field experts to evaluate whether large grants play a major part in the acquisition of expensive tools such as specialized equipment. Caleb Smith (currently an intelligence analyst at the University of Michigan Medical School) and Nora Visscher-Simon

(7)

STI Conference 2018 · Leiden

(formerly the assistant to the Vice Provost of Research at Johns Hopkins University) both agreed with this assumption. They pointed out that Center Grants are often intended to purchase costly equipment, services, or other potentially costly resources available for multiple related projects. In comparison, R01’s, while they dominate the numbers of grants, are specifically aimed at supporting individual researchers working on distinct and singular projects. While some of the R01 grants might be allocated to researchers who are working with expensive tools, the most common R and T grants (R01, R32 and T32) are rarely considered a primary mechanism by which expensive tools are purchased by the university.

Our two field experts pointed out, however, that grant size can indicate other phenomena.

Deans and Research Provosts are rewarded if they are able to acquire and retain large P and U grants. Correspondingly, considerable senior-level attention is given to the generation and renewal of research proposals for these grant types. Far less senior-level attention is given to generating R01 and R21 grants, despite the fact that they account for twice as many dollars as P and U grants combined. The responsibility for writing R01 grants lands on individual faculty and serves to augment faculty salaries. In addition, R01 grants are mobile (they can move with the PI to another institution) while P and U grants are stable (they remain with the institution). People move, infrastructure doesn’t. Alternative interpretations of this indicator are possible.

Grant Size and Topic Size

Data on the relationship between grant size and topic size for both models is presented in Table 5. Funding from very large grants (those over $5 million) only accounted for 7.94% of all grant funding assigned to topics in the Scopus model (i.e., the expected value). But the amount of funding from very large grants to large topics was nearly 50% higher than expected (11.72%). This pattern holds for large grants to large topics (26.78% vs. 24.36% expected) but to a lesser extent. The pattern shifts for medium grants – they are preferentially associated with smaller topics – while small grants are more evenly distributed. Roughly the same pattern is found for assignment of grants to PubMed topics, which are based on text relationships between documents. Overall, the data show that large grants are preferentially associated with large topics regardless of the type of model used.

Table 5: Fraction of grant funding assigned to topics in the Scopus and PubMed models as a function of grant size and topic size. Expected values in italics.

Scopus model – Grant size PubMed model – Grant size Topic Sz Very Lg Large Medium Small Very Lg Large Medium Small Expected 7.94% 24.36% 57.38% 10.31% 8.14% 24.61% 57.55% 9.73%

Large 11.72% 26.78% 51.68% 9.81% 11.62% 27.99% 50.62% 9.77%

Medium 7.00% 23.90% 58.73% 10.37% 7.83% 24.15% 58.13% 9.89%

Small 4.74% 21.88% 62.44% 10.94% 6.08% 23.03% 61.62% 9.27%

Discussion

We now have reasonable evidence that large grants tend to indicate an investment in expensive tools. Large grants are preferentially associated with large topics. The preferential association is not due to the way that topics are detected since the results hold for citation- based (Scopus) and text-based (PubMed) models.

There are numerous methodological shortcomings in this study. We have not directly measured the role of expensive tools. We do not know if the grant size-topic size relationship

(8)

is generalizable across different funding bodies (87% of our data are from NIH) or different fields such as physics, engineering, computer science or the social sciences. These methodological shortcomings can be addressed by gaining access to the full text of research grants. We are, in fact, pursuing this line of inquiry with research proposals. A proof-of- principle report on this stream of research has been published (Boyack, Smith, & Klavans, 2018). At that time, our analysis was limited to a sample of 369 R01 proposals. We are now gaining access to a much larger set of proposals that include, among other things, extensive information about the tools to be used and purchased, which will allow us to further test the ideas presented here. Alternative organizing principles for topics of different sizes can be tested with these data.

More importantly, an understanding of the organizing principle of large vs. small topics has broader application. This line of inquiry could open up new possibilities for detecting topic emergence. If expensive tools are an organizing principle for large topics, an indicator showing the involvement of expensive tools might be a signal of potential emergence. On the other hand, without the contextual existence of expensive tools, we may find that small topics are destined to remain small. Overall, the organizing principles of large vs. small topics is an important, but relatively unexamined, research question that deserves more attention.

References

Boyack, K. W., & Klavans, R. (2018). Accurately identifying topics using text: Mapping PubMed. Paper presented at the 23rd International Conference on Science and Technology Indicators (STI 2018).

Boyack, K. W., Smith, C., & Klavans, R. (2018). Toward predicting research proposal success. Scientometrics, 114(2), 449-461.

Klavans, R., & Boyack, K. W. (2017a). Research portfolio analysis and topic prominence.

Journal of Informetrics, 11(4), 1158-1174.

Klavans, R., & Boyack, K. W. (2017b). Which type of citation analysis generates the most accurate taxonomy of scientific and technical knowledge? Journal of the Association for Information Science and Technology, 68(4), 984-998.

Latour, B., & Woolgar, S. (1986). Laboratory Life: The Social Construction of Scientific Facts. Princeton: Princeton University Press.

Price, D. J. D. (1963). Little Science, Big Science. New York: Columbia University Press.

Sjögårde, P., & Ahlgren, P. (2018). Granularity of algorithmically constructed publication- level classifications of research publications: Identification of topic. Journal of Informetrics, 12(1), 133-152.

Waltman, L., Boyack, K. W., Colavizza, G., & Van Eck, N. J. (2017). A principled approach for comparing relatedness measures for clustering publications. Paper presented at the 16th International Conference of the International Society on Scientometrics and Informetrics.

Referenties

GERELATEERDE DOCUMENTEN

While conditional quantiles of this quantity are identified immediately, the main challenge and the main focus of this paper is to relate them to the distribution of taste

From a methodological point of view the capability to measure project funding allocation proved to be reliable enough to allow further investigations (Lepori et al., 2018). For the

Here we compare the accuracy of our PubMed-based models to several citation-based models from our earlier research (Klavans & Boyack, 2017b), including direct

Our definition of the interdisciplinarity of topics, which in this study are regarded as clusters of the used CWTS classification, takes into account the degree of

Further analysis incorporating job changes shows that training for job change purpose increases the probability to change jobs, but job changes immediately following

Dat dit vierde deel van de Nieuwe Schoolalgebra onder de gemeenschappelijke namen van Wijdenes en Beth kon verschijnen kunnen we voor de laatste zien als het doortrekken van de

dependent upon the nature of the material used for balance construction, it has been calculated that, for the dimensions of balance and case encountered

In the first phase of this work, we employed the Java implementation of LDA (JGibbLDA) [20] to learn the topic and topic-word distributions and ultimately generate the