Predicting citation impact of scientific papers using peer reviews

(1)

Predicting citation impact of

scientific papers using peer reviews

Reinder Gerard van Dalen

Master thesis

Information Science Reinder Gerard van Dalen s2497867

(2)

A B S T R A C T

Sharing scientific knowledge is of great importance in developing our joint knowl-edge. Researchers contribute to sharing such knowledge by publishing research in various scientific fields. However, it is essential that these studies are read and used in further research in order for them to have an effect on our joint knowledge. The citation impact of researcher’s work reflects whether this is the case. Within the field of exploring factors that influence this impact, studies have been done into the predictive power of for example the paper title and number of authors of a paper. Using peer reviews to predict the citation impact is however still unexplored. This thesis examines whether it is possible to predict the citation impact of scientific papers using their peer reviews. It is hypothesized that scientific papers with a high citation impact contain more positive feedback than papers with a low citation impact. To check this hypothesis, two corpora containing peer review texts were created. Subsequently, the predictive power of peer reviews for citation impact was

explored. Peer reviews of in total 3,315 scientific papers from the NIPS1

and ICLR2

conferences were collected. The 2,421 papers from NIPS were labeled with a low, medium or high citation impact. This was based on citation scores collected over two years. The 894 papers from ICLR were labeled with paper acceptance. Both corpora were employed to predict the citation impact of NIPS scientific papers. Sev-eral classifiers were trained on the data using Machine Learning. The best setup used NIPS data only and yielded an f1-score of 0.83, which outperformed the base-line of 0.71. The use of an auxiliary classifier based on ICLR data did not improve the classification of NIPS papers, although several implementation methods were investigated. A feature analysis of the optimal citation impact prediction model did show several differences in topics and sentiment between the low and high impact papers. Additionally, this research yielded two corpora of peer reviews and infor-mation of papers from various editions of NIPS and ICLR. All in all, this thesis is hopefully a step towards further research in this field.

1

Conference on Neural Information Processing Systems,https://nips.cc/. 2

International Conference on Learning Representations,https://iclr.cc/.

(3)

C O N T E N T S

Abstract i

Preface iv

1 _introduction 1

2 _{related work} 3

3 _{data and material} 7

3.1 Collection . . . 7

3.1.1 NIPS Citation Counts . . . 7

3.1.2 NIPS Reviews . . . 12

3.1.3 ICLR Reviews on OpenReview.net . . . 15

3.2 Annotation . . . 20

3.2.1 NIPS Reviews . . . 20

3.2.2 ICLR Reviews on OpenReview.net . . . 21

3.3 Processing . . . 22

3.3.1 Citation impact . . . 22

Citation growth rates . . . 23

Citation impact . . . 24

3.3.2 Textual information . . . 28

3.3.3 Specific features . . . 30

4 _method 32 4.1 Classification of NIPS papers . . . 32

4.1.1 Dataset . . . 32

4.1.2 Development . . . 33

4.1.3 Testing . . . 33

4.1.4 Evaluation . . . 33

4.1.5 Feature analysis . . . 34

4.2 Classification of ICLR papers . . . 34

4.2.1 Dataset . . . 34

4.2.2 Development . . . 35

4.2.3 Testing . . . 35

4.2.4 Evaluation . . . 36

4.3 Expanding the NIPS classifier . . . 36

4.3.1 Expanding the dataset . . . 36

4.3.2 Development . . . 37 4.3.3 Testing . . . 37 4.3.4 Evaluation . . . 37 4.4 Stacking . . . 37 4.4.1 Dataset . . . 37 4.4.2 Experiments . . . 38 4.4.3 Evaluation . . . 38

5 _{results and discussion} 39 5.1 General findings . . . 39

5.2 Classification of NIPS papers . . . 40

5.2.1 Development phase . . . 40

5.2.2 Testing phase . . . 42

5.2.3 Feature analysis . . . 43

(4)

CONTENTS iii

Low impact papers . . . 44

High impact papers . . . 45

5.2.4 Discussion . . . 47

5.3 Classification of ICLR papers . . . 48

5.4 NIPS classification using the extended classifier . . . 51

Rating prediction . . . 53

5.5 NIPS classification using stacking . . . 53

6 _{conclusion and future work} 55

Appendix 58

a _{modifiability of the data processing} 59

b _{classification of nips papers: feature combinations} 61

c _{classification of nips papers: results of the development phase} 63

d _{classification of nips papers: informative features} 65

e _{classification of nips papers: informative features with 10 most}

likely collocations 69

f _{classification of iclr papers: feature combinations} 74

g _{classification of iclr papers: results of the development phase} 76

h _{extended nips classifier: feature combinations} 78

(5)

P R E F A C E

In 2013 I started studying the Bachelor Information Science at the University of Groningen (UoG). During this Bachelor I have become acquainted with, amongst other things, the processing of Big Data, the field of Computational Linguistics and Machine Learning. In my Bachelor thesis I combined these research fields with my interests for politics. In that thesis, the automatic classification of Dutch Twitter users based on political preference was explored. The results of the study were striking. Many prejudices about supporters of certain political parties could be demonstrated using a feature analysis of an implemented machine learning model. Together with my thesis supervisor B. Plank and a fellow student we submitted a combined Author Profiling paper to the Computational Linguistics in the

Nether-lands (CLIN) Journal (van Dalen et al., 2017). During the CLIN27 conference, I

got acquainted with and became interested in the scientific world. When I finished my Bachelor and started the Master Information Science at the UoG, this interest was further stimulated. In June 2017 my supervisor shared her interest in doing meta-research on scientific papers with me. In this field of research, I could further express my interest in the scientific world. I decided to use the further knowledge about Machine Learning, I learned during the Master, in the meta-research on sci-entific papers: predicting the impact of scisci-entific papers.

During the creation of this thesis I got full support of my friends, parents and family. Partly thanks to them, I was able to do research and make this document. A special thanks goes to Stijn Eikelboom for his valuable suggestions on improv-ing this study. Beside the support of them, the last three years the department of Information Science in Groningen taught and supported me in a professional and accessible way. I’m very thankful to my supervisor during the period of my re-search and the creation of this thesis. Without her, I would not have been able to make the thesis into what it is now.

I’m very pleased by the final product of my thesis. When creating this product, I used all the skills I learned during my studies at the UoG. By combining all these skills and the knowledge these entail, I was able to create this thesis which I hope will contribute to the scientific world.

(6)

1

_{I N T R O D U C T I O N}

All around the world researchers try to solve divergent scientific problems. From difficult mathematical issues up to ethical questions. In all of these fields, there are scientific conferences where people of the scientific world contribute to our joint knowledge. Without all this unique research, the world would not have been where it is now. From improvements in the agricultural sector up to the newest medical techniques, sharing knowledge in all scientific fields is of great importance for all of this. Because of the many research possibilities, the world is regularly introduced

to new research through scientific articles in scientific journals such as Science1

or

Nature2

. However, there are also smaller scientific conferences with its own journals in which interesting studies appear. All scientific articles are of great importance for new scientific research. In these articles, researchers, for example, discover new techniques and encourage others to conduct further research. This further research can be in the same field of research, but also use the discovered techniques in a completely different field.

When a discovered technique is so groundbreaking that it can be used in a mul-tidisciplinary way, the accompanying scientific article will receive many citations. The number of citations is important for a researcher in terms of contributing to the field and attracting funding. In addition, a high number of citations from a large audience can serve as a reward for the hard work of the researcher. But probably most important is, that a lot of future research will be conducted into a field that is of great importance. It is therefore interesting to investigate what factors influence the citation impact. Are there meta-variables that cause one paper to be cited more often than another?

Typically, papers submitted to a conference are peer reviewed. These reviews are of great importance. Reviewers provide recommendations to conference chairs and journal editors, who decide if a scientific paper is accepted or rejected for a conference or journal. Does the content of these reviews perhaps tell us more about a paper than just whether it should be accepted or not? To investigate this, this study focuses on detecting signals in peer review texts that are predictive of citation impact. It intends to answer the following research question: Is it possible to predict the citation impact of scientific papers using their peer reviews? It is hypothesized that scientific papers with a high citation impact contain more positive feedback than papers with a low citation impact. Therefore, it should be possible to predict the citation impact on the basis of peer reviews.

To answer this research question, two corpora with review texts were created. For one of the corpora the citation scores of the corresponding papers were collected. Subsequently the data of both corpora were used in trying to predict the citation impact using machine learning.

The first created corpus contains peer reviews scraped from the proceedings website of the Conference on Neural Information Processing Systems (NIPS). In total 2,421 papers with corresponding reviews were collected. The second corpus that was created consists of reviews from the International Conference on Learning Representations (ICLR). This corpus consists of review texts from 894 scientific pa-pers that were scraped from OpenReview.net. From April 2016 up to and including May 2018, the citation scores of papers from five NIPS editions were collected from Google Scholar. During this period, these were collected approximately every six months. The collected citation scores were used to determine the citation impact of NIPS scientific papers. Using supervised Machine Learning a model was created 1

http://www.sciencemag.org 2

https://www.nature.com

(7)

introduction 2

in predicting this citation impact. The reviews from ICLR were used in creating a auxiliary classifier which predicts the paper acceptance of scientific papers. Two dif-ferent implementations were explored in combining the auxiliary and the citation impact classifier to improve the classification of NIPS scientific papers.

First, the related work in the field of predicting citation impact is discussed in

Chapter2. Next, the creation of the two corpora is discussed in detail in Chapter3.

The collection of the citation scores, the annotation of the papers in both corpora and the pre-processing of the data for Machine Learning is also discussed in this

chapter. The creation of different classification models is described in Chapter4. In

(8)

2

_{R E L A T E D W O R K}

There are different ways of determining the impact of research. For example, the impact of a specific paper or that of an individual researcher could be investigated. For the latter case, one could for example look at the publication and citation records of an author. In defining the impact of a specific paper, citations could be explored in determining the impact. In both cases, citation counts are a central measure.

Impact of individual researchers

In a study conducted by Hirsch, citation counts are also of importance. In 2005,

Hirschintroduced the h-index, an index defined as the number of papers with

cita-tion number > h. Hirschintroduced this index as a useful tool to characterize the

scientific output of a researcher. With the h-index,Hirschwants to solve the question

of how to quantify the cumulative impact and relevance of an individual’s scientific

research output. He defines h as: “A scientist has index h if h of his or her Np papers

have at least h citations each and the other (Np- h) papers have 6 h citations each” (Hirsch,

2005).Hirschargues that the h-index gives a more realistic view of the scientific

im-pact of an author’s publications than using other measures: number of papers (no measure of importance or impact), number of citations (hard to collect and may be inflated by a small number of paper with many citations) and citations per

pa-per (rewards low productivity and penalizes high productivity). Hirsch’s method

is a generally accepted method in defining the impact of an author’s research. In predicting the h-index of an author, a range of research has been conducted.

Dong et al.(2015) conducted a prediction task to determine whether a given

pa-per will, within a pre-defined timeframe, increase the h-index of its primary author. In their study, six categories of factors were employed, comprised of author, topic, reference, social, venue and temporal attributes. They found two predictive factors: the researcher’s authority on the publication topic (denoted by being highly cited by others in a specific domain of expertise) and the venue in which the paper was published. Surprisingly the topic popularity and the co-authors’ h-index were of little relevance in increasing the author’s h-index.

Impact of papers

Besides predicting the impact of an author’s research, various ways of predicting

im-pact of scientific papers have been explored.Mcnamara et al.(2013) researched the

predictability of citation network features in predicting high impact papers. They argue that being able to predict ’the next big thing’ allows the allocation of resources

to fields where rapid developments are occurring. In the study ofMcnamara et al.,

network features including the neighborhood in the citations network and measures of interdisciplinarity were explored. Using these features and the Scopus Database of Elsevier, which contains 24,097,496 papers published during the years 1995-2012,

Mcnamara et al. tried to predict paper impact. The paper impact was defined by taking citations over several years into account. The algorithms that were used in

predicting the impact were linear regression, decision trees and random forests.

Mcna-mara et al. found multiple predictors of high impact papers. Papers with early high citation counts, citation counts by a paper and citations of and by highly cited papers proved predictive for high impact. Also interdisciplinary citations of a paper and of papers that cite it, were found as being predictors of high impact papers. In

(9)

related work 4

contrast to the research by Mcnamara et al., this thesis explores textual and peer

review features instead of network features.

Another approach in exploring the citations to scientific articles was used in the

study of Vieira and Gomes (2010). Instead of exploring network features, Vieira

and Gomes analyzed the dependence on articles features in citation growth. In the study 226,166 articles that were published in 2004 in the fields of Biology & Biochemistry, Chemistry, Mathematics and Physics were analyzed. The number of citations and the article features co-authors, institutional addresses, number of pages, number of references, journals and number of citations (approximately 5 year window) were collected using the Web of Science (WoS) corpus. To analyze the dependence

on the citation growth, a mean citation rate was calculated for each paper. Vieira

and Gomes found that there was a dependence on the mean citation rate for the number of co-authors, the number of addresses and the number of references. For the relation between the mean impact and the number of pages, the dependence

obtained byVieira and Gomeswas very low. In this thesis, a similar method as that

ofVieira and Gomeswas used in determining the citation growth of a paper. Vieira and Gomesused citation counts in approximately a 5 year window and calculated for each paper a mean citation rate. The citation growth rate as calculated in this

paper, is inspired by the technique used byVieira and Gomes.

In a study conducted byHaslam et al.(2008), more factors contributing to

cita-tion impact were explored. Haslam et al. explored multiple impact predictors of

308standard articles published in three of the primary journals on social-personality

psychology in 1996. The citation counts were collected in mid-July 2006. A broad range of predictors was explored in this study: author characteristics (number of authors, gender, nationality and eminence), institutional factors (university, pres-tige, journal prestige and grant support), article features (title characteristics, num-ber of studies, figures and tables, numnum-ber and recency of references) and research approach (topic area and methodology). Using statistical multivariate analysis,

Haslam et al. found that the following factors were predictive for a high num-ber of citations: first author eminence, the later author being more senior, journal prestige, article length and number and recency of references.

Jacques and Sebire (2010) hypothesized that specific features of journal titles

may be related to citation rates. They investigated this by reviewing the title char-acteristics of the 25 most cited articles and the 25 least cited articles published in

general and specialist media journals. Using a Mann-Whitney U test, Jacques and

Sebirefound that there was a positive correlation between the length of paper titles, the use of colons in paper titles and the presence of an acronym. Because of the

positive correlation that was found,Jacques and Sebiresuggest that the effect of the

construction of an article title on citation impact, may be underestimated.

A somewhat similar study was conducted in 2009 by Webster et al.. In their

research, they wanted to help answering the question of which variables are pre-dictive for high citation counts. Instead of only investigating meta-features of titles, they focused on word frequencies in titles during certain periods. The meta-feature

number of references was also explored in this study.Webster et al. analyzed 808

arti-cles containing 8,631 title words from the psychology journal Evolution and Human Behavior between 1979 and 2008. For different periods, they found different words that could be identified as hot topics. They also found that articles that cite more references are in turn cited more themselves.

Other title features were investigated byJamali and Nikzad(2011). 2,172 papers

published in 2007 out of six journals of the Public Library of Science were explored. The articles came from the research fields of biology, medicine, computational

biol-ogy, genetics and pathogens. Jamali and Nikzadcategorized all titles as descriptive,

indicative or question. Furthermore, the titles were featureized by the number of

words the title contains. Using statistical difference and correlation tests, Jamali

and Nikzad found correlations between the explored features and the number of downloads and citations. Papers with question titles tended to be downloaded

(10)

related work 5

more, but cited less than others. Papers with longer titles were downloaded slightly less often than the papers with shorter titles. The study also showed that titles

with colons tended to be longer and receive fewer downloads and citations. Jamali

and Nikzadalso found that there was a positive correlation between the number of downloads and citations.

Another study that shows a positive relation between the number of downloads

and citations was conducted by Subotic and Mukherjee (2014). In this study a

unified investigation of article title characteristics in relation to subsequent article

citation and download rates was conducted. Subotic and Mukherjeecombined

ar-ticle title features that were previously studied mostly individually. They carefully selected 258 articles out of 41 different psychology journals. Similar to the findings ofJamali and Nikzad(2011),Subotic and Mukherjeefound that shorter titles were

associated with more citations, by using a statistical analysis. The study also con-cluded that the title amusement level was slightly correlated with downloads, but

not with the number of citations. Furthermore,Subotic and Mukherjeefound that

amusing titles tended to be shorter.

Various studies show that there is a relation between the paper title and the number of citations or download rates. However, these studies mainly focused their

research on specific scientific fields. In2016,Hudsondid a more broad study in the

relation of article features and the citation impact. In this study,Hudsonanalyzed

52,000 articles that were submitted in 2014 in the UK’s four-year Research

Evalu-ation Framework (REF). The analyzed articles were from 36 different disciplines.

Using statistical regression,Hudsonfound that the title characteristics varied

con-siderably between the different disciplines. However, the research shows four main

findings. First,Hudsonfound that the lengths of the titles increased with the

num-ber of authors in almost all disciplines. The second finding was that the use of colons and question marks tended to decline with increasing numbers of authors. Third, papers published later on in the 4-year period, tended to have more authors than those published earlier. The fourth finding of the study is interesting for the research on citation impact. In some disciplines, the number of subsequent cita-tions to papers were higher when the titles were shorter and when they employed

colons. The citations were lower when paper titles employed question marks (

Hud-son, 2016). Again this shows that, in some fields, there is a relation between the

paper title and the number of citations. Some parts of the approach ofHudsonare

also used to predict the citation impact in this thesis. The use of question marks and colons in paper titles are considered in this prediction task.

The previously described studies show that in various areas of research, different

article features are predictors of or related to scientific impact.Bergsma et al.(2012)

showed that certain article features are also predictors for hidden attributes such as whether the author of a paper is a native English speaker, whether the author is male or female and whether the paper was published in a conference or workshop proceedings. Using Bag of Words features in training, three linear SVM classifiers

on three datasets containing approximately 400 papers each,Bergsma et al. were

able to outperform a minority class baseline in predicting all three hidden attributes. By adding stylistic and syntactic features to the models, the performance of the classifiers improved even more. In predicting the native language of an author, one of the classifiers performed with an f1-score of 91.6, almost 42 percent point better than the baseline of 49.8. Interesting in this study is the use of features extracted from the article itself. It could be that for example the abstract is not only predictive for the hidden attributes, but also for the citation impact. It is worth investigating whether features representing the content of an article are predictive for citation impact.

A novelty in the field of investigating predictors for citation impact, is the use

of peer reviews. Kang et al. (2018) presents the first public dataset of scientific

peer reviews available for research purposes (PeerRead v1). However, before the publication of this study, two corpora with peer reviews from the NIPS and ICLR

(11)

related work 6

conference were already created for this thesis. The dataset Kang et al. created

consist of 14.7K papers with their peer-reviews and corresponding accept/reject

de-cisions. For a subset of 3K papers, the dataset contains 10.7K textual reviews. Kang

et al. created the PeerRead v1 using three strategies. The first was that of collaborat-ing with conference chairs and conference management systems to allow authors and reviews to opt-in their paper drafts and peer reviews, respectively. Secondly

Kang et al. crawled publicly available peer reviews and labeled these with numer-ical scores such as clarity and impact. Last, they crawled arXiv submissions which coincide with important conference submission dates and checked whether a sim-ilar paper appears in proceedings of these conferences at a later date. The dataset

created byKang et al.differs with respect to the size relative to the corpora created

for this thesis. PeerRead v1 contains papers from the 2017 ICLR edition, where the ICLR corpus created for this thesis contains peer reviews of the 2013, 2014, 2016 and 2017 ICLR editions. The PeerRead v1 dataset consists of the same papers and reviews as the NIPS corpus created for this thesis does. Furthermore, the PeerRead

v1 contains reviews from ACL 2017 and CoNLL 2016. The dataset created byKang

et al. also included another 11,778 papers without reviews, but labeled with their

accept/reject decisions. Kang et al. also conducted a classification task with a part

of their created dataset. In this classification task, Kang et al. tried to predict the

paper acceptance of ICLR and arXiv papers. Both models performed better than the majority baseline. Most relevant for this thesis is the classification model built for predicting the ICLR acceptance. With the most optimal settings, this model had an accuracy of 65,3%. The model outperforms the majority baseline of 57,6%. In

creat-ing this optimal model,Kang et al. implemented a feature set containing 22 coarse

features and sparse and dense lexical features. A feature analysis showed that these feature were most predictive for the paper acceptance: use of appendix (True/False), number of theorems (Integer), number of equations (Integer), average number of references (Float), use of ‘state-of-the-art’ in the abstract (True/False) and number of cited papers pub-lished in the last five years (Integer). In this thesis, a classification model that predicts ICLR papers was also created. However, this model differs from the model created by Kang et al. in the implemented feature combinations. In this thesis, limited article features were available and therefore implemented. Only lexical features of the abstract were used in this thesis. This thesis mainly focused on trying to predict the paper acceptance based on the peer reviews of the papers.

As rightly concluded byKang et al., the access to peer reviews is limited.

There-fore, both the author of this thesis andKang et al.chose to collect the peer reviews

of NIPS. This is because NIPS is one of the few conferences that made its peer

re-views publicly accessible. As described byHirsch, citation counts are hard to collect

as well. WhileMcnamara et al. had access to the Scopus Database of Elsevier, which

also contains information about citations, the author of this thesis only had access to public sites like Google Scholar for collecting citations.

In predicting the citation impact of NIPS scientific papers, the relatively novel field of using peer reviews as predictors for citation impact is explored in this thesis. However, predictors that were also investigated by some researchers as described above, were implemented in the development of the classification model to predict the NIPS papers citation impact. In this thesis predictors are referred to as the specific features derived from the previous work. This specific feature set includes the

use of question marks in paper titles (Hudson(2016),Jamali and Nikzad(2011)), the

use of colons in paper titles (Jacques and Sebire(2010),Jamali and Nikzad(2011),

Hudson(2016)), the length of the paper title (Jamali and Nikzad(2011),Subotic and Mukherjee(2014),Hudson(2016)) and the number of authors of a paper (Vieira and Gomes(2010),Hudson(2016)).

(12)

3

_{D A T A A N D M A T E R I A L}

3.1 collection

For this research two different types of data needed to be collected: peer reviews and citation counts. Both data types were hard to collect. There are very few con-ferences who make reviews publicly available and there is no central place where citation counts can be downloaded easily. In the different types of data, the

con-ference on Neural Information Processing Systems (NIPS)1

and the the International

Conference on Learning Representations (ICLR)2

play a key role.

NIPS is a machine learning and computational neuroscience conference held

every December3

. Because NIPS was one of the first conferences that made peer

review public and keeps their records in a way that is easy to access4

, peer reviews (referred to as reviews) of NIPS scientific papers and their belonging citation counts

were collected from Google Scholar5

.

Just like NIPS, ICLR is a conference on machine learning. In contrast to NIPS, ICLR is held every April. Reviews of ICLR scientific papers were collected from the

website OpenReview.net6

. This website contains peer reviews of 13 different

confer-ences7

. The choice to collect reviews of ICLR was made, because of the similarities between the subjects of both NIPS and ICLR. The collection of the reviews and citation counts is described in the following subsections.

3.1.1 NIPS Citation Counts

In order to get the citation scores of the NIPS scientific papers, initially a list of papers was created for the editions of 2013, 2014, 2015, 2016 and 2017. This was done automatically using a Python script.

In a first attempt to collect the citation counts, the Python module scholar.py8

was used in scraping the citations counts from Google Scholar. Unfortunately this attempt failed because Google was not allowing the required number of requests to their server. Therefore, a human annotator was needed to recorded the citation scores from Google Scholar. The annotator was helped by a Python script that au-tomatically opened the Google Scholar page with search results of a specific paper and allowed the annotator to enter the number of citations. The only thing the

an-notator had to do, was typing over the right number as marked in red in Figure1.

The recording of the citation scores started in April 2016 (citations1) by the supervisor of this thesis. The citation scores were also recorded in November 2016 (citations2), June 2017

(citation-s3), March 2018 (citations4) and May 2018 (citations5). Table1shows an overview

of the recorded citation counts. During these periods the available citation counts of the NIPS 2013 up to and including the 2017 edition were collected.

For almost all of the NIPS papers, the citation counts were accurately recorded by the author of this thesis. Unfortunately, some citation counts were not collected due to technical problems. Due to this, no citation counts were recorded for four 1 https://nips.cc 2 https://iclr.cc 3 https://en.wikipedia.org/wiki/Conference_on_Neural_Information_Processing_Systems 4 https://papers.nips.cc 5 http://scholar.google.com 6 https://openreview.net 7 https://openreview.net/on 2018-04-17 8 https://github.com/ckreibich/scholar.py 7

(13)

3.1 collection 8

papers of the 2017 edition. All collected citation counts are stored into CSV files. For each year, there is a CSV file containing the citation counts for the different periods.

Figure 1: The annotator had to type of the number of citations.

citations1 citations2 citations3 citations4 citations5 NIPS

Edition April 2016 November 2016 June 2017 March 2018 May 2018 2013 2014 2015 2016 2017 recorded not recorded

Table 1: Overview of the recorded NIPS citation counts.9

In Figure2, the absolute increase in citations for each edition of NIPS is plotted

against the first collected citations. The Figure shows the growth of citations from the first up to the most recent moment the citations were collected. In the plot for NIPS 2013, one can see one paper with an outstandingly high growth in citations. The absolute growth of this paper between citations1 and citations4 is much bigger than the growth of other NIPS 2013 papers. The plots for NIPS 2014, 2015,

2016and 2017 show a similar picture. In all editions, there is one paper that really

stands out in terms of the absolute increase of citations. A list of these papers can

be seen in Table2.

It shows that when the paper with the highest increase was removed from the NIPS 2013 plot, the plots of the different NIPS editions look more similar (see

Fig-ure3. This paper is therefore an exception in citation growth. It is clearly that for

each edition of NIPS, there are papers with a normal growth and papers with a higher growth. The papers with the more normal growth are clustered in the bot-tom left corner of the plot. Papers with a faster growth are floating from the botbot-tom

left, up to the top right corner. In Figure4, the clustered papers are displayed using

a red dotted line, the green dotted line indicates the floating papers.

The plot of the growth in citations for the NIPS 2017 edition differs somewhat from the plots of the other editions. There is no outstanding paper and there are less floating papers. This is probably because of the difference in time between the periods which the citations were collected. For the NIPS 2017 edition, there are only two months in between citations4 and citations5: March 2018 until May

2018. For comparison, the time in between periods of collecting citations for the

other editions is seven up to nine months. The plot of NIPS 2017 shows that the 9

In May 2018, only the citation counts for the NIPS 2017 edition were collected, to ensure that at least two consecutive collected citation counts were available for each edition.

(14)

3.1 collection 9

development of citation growth is already visible, but not as clearly as in the other plots.

The plot in the bottom right corner of Figure 2, shows the citation growth of

the different NIPS editions all together. Here the most recent absolute number of citations is plotted against the absolute increase in citations. The outstanding paper of NIPS 2013 is clearly visible in this plot. When removing this paper to zoom in on

the plot (Figure5), the clustered and floating papers are well observable. However

the floating papers are closer to the clustered papers, which is mainly due to another outstanding paper in the top right corner.

The different plots of the collected citation counts show that there is distinguish-able variation in the amount of growth between the NIPS scientific papers. The processing and combining of the citation counts with the NIPS reviews is described

in Chapter3.3.

NIPS

Edition Farthest outlier

2013 Distributed Representations of Words and Phrases and their

Composi-tionality (Mikolov et al.,2013)

2014 Sequence to Sequence Learning with Neural Networks (Sutskever

et al.,2014)

2015 Faster R-CNN: Towards Real-Time Object Detection with Region

Pro-posal Networks (Ren et al.,2015)

2016 Improved Techniques for Training GANs (Salimans et al.,2016)

2017 Dual path networks (Chen et al.,2017)

Table 2: Papers with the highest growth in citations and the highest number of citations in

(15)

3.1 collection 10

(16)

3.1 collection 11

Figure 3: Citations of NIPS 2013 Scientific Papers.

Figure 4: NIPS 2013: Clustered papers with normal growth and floating papers with faster

(17)

3.1 collection 12

Figure 5: Citations of NIPS 2013 - 2017.

3.1.2 NIPS Reviews

As mentioned before, NIPS has an easily accessible proceedings website. However, the available information was not homogeneous across the different NIPS editions. Not only the information of the NIPS scientific papers is available on this site, it

also contains the peer reviews of the conference reviewers. Using Beautiful Soup10

, the available information was scraped from the NIPS proceedings website. An

overview of the information that was scraped can be seen in Table3. A description

of the information that was scraped for each individual review can be found in

Table4. As that Table shows, the available information is not the same for every

NIPS edition. An overview of the differences can be seen in Table5.

Information Description

ID The unique paper ID of the NIPS Proceedings website.

Title The organizational title of the scientific paper.

Abstract The abstract of the paper.

URL The URL to the NIPS Proceedings paper page.

BIB URL The URL to the BIB file of the scientific paper.

PDF URL The URL to the PDF file of the scientific paper.

Presentation type The medium used to present the scientific paper. For

example Poster or Oral.

Authors A list of the authors of the paper.

Author ID’s The relative URLs of the NIPS Proceedings author pages.

Path to reviews The local path and file name to a copy of the NIPS

re-view HTML file.

Submission paper ID The submission ID of the NIPS paper.

Reviews Peer reviews of the paper. See Table 4 for a detailed

description about the scraped information for a single review.

Table 3: An overview of the information that was scraped from the individual web pages of

the NIPS papers.

The reviews and the information about the papers were saved into a JSON file. For each edition of the NIPS conference, a separate JSON file was created. The

collected citation counts as described in Section 3.1.1 were also included in the

10

(18)

3.1 collection 13

ID The name or ID of the reviewer. This is always a string

that contains a number of the reviewer.

For example: Assigned_Reviewer_13 or Reviewer 3.

Review The review text written by the reviewer.

For the NIPS editions 2013, 2014 and 2015, the reviewers were asked to write a review the following way: "First provide a summary of the paper, and then address the follow-ing criteria: Quality, clarity, originality and significance.". In 2016 the reviewers were asked to write a "Qualitative Assessment".

In 2017 there was no heading for the actual review.

Summary The summary of the review.

For the NIPS editions 2013, 2014 and 2015 the review-ers were asked to write a summary the following way: "Please summarize your review in 1-2 sentences".

In 2016 the reviewers were just asked to write a "Sum-mary".

In 2017 the reviewers weren’t asked to write a summary of the review.

Reviewer confidence In 2016 the reviewers were asked specify their

confi-dence in the review. The reviewers could choose one out of the following three options:

1. Less confident (might not have understood

signif-icant parts);

2. Confident (read it all; understood it all reasonably

well);

3. Expert (read the paper in detail, know the area,

quite certain of my opinion).

Author feedback In 2013, 2014 and 2015 the author was asked to give

feedback based on the reviews of the reviewers. Table 4: An overview of the scraped data for an individual NIPS review.

JSON file. The original HTML source files of the reviews were subsequently saved into a separate folder for each edition. An example of the JSON structure of a

specific paper can be seen in Listing1.

Review NIPS

Edition Reviewer_ID Review Summary Reviewer

confidence Author feedback 2013 2014 2015 2016 2017 Information available Information not available

Table 5: Overview of the available review information for papers of the different NIPS

edi-tions.

The NIPS reviews dataset contains 9,163 reviews in total. Some statistics about

the dataset can be seen in Table6. On average there are four peer reviews written

(19)

3.1 collection 14

1 {

2 "5138": {

3 "paper_id": "5138",

4 "paper_title": "The Randomized Dependence Coefficient",

5 "abstract": "We introduce the Randomized Dependence Coefficient (RDC), ...", 6 "url": "https://papers.nips.cc/paper/5138-the-randomized-dependence-coefficient", 7 "bib": "http://papers.nips.cc/paper/5138-the-randomized-dependence-coefficient/...", 8 "pdf": "http://papers.nips.cc/paper/5138-the-randomized-dependence-coefficient.pdf", 9 "conf_event_type": "Poster", 10 "authors": [ 11 "David Lopez-Paz", 12 "Philipp Hennig", 13 "Bernhard Schölkopf" 14 ], 15 "authors_ids": [ 16 "/author/david-lopez-paz-6694", 17 "/author/philipp-hennig-5163", 18 "/author/bernhard-scholkopf-1472" 19 ], 20 "path_to_rev": "reviews_2013/5138.rev", 21 "reviews_url": "https://media.nips.cc/nipsbooks/nipspapers/paper_files/nips26/...", 22 "submission_paper_id": "14", 23 "reviews": { 24 "Assigned_Reviewer_4": {

25 "review": "The paper introduces a new method called RDC to measure the ...", 26 "summary": "An interesting work combining several known ideas, but the ...", 27 "review_confidence": null

28 },

29 "Assigned_Reviewer_5": {

30 "review": "The RDC is a non-linear dependency estimator that satisfies ...", 31 "summary": "RDC is a straightforward and computationally efficient ...", 32 "review_confidence": null

33 },

35 "review": "The authors propose a non-linear measure of dependence ...", 36 "summary": "I think overall the work is extremely interesting and ...", 37 "review_confidence": null

38 },

40 "review": "This paper gives a new approach to nonlinear correlation. The ...", 41 "summary": "New and interesting approach to the metric of nonlinear ...", 42 "review_confidence": null

43 }

44 },

45 "author_feedback": "Dear reviewers, Thank you for the supportive feedback and ...", 46 "citations": { 47 "citations1": 23, 48 "citations2": 33, 49 "citations3": 53, 50 "citations4": 75 51 } 52 } 53 }

(20)

3.1 collection 15

Besides that, it includes ≈ 1,2 million words of review summaries and feedback on the reviews by the paper authors. On average, each review contains 312 words. The average number of words in the review summaries is 40. The Author Feedback contains on average the most number of words. Authors write their feedback about the reviews in on average 644 words.

The created NIPS corpus was published on GitHub11

and is available for re-search purposes.

NIPS Editions 2013 2014 2015 2016 2017

Number of papers 360 411 403 568 679

Average number

of reviews per paper 3 3 4 6 3

Average confidence score 2 Reviews* 1,132 1,278 1,536 3,240 1,977 Summaries* 1,132 1,278 1,536 3,240 Totals Author Feedback 359 408 403 Reviews* 422,538 423,940 445,629 775,677 645,547 Summaries* 40,732 44625 62,494 291,952 Total number of words Author Feedback 225,784 269,379 259,455 Reviews* 373 332 290 239 327 Summaries* 36 35 41 90 Average number of words Author Feedback 629 660 644

Table 6: Statistics of the NIPS Reviews dataset. Information marked with a * was used in the

experiments of this thesis.

3.1.3 ICLR Reviews on OpenReview.net

OpenReview.net contains peer reviews for 25 different venues of 13 different

confer-ences12

. "OpenReview aims to promote openness in scientific communication, particularly the peer review process, by providing a flexible cloud-based web interface and underlying

database API."13

Using this API, reviews of scientific papers of the International Con-ference on Learning Representations (ICRL) were collected. The reviews of the follow-ing editions were processed: 2013, 2014, 2016 and 2017. Unfortunately, the reviews of the 2015 ICLR edition were not available on OpenReview.

A separate JSON file was created for each ICLR edition. For each paper, two types of information were processed and saved into a JSON file: information about the paper itself and information about the reviews belonging to the paper. The

information collected about the paper itself is described in Table 7. Information

collected about individual reviews is described in Table8.

An example of the JSON structure of a specific OpenReview paper can be seen in

Figure2. An overview of review and paper information with differing availability

between ICLR editions can be seen in Table9.

The corpus of ICLR reviews from OpenReview has a total of 2,697 reviews, 4,994 replies and 1,021 questions. In total, the corpus contains 783,848 words of review texts. It also contains ≈ 1 million words of replies to these reviews. The texts labeled as questions contain 98,417 words. On average a review in the ICLR corpus consists of ≈ 291 words. A reply has an average length of ≈ 268 words. The texts in 11 https://github.com/reinardvandalen/citations 12 https://openreview.net/on 2018-04-17 13 https://openreview.net/about

(21)

3.1 collection 16

the ICLR 2017 edition that were marked as questions, contain on average 96 words.

More details and an overview of the quantity of the corpus can be seen in Table10.

In comparison to the corpus of NIPS reviews, the ICLR corpus is much smaller. While the NIPS corpus contains peer reviews of 2,421 papers, the ICLR corpus only contains reviews of 894 papers.

The collected ICLR corpus was published on GitHub14

and is available for re-search purposes.

ID The unique OpenReview ID of the scientific paper.

Serial number The serial number of the paper for the particular ICLR edition.

Title The organizational title of the paper.

Abstract The abstract of the paper.

PDF The URL to the PDF file of the scientific paper.

Authors The author(s) of the paper.

Author ID’s The unique OpenReview IDs of the author(s). On OpenReview

these IDs are the email addresses of the authors.

Keywords Some papers are labeled with one or more keywords. These

key-words are saved into a list.

Track The type of track followed within the conference. For example:

workshop or conference.

Acceptance States if the paper is accepted or not.

Reviews The peer reviews of the paper. See Table 8 for a detailed

de-scription about the collected information for a single review. Table 7: An overview of the information that was collected about the ICLR papers. 14

(22)

3.1 collection 17

ID A unique OpenReview review ID

Serial number The serial number of the review.

Parent The review ID of the parent.

When the review is a direct review to the paper, the parent is the OpenReview paper ID. If the review is a reply to an already existing review, the parent is the review ID of the review that was replied to.

Authors The author of the review. Sometimes the author of the paper

itself replies to one of the reviews. This can be checked by comparing the author of the paper with the author of the reply.

Title The title of the review.

Text The contents of the review, reply or question.

Type The type of the review. For the editions 2013, 2014 and 2016

this can be a review or a reply. For the 2017 edition this can also be a question.

Rating The rating a reviewer gives the paper. This information is only

available in the ICLR 2016 and 2017 edition. The reviewers could choose one out of the following scores:

1. Trivial or wrong;

2. Strong rejection;

3. Clear rejection;

4. Ok but not good enough;

5. Marginally below acceptance threshold;

6. Marginally above acceptence threshold;

7. Good paper, accept;

8. Top 50% of accepted papers, clear accept;

9. Top 15% of accepted papers, strong accept;

10. Top 5% of accepted papers, seminal paper.

Confidence The confidence of the reviewer about its review. This

informa-tion is only available in the ICLR 2016 and 2017 ediinforma-tion. The reviewers could choose one out of the following scores:

1. The reviewer’s evaluation is an educated guess;

2. The reviewer is willing to defend the evaluation, but is

quite likely that the reviewer did not understand central parts of the paper;

3. The reviewer is fairly confident that the evaluation is

cor-rect;

4. The reviewer is confident but not absolutely certain that

the evaluation is correct;

5. The reviewer is absolutely certain that the evaluation is

correct and very familiar with the relevant literature.

Decision When the review is the final decision of the jury, the decision is:

"Invite to Workshop Track", "Reject", "Accept (Poster)" or "Accept (Oral)". This information is only available for the 2017 edition.

Replies If there are any replies on a review, these replies or reviews are

nested here along with all review information as described in this Table.

(23)

3.1 collection 18 1 { 2 "5Qbn4E0Njz4Si": { 3 "paper_info": { 4 "paper_id": "5Qbn4E0Njz4Si", 5 "paper_nbr": 38,

6 "title": "Hierarchical Data Representation Model - Multi-layer NMF",

7 "abstract": "Understanding and representing the underlying structure of ...", 8 "pdf": "https://openreviews.nethttps://arxiv.org/abs/1301.6316", 9 "authors": [ 10 "Hyun-Ah Song", 11 "Soo-Young Lee" 12 ], 13 "author_ids": [ 14 "hi.hyunah@gmail.com", 15 "longlivelee@gmail.com" 16 ], 17 "keywords": [], 18 "track": "workshop", 19 "acceptance": true 20 }, 21 "reviews": { 22 "Oel6vaaN-neNQ": { 23 "id": "Oel6vaaN-neNQ", 24 "number": 2, 25 "parent": "5Qbn4E0Njz4Si", 26 "authors": [ 27 "anonymous reviewer 7984" 28 ],

29 "title": "review of Hierarchical Data Representation Model - Multi-layer NMF", 30 "text": "The paper proposes to stack NMF models on top of each other. ...", 31 "type": "review", 32 "rating": null, 33 "confidence": null, 34 "decision": null, 35 "replies": { 36 "-B7o-Yy0XjB0_": { 37 "id": "-B7o-Yy0XjB0_", 38 "number": 1, 39 "parent": "Oel6vaaN-neNQ", 40 "authors": [ 41 "Hyun-Ah Song" 42 ], 43 "title": "",

44 "text": "- Points on the con that experimental results are not great: ...", 45 "type": "reply", 46 "rating": null, 47 "confidence": null, 48 "decision": null, 49 "replies": {} 50 } 51 } 52 } 53 } 54 } 55 }

(24)

3.1 collection 19

Review ICLR

Edition Rating Confidence Decision

Paper acceptance 2013 2014 2016 2017 Information available Information not available

Table 9: An overview of review and paper information with differing availability between

ICLR editions

ICLR Editions 2013 2014 2016 2017

Number of papers 67 88 125 614

Average number

of reviews per paper 5 5 1 3

Average number

of replies per paper 1 2 1 8

Average number

of questions per paper 2

Average rating score 6 6

Average confidence score 4 4

Average acceptance* rate 0.82 0.42 0.52

Reviews* 309 446 183 1,759 Replies 64 205 81 4,644 Totals Questions 1,021 Reviews* 96,334 154,086 41,735 491,693 Replies 19,427 69,410 17,792 962,716 Total number of words Questions 98,417 Reviews* 312 346 228 280 Replies 304 339 220 207 Average number of words Questions 96

Table 10: Statistics of the ICLR OpenReview reviews. Information marked with a * was used

(25)

3.2 annotation 20

3.2 annotation

The NIPS Reviews and ICLR Reviews corpus were both annotated in a different way. The annotation process for both corpora is described in the following subsections.

3.2.1 NIPS Reviews

For the NIPS Reviews corpus the collected citation counts as described in Section3.1.1

were used to annotate the corpus. Every review is annotated with the available

ci-tation counts for the associated paper. As described in Section 3.1.1, the citation

counts were collected by a human annotator. The supervisor of this thesis, Assoc.

Prof. B. Plank15

, started collecting the citations in April 2016. The author of this thesis continued to do so in June 2017. The citation counts were recorded in the following periods, with the aim of collecting the citations approximately every six months: • citations1: April 2016 • citations2: November 2016 • citations3: June 2017 • citations4: March 2018 • citations5: May 2018

For the 2013, 2014 and 2015 editions, the citation counts of citations1,

cita-tions2, citations3 and citations4 are available. citations3 and citations4 are

available for the 2016 edition. For the 2017 edition the citations4 and citations5 were collected. In May 2018 only the citations of the NIPS 2017 edition were col-lected, to ensure that at least two consecutive citation counts were available for each edition.

The annotation process started with automatically creating a list of the NIPS scientific papers. This was done automatically by using the Beautiful Soup Python scraper. On the different moments mentioned above, a human annotator recorded the actual Google Scholar citation counts. The annotator was helped by a Python program, which automatically opened the Google Scholar search results page of the paper from where the citation counts needed to be collected. The program the

annotator used can be seen in Figure 6. The human annotator only had to type

over the Google Scholar citation scores. However, the use of Google Scholar comes with some limitations. The data collected by the annotator could be noisy because of the citation scores scraping algorithm of Google Scholar. Furthermore, human errors could be made during the collecting. However, for this research, there was no other method of collecting citation counts than using Google Scholar. This is in

contrast to publishers like Elsevier16

, that do have exact information about citations, but keep these behind a pay wall.

15

http://www.let.rug.nl/bplank/ 16

(26)

3.2 annotation 21

Figure 6: The Python program the annotator used to record the citation counts from Google

Scholar.

Due to technical problems no citation counts were recorded for four papers of the 2017 edition. Because of this, these four papers were excluded form the research. Besides these technical issues, it could also happen that the annotator typed over the wrong number of citations. Because the citations counts were not checked by a second annotator, this has to be taken into account when using the citation counts. However, there were no lower citation counts in consecutive periods. Sometimes Google adjusts the number of citations of a paper on Google Scholar. This could mean that in June 2017 a paper could have 25 citations and in May 2018 24 citations. If this was the case, the 24 citations in May 2018 were adjusted to 25. This was done because it would have been difficult to decide how many periods had to be adjusted back in time. It was therefore decided to maintain the highest recorded citation count. However, the adjusting of the citation counts was only required for very few papers.

As described earlier in Section 3.1.1, the citation counts were stored into CSV

files. For each NIPS edition, a CSV file was created, with in each column a period of citations. The citation counts were also stored in the NIPS reviews dataset as

described in Section3.2.1.

Besides the citation counts, a part of the NIPS reviews dataset was annotated

by the reviewers with the so called Reviewer confidence (Table 4). This information

is available only for the NIPS 2016 edition. As described in Table 4 each review

was annotated by the reviewer with a confidence rate. The authors of the reviews could choose between three different options: less confident, confident and expert. In this research the reviewer confidence scores are not used, but they are available in

the NIPS reviews dataset (Section3.2.1).

3.2.2 ICLR Reviews on OpenReview.net

The research conducted in this thesis tries to predict the citation impact of NIPS scientific papers. In order to improve the classification of these papers, a model was built to try predict the paper acceptance of ICLR papers using the reviews form OpenReview.net. It is expected that there are corresponding features in the NIPS and ICLR datasets, because both corpora contain peer reviews. Therefore it is hypothe-sized that the addition of the ICLR classification model will improve the NIPS clas-sification. To be able to built a model that predicts the ICLR paper acceptance, every ICLR review was annotated with its paper acceptance, expressed as either true or

false. For the 2013 and 2017 ICLR editions, the acceptance information was

avail-able on OpenReview.net. For annotating the papers of the 2014 edition, a technique called Distant Supervision was used. This technique was also used amongst others likeMintz et al.(2009),Read(2005) andGo et al.(2009). The acceptance information

for the 2014 edition was scraped from the ICLR 2014 website17

. Using Distant Super-vision the paper acceptance information was linked to the corresponding ICLR 2014 17

(27)

3.3 processing 22

papers. Because the acceptance information was not available for the 2016 edition, these reviews were not annotated with their belonging paper acceptance.

On OpenReview.net, Ratings and Confidence scores were also available for the peer reviews of some ICLR editions. Just like the paper acceptance, these reviews could hold predictive features for citation impact. Therefore, the ICLR reviews for the

2016and 2017 editions were also annotated with a Rating and Confidence score, as

earlier described in Section3.1.3. Both the rating and confidence score were given

by the author of a review. The author could give a scientific paper a score of 1 to 10. The confidence of the author about his or her review was captured in the Confidence score. Here, the author could annotate their own review with a score of 1 to 5. In both cases one was a bad score and ten a high score. An extensive overview of the

various scores can be seen in Table8.

The paper acceptance, paper rating and reviewer confidence were all three stored

into the ICLR Reviews dataset (Section3.1.3).

3.3 processing

In order to predict the impact of the NIPS scientific papers, the citation impact has to be defined. The NIPS reviews dataset only contains absolute citation scores,

col-lected on specific points in time. In Section3.3.1, the process in defining the citation

impact is described. The NIPS and ICLR corpus also contain textual information, beside annotated labels. This information needed to be processed so that it could be used to predict the impact of papers. The processing of this textual information

is described in Section3.3.2. In previous work by for exampleHudson(2016), other

features were indicated as predictive for the impact of papers. The creation of these

features for datasets is described in Section3.3.3.

Figure 7: Timeline describing the periods in which citation counts were collected.

3.3.1 Citation impact

To predict the citation impact of NIPS scientific papers, citation impact needed to be defined. There were multiple options to define the citation impact. In any case, first of all the citation growth needed to be calculated. This was done by calculating

a mean citation rate for each paper as inspired byVieira and Gomes(2010).

The citation rates could have been interpreted as the citation impact, where high numbers represent a high citation impact and low number represent low citation impact. The impact of NIPS scientific papers could in that case have been predicted using a regression task. However, this approach did not appear to be effective in early experiments. The performance of a regression model in trying to predict the

citation rate was poor18

. 18

(28)

3.3 processing 23

Because of the poor performance of the regression task, there was chosen to use a classification task to predict the citation impact of NIPS scientific papers. Therefore, the different citation rates needed to be classified into different impact categories. In order to define the different impact categories, a boxplot of the citation rates

was created (Figure 4). The boxplot shows a clear distinction between two types

of citation growth. The inliers in the boxplot were labelled as papers withlow-med

impact and the outliers ashighimpact. To see if a further classification of the citation

rates was possible, a second boxplot was created (Figure9). This boxplot represents

the citation rates classified with alow-medimpact. This boxplot shows that there is

again a clear distinction between two types of growth. The figure shows a group of outliers and a group of inliers. This means that there were still papers that had about the same growth and papers that had a more extreme growth. The first group was labelled as low impact papers and the latter group as medium impact papers.

In the sections below, the calculation of the citation growth and impact are de-scribed in more detail.

Citation growth rates

As a first step, the citation growth of a paper was calculated. Because the periods in between the different collection points differed, one can not directly use the absolute citation growth to indicate the growth over time. The periods in which the citation

counts were collected are plotted on a timeline in Figure 7. The time in between

these periods was indicated using v1, v2, v3 and v4. The first step in processing the citation counts, was calculating the absolute growth in citations for each of the described periods, the so called v_abs. The calculation of v_abs was performed by taking the number of citations of the successive period and subtract this from the

citations of the previous period (Equation1). Second, the growth rate of citations

was calculated, the so called v_rate. This calculation was performed for all periods. For this calculation the v_rate of the period in question was divided by the length

of that period (Equation 2). The last step in calculating the citation growth was to

calculate the average growth rate per month, the v_avg. This was calculated by

summing up all the v_rates and dividing this by the number of v_rates (Equation3).

An example of calculating the citation growth can be seen in Table11.

v_absx= citationsz− citationsy (1)

Equation 1: Calculates the absolute growth in citations in between two collection points.

Where x is a period in between to collection points, citationszthe number of

ab-solute citations of the successive period and citationsythe number of absolute

citations of the previous period.

v_ratex=

v_absx

v_lengthx (2)

Equation 2: Calculates the growth rate of citations within a period. Where x is a period in

between to collection points.

v_avg =

Pn

i=xv_ratex+ v_ratex+1.. + v_raten

z (3)

Equation 3: Calculates the average growth rate per month. Where x is a period in between

(29)

3.3 processing 24

The Randomized Dependence Coefficient

Conference NIPS Edition 2013 Citation counts citations₁= 23 citations2= 33 citations3= 53 citations4= 75 v_abs

v_absv1 = citations2− citations1= 10

v_rate

v_ratev1= _{v_length}v_absv1

v1 =

10 7 = 1.43

v_ratev2= _{v_length}v_absv2_v2 = 207 = 2.86

v_ratev3= _{v_length}v_absv3_v3 = 229 = 2.44

Citation growth v_avg = v_ratev1+v_ratev2+v_ratev3

3 = 2.24

Table 11: Calculation example of calculating the citation growth of The Randomized Dependence

Coefficient (Lopez-Paz et al.,2013).

1 { 2 "citations": { 3 "citations": { 4 "citations1": 23, 5 "citations2": 33, 6 "citations3": 53, 7 "citations4": 75 8 }, 9 "v_rates": { 10 "v1": 10, 11 "v2": 20, 12 "v3": 22 13 }, 14 "v_avg": { 15 "v1_avg": 1.43, 16 "v2_avg": 2.86, 17 "v3_avg": 2.44, 18 "v_avg": 2.24 19 } 20 } 21 }

Listing 3: Example of the JSON code for the in Table11calculated growth figures. The citation growth was calculated for all papers in the NIPS reviews dataset. All v_abs figures, v_rates and the citation growth (v_avg) were added to a copy of the JSON dataset. An example of the JSON code representing the growth figures

calculated in Table11, is displayed in Listing3. Important to notice is that the names

of the growth figures in the JSON code differ from the names described above. The

in the code described v_rates are the v_abs figures. The vx_avg in the code are the

v_rates. The v_avg is documented in the same way as described above.

Citation impact

In Figure8, the average growth rate per month is plotted against the the first

(30)

3.3 processing 25

the y-axis contained the absolute growth in citations. Just like in Figure2, one can

see that not all papers have the same citation growth. Some papers in the plots are more clustered in the bottom left corner and some are floating from the bottom left up to the top right corner. To make the clustered and floating papers more concrete, a boxplot of the average growth rate per month of the complete dataset was created.

This boxplot can be seen in Figure9. The boxplot itself is hard to read. Black circles

dominate the Figure, the so-called fliers or outliers. These are data points above the Upper Outlier Threshold (UOT) of the boxplot. Te UOT is defined by adding up the Inter Quartile Range Rule (IQRR) to the third Quartile 3 (Q3) of the boxplot

(Equation4). The IQRR is calculated as can be seen in Equation5. Here, the Inter

Quartile Range (IQR) is multiplied by 1.5. The IQR is the absolute difference

be-tween Quartile 1 (Q1) and Q3 of the boxplot (Equation6). Each growth rate that is

above the UOT is an outlier or flier in the data. Papers with such a growth rate are therefore defined as high impact papers. Papers with a growth rate that is below the UOT, are defined as low/medium impact papers. The growth rates belonging to the

latter class of papers are within the boxplot of Figure9, in between 0 and the Upper

Outlier Threshold. Taking these datapoints, the boxplot in Figure 10was created.

Just like in Figure9, the flyers and outliers are visible. In contrast to Figure9, the

boxplot itself is much more visible. For the boxplot in of the low/medium citation growth rates, the UOT was also calculated. All growth rates above the UOT were defined as medium impact papers. Papers with a growth rate in between 0 and the UOT, were defined as low impact papers. An overview of how a paper is defined as

low, medium or high impact paper, can be seen in Table12. In Figure8, the calculated

citation impact is visually indicated.

UOTz= Q3z+ IQRRz (4)

Equation 4: Calculates the Upper Outlier Threshold (UOT) of z above which data points are

marked as flier or outlier. Where z is a boxplot of growth rates and IQRR the Inter Quartile Range Rule (see Equation5)

IQRRz= 1.5 ∗ IQRz (5)

Equation 5: Calculates the Inter Quartile Range Rule (IQRR) of z. Where z is a boxplot of

growth rates and IQR the Inter Quartile Range (see Equation6).

IQRz= Q3z− Q1z (6)

Equation 6: Calculates the absolute difference between Q1zand Q3zof z: the Inter Quartile

Range (IQR). Where z is a boxplot of growth rates.

high impact paper v_avgx > UOTz

medium impact paper v_avgx > UOTi

low impact paper v_avgx < UOTi

Table 12: Overview of how paper x is defined as low, medium or high impact paper. Where

zthe boxplot of all growth rates and i the boxplot of all the growth rates without the high impact rates.

(31)

3.3 processing 26

Figure 8: Citation Growth (average growth rate per month) of NIPS Scientific Papers. For each

NIPS year, the paper with the highest impact is excluded from the graph (the same papers as in Table2). Legend: low impact papers (•); medium impact papers (×);

(32)

3.3 processing 27

Figure 9: Boxplot z of the NIPS citation growth (NIPS 2013 - 2017).

Figure 10: Boxplot i of the low/medium NIPS citation growth (NIPS 2013 - 2017).

Measures Boxplot z Boxplot i

Q1 0.14 0.11 Q2 0.50 0.44 Q3 1.16 0.85 IQR 1.02 0.74 IQRR 1.53 1.11 UOT 2.69 1.96

Total growth rates/papers 2417 2133

Low/medium impact papers 2133

High impact papers 284

Medium impact papers 112

Low impact papers 2021

Table 13: Different measures of the boxplot z and boxplot i and overview of the low, medium

and high impact papers. Where boxplot z is created out of all growth rates and boxplot i out of all growth rates without high impact rates.