A comparison of methods to measure novelty in scientiﬁc papers

(1)

Master’s Thesis

A comparison of methods to measure

novelty in scientific papers

M.D. Bosma

Student number: 11418591

Date of final version: August 15, 2018 Master’s programme: Econometrics

Specialisation: Big Data Business Analytics Supervisor: Dr. M. J. van der Leij Second reader: Dr. L. R. Waltman

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided up into a number of sections and contains references. An outline can be something like (this is an example for an empirical thesis, for a theoretical thesis have a look at a relevant paper from the literature):

(a) Front page (requirements see below)

(b) Statement of originality (compulsary, separate page) (c) Introduction (d) Theoretical background (e) Model (f) Data (g) Empirical Analysis (h) Conclusions

(i) References (compulsary)

If preferred you can change the number and order of the sections (but the order you use should be logical) and the heading of the sections. You have a free choice how to list your references but be consistent. References in the text should contain the names of the authors and the year of publication. E.g. Heckman and McFadden (2013). In the case of three or more authors: list all names and year of publication in case of the rst reference and use the rst name and et al and year of publication for the other references. Provide page numbers.

2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All that actually matters is that your supervisor agrees with your thesis.

3. The front page should contain:

(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Faculty as in the heading of this document. This combination is provided on Blackboard (in MSc Econometrics Theses & Presentations).

(b) The title of the thesis

(c) Your name and student number (d) Date of submission nal version

(2)

Statement of Originality

This document is written by Mick Bosma who declares to take full respon-sibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

A comparison of methods to measure novelty in scientific papers

M.D. Bosma

Abstract

Scientific research that adds something new to the existing scientific literature, is called novel research. Since there is no explicit formula to measure the degree of novelty of an article, multiple methods exist. In this research we introduce a completely new method. The degree of novelty equals a measure of the variety in the referenced knowledge. A machine learning technique called Doc2Vec is used to transform the abstracts of scientific articles into numeric vectors. Based on these vectors representing referenced abstracts, the degree of novelty is calculated. We compare the new method to two benchmark methods used in prior bibliometric literature. We find that the methods identify different novelty on a set of Web of Science articles. Finally, we examine the relation between novelty and impact. The results suggest that the new method better is a better estimation of scientific novelty.

Keywords: bibliometrics, novelty, impact, paragraph vectors, cosine similarity

Contents 1 Introduction 4 1.1 General Introduction . . . 4 1.2 Problem Formulation . . . 5 2 Theoretical Background 6 2.1 Theoretical process . . . 6

2.2 Referenced journal pairs . . . 7

2.3 Keyword combinations . . . 10

2.4 Text mining . . . 12

3 Methodology 16 3.1 Novelty: New referenced journal pairs . . . 16

3.2 Novelty: Atypicality of keyword combinations . . . 18

3.3 Novelty: Text mining - Variety in referenced abstracts . . 19

3.3.1 Paragraph Vector model . . . 20

(4)

3.3.3 Computation of the novelty score . . . 26

3.4 Comparison of measures: ROC curves . . . 27

3.5 Regression analysis: Novelty and impact . . . 29

3.5.1 Regression variables . . . 30

3.5.2 Generalized Negative Binomial model . . . 31

3.5.3 Ordered Probit model . . . 32

3.5.4 Probit model . . . 34

4 Data Description 34 4.1 Data selection . . . 35

4.2 Descriptive statistics regression variables . . . 38

5 Results 39 5.1 Novelty: New journal pairs . . . 40

5.2 Novelty: Atypicality of keyword combinations . . . 41

5.3 Novelty: Variety in referenced abstracts . . . 42

5.3.1 Testing specifications of the PV model . . . 42

5.3.2 Novelty scores . . . 45

5.4 Comparison of novelty methods . . . 45

5.4.1 Descriptive statistics . . . 45

5.4.2 Examples: Extreme opposites . . . 47

5.4.3 ROC curves . . . 48

5.5 Impact regressions . . . 52

6 Conclusion and Discussion 54 7 Bibliography 57 8 Appendices 60 8.1 Tables . . . 60

(5)

1 Introduction

1.1 General Introduction

Conducting research on the characteristics of research. A science of science. These sentences may seem rather vague, however, these are descriptions of the scientific field called bibliometrics. The field of bibliometrics consists of quantitative methods to provide statistical analysis on scientific publica-tions. For example, based on the number of times journals have been cited together, the size and variety of scientific fields in terms of journals can be represented in maps (van Eck and Waltman, 2009). In this way, there exist many applications of bibliometric measures. According to Glanzel (2003), the term bibliometrics was first used in 1969, explained as “the application of mathematical and statistical methods to books and other media of communication”. In the same year the term scientometrics was introduced as well. The definition of scientometrics stated: “the appli-cation of those quantitative methods which are dealing with the analysis of science viewed as an information process”. Nowadays, both terms are considered to be almost similar.

The use of bibliometrics is very broad, as the methods can be applied to most of the scientific fields. An important application of bibliometrics is exploring the evolution of a specific scientific field. One of the aspects contributing to the evolution, is the degree of novel research. Novelty in scientific research is hard to properly define. However, it can be explained as something never published in a journal before. Hence, an article’s degree of novelty indicates the extent to which it adds something new to the existing body of knowledge. Finding breakthrough results may be one of the main incentives to conduct research, however, Rehman (2018) states that the quest for novelty forms an obstacle for the reproducibility of research. By mainly investing time in trying to find novel results instead

(6)

of ensuring the quality of the research, the results could turn out to be irreproducible. This is a problem as reproducibility is considered to be a core foundation of science. Rehman (2018) argues that journals prefer to publish novel findings as these tend to be more likely to be cited. On the other hand, Wang et al. (2017) state that funding agencies are considered to be risk-averse and therefore prefer safe over novel projects. Furthermore, the authors mention that these funding agencies base their decisions and selections on bibliometric measures. As novelty in scientific research is hard to properly define, the question that arises is: How to create a bibliometric method to measure a phenomenon with no explicit definition?

1.2 Problem Formulation

In the last decade the research on novelty measurement in scientific papers has resulted in a few bibliometric methods. In this research we examine two of these methods. We discuss the methodology and compute the mea-sures for a set of articles. Both meamea-sures are created to indicate a degree of novelty, however, they are based on different characteristics of articles. Therefore, it is interesting to investigate whether the different methods identify the same articles to be novel. Furthermore, we examine the rela-tion between the degree of novelty and the impact of a paper. The impact of a paper is commonly defined as the number of times that paper has been cited by other papers. Novel research is assumed to have more potential to achieve high impact, as stated in Wang et al. (2017). Therefore, in or-der to evaluate the computed novelty scores, the relation between novelty and impact is interesting to investigate. In addition to these two existing methods, we develop a completely new method to measure novelty. The results from this method are compared to the two benchmark methods as well. In short, the goal of this research is to investigate and compare different methods to measure novelty of scientific papers.

(7)

First, in the next section we discuss existing literature on the method-ologies of the two proposed measures. Furthermore, we explore literature on methods that form the basis of the new novelty measure we want to develop. After discussing the relevant prior literature, we focus on the de-tailed methodology behind the three measures we apply in this research. The techniques we use in order to compare the novelty measures are ex-plained as well. Next, we describe the data selection process and the resulting set of articles. At this point, all required information on the the-ory and data is provided, and we can discuss the results. The discussion of the results includes: novelty scores, testing parameters, limitations and comparison of methods. After presenting the results, we summarize and evaluate the entire research.

2 Theoretical Background

2.1 Theoretical process

In this section, we discuss existing literature on the topic of novelty mea-surement in scientific papers. First, we elaborate on the selection process of the literature that we discuss in this section. The basis of this research consists of two quite recently published papers: Wang et al. (2017) and Bramoull´e and Ductor (2018). After reading these papers, the proposed methods were considered as benchmark methods for this research. In or-der to explore the origins of these methods, we analyse the referenced articles on these methods. If these referenced papers contain other new references on the proposed methods, then we review these references as well. Hence, by discussing references, we end up with a literature review for the methods from Wang et al. (2017) and Bramoull´e and Ductor (2018).

In order to add another perspective of novelty measurement, we use Google Scholar to search for papers on text-based novelty measures. We make

(8)

use of combinations of the following terms in the search process: novelty detection, novelty measure, scientific literature, text mining, text-based, abstracts, paragraph vector. The resulting papers are examined, and based on their methodology and references, the selection is made.

2.2 Referenced journal pairs

Identifying novel research from a large set of articles is called novelty de-tection. Since there is no exact definition of novelty, multiple ideas and methods to measure novelty exist in the literature. An often proposed view on novelty is known as ‘combinatorial novelty’. This view is based on the assumption that novelty can be defined as recombining existing parts of literature in a way that has never been proposed before. Following this assumption, Uzzi et al. (2013) propose that the degree of novelty of an ar-ticle can be measured by the atypicality of referenced journal pairs. Here, journals are considered bodies of knowledge, hence combining knowledge is represented by paired references. For each of these pairs it is calculated how common or conventional that pairing is. This is done by comparing the observed frequency of journal pairs to their expected frequencies, obtained by randomized citation networks. As a result, each paper is assigned a distribution of conventionality scores of the journal pairs. Analysing these distributions by their medians or percentiles results in some interesting findings. The median is used to indicate the level of typical combinations, while the 10th percentile represents the high novelty level. The first result indicates that high tail novelty papers outperform low tail novelty papers in terms of impact. Second, a very high median conventionality results in a peak in impact as well, independently from the tail novelty. Thirdly, the authors find that larger teams of authors result in higher impact of the paper, for all mixes of tail novelty and median conventionality. Their findings suggest that the mix of mostly highly conventional combinations of existing knowledge together with some very rare combinations leads to

(9)

the highest impact. Papers that consist of both high median conventional-ity and high tail novelty turn out to be twice as likely to become a highly cited paper.

The method from Uzzi et al. (2013) is adapted by Lee et al. (2015) in their research on creativity in scientific teams. They state that scientific creativity consists of two components: novelty and impact. In the paper, the relation between the characteristics of the team of authors and the creativity of the paper is examined. The characteristics of the team of authors are defined as the size and variety of the team. Variety of a team is divided into field variety (educational background) and task variety (di-vision of specialties). Lee et al. (2015) find that the size of the team of authors is positively related with novelty up to a certain point, also known as an inverted-U relation. However, they find that the main effect on nov-elty is the knowledge variety of the team, that usually increases as team size increases. Furthermore, the influence of the team characteristics on the degree of novelty is different from the influence of team characteristics on the impact of the paper. Therefore, Lee et al. (2015) conclude that citations and novelty might not be highly correlated.

Boyack and Klavans (2014) adapt the method from Uzzi et al. (2013) as well, however, use a different technique for normalizing the convention-ality scores. Uzzi et al. (2013) use a Monte Carlo technique to simulate randomized citation networks in order to compare the observed frequen-cies with the expected frequenfrequen-cies from simulation. Boyack and Klavans (2014) make use of the square co-citation count matrix to calculate the ex-pected values. The resulting scores are quite similar to the scores from Uzzi et al. (2013), while the co-citation matrix technique is much less compu-tationally expensive. Furthermore, Boyack and Klavans (2014) examine the influence of disciplinary effects on the novelty scores. They suggest

(10)

that ideally measurements of novelty should be relatively independent of discipline and journal effects.

Recent research from Wagner et al. (2018) investigates the effect of inter-national collaboration on the degree of novelty of an article. Moreover, international collaboration occurs when the list of authors on an article contains authors from two or more different countries. The authors expect international collaboration to result in highly novel and highly conven-tional research, following the categorization from Uzzi et al. (2013). This hypothesis is based on earlier research that found international collabo-ration to be more highly cited. However, Wagner et al. (2018) find no evidence for the hypothesis. The analysis indicates that international col-laboration results in highly conventional reference pairs but low novelty measures. In order to explain the negative relation between international collaboration and novelty, the authors propose two possible explanations: higher transaction costs and limitations on implicit communication. How-ever, these possible reasons do not explain the high impact of international collaboration. In order to explain this phenomenon, the authors argue that international co-authorship leads to a larger group of possible readers, as all authors benefit from their separate networks. This is also known as the audience effect.

Wang et al. (2017) conduct extensive research on the relation between the novelty and impact of a paper. First, the research investigates whether the citation profile of novel papers equals the “high risk/high gain” profile associated with breakthrough research. This high risk/high gain profile follows from novel papers having a higher chance of becoming a top cited paper (high gain), while novel papers show a larger variance in their ci-tation as well (high risk). Second, the authors examine whether popular bibliometric measures might be biased against novel research. In order to

(11)

measure novelty, Wang et al. (2017) adjust the method from Uzzi et al. (2013) such that the focus is on journal pair novelty instead of journal pair atypicality. Therefore, only journal pairs that never appeared in existing publications before are considered. For these new combinations, the diffi-culty of combining these journals is measured, based on their co-citation profiles. Moreover, the degree of novelty is calculated by the amount of new journal pairs weighted by the similarity between these journals.

Wang et al. (2017) find that the citation profiles of highly novel papers indeed match the “high risk/high gain” description. Furthermore, the impact of novel papers in foreign fields is larger, as is the distance of the foreign fields they reach. Another interesting result is the existence of delayed recognition of novel papers. Hence, it takes longer for a novel approach to be recognized and used in further research. The authors conclude that some often used bibliometric measures suffer from a bias against novel research.

2.3 Keyword combinations

Similar to the methods based on journal pairs, Sreenivasan (2013) defines novelty as a measure of atypicality. The author investigates the evolu-tion of novelty in cinema films. In order to measure the novelty of a film, the atypicality of keyword combinations is used. Plot-keywords describe a large number of aspects of the film and are therefore considered to be useful in measuring novelty. First, Sreenivasan (2013) calculates the ele-mental novelty that is based on the rarity of individual keywords. Next, the same method is used for keyword pairs, called combinatorial novelty. The method starts by calculating the probabilities of two specific keywords occurring together, based on all films that appeared prior to the films un-der consiun-deration. Then, these probabilities are modified into the surprises of keyword pairs by taking the minus logarithm. Finally, these values are

(12)

normalized in order to remove the dependence of the time of release. The author finds a positive relation between combinatorial novelty and the ag-gregated revenue of the film.

Boudreau et al. (2016) investigate the relation between the novelty of re-search proposals and the scores awarded by the evaluators. The measure is based on the Medical Subject Headings (MeSH), an equivalent of key-words. Instead of focussing on the rarity of all keyword pairs as Sreeni-vasan (2013), only the fraction of pairs that never occurred in prior litera-ture is calculated. The average relation between novelty and expert scores turns out to be negative, however, the actual relation can be described as inverted-U. Hence, higher novelty results in increasing expert ratings up to a certain degree of novelty. After this point, even higher novelty leads to decreasing expert scores.

Bramoull´e and Ductor (2018) analyse the relationship between the title length of a paper and its scientific quality. Three different definitions for scientific quality are proposed: journal quality, citations and novelty. In order to measure novelty, the authors follow Sreenivasan (2013) and ap-ply this method on article keyword combinations atypicality. To test the robustness of the results, two alternative novelty methods are used: The first one is based on relative frequencies of keyword combinations to pa-pers released in the same year. The second method measures atypicality of individual keywords instead of paired keywords. The research indicates that the length of the title is strongly negatively correlated with scien-tific quality, independent of the quality definition. Bramoull´e and Ductor (2018) propose two possible reasons for this result. First, a causal effect of title length on journal quality or number of citations could exist. For example, shorter titles can be easier to remember and therefore more of-ten cited. Second, the quality of an article might be represented by its

(13)

title length, as qualitatively higher articles tend to have shorter titles and atypical keyword combinations. Furthermore, the authors find a negative relationship between novelty and the impact of a paper.

2.4 Text mining

Up to this point, all discussed methods for measuring novelty are based on the combinatorial novelty definition. Either the combination of referenced journals or keywords forms the basis of the method. However, a degree of novelty can be determined by comparing the actual text of papers as well. Deriving information from text is called text mining. Next, we discuss several papers that make use of such text mining techniques in order to measure novelty. The information in these papers constitute the basis for the totally new novelty measure we develop in this research.

First, Karkali et al. (2013) introduce a novelty scoring method that com-pares a document to a corpus of text. This research focusses on news articles instead of scientific papers as documents. The method is based on the Inverse Document Frequency (idf), a measure of commonness of the term across all documents. The idea behind this method is that a novel article uses a different vocabulary than used in previous articles. The idf of a term is defined as the logarithm of the total number of documents divided by the number of documents containing that term. The novelty score equals a normalized sum of the term frequencies times the idf of ev-ery term. A disadvantage of this method is that synonyms are considered as totally different words, while their meaning is similar. Karkali et al. (2013) compare the method to baseline methods on two datasets: Google News and Twitter. The articles are divided between clusters on different news events and only the first article in such a cluster is considered to be novel. The performance is evaluated using a Detection Error Trade-off curve between missed novelty and false assessment. The proposed method

(14)

outperforms the baseline approaches.

Dasgupta and Dey (2016) propose a method to measure novelty of textual ideas. The following two main aspects of innovativeness are considered: First, if an idea contains a technique or approach that has never been men-tioned before, it should score higher on innovativeness. Second, if an idea is not completely new, the text that contains the more extensive explanation should receive the higher score. Therefore, the authors state that a docu-ment with high information content has a high probability of being novel. The information content of a document is calculated by its entropy. This entropy is based on the term specificity, which is computed by its Inverse Document Frequency. Based on the novelty score, the ideas are classified as high novelty, average novelty or low novelty. Subsequently, these scores are compared with expert ratings on the document novelty. The informa-tion content method outperforms baseline methods as the maximum or minimum cosine similarity or the Kullback-Leibler divergence.

Ghosal et al. (2018) present a benchmark setup for different techniques in document level novelty detection. A Random Forest model is used to classify documents as novel or non-novel, based on a set of features that captures different aspects of the semantics of a document. The first two are semantic features constructed by vector representations of paragraphs or words (Le and Mikolov (2014) and Mikolov et al. (2013)). The next set of features focusses on lexical similarity. The prior use of sets of words or keywords is quantified, as is the use of totally new words. Finally, the Kullback-Leibler divergence is included as similarity measure between lan-guage models. The target documents are labelled by experts for evaluation purposes. The authors find that the proposed method outperforms a set of baseline approaches.

(15)

Packalen and Bhattacharya (2015) analyse abstracts and titles of biomedi-cal literature in order to construct a measure of novelty based on the age of mentioned ideas. After preprocessing the text, all words and 2- or 3-word combinations are considered to be concepts. For each of these concepts, the year that it first appeared in the literature is determined, called its cohort. Based on the years since this first appearance and the number of times the concept has been mentioned before, the vintage of an idea is com-puted. Every concept is linked to its cohort and for each cohort a ranking is created of the 100 most referenced concepts. Next, for each paper the set of referenced top-100 concepts is examined and the minimum age of these concepts is selected. Then, for all papers published in a certain year these minimum concept ages are listed. Finally, the top 20th percentile papers in that year are scored 1 for an novelty indicator variable, where the rest scores 0. Packalen and Bhattacharya (2015) find that the prob-ability of trying out new ideas decreases with the age of the author’s career.

He and Chen (2018) investigate the novelty in research topics by measur-ing semantic changes of terms over time. For different periods in time, contextual embeddings of research concepts are constructed. In order to do this, the Word2Vec method from Mikolov et al. (2013) is adapted. The period specific concept vectors are compared using cosine similarity. The novelty score in a certain period is the negative of the maximum similarity score. Based on the novelty score of a research topic, the authors find that high novelty predicts growth in publications.

In most of the methods described up to now, the occurrence of specific words is used as an indicator of novelty. However, using a word with completely equal meaning to a previous word, should not be regarded as novelty. A way to include synonyms and other semantic relations is the use of Word2Vec (Mikolov et al., 2013). The goal of Word2Vec is to represent

(16)

every word in the corpus by a high dimension vector. Similar words will be represented by similar vectors. Furthermore, relations between words are captured in the vector representation as well. An example of such a rela-tion is a country and its capital: Netherlands compares to Amsterdam as Germany compares to Berlin. Two different types of the Word2Vec model are proposed in the paper by Mikolov et al. (2013): Skip-gram and Bag-of-Words. The main architecture of both versions is similar: a two-layer neural network. The Skip-gram model is built to predict the context of a certain word, while the Bag-of-Words model uses the context to predict the word related to it.

Le and Mikolov (2014) expand the Word2Vec method in order to represent a piece of text by a numeric vector. The proposed method, Paragraph Vec-tor, is an unsupervised algorithm learning vector representations for larger pieces of text (sentences, paragraphs or documents). The vector is trained to predict the words occurring in the text. Again, two versions are cre-ated: Distributed memory and Distributed bag-of-words. The first one combines the paragraph with context words to predict the next word. The second version does not include the context words as input and predicts words based on the paragraph. Le and Mikolov (2014) combine the vectors from both methods to create the paragraph vectors used in experiments. The resulting paragraph vectors are used for sentiment analysis of movie reviews and a test of paragraph similarities. The proposed method out-performs baseline methods.

Dai et al. (2015) use the Paragraph Vector (PV) method from Le and Mikolov (2014) on Wikipedia and arXiv articles. First, the vectors for Wikipedia articles are used to identify the nearest neighbours, based on cosine similarities. Next, these neighbours are compared to the results from Latent Dirichlet Allocation, a topic modelling technique. The PV model

(17)

produces better nearest neighbours than LDA. The authors find that the paragraphs vectors are suitable for vector operations as well. For example, using the PV model on the Wikipedia article on famous American singer ’Lady Gaga’, together with word vectors for ’American’ and ’Japanese’, one of the most popular Japanese singers with a equally named album is found. PV outperforms LDA and other baseline methods in identifying the least similar article in triplets of Wikipedia articles as well. The same results follows from similar experiments conducted on arXiv articles.

3 Methodology

Now that we have discussed the existing literature on the different meth-ods, we elaborate on the computational details used to calculate the degree of novelty. In Section 4 we discuss the data used for this research. How-ever, in order to clarify the upcoming sections, it is useful to state the following: The idea is to apply all three novelty methods that we investi-gate, to a fixed set of articles, such that the results can be compared. For now, let us denote the set of articles for which we want to measure the novelty scores by A∗.

3.1 Novelty: New referenced journal pairs

The first method we explore in this research follows Wang et al. (2017) and utilizes the references of a paper. The idea is that referencing a com-bination of journals that has never been cited together before, results in novelty. Therefore, we focus on the referenced journals for all papers in A∗. First, let the set of all Web of Science papers since 1980 be denoted by A. Hence, the set of all papers used for novelty measurement is a subset, A∗ ⊆ A.

(18)

certain paper i ∈ A∗, we denote Ji as the set of unique referenced journals.

Moreover, the total number of journals in this set is given by ni = |Ji|. The

individual journals in Ji are denoted by ji,m, for m = 1, ..., ni. As we are

interested in journal pairs, we create all possible combinations of journals (ji,m, ji,1), ..., (ji,m, ji,ni). At this point, we have all pairs of journals that

are cited by paper i.

We want to obtain the new journal pairs: combinations of journals that have not occurred in previous literature. Therefore, we compare the ref-erenced journal pairs in paper i, to the refref-erenced journal pairs in all previously published papers in A. We select only the journal pairs that are cited together for the first time. In order to avoid trivial pairs, we only select new pairs of journals that belong to the top 50% cited journals over the preceding three years. For example, if paper i is published in year ti,

we determine the number of times each journal has been cited in the years ti− 1, ti− 2 and ti− 3 combined. Then, only new referenced journal pairs

in paper i consisting of two journals in the top 50% most cited journals are selected.

Now, each paper in A∗ that has no new journal pair is given a novelty score equal to zero. For the papers that do reference new journal pairs, we want to assess the difficulty of combining these journals. In order to do this, we compare their journal co-citation profiles in the preceding three years. For paper i, published in year ti, we denote by ck,l the number of papers that

cite both journal k and l in the years ti − 1, ti − 2 and ti − 3. Now, the

co-citation profile of journal k contains for all other journals l the value ck,l. The profiles follow from the co-citation matrix given in Table 1.

Now, in order to determine the ease of combining two journals, we compare the co-citation profiles using the cosine similarity, as displayed in (1). For

(19)

j1 j2 j3 j4 j5 ... j1 / c1,2 c1,3 c1,4 c1,5 ... j2 c1,2 / c2,3 c2,4 c2,5 ... j3 c1,3 c2,3 / c3,4 c3,5 ... j4 c1,4 c2,4 c3,4 / c4,5 ... j5 c1,5 c2,5 c3,5 c4,5 / ... .. . ... ... ... ... ... . ..

Table 1: Co-citation matrix of journals

example, let j1 and j2 represent the row vectors from Table 1. Then the

cosine similarity between journal 1 and 2 is defined as,

COS1,2 =

j1· j2

||j1|| · ||j2||

(1) Let (1 − COS1,2) define the novelty score of journal pair (j1, j2). Finally,

the degree of novelty of a paper equals the sum of novelty scores of all its new journal pairs. Hence, the novelty of paper i is calculated as follows,

N ov_ijournal = X

(ji,m−ji,n6=m)|new

(1 − COSm,n) (2)

The notation N ovjournal is introduced to represent novelty scores following from the method that is based on journal pairs.

3.2 Novelty: Atypicality of keyword combinations

The second method is introduced by Sreenivasan (2013) for film analysis and subsequently used by Bramoull´e and Ductor (2018) on scientific ar-ticles. The degree of novelty of an article is measured by the atypicality of its keywords. The keywords are assigned by the author of the article. Now, let Ki denote the set of all keywords of article i, that is published

in year ti. We denote Nti;{k1,k2} as the total number of papers published

in or before year ti containing keyword pair {k1, k2}. The total number of

(20)

by Nti. First, we calculate the probability of keywords k1 and k2 occurring

together in or before year ti, as displayed in (3).

Pti(k1, k2) =

N_t_i_;{k₁_,k₂_} Nti

(3) Next, we transform the probability into a measure of atypicality by taking the negative logarithm. The novelty of article i equals the average atypi-cality of its keyword combinations, normalized by the number of keyword pairs and ln Nti to ensure values between 0 and 1. Note that a novelty

score of 1 is related to an article that only contains new keyword combi-nations. The total number of keyword combinations {k1, k2} is given by

|Ki||Ki− 1|. The novelty of article i is calculated as,

N ov_ikeyword = − P

k16=k2∈Kiln (Pti(k1, k2))

|Ki||Ki− 1| ln (Nti)

(4) We denote the novelty scores resulting from this method, based on keyword combinations, by N ovkeyword.

3.3 Novelty: Text mining - Variety in referenced abstracts

In this research, we introduce a completely new method to measure nov-elty. This method is based on a similar concept of combinatorial novnov-elty. The idea is that combining references that contain different knowledge, results in a more novel paper than a paper containing very similar ref-erences. This is quite similar to the idea from Uzzi et al. (2013), since novelty is based on combining prior knowledge. Furthermore, Lee et al. (2015) find that knowledge variety in the team of authors is positively cor-related with novelty. Therefore, variety in referenced knowledge might be a proper indicator of novelty as well. For this method, we do not examine the number of co-citations of a paper’s references. Instead, we focus on the actual information in the abstracts of the referenced papers. By com-paring the abstracts of all referenced papers, a measure of variety in the

(21)

references can be determined. In order to compare the abstracts, we want to translate the text into numeric vectors. The Doc2Vec method from Le and Mikolov (2014) is appropriate for this task. A different name for this technique is Paragraph Vector, which will be used from now on. Now, we first discuss details of the Paragraph Vector model. Thereafter we explain the calculation of the novelty measure.

3.3.1 Paragraph Vector model

As discussed in Section 2.4, Le and Mikolov (2014) introduce two versions of the Paragraph Vector (PV) model: Distributed Bag-of-Words (DBOW) and Distributed Memory (DM). The idea behind the PV-DBOW is that the model is trained to predict words contained in the paragraph given the paragraph id. The PV-DM uses the paragraph id combined with context to predict the word missing from that context. Figure 1 and 2 are from Le and Mikolov (2014) and graphically demonstrate both models, applied to the context ‘the cat sat on’ in a certain paragraph. In Figure 1, the paragraph id is the only input, trained to predict the words ‘the’, ‘cat’, ‘sat’ and ‘on’. In Figure 2, we see that besides the paragraph id, the context words ‘the’, ‘cat’ and ‘sat’ are used as input as well. Now, the model is trained to predict the missing word ‘on’ based on this input. Dai et al. (2015) state that the PV-DBOW model is more efficient and use this version in their paper on document similarity. Based on this literature, the PV-DBOW seems to be the most appropriate model. Therefore, we discuss the PV-DBOW model in more detail. However, we perform some experiments with the other type of model as well. A detailed explanation of the PV-DM model is outside the scope of this research.

Recall that we use the PV model to transform the abstracts of all refer-enced papers into numeric vectors. From now on, we call these numeric vectors abstract vectors. Every abstract is labelled with a unique abstract

(22)

Figure 1: PV-DBOW model Figure 2: PV-DM model

id. The explanation of the PV-DBOW model is based on the extensive discussion from Berendsen (2017). The basis for the Paragraph Vector model is a three layer neural network: input layer, hidden layer and out-put layer. A graphical representation of the neural network is displayed in Figure 3. The input layer consists of N units, where N equals the number of documents, or in this case, abstracts. Each unit corresponds to a certain abstract, which will equal 1 for that abstract and zero otherwise. Hence, the input layer can be represented as the vector d of all abstracts for which only the unit corresponding to the input abstract equals 1. The size of the hidden layer equals the actual size of the final abstract vectors we want to obtain, say M . All units from the input layer are connected to all hidden units, so the two layers are fully connected. All connections contain a certain weight contained in the weight matrix W1, with size M × N . The

resulting hidden layer h can now be computed as h = W1d. Now, after

training, the resulting vector h is the actual abstract vector corresponding to the input vector d.

Next, the hidden layer is fully connected with the output layer as well. The output layer of the network consists of all unique words in the vo-cabulary. The vocabulary is the collection of all words contained in the dataset of abstracts. The output layer has size Q, the number of unique words. The output layer is represented as the vector v, similar to a term

(23)

d₁ d2 dN h1 h2 hM v1 v2 vQ d h v t̂ t ̂1 t ̂2 t ̂Q Softmax

Input layer Hidden layer Output layer

Document + word Probabilities W₁ W₂ t2 t1 tQ t Target vector Loss function W₁ W₂ Update weights (SGD)

Figure 3: Graphical representation of PV-DBOW neural network training

vector, as each unit corresponds to a single word in the vocabulary. Since the hidden and output layer are fully connected, we have that the vector v is computed as v = W2h. Here, W2 is the Q × M weight matrix for the

connection of the hidden and output layer. Now we know the lay-out of the network, we can focus on the training of the model.

The values in the output layer, also known as activations, are inserted in the Softmax function s. The Softmax function is a generalization of the logistic function, used to normalize the entries in a vector. For output unit vi we have, s (vi) = evi PQ k=1evk (5) This Softmax function is used in classification problems. In this case, dur-ing traindur-ing the network is presented an abstract id combined with one of the words in that abstract. The task is to correctly classify that cor-responding word, called target word. Now, the output of the Softmax

(24)

function can be interpreted as a probability distribution over all words in the vocabulary for the input abstract. We call this vector of estimated probabilities ˆt, since it is a estimation of the target word vector t, repre-senting the target word. This target word vector equals one for the target word and zero for every other word in the vocabulary. In order to train the network, a loss function is used to compare the estimated vector ˆt to the corresponding target word vector t. This loss function is based on the cross entropy, E ˆt, t = − Q X i=1 tilog ˆti (6)

Now, the loss function follows by taking the average cross entropy over the training set. For a training set T with tj as the jth target word vector, we have the following loss function,

L = 1 |T | |T | X j=1 E ˆtj, tj (7)

The objective of training the neural network is to minimize this loss func-tion. Since ˆtj actually only depends on the values of the weights, the loss function turns out to be a function of the weights from the weight matrices W1 and W2 as well. Now, the algorithm used to minimize the loss

func-tion with respect to these weights, is called Stochastic Gradient Descent (SGD). While training the network, the gradient of the loss function L for the current weight values is computed. Now, the idea behind SGD is to adjust the weights in the opposite direction as the gradient. Moreover, the weights are modified in the opposite direction by a small correction based on a specified learning rate, denoted by η. By adjusting the weights following this algorithm, the loss function is being minimized and the final values in the weight matrix W1 correspond to the abstract vectors we need.

(25)

we briefly discuss some final notes on the training process. During train-ing, the network is fed with a combination of an abstract id and a word in that abstract. For every combination, the network should compute the Softmax function for all words in the vocabulary. This would be computa-tionally infeasible to do. In order to overcome this problem, two possible methods exist. The first method is called hierarchical Softmax, which uses a Huffman tree based on word frequencies. In this way, we do not have an output unit for every word in the vocabulary, but a pattern of output nodes that represents a specific word. With more frequent words having shorter paths in the tree, this saves a lot of time in calculating the Soft-max function. The second method is called negative sampling. This comes down to sampling only a small number of words besides the target word, called ’negative’ words, for which the weights will be updated. The intu-ition behind this approach is that a proper model should identify data from noise. By selecting only a few negative words, the computation time of the Softmax function drastically decreases. After training the Paragraph Vector model, every abstract is transformed into a vector of length M .

The PV-DBOW model is conceptually simpler than the PV-DM model. Since we do not use the PV-DM model for novelty calculation, we limit the explanation of details to the PV-DBOW model. We do perform a few experiments including the PV-DM model in Section 5.3.1. However, it suffices to understand that the input consists of context words as well.

3.3.2 Testing specifications of the model

Before we compute the novelty scores using this method, we have to decide which specification of the Paragraph Vector model to use. As stated in Section 3.3.1, we can choose between a PV-DBOW and PV-DM model. The PV-DM model combines the abstract id with context words in or-der to predict the next word. The context word vectors can be combined

(26)

in two ways: averaging or concatenation. Furthermore, we consider both hierarchical Softmax and negative sampling for model training. Finally, we set the dimension M of the resulting abstract vectors to different values.

Now, in order to test which specification of the Paragraph Vector model performs best, we need a measure of accuracy. Similar to Dai et al. (2015), we create triplets of articles, where two of the articles are closer to each other than the third. Since we do not know which articles are quite similar, we have to make an assumption. In order to explain that assumption, con-sider that Web of Science classifies research areas into five categories: Arts Humanities, Life Sciences Biomedicine, Physical Sciences, Social Sciences and Technology. Each of these research areas consists of many subject categories. Now, Web of Science assigns at least one of these subject cat-egories to every journal in the database. The selection of the triplets is based on the subject category assigned to the journal the article is pub-lished in. Hence, the assumption we make is: Articles pubpub-lished in journals with equal subject categories are more similar than articles published in journals with no matching subject category.

Now, we start by randomly selecting 1000 articles published in any jour-nal with only one subject category. We call these articles A. Next, we randomly assign a second article, published in a journal with the same single subject category, to each article. We call these articles B. Finally, we add a third article, that is published in a journal with subject cate-gories different from the first two articles. We call these articles C. In this way, we have 1000 triplets of articles used to test the accuracy of the Paragraph Model specification. For each triplet, we calculate the cosine similarity between each of the three vectors A, B and C. Now, based on our assumption, the cosine similarity between vectors A and B should be the highest. Triplets for which this holds, are considered to be correctly

(27)

predicted triplets. Therefore, the accuracy of a certain specification of the model is given by,

Accuracy = N umber of correctly predicted triplets

T otal number of triplets (8) Notice that the total number of triplets is equal to 1000. In Section 5.3 the results of testing multiple specifications are presented. We now elaborate on the further use of the vectors.

3.3.3 Computation of the novelty score

Now, the idea behind this method is that a high variety in the knowledge in referenced articles results in a novel paper. Recall that the set of all articles for which we want to calculate the novelty, is denoted by A∗. We train abstract vectors for all abstracts of papers referenced by the articles in A∗. Consequently, we have a vector of length M for each of these abstracts. We denote the vector corresponding to referenced abstract j in paper i as ai,j. Next, for paper i, we calculate the cosine similarity between all its

referenced vectors. Let Ni denote the number of referenced abstracts for

paper i. Then, we have for all pairs of j, k = 1, ..., Ni with j 6= k, the

cosine similarities between ai,j and ai,k,

COSai,j,ai,k =

ai,j · ai,k

||ai,j|| · ||ai,k||

(9) Hence, for paper i with Ni references we have a total of _2!·(NNi!

i−2)! cosine

similarities. The abstract vectors can contain negative and positive entries. Therefore, the cosine similarities can be in the range between -1 and 1. In order to ensure novelty scores between 0 and 1, similar to the keyword method, we normalize the cosine similarities to the range from 0 to 1 as well. This normalization is performed by adding 1 to COSai,j,ai,k and divide

by 2. Hence we get,

COSai,j,ai,k →

COSai,j,ai,k + 1

(28)

We transform these normalized cosine similarities in to asimilarity scores by taking one minus the value. Finally, the novelty score is given by the 75th percentile of this sequence of asimilarity scores. Hence, we have for the novelty score of paper i,

N ovabstract_i = P75 {1 − COSai,j,ai,k | j, k ∈ Ni, j 6= k}

(11) Here, P75() denotes the 75th percentile of the collection of values between

the brackets. The 75th percentile of the asimilarities represents a degree of variety in the most unsimilar referenced papers. The percentile is cal-culated using linear interpolation.

We demonstrate this interpolation in the calculation of P75(x), for a vector

x of length 20 containing sorted values x1, x2, ..., x20. The 75th percentile

equals the value 0.75 of the way from x1 to x20. The distance between

x1 and x20 equals 20 − 1 = 19. Therefore, the value that is 0.75 of the

way from x1 to x20 equals the value at position (0.75 × 19) + 1 = 15.25.

However, there is no value at index 15.25, since indexes are integers. This is the situation where linear interpolation is used to calculate the percentile. The integer part of the index 15.25 equals 15. Therefore, the starting point for the percentile is the value at index 15, which is x15. The fractional part

of the index is 0.25, which is multiplied by the difference between x15 and

the next value x16. The 75th percentile is now calculated by,

P75(x) = x15+ 0.25 × (x16 − x15) (12)

We denote the novelty scores following from the method based on the variety in the abstracts of referenced papers by N ovabstract.

3.4 Comparison of measures: ROC curves

In order to compare the three methods in their ability to identify highly novel articles, we make use of Receiver Operating Characteristic curves,

(29)

better known as ROC curves. ROC curves are used in binary classifica-tion problems and display the trade-off between the true positive and false positive rates. A true positive occurs when a target value of 1 is classified correctly. A false positive is a negative (target value 0 in binary classifi-cation) that is classified as a positive. The true positive rate equals the number of true positives divided by the actual number of positive targets. The false positive rate equals the number of false positives divided by the actual number of negative targets. By shifting the threshold, indicating for which value a prediction is classified as positive, the ROC curve can be computed.

In our case, we do not have actual target values, since the articles are not labelled as novel or non-novel. Such labelling can be performed by an expert in the field of that particular article, however, with thousands of ar-ticles this is not possible. Therefore, we do not strive to determine whether the methods correctly classify novelty. Instead, we examine whether the three methods identify the same papers to be considered novel. We accom-plish this in the following way: We select the novelty scores of one of the three methods as the true values or targets. In order to illustrate, let us assume N ovjournal represents the target novelty scores. Next, we label the top 5% or top 10% of N ovjournal as novel (1), and the rest as non-novel (0). Now, let us assume N ovkeyword serves as novelty predictor. The thresholds indicate which values of N ovkeyword represent a novel article. At each step in the ROC curve, the threshold shifts one value in N ovkeyword. For each threshold, the true positive and false positive rate are calculated, in order to plot the ROC curves. A true positive for a certain threshold, is an article with a keyword novelty score above the threshold, that is contained in the top 5% or top 10% of N ovjournal as well. Hence, a false positive indicates an article with a keyword novelty score above the threshold that falls out the top 5% or top 10% of N ovjournal. Figure 2 displays a confusion

(30)

matrix representing the idea of true/false positives/negatives correspond-ing to this example.

N ovjournal (5% or 10%)

Novel Non-novel Total

N ov k ey w or d Novel TP FP PP Non-novel FN TN PN Total P N

Table 2: Confusion matrix with N ovjournal as target and N ovkeyword as predictor

The columns indicate the target value (novel or non-novel) based on the top 5% or 10% novelty scores N ovjournal. The rows indicate the predicted value based on N ovkeyword, computed for the threshold at that time. Hence, the true positives (TP) represent the number of correctly predicted novel articles. The false negatives (FN) represent the number of novel articles predicted as non-novel. Furthermore, the total number of predicted pos-itives (PP) and negatives (PN) and the total number of actual pospos-itives (P) and negatives (N) is displayed. Now, we can define the true positive rate as T P R = T P_P . The false positive rate is given by F P R = F P_N . We ex-amine several cases, with different methods as target values and different top percentages. The Area Under the Curve (AUC) indicates the accuracy of the classification. In our case this can be interpreted as the combined ability to identify highly novel articles.

3.5 Regression analysis: Novelty and impact

A general assumption on novel research is that it will be cited more than non-novel research. On the other hand, novel research is considered to face a higher risk of low impact as well. Wang et al. (2017) investigated this “high risk/high gain” nature of novel research and confirmed the hypoth-esis. In order to compare the three methods, we run different regressions to examine the relation between the novelty score and impact. The

(31)

im-pact of a paper is defined as the number of times the paper has been cited during a fixed time window. We include novelty in two types of variables. The novelty score as a continuous variable and as a categorical variable representing different classes of novelty.

3.5.1 Regression variables

In order to truly examine the relation between novelty and impact, we add a set of control variables. We include year dummies γt, for each

publica-tion year t from 2003 to 2009. Furthermore, we consider dummy variable ICi, that indicates whether paper i is an international collaboration and

M Ii indicating whether paper i is written by authors from multiple

insti-tutions. These variables are related to the size of the network of the paper, which may partly explain the number of times the paper will be cited. For this same reason, the number of authors is included as well, denoted by authorsi. The last two control variables contain the number of references

in paper i as ref erencesi and the number of keywords as keywordsi. As

our novelty methods are based on these characteristics, we include them in the regression to identify the effect of the novelty controlled for the un-derlying variable.

Now, let us elaborate on the variables representing the novelty scores. To be used as continuous variables, the novelty scores are included as ln (N ov_im+ 1), where m indicates the novelty method. This approach is similar to Wang et al. (2017). Furthermore, we create two categorical vari-ables for each novelty method. The first division of novelty scores is based on the following assumption: The top 10% novelty scores are considered highly novel (CAT3 ) and the bottom 50% lowly novel (CAT1 ). Moreover, the middle 40% represents medium novelty (CAT2 ). The categories for the three methods are denoted by CAT_im, which equal 1, 2 or 3, corre-sponding to the category.

(32)

We use another method to create three classes of novelty as well. This method is called the Jenks natural breaks classification method, a one-dimensional equivalent of K-means clustering. Jenks method creates three classes of the novelty scores, such that the variance within each class is minimized. The cluster with the highest novelty scores is cluster 3. Con-sequently, the medium scores are in cluster 2 and the lowest scores in cluster 1. We distinguish the following steps in creating the optimal nov-elty classes:

• Step 1: Compute the variance around the mean novelty score: V armean

• Step 2: Divide the ordered novelty scores into three initial classes • Step 3: Compute the variance within each class j: V arj_within

• Step 4: Calculate the variance between classes for each class j: V ar_betweenj = V armean− V ar

j within

• Step 5: Compute the sum of all within class variances: P

j = V ar j within

After step 5, a paper is shifted from the class with the highest V arbetween

towards the class with the lowest value. Step 3-5 are then performed again. This process will be repeated until the sum of within class variances, P

j = V ar j

within, reaches a minimal value. The resulting clusters for each

method are represented by categorical variables CSRm_i .

3.5.2 Generalized Negative Binomial model

The dependent variable in the first set of regressions is the number of times paper i has been cited, denoted by ci. The number of citations is a count

variable. Furthermore, the variance in the number of times papers have been cited can be large. Therefore, we estimate a Generalized Negative Binomial (GNB) model, similar to Wang et al. (2017). For example, a Poisson regression model assumes the variance to be equal to the mean.

(33)

The choice for GNB provides us with the possibility to model a dispersion parameter based on a selection of variables. In this way, we can investigate the high-risk assumption of novel research. Let the dispersion parameter be denoted as α. The variance of the model can now be computed as σ2 = µ + αµ2, where µ equals the mean. We estimate three GNB models for each novelty method m. The first one includes novelty as a continuous variable and the full model specification is given by,

ln (ci) = β0 + β1ln (N ovim + 1) + β2ICi+ β2M Ii+ β4ln(authorsi)

+ β5ln(ref erencesi) + β4ln(keywordsi) +

X

t

γt + i (13)

Now, the coefficient of interest in (13) is β1, as it describes the relation

between novelty and impact. The second GNB model contains the catego-rization based on fixed percentages, represented by CAT_im. The regression is similar to (13), with the continuous novelty replaced by the dummy vari-ables for category 2 and 3 of CAT_im. The final GNB models include the novelty clusters CSRm_i created with Jenks method instead of the fixed per-centage categorization. The dispersion parameter, as ln(α), is estimated by the same set of independent variables.

3.5.3 Ordered Probit model

For the second set of regressions, we transform the dependent variable into a categorical variable. Moreover, the papers are divided into sepa-rate classes based on the number of times they have been cited. First, the top 10% most cited papers form impact class 3. Second, a small per-centage representing the least cited papers are impact class 1. The exact percentage is determined after investigating the data. Consequently, the largest group of papers (between impact class 1 and 3) is called impact class 2. Now, we can examine the effect of novelty scores on on different levels of impact, represented by the categorical variable Ii. The variable

(34)

and high impact are indicated by Ii = 1, Ii = 2 and Ii = 3 respectively.

Consequently, we estimate an ordered Probit model to examine the effect of novelty on the different classes of impact. The idea of an ordered Probit model is to estimate an underlying latent variable instead of the dependent variable directly. Based on the value for the latent variable, the probabil-ties to be assigned to a class can be computed.

We consider the latent variable I_i∗ for the impact class variable Ii. Now,

I_i∗ is estimated by a linear combination of the independent variables. We separately include the categorical novelty variables CAT_im and CSRm_i in this regression. The model specification where the latent variable I_i∗ is estimated by CAT_im is given by,

I_i∗ = β1(CATim = 2) + β2(CATim = 3) + β3ICi+ β4M Ii+ β5ln(authorsi)

X

t

γt + i (14)

The error term in the ordered Probit model has a standard Normal dis-tribution, hence i ∼ N (0, 1). Now, the value of the ordinal variable Ii is

based on the value of the latent variable I_i∗ and the corresponding thresh-olds δ1 and δ2. The value of Ii is given by,

Ii = 1 ⇐⇒ 0 < Ii∗ ≤ δ1

Ii = 2 ⇐⇒ δ1 < Ii∗ ≤ δ2

Ii = 3 ⇐⇒ δ2 < Ii∗ ≤ ∞ (15)

Let us abbreviate the RHS without error term of (14) byP

jβjxj, the

sum-mation of all independent variables and its coefficients. Now, as the error terms are standard normally distributed, we can compute the probabilities for each ordinal outcome by,

(35)

P [Ii = 1] = P [0 < Ii∗ ≤ δ1] = Φ " δ1 − X j βjxj # P [Ii = 2] = P [δ1 < Ii∗ ≤ δ2] = Φ " δ2 − X j βjxj # − Φ " δ1 − X j βjxj # P [Ii = 3] = P [δ2 < Ii∗ ≤ ∞] = 1 − Φ " δ2 − X j βjxj # (16) 3.5.4 Probit model

Finally, we examine whether the relation between the novelty scores and big hit papers. We consider the top 1% most cited papers to big hits. We use a probit regression with the binary variable Hiti indicating whether

paper i is considered to be a big hit (1) or not (0). The methodology for this Probit model is quite similar to the details explained for the ordered Probit in Section 3.5.3, with only two classes. Therefore, we do not elaborate on the idea behind the model. The full model specification is quite similar as well, as we include the same control variables and include novelty as categorical variables. The full model specification including the clustered novelty scores is given by,

Hit∗_i = β1(CSRmi = 2) + β2(CSRmi = 3) + β3ICi + β4M Ii + β5ln(authorsi)

X

t

γt + i (17)

Now we have discussed the models we use to examine the relation between a paper’s degree of novelty and its impact, we focus on the actual data we use.

4 Data Description

The data used in this research is provided by CWTS, the Centre for Sci-ence and Technology Studies at Leiden University. We make use of a

(36)

large database consisting of many characteristics of scientific papers. The database corresponds to the Web of Science database, an online scien-tific citation data tool, currently maintained by the company Clarivate Analytics. Examples of available information in the database are: titles, keywords, abstracts and references. We make use of SQL to query the required data from a relational database.

For this research we focus on articles published in journals considered to be in the fields of economics or finance. Therefore, from the total number of 18,788 journals in the database, we select the 496 journals in these fields. We select all documents with document type ‘article’ that are published in these journals before 2009. From now on, if we use the term ‘articles’, we refer to documents with document type ‘articles’. The cut-off at 2009 is set, so the impact of these articles in terms of citations can be measured as well. If we would calculate the novelty for recently published papers, we would have no information on the impact this paper will have. Next, due to the methodology of the different methods we have to take a couple of limitations into account while selecting the final data for novelty computation. In the end, we want to obtain the set of articles A∗ with three novelty scores, one calculated by each method. Therefore, for each article in A∗, all necessary information for all three methods must be available. Next, we discuss the process of selecting A∗, while examining the different methods.

4.1 Data selection

The new method introduced in this research is based on the abstracts of referenced articles. Therefore, we can only include articles that contain references to articles for which the abstract is available. The database from CWTS contains abstracts of articles in the social science field pub-lished in 1992 and later. Hence, for references to articles pubpub-lished before

(37)

1992 we do not have the abstract. We examine the age of citations: the difference in publication years between the citing and cited paper. We find that the average age of a citation in our data is 9 years. We set a fixed time window for references, such that more recent papers do not have a larger set of available references due to our data limitations. On the other hand, we do not want to exclude too many references. After examining the age of references, we decided to include all references within 10 years from the publication year. Using the 10 years window, 60.5% of the references are included. Using only the most recent cited literature in novelty cal-culation seems justifiable, as combining new ideas probably leads to novel ideas more often than combining old literature.

Now, since we include references from the last 10 years and abstracts are available from 1992, we select articles between 2002 and 2009 for novelty calculation. We have a total of 97,421 articles published in economics/-finance journals published between 2002 and 2009. For these articles, we need the references. However, we first focus on the novelty measure based on keywords. Since the method from Bramoull´e and Ductor (2018) re-quires an article to contain keywords, we select only the 54,625 articles that contain keywords chosen by the author. Now, for these articles we select the references. For 2,331 articles we do not have any references in the database. Next, we only select the references within the 10 years time window. We lose 1,274 articles since these do not have references in the database within this time window. Finally, we select only the references that contain an abstract. This step deletes 582 articles from the data, since these do not contain any references with an abstract. Hence, we end up with set A∗ that contains a total of 50,438 articles with 433,186 corresponding references. These references consist of 136,060 distinct ref-erenced articles.

(38)

The abstracts of these 136,060 distinct referenced articles are the input for the Paragraph Vector model. However, we run Paragraph Vector models for each publication year separately. If we use all references as input for only one model, we have that the novelty scores are influenced by papers published in future years. Let us explain this in more detail. For example, the novelty score of a paper i published in 2002, depends on the abstract vectors from the Paragraph Vector model. If we use all 136,060 references as input for this model, we have that the referenced abstracts in paper i are trained together with referenced abstracts in papers published in 2003-2009. Now, let us assume that a paper published in 2009 references a paper from 2006. This would mean that the abstract vectors for the referenced abstract in paper i, are trained together with the abstract of this paper from 2006. In this way, the abstracts vectors are influenced by abstracts published in the upcoming years. Therefore, it is not possible to calculate the novelty score for a paper in the year that it is published. In order to fix this issue, we run the Paragraph Vector model separately for each publication year in 2002-2009. If we only include the referenced abstracts in papers we use for novelty calculation (A∗), the training data for the earlier years would be limited. Therefore, we include referenced abstracts in papers published before 2002 as well. Consequently, the model for year t is trained on all referenced abstracts in articles published in economics/-finance journals published between 1990 and t.

The 50,438 articles in set A∗ all include keywords, since we already im-posed this restriction. For the method from Bramoull´e and Ductor (2018), we select all other articles published between 1970 and 2009 that contain keywords as well. The resulting set consists of 7,909,306 articles that con-tain keywords. The keywords of these articles are used to determine the previous number of occurrences of keyword pairs from papers in A∗.

(39)

For the method from Wang et al. (2017) we need all new referenced journal pairs in the 50,438 articles in A∗. However, we only select the new journal pairs that consist of the top 50% journals. Wang et al. (2017) impose this restriction to avoid trivial combinations. For each year in our data (2002 to 2009), we calculate the top 50% cited journals in the preceding three years. Next, only new journal pairs for which both journals are in the top 50% are selected. Finally, we need for every year the co-citation matrix regarding the preceding three years. Therefore, for every paper published in the three previous years, we create the co-cited journal pairs. Next, for every pair we count the number of times it occurred. This process is repeated for all years in 2002 to 2009, in order to create the co-citation matrices.

We now have all data necessary to compute all three measures of novelty for the 50,438 papers in A∗. However, during the computation of the abstract method, we introduce the following restriction: A paper in A∗ must reference at least three papers with available abstracts, in order to compute N ovabstract based on multiple cosine similarities. The final set of papers A∗ for which we want to calculate the novelty scores contains 42,668 papers.

4.2 Descriptive statistics regression variables

The dependent variables in the regressions are all variations of the impact of the paper. In this research, the impact of a paper is defined as the num-ber of times the paper is cited in the 8 years after publication. Since we calculate a novelty score for papers in 2009, this is the maximum citation window possible. As discussed in Section 3.5.3, we divide the impact of papers into three categories as well. We find that 5,624 papers are cited at most one time. We consider these 13, 2% least cited papers to be the low impact class 1. The top 10% most cited papers in high impact class 3

(40)

are cited at least 36 times in the next 8 years. Consequently, the medium class contains papers cited between 2 and 35 times.

Figure 3 displays descriptive statistics of the dependent variable and con-trol variables. The correlations between all variables are included as well.

Variable Mean Std. Dev. Min Max Correlation

1 2 3 4 5 6 1 Times cited 15.41 25.97 0 941 1.00 2 Big hit 0.01 0.01 0 1 0.65 1.00 3 Int. collab 0.26 0.44 0 1 0.05 0.01 1.00 4 Multi institute 0.52 0.50 0 1 0.10 0.04 0.57 1.00 5 # authors 2.08 1.10 1 44 0.13 0.05 0.30 0.45 1.00 6 # references 12.28 10.13 3 264 0.19 0.08 0.04 0.08 0.08 1.00 7 # keywords 4.36 1.48 2 20 0.05 0.02 0.03 0.01 0.04 0.06

Significant correlations bold at p < 0.05

Table 3: Descriptive statistics regression variables

We observe that the average number of times a paper is cited in the next 8 years equals 15.41. Moreover, we have that the standard deviation equals almost 26. This finding supports the choice for a Generalized Negative Binomial, as the variance is much larger than the mean. Obviously, the highest correlation is between the number of times a paper is cited and whether it is a big hit. We find above average correlations between inter-national collaberation, multiple institutes and the number of authors as well. This can easily be explained as well, as a larger number of authors increases the probability to be from different institutions or countries.

5 Results

In this section, we discuss the resulting novelty scores based on the three methods. First, for each method we elaborate on the novelty calculation for a specific example article. Furthermore, we discuss issues and remarks that arise during the implementation of the method. After discussing the

(41)

results of each method separately, we focus on comparing the novelty scores with each other.

As our example article, we consider the article from Dyar and Wagner (2003) titled “Uncertainty and species recovery program design”. The paper is published in the Journal of Environmental Economics and Man-agement in 2003.

5.1 Novelty: New journal pairs

The example article contains 35 cited references in the database. Combin-ing the journals in which the referenced articles are published, results in 18 journal pairs that have not occurred in prior articles. The journals that form these pairs all belong to the top 50% cited journals in 2000-2002. For the 18 new journal pairs we calculate the ease of combining these journals by calculating the cosine similarity between their co-citation profiles from the years 2000-2002. The length of the co-citation profile is 12,435, which equals the total number of journals in this period. The cosine similarities range from 0.004 to 0.144, which results in a novelty score equal to 17.60. This equals the 238th highest novelty score for this method.

Now, we notice that the cosine similarities for the example paper equal quite low values. Due to these low values, the methodology reduces to counting the number of new journal pairs. Adjusting the number of new pairs by their asimilarity has a negligible influence for these cosine similar-ities. We experiment with the length of the co-citation profiles to examine the effect on the cosine similarities. If we include all journals in the co-citation profile, we include many journals that are almost never cited as well. Consequently, the profiles contain a large number of small values as well. We test whether the exclusion of lowly cited journals in the profiles effects the cosine similarities. First, we exclude the 25% least cited