• No results found

Characterizing in-text citations in scientific articles: A large-scale analysis.

N/A
N/A
Protected

Academic year: 2021

Share "Characterizing in-text citations in scientific articles: A large-scale analysis."

Copied!
16
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

https://openaccess.leidenuniv.nl

License: Article 25fa pilot End User Agreement

This publication is distributed under the terms of Article 25fa of the Dutch Copyright Act (Auteurswet) with explicit consent by the author. Dutch law entitles the maker of a short scientific work funded either wholly or partially by Dutch public funds to make that work publicly available for no consideration following a reasonable period of time after the work was first published, provided that clear reference is made to the source of the first publication of the work.

This publication is distributed under The Association of Universities in the Netherlands (VSNU) ‘Article 25fa implementation’ pilot project. In this pilot research outputs of researchers employed by Dutch Universities that comply with the legal requirements of Article 25fa of the Dutch Copyright Act are distributed online and free of cost or other barriers in institutional repositories. Research outputs are distributed six months after their first online publication in the original published version and with proper attribution to the source of the original publication.

You are permitted to download and use the publication for personal purposes. All rights remain with the author(s) and/or copyrights owner(s) of this work. Any use of the publication other than authorised under this licence or copyright law is prohibited.

If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please contact the Library through email:

OpenAccess@library.leidenuniv.nl

Article details

Boyack K.W., Eck N.J. van, Colavizza G. & Waltman L. (2018), Characterizing in-text citations in scientific articles: A large-scale analysis., Journal of Informetrics 12(1): 59-73.

Doi: 10.1016/j.joi.2017.11.005

(2)

ContentslistsavailableatScienceDirect

Journal of Informetrics

jo u r n al hom e p ag e :w w w . e l s e v i e r . c o m / l o c a t e / j o i

Regular article

Characterizing in-text citations in scientific articles: A large-scale analysis

Kevin W. Boyack

a,∗

, Nees Jan van Eck

b

, Giovanni Colavizza

c

, Ludo Waltman

b

aSciTechStrategies,Inc.,Albuquerque,NM,USA

bCentreforScienceandTechnologyStudies(CWTS),LeidenUniversity,TheNetherlands

cDigitalHumanitiesLaboratory,ÉcolePolytechniqueFédéraledeLausanne,Switzerland

a rt i c l e i n f o

Articlehistory:

Received9October2017 Receivedinrevisedform 21November2017 Accepted21November2017 Availableonline1December2017

Keywords:

In-textcitations Citationpositionanalysis Field-levelanalysis Referenceage Citationcounts

a b s t ra c t

Wereportcharacteristicsofin-textcitationsinoverfivemillionfulltextarticlesfrom twolargedatabases–thePubMedCentralOpenAccesssubsetandElsevierjournals– asfunctionsoftime,textualprogression,andscientificfield.Thepurposeofthisstudy istounderstandthecharacteristicsofin-textcitationsinadetailedwaypriortopursuing otherstudiesfocusedonansweringmoresubstantiveresearchquestions.Assuch,wehave analyzedin-textcitationsinseveralwaysandreportmanyfindingshere.Perhapsmost significantly,wefindthattherearelargefield-leveldifferencesthatarereflectedinposi- tionwithinthetext,citationinterval(orreferenceage),andcitationcountsofreferences.

Ingeneral,thefieldsofBiomedicalandHealthSciences,LifeandEarthSciences,andPhysical SciencesandEngineeringhavesimilarreferencedistributions,althoughtheyvaryintheir specifics.Thetworemainingfields,MathematicsandComputerScienceandSocialScienceand Humanities,havedifferentreferencedistributionsfromtheotherthreefieldsandbetween themselves.Wealsoshowthatinallfieldsthenumbersofsentences,references,andin-text mentionsperarticlehaveincreasedovertime,andthattherearefield-levelandtemporal differencesinthenumbersofin-textmentionsperreference.Afinalfindingisthatref- erencesmentionedonlyoncetendtobemuchmorehighlycitedthanthosementioned multipletimes.

©2017ElsevierLtd.Allrightsreserved.

1. Introduction

Theincreasingavailabilityoffulltextfromscientificarticlesinmachinereadableelectronicformatsisadevelopment withthepotentialtogreatlyimpactcitationanalyticsandtosignificantlyimprovetheaccuracyofmodelsofthestructure ofscience.Fulltextcontainsinformationnotonlyontheexactlocationsofin-textcitationswithinarticles,butalsoonthe contextinwhichacitationtopreviousworkismade.Specificproblemsthatcanbeaddressedusingfulltextdatainclude classificationofin-textcitationsbytypeandfunction,andimprovingmeasuresofimpactbyweightingofcitationsbasedon polarity,typology,function,citinglocation,andperhapsotherfeaturesaswell.Weightingofcitationsalsohasthepotential toimpactourknowledgeofthestructureofscience,inthatdocumentclustering(andtheresultingmaps)couldbebased

夽 ThepeerreviewprocessofthispaperwashandledbyVincentLarivière,AssociateEditorofJournalofInformetrics.

∗ Correspondingauthor.

E-mailaddresses:kboyack@mapofscience.com(K.W.Boyack),ecknjpvan@cwts.leidenuniv.nl(N.J.vanEck),giovanni.colavizza@epfl.ch(G.Colavizza), waltmanlr@cwts.leidenuniv.nl(L.Waltman).

https://doi.org/10.1016/j.joi.2017.11.005 1751-1577/©2017ElsevierLtd.Allrightsreserved.

(3)

onamoreaccuratemeasureoftherelatednessbetweendocuments.Theseapplications,althoughbeyondthescopeofthis paper,motivatethecurrentwork,whichstudiesthecharacteristicsofin-textcitations(andassociatedfeatures)intwolarge fulltextdatabases.Asolidunderstandingofthecharacteristicsofin-textcitationsisrequiredbeforeadvancedapplications offulltextdata,suchasthosementioned,canbemostfruitfullypursued.

Studyofin-textcitationsandrelatedtextfromscientificdocumentsusingfulltextsourceshasalonghistory.Although bothpositional(thelocationofreferences)andsemantic(themeaningofreferences)studieshavebeenpursued,herewe focusprimarilyonthepositionalaspect.Theterminologyusedinpreviousstudiesofin-textcitationsisnotconsistent.Thus, toavoidconfusion,wedefineourterminologyhere.Areferenceisaniteminthebibliographyorreferencelistofadocument.

Anin-textcitationisamentionofareferencewithinthefulltextofadocument.Areferencecanbementionedoneormore timesinadocument.Eachmentionisanin-textcitation.Weusethetermsin-textcitationandmentioninterchangeablyin thisarticle.

Ourworkexaminesdistributionsofin-textcitationsfortwolargefulltextdatasets–thePubMedCentral(PMC)Open AccesssubsetandalargeportionoftheElsevierfulltextcorpus.Usingtheselargeanddisciplinarilybroaddatasets,wewill showthattherearesignificantvariationsinthedistributionsthathavenotbeenreportedbefore.Wespecificallyinvestigate field-leveldependenciesandreportcitationcountdistributionsasafunctionoftextprogression.

Thepaperproceedsasfollows.Wefirstreviewrelevantliteratureandthendescribeourdatasetsandanalysismethods.

Resultsarethenreportedalongwithkeyobservations.Thepaperconcludeswithasummary,mentionoflimitationsand suggestionsforadditionalwork.

2. Background

Overtheyears,manystudiesofthelocationorpositionofin-textcitationshavesoughttoidentifytherelativevalueof citationsasafunctionofthepositionorthenumberofmentions.Implicitamongmanyofthesestudiesistheassumption thatreferencesthataremorerelatedtothecitingarticlearethemorevaluableoressentialreferencesforthatarticle.In reviewingpriorwork,wefocusonthosestudiesthatexplicitlyincludementionlocationintheiranalysis.Aswillbeshown below,regardingdistributionsofin-textcitations,thereisaconsensusamongpreviousstudiesthatmentionstendtobemore concentratedatthebeginnings(e.g.,introductionandrelatedwork)andendings(e.g.,discussionandconclusion)ofarticles thaninthemiddlesections.Thereisalsoaroughconsensusthatreferencesthatarementionedoutsidetheintroductory sectionstendtobethemostvaluable.

Earlystudieswerenecessarilydonebyhandwithsmalldatasets.Inoneofthefirststudies,VoosandDagaev(1976) examinedcitationstoasetoffourhighlycitedarticles,twofrombiology,onefrommedicineandonefromphysics.Despite theirverysmallsample,theirfindingssuggestedthata)mostmentionscomefromintroductionsections,b)thelocation andthenumberofmentions–whichearlystudiesoftenreferredtoas ¨op.cit.¨–werebothimportantindeterminingthe valueofacitation,c)timewasimportant,andd)differentdisciplineshaddifferentcitationpatterns.Bonzi(1982)useda setofnearly500citationsfrom31articlesandfoundthatthenumberoftimesaworkiscitedinthetext“showspromiseof predictingrelatednessbetweencitingandcitedworks”.

Cano(1989)soughttostudycitationfunctionandutilitywhilealsoexaminingposition.Using344referencesthatwere codedbyfunctionandutilitybytheirauthors,theyfoundthatreferencesthatwerementionedinaperfunctoryandnegational wayweremostoftenperipheraloroflowutility.Theyalsofoundthatreferencesclassifiedasorganic,conceptual,operational orevolutionaryweremoretypicallyessentialorofhigherutility,andthatmentionsweremoreconcentratedinthefirst15%

ofanarticle.Hooten(1991)examined417citingcontextsandfoundthatreferenceswithlargernumbersofin-textcitations seemedmorerelatedtothecitingpaper,andthusmoreessential,thanthosewithonlyonein-textcitation.

McCainandTurner(1989),usingasetof11highlycitedpapers,createdanindexbasedoncitinglocation,numberof in-textcitations,citationutilityfromcitationcontexts,andself-citation,findingthatpaperswithalatercitationpeak(at sixyears)weremorebroadlyusefulthanthosewithanearlycitationpeak.Citationstopaperswithalatercitationpeak weremoreoftenformethodologicaladvancesratherthanforexperimentalresultsortheoreticalconcepts.Mariˇci ´c,Spaventi, Paviˇci ´c,andPifat-Mrzljak(1998)examinedcitationcontextsaswellaslocationsusing11%ofthementionstoasetof357 articles,andsuggestedthatreferencesshouldbevalueddifferentlybasedonthesectionofthecitingarticleinwhichthey appear.Theyfoundreferenceswithrelativelylow“meaning”(orvalue)tobementionedpredominantlyintheintroduction, whilethosementionedinothersectionshadhighermeaning.

BornmannandDaniel(2008)examinedasetof350in-textcitationstoasetofarticleswrittenbygrantapplicants.

UsingtheIMRaD(introduction,methods,resultsanddiscussion)structure,theyfoundthatwhilemorementionsappeared intheintroductionanddiscussionsectionsofcitingarticles,themethodsandresultssectionswereslightlyenrichedwith mentionstoarticleswithhighercitationcounts.Inperhapsthemostdetailedcomparisonwithgroundtruthdataavailable, TangandSafer(2008)surveyedauthorsof49articlesinbiologyand50articlesinpsychologywhoassessedthementionsin theirarticlesforimportance,reasonforcitation,andrelationshiptothecitedauthor.Theyfoundthatreferenceimportance increasedproportionallywithnumbersofmentionsandmoredetaileddiscussionoftheciteddocument.Inaddition,the authorsconsideredreferencesmentionedinthemethodsandresultssectionstobemostimportant,whilethosementioned intheintroductionsectiononlywerelessimportantthanthosementionedinothersections.Hou,Li,andNiu(2011)studied 651biochemistrypapersandfoundthatreferencesthatsharedatleast10referenceswiththecitingpaperhad,onaverage,

(4)

twiceasmanyin-textcitationsasthosewithfewerthan10sharedreferences.Inotherwords,referencesthatweremore similartothecitingpaperhadmorein-textcitationsthanthosethatwerelesssimilar.

Severalmorerecentstudieshavealsofocusedonthevalueofcitations.WanandLiu(2014)hand-coded820references in40AssociationforComputationalLinguistics(ACL)papersforcitationstrength,andfoundthattheaveragedensityof citationoccurrence(i.e.,thesmallestdistancebetweenneighboringcitationoccurrences)wasthebestpredictorofstrength amongthesixtested.TheyappliedtheirregressionbasedonallsixfeaturestoamuchlargersetofACLreferencesandfound thatonly14%werepredictedtobeveryimportant.Zhu,Turney,Lemire,andVellino(2015)surveyedauthorsof100papers (mostlyincomputerscience)whocoded10.2%ofreferencesintheirpapersas“influential”.Amongthelargenumberof featuresthatwerecomparedwiththereferencecoding,itwasdeterminedthatthenumberofin-textmentionswasthe singlefeaturethatbestpredictedinfluence.Valenzuela,Ha,andEtzioni(2015)annotated465referencesfromACLpapers, codingonly14.6%as“important”.Aftertrainingaclassifierontheirannotateddatawithanumberoffeatures,theyfound thatthenumberofin-textcitationsandthenumberofin-textcitationspersectionwerethemostpredictivefeatures.Jones andHanney(2016)alsofoundthatthepercentageofcitedarticlesthatarecentraltothecitingarticleincreaseswiththe numberofin-textmentions.Zhao,Cappello,andJohnston(2017)coded1473in-textcitationsfrom14articlesinJournal oftheAssociationforInformationScienceandTechnology(JASIST)bycitationfunction,usingthreecategoriesthatwere consideredinfluentialandtwocategoriesthatwereconsiderednon-essential.Tolimitreferencestothosethatareinfluential forthepurposesofcitationanalysis,theysuggestthat“removingallcitationoccurrencesintheBackgroundandLiterature Reviewsectionsanduni-citationsintheIntroductionsectionappearstoprovideagoodbalancebetweenfiltrationanderror rates.”

Severalrecentworkshavefocusedonceagainonthedistributionofin-textcitationswithinfulltextratherthantrying todiscernthedifferingvalueofreferences.Hu,Chen,andLiu(2013)countedthenumberofmentionspersectionfor350 articlesfromJournalofInformetrics,findingthathalfofthementionsareinthefirst30%ofthetext.Theygraphically representeddistributionsofmentionsforarticleswithfour,fiveandsixsections.Thesix-sectionrepresentationshowedthe greatestdifferentiationinthenumberofmentionspersection,withcountsdecreasing(perkilo-words)throughthefirst foursections,thenincreasingforthefifthsection,anddecreasingagainforthelastsection.Ding,Liu,Guo,andCronin(2013) alsocountedmentionsbysectionfor866JASISTarticles,findingthattheliteraturereviewsectionhadthelargestnumbers, andthatthemosthighlycitedarticleswerereferencedintheintroductionandliteraturereviewsections.

Thelargeststudyof in-textcitationdistributions,and themostsimilartoourwork,is therecentstudyby Bertin, Atanassova,Gingras,andLariviere(2016),whoshoweddistributionsofmentionsandreferenceagesasafunctionoftext progressionfrom45,000scientificarticlespublishedinPLOSjournals.GiventhatthePLOSdocumentshaverelativelycon- sistentstylesbyjournalandhavewelltaggedsectionheaders,theywereabletocharacterizethesedistributionsinterms oftheIMRaDdocumentstyle.SpecificdocumentsthathavesectionsnotinIMRaDorderwerereorderedtoIMRaDorder, leadingtotheobservationthatmentionsaremosthighlyconcentratedintheintroductionsofarticles,followedbythedis- cussionsection.Theyalsoshowedthatreferenceages(i.e.,timebetweenthepublicationyearsofacitingpublicationand acitedpublication)arehighestatthebeginningoftheintroductionandinthemethodssection,decreasingattheendof theintroductionandintheresultsanddiscussionsections.ThefactthatthesedistributionsareverysimilaracrossPLOS journalsledthemtocharacterizetheseas“invariant”distributions.

Hu,Lin,Sun,andHou(2017)alsocharacterizedpropertiesofin-textcitationsusing350papersfromJournalofInfor- metrics,andfocusingonmultiplymentionedreferences.Theyfoundthat25.7%ofthereferencesarementionedmorethan oncewithanaverageof1.48mentionsperreference,thatself-citationsaremorelikelytobemultiplymentionedthannon- self-citations,andthatthenumberofmentionsperreferencedeclineswithreferenceage.Finally,inarelatedconference paper,weusedthetwodatasetsdescribedinthenextsectiontocharacterizedistributionsofmentionsatthelevelofthe fulldatabases(Boyack,VanEck,Colavizza,&Waltman,2017).

3. Dataandmethods

Wehaveobtainedaccesstotwolargesourcesoffulltextscientificarticlesthatareavailableinmachinereadableform– thePubMedCentralOpenAccessSubset(PMCOA)andfulltextfromElsevier(ELS)journals.Elseviercontentisalsoavailable intheirScienceDirectproduct.PMCOAcontainsroughly36%ofthearticlesfromPubMedCentral,andiscomprisedofa) articlesfromopenaccessjournalsandb)articlesthatarerequiredtobemadepubliclyavailableundertheNationalInstitutes ofHealth(NIH)publicaccesspolicyandlegislativemandates.Givenpublisherembargos,inclusionofthelattertypeofarticles istypicallydelayed12monthsafterpublication.NearlyallPMCOAarticlesarealsoindexedinPubMed.ELScontainsarticles fromalmost3000Elsevierjournals.GiventhatElsevieristhelargestpublisherofjournalsintheworld,ELSisthesingle largestsourceoffulltextscientificarticlescurrentlyavailable,andcoversmostscientificandtechnicalfields,includingthe socialsciencesandhumanities.

Eachfulltextsourcewasobtainedandprocessedindependently,atdifferenttimes,bydifferentmembersofourresearch team(i.e.,PMCOAbySciTechStrategiesandELSbyCWTS),andusingdifferentcode,aswillbeexplainedbelow.Despitethese differences,thesamebasicstepswereappliedtoeachsource:downloading,filtering,parsing,databasecreation,reference matching,andanalysis.

(5)

3.1. PubMedcentralopenaccesssubset

ThePubMedCentralOpenAccessSubsetwasdownloadedinXMLformatandprocessedbySciTechStrategiesinOctober 2015.Thesubsetincludeddatathroughmid-2015andcontained1,113,891individualrecords,ofwhich945,279hadan associatedPubMedID(PMID)andatleastonereference.MostoftherecordswithoutPMIDwereconferenceabstractsthat arenotindexedinPubMed.WefurtherlimitedthedatatoarticlesthatwereclassifiedinPubMedaseithera‘journalarticle’

ora‘review’,thatwerepublishedin1998orlater,andthathadatleastonereferencewithareasonablereferenceyear (definedasbeingbetween1900and2015andwherethepublicationyearwasnoearlierthanoneyearpriortothereference year).

EachXMLrecordwasparsedsuchthatindividualsections,paragraphsandsentenceswereidentifiedandnumbered.

Thisisimportanttoproperlylocateeachin-textcitation.WhilesectionsandparagraphsweredelimitedusingXMLtags, sentencesweredelimitedandsplitusingtheNLTK(http://www.nltk.org/)pre-trainedPunkttokenizerforEnglish.Refer- ences,alongwiththeirbibliographicmetadata(includingPMIDinmanycases)werealsoextracted.In-textcitationsand theirexactpositionsinthetext(intermsofcharacteroffsetsandtextprogressioncentileswithinthearticle)wereidentified inthesentenceleveldatausingreferencetags.Multiplein-textcitationsinthesamebracket,whetherusingauthor/dateor numberedformats,weregiventhesamepositioninthetext,regardlessofwhichreferencewaslistedfirst.Figureandtable captionsandfootnoteswerenotconsidered.

Inadditiontoitscentilepositionwithinthetext,otherdatawereaddedtoeachmentionincludingthenumberofmentions (forthecitingarticleandreferencecombination),referenceage,andtheScopuscitationcountstothereferenceasofthe publicationyearofthecitingarticle.Citationcountswereobtainedinatwo-stepprocess.First,ScopusarticleIDswere identifiedforeachreferencewherepossibleby1)usingthelistedPMIDforthereferenceandlookingupthecorresponding ScopusIDfromourmatchingtable,or2)matchingthereferencemetadatatopublicationmetadatafromScopus.Scopus IDswereidentifiedfornearly90%ofthementions.CitationcountstoeachreferencewerethenobtainedfromourScopus datatablesandaddedtothedata.Wenotethatcitationcountsareincompleteforreferencespublishedpriorto1996since Scopusrecordsonlycontainreferencesfrom1996onward.

3.2. Elsevierfulltext

TheElsevierfulltextdatawereobtainedinJanuary2017andthusincludedanearlycomplete2016publicationyear.

DataweredownloadedandfilteredbyCWTSusingthefollowingsteps.First,theCrossRefRESTAPIwasusedtoidentifyall publicationsinElsevierjournals,numbering8,437,487.Othertypesofpublications,forinstancepublicationsinbookseries orconferenceproceedings,werenotconsidered.Second,theElsevierScienceDirectAPI(ArticleRetrievalAPI)wasusedto downloadtheidentifiedpublicationsinXMLformat.Downloadingwaspossibleonlyforthe7,862,859publicationstowhich LeidenUniversityhasaccessviasubscription.Forsomeofthesepublications,theXMLincludedonlymetadataratherthan fulltext.Publicationswithoutfulltextwerediscarded,leaving6,179,750XMLrecords.Inparticular,allpublicationsthat appearedbefore1998werediscarded,becauseforalmostallofthesepublicationsXMLformattedfulltextwasnotavailable.

Finally,thedatawerelimitedtothoserecordsthatwereEnglish-languagepublicationsspecificallylabeledas‘full-length article’,‘shortcommunication’,or‘reviewarticle’,leavingatotalof4,821,774fulltextrecordsforanalysis.

EachXMLfulltextrecordwasparsedtocreatesections,paragraphs,andsentences.Sectionsandparagraphswereiden- tifiedusingXMLtags.Onlymajorsectionsweretakenintoaccountwhilesubsectionswereignored.Sentencescouldnot beidentifieddirectlyusingXMLtags.Toidentifysentences,weusedamodifiedversionofthesentencesplittingalgorithm providedintheBreakIteratorclassintheJavaAPI.In-textcitationsandtheirexactpositionsinthetext(intermsofcharacter offsetsandtextprogressioncentileswithinthearticle)wereidentifiedinthesentenceleveldatausingtheCWTSparsing algorithms.In-textcitationsoccurringatspeciallocationsinthefulltextofapublication,suchasinfootnotesandinthe captionsoftablesandfigures,werenotincludedintheanalysis.

PublicationsintheELSdatabasewerematchedwithpublicationsintheWebofScience(WoS)database.Publicationswere firstmatchedbasedonDOI.IfnoDOI-basedmatchwasobtained,publicationswerematchedbasedonthecombinationof thelastnamewithfirstinitialofthefirstauthor,publicationyear,volumenumber,andfirstpagenumber,whichtogether formarelativelyuniquekeyforeachpublication.Amatchwasrequiredforallfourfields.Ofthepublicationsretained foranalysis,4,503,790couldbematchedwithapublicationintheWoSdatabase.Referencesinthefulltextpublications werealsomatchedwithpublicationsintheWoSdatabase.Inthiscase,DOI-basedmatchingcouldnotbeappliedbecause referencesintheELSdatabasedonotincludeaDOI.Instead,matchingwasperformedbasedonthefourfieldsmentioned above.WenotethattheWoSdatabaseusedbyCWTSincludestheScienceCitationIndexExpanded,theSocialSciences CitationIndex,andtheArts&HumanitiesCitationIndex.OtherWoScitationindicesarenotincluded.Publicationsbefore 1980arenotincludedeither.

PublicationsintheELSdatabasewerecategorizedintofivebroadfieldsofscience.ThefieldsdistinguishedintheCWTS LeidenRanking(www.leidenranking.com)wereusedforthispurpose.Furthermore,forreferencesinthefulltextpublica- tions,citationcountsintheWoSdatabaseweredetermined.Thiswasdoneonlyforreferencestopublicationsindexedin theWoSdatabaseusedbyCWTS.

(6)

Table1

Characteristicsofthetwofulltextdatasets.

PMCOA ELS

Yearscovered 1998–2015(partial) 1998–2016

#Publications 884,557 4,821,774

#References 34,746,187 175,156,040

#In-textcitations(mentions) 54,649,985 275,337,977

#Mentionsw/citationcounts 48,834,690 189,482,219

Avg#sections 12.27 5.70

Avg#paragraphs 39.01 32.59

Avg#sentences 179.19 152.45

Avg#sentencesw/mentions 35.67 30.14

Avg#characters 27,582 24,588

Avg#references 39.28 36.33

Avg#locationsw/mentions 45.04 36.72

Avg#in-textcitations 61.78 57.10

Table2

NumberofElsevierfulltextpapersbyLeidenRankingfield(2000–2015).

Abbreviation #Publications

BiomedicalandHealthSciences BHS 1,447,377

LifeandEarthSciences LES 568,195

MathematicsandComputerScience MCS 237,080

PhysicalSciencesandEngineering PSE 1,430,594

SocialSciencesandHumanities SSH 208,837

3.3. Descriptivestatistics

Numbersofarticles,referencesandin-textcitationsforbothdatasetsaregiveninTable1,alongwithothercharacteristics ofthedata.Roughly3%ofthereferenceswerenotincludedinthePMCOAanalysisbecausetheyweremissingareference yearordidnothaveareasonablereferenceyear.IntheELSanalysis,allreferenceswereincluded.Althoughtheaverage articleintheELSdatasetissomewhatshorterthantheaveragearticleinthePMCOAdataset(e.g.,innumbersofparagraphs, sentences,characters,etc.),theaveragenumbersofmentionsperreference(1.57forbothPMCOAandELS)andpercentageof sentenceswithmentions(19.9%and19.8%forPMCOAandELS,respectively)areverysimilar.Themostsubstantialdifference showninTable1isinthenumbersofsections;thisdifferenceisexplainedbythefactthatmajorsectionsandsubsections werecountedforPMCOAwhileonlymajorsectionswerecountedforELS.

FulltextpublicationsforwhichamatchwasobtainedbetweentheELSdatabaseandtheWoSdatabasewerealsolinkedto thefivebroadfieldsofsciencedistinguishedintheCWTSLeidenRanking.Publicationsfrom2016wereexcludedinthisstep sincewemadeuseofthe2016editionoftheLeidenRanking,whichincludesonlypublicationsthrough2015.Thenumber ofpublicationsfrom2000through2015thatcouldbelinkedtooneofthefivefieldsis3,892,083,withadistributionof publicationsbyfieldasshowninTable2.BiomedicalandHealthSciences(BHS)andPhysicalSciencesandEngineering(PSE)are thetwolargestfieldsandareroughlythesamesize,eachcontainingover35%oftheELScontent.MathematicsandComputer Science(MCS)andSocialScienceandHumanities(SSH)arethetwosmallestfields,eachwitharound6%oftheELScontent.

Eventhoughthefieldsdifferwidelyinsize,eachfieldissufficientlylargethatcomparisonsbetweenfieldsshouldprovide robustresults.

4. Results

In-textcitations(i.e.,mentions)fromfulltextarticlescanbecharacterizedinanumberofways.Inthisstudy,wereport statisticsanddistributionsrelatedtomentionsasafunctionoftime,numberofmentionsandtextprogression.

4.1. Temporaltrends

AnalyseswerecarriedouttoidentifychangesovertimeinthePMCOAandELSdatasetsandinpropertiesrelatedto sentences,referencesandmentions.

Fig.1showsthatthenumbersofdocumentsinPMCOAandELShaveverydifferenttemporaltrends.ELScontainsover 100,000fulltextarticlespublishedin1998,whichwasjustover12%ofthenumberoforiginalresearchdocuments(articles andreviews)indexedinWoSthatyear.ELSfulltextcoveragehasgrownconsistentlyovertheyears,stabilizingatroughly 22%ofWoScontentfrom2004to2015,beforegrowingagainto24%ofWoScontentin2016.Notethatthesepercentages arebasedonarticlesandreviews,anddonotincludeotherdocumenttypes.Incontrast,PMCOAcontainslessthan1%ofthe articlesindexedinPubMedandpublishedpriorto2004,buthasrapidlyexpandedsincethentowhereitcontains21%of PubMedarticlespublishedin2016.PMCOAishighlyskewedtorecentcontent,whileELScontainsaroughlyconsistentslice

(7)

Fig.1.Fulltextcoveragebyyearwithnumbersofdocumentsanalyzedbydatabaseandfield(top)andpercentagecoveragewithreferencetoalarger database(bottom).(Forinterpretationofthereferencestocolorinthetext,thereaderisreferredtothewebversionofthisarticle.)

ofthescientificliteraturesince2003.NotethatwhilewehadfewerPMCOAarticlesin2015thanin2014duetothetiming ofourdataacquisition,fullyearvaluesareshowninFig.1(top,reddashedline)for2015and2016usingcountsfromamore recentquery.Fig.1alsoshowsthatthenumbersoffulltextdocumentsineachoffiveLeidenRankingfieldshaveincreased atroughlysimilarrates.

Table1showedthattheaveragenumbersofsentencesperdocumentweredifferentforthetwodatabases,with179.2 and152.5sentencesforthePMCOAandELSdocuments,respectively.However,theseaveragenumbersdonottellthestory.

Fig.2showsthatthenumbersofsentencesvarybyfield,withSSHdocumentsbeinglongerthanthoseinotherfieldswith over250sentencesonaveragein2015.Documentsinthemedical(BHS)andphysical(PSE)sciencesaretheshortestwith only150and165sentences,respectively,onaveragein2015.

Fig.2alsoshowsthattheaveragenumberofsentencesperdocumenthasbeenincreasingovertimeforallfiveLeiden RankingfieldsusingtheELSdata,whilethePMCOAdatashowaroughlyconstantnumberofsentencesperdocument.In addition,thepercentageofsentenceswithmentionshasbeenincreasingovertimeforbothdatabasesandforallfiveLeiden

(8)

Fig.2.Averagenumbersofsentencesperdocument(top)andthepercentagesofsentencescontainingmentions(bottom)byyear.

Rankingfields.Overall,theincreaseinnumbersofsentencesperarticle(usingthelargerELSdataset)maysuggestthat concernsaboutsalamislicingofpublications(Schein&Paladugu,2001)thatwereindicatedbystudiesinthe1970sshowing atrendtowardfewerpagesperarticle(Broad,1981)areperhapsoverstated,andthatauthorshave,forthemostpart,been packagingtheirresultsin“sizablereports”(Bornmann&Daniel,2007)forthepasttwodecades.

IncreasesinthenumberofsentenceshavebeenmostdramaticfordocumentsintheMCSandPSEfieldsataround35%

overthesixteen-yearperiodfrom2000to2015.IncreasesforBHSandLEShavebeenonly20%overthesametimeperiod.

Itisinterestingthateventhoughthenumbersofsentencesperdocumenthasincreasedovertheyears,thepercentageof sentenceswithmentionshasalsoincreased,indicatingasubtleshiftinreferencingbehavior.Thismaysuggestthatauthors arefeelinganincreasingneedtofullycontextualizetheirresearch.Increasesinthepercentageofsentencescontaining mentionsweremostpronouncedforSSH(39%),andweremuchlowerforotherfields.Notethataveragesandpercentages reportedinthisworkwerecalculateddirectlyattheaggregatelevel.Forinstance,inthebottomplotinFig.2,thepercentage ofsentencescontainingmentionswasobtainedbyexpressingthenumberofsentencescontainingmentionsacrossasetof papers(e.g.,thesetofallpapersinacertainfieldandyear)asapercentageofthetotalnumberofsentencesacrossthesame setofpapers.Analternativeapproachcouldhavebeentocalculatepercentagesatthelevelofindividualpapersandtothen averagethesepercentages.Wedidnottakethisapproach.

(9)

Fig.3. Averagenumbersofreferences(topleft),mentions(topright),referencespersentence(bottomleft)andmentionspersentence(bottomright)by year.

WehadexpectedthePMCOAandBHScurvestoberoughlysimilargiventhattheBHSLeidenRankingfieldshouldbe similarinscopetowhatiscoveredinPubMed.Toinvestigatethisfurther,weidentified11,459Elsevierdocumentsfrom2010 to2015thatwereinbothdatasetsandfoundthatthePMCOAandELSversionshadonaverage170.0and189.3sentences, respectively,perdocument.Thesenumbersinprincipleshouldhavebeenthesameforbothdatasets.Thedifferencemustbe duetodifferencesinthePMCOAandELSXMLdocumentformatsand/orinthesentencesplittingprocessesusedbySciTech StrategiesandCWTS.Sincethisstudyisfocusedonpositionalratherthansemanticfeatures,wearecurrentlyunconcerned aboutthesedifferences.Whiletheyaffectthenumbersofsentencesandthusdoaffectthecomparisonofthetwodata sourcesatthesentencelevel,theydonotaffectthetemporaltrendsintheELSdataortheanalysesbasedontextprogression ornumbersofmentionsperreference.

Averagenumbersofreferencesandmentionshavebothincreasedovertimeatahigherratethanthatseenforsentences (Fig.3).Asitwasforsentences,field-leveldifferencesexistforreferencesandmentions.LESandSSHhavethehighest numbersofreferencesandmentionsperdocument,whileMCShasthesmallestnumbersofreferencesandmentionsper document.MCSalsohasthesmallestnumbersofreferencesandmentionspersentenceatjustoverhalftheaverageforall fields.BHShasthelargestnumbersofreferencespersentence,whileLEShasrecentlyovertakenBHSforthelargestnumbers ofmentionspersentence.GrowthratesinnumbersofreferencesandmentionspersentencehavebeenhighestforLESand SSHdocuments,whilegrowthratesfortheotherthreefieldshavebeenrelativelymodest.Thesimilaritywehadexpectedto seebetweenthePMCOAandBHScurvescanbeseenintheirnumbersofreferencesandmentionsperdocument.However, theirgrowthratesaredifferent–numbersforPMCOAdocumentsaregrowingmoreslowlythanthoseforELSdocuments.

ThismayreflecttheselectivenatureofPMCOApublications,whichareeitheropenaccesspublicationsorthosefundedby NIHandmadeavailablebymandate.

Despitethegrowthratesinnumbersofreferencesandmentions,theaveragenumbersofmentionsperreferenceare verysimilarforthetwodatabasesandarenearlyconstantovertime,decreasingonlyslightlyfrom1.59in2001to1.57in 2015(Fig.4).Thereare,however,quitesignificantdifferencesbyfield.Thenumberofmentionsperreferenceishighest forMCSandincreasingslightly,whileitislowestforPSEanddecreasingatasubstantialrate.Mentionsperreferenceare alsoincreasingatahighrateforSSH.BHSandLEShavechangedtheleastovertime.Thissuggeststhatthedegreetowhich authorsengagewiththereferencestheyciteintheirarticleshasnotchangedsignificantlyforthehealthandlifesciences (BHSandLES)inthepast15years,buthaschangedsubstantiallyforthephysical,engineeringandsocialsciences(PSE, MCS,SSH).Moredetailedanalysis,perhapsalsoatasemanticlevel,willbenecessarytounderstandthereasonsforthose

(10)

Fig.4. Averagenumbersofmentionsperreferencebyyear.

Table3

Percentagesofsentencesasafunctionofnumberofmentionscontainedinasentencefordocumentspublishedin2015.

#Mentions PMCOA ELS BHS LES MCS PSE SSH

0 79.66% 79.48% 75.75% 75.12% 86.84% 82.20% 80.83%

1 12.56% 12.72% 14.63% 15.22% 9.00% 10.92% 12.37%

2 4.75% 4.01% 4.98% 5.04% 2.24% 3.37% 3.68%

3 1.54% 1.73% 2.19% 2.20% 0.85% 1.47% 1.54%

4 0.70% 0.84% 1.05% 1.04% 0.41% 0.75% 0.71%

5 0.32% 0.44% 0.54% 0.54% 0.22% 0.42% 0.36%

≥6 0.47% 0.77% 0.86% 0.83% 0.43% 0.86% 0.50%

Table4

Percentagesofreferencesasafunctionofnumberofmentionsfordocumentspublishedin2015.

#Mentions PMCOA ELS BHS LES MCS PSE SSH

0 1.40% 1.45% 1.90% 1.14% 1.13% 1.57%

1 71.46% 69.48% 70.31% 67.94% 67.32% 72.78% 67.24%

2 16.69% 16.75% 16.59% 16.67% 16.76% 15.33% 17.08%

3 5.80% 6.04% 5.79% 6.27% 6.48% 5.31% 6.51%

4 2.59% 2.73% 2.56% 2.96% 3.19% 2.36% 3.07%

5 1.31% 1.40% 1.29% 1.58% 1.73% 1.19% 1.66%

6 0.75% 0.79% 0.71% 0.91% 1.04% 0.68% 0.96%

7 0.44% 0.47% 0.43% 0.56% 0.66% 0.41% 0.59%

8 0.28% 0.29% 0.27% 0.36% 0.45% 0.25% 0.38%

9 0.19% 0.19% 0.17% 0.24% 0.30% 0.16% 0.27%

10 0.13% 0.13% 0.12% 0.16% 0.23% 0.11% 0.18%

changes.Inadditiontotheanalysisshownhere,itwaspreviouslyshownthattheaveragenumberofmentionsperreference decreaseswithcitationinterval(Boyacketal.,2017).

4.2. Distributionsbymentions

Fig.2showedthatonaverageabout20%ofsentences containareference.Table3expandsonthisandshows the percentageofsentencesasafunctionofthenumberofmentionsandhowthesevarybyfield.MCShasthehighestpercentage ofsentenceswithnoreferences,andthelowestpercentageofsentenceswithmentionsforeachvalueofthenumberof mentions.Lessthan2%ofsentencesinMCSpapershavethreeormorementions.BHSandLEShavethehighestpercentages ofsentenceswithmultiplementions;yetlessthan5%ofsentencesinthesetwofieldshavethreeormorementions.

Itisalsousefultoknowthefrequencywithwhicheachreferenceismentioned.Table4showsthat71.5%ofreferencesin thePMCOAcorpusarementionedonlyonce,whilethenumberisslightlylower,69.5%,fortheELScorpus.Giventhelarge

(11)

Fig.5.Citationintervals(top)andcitationcountsperreference(bottom)asafunctionofnumberofmentionsfordocumentspublishedin2015.

sizesofourdatabases,thesenumbersarefarmoredefinitivethanthevaluesof65.6%and74.3%reportedforJASIST(Zhao etal.,2017)andJournalofInformetrics(Huetal.,2017)documents,respectively.OftheLeidenRankingfields,PSEhasthe highestpercentageofreferencesmentionedonlyonce,butthenumbersacrossfieldsdonotdifferbymuch.Notealsothat fortheELSdataset,1.4%ofreferenceswerenotmentionedinthetextatall.Ofthese,mostwerementionedinfigureortable captions.

Wealsoexploredcitationintervalsandnumbersoftimesreferenceshadbeencitedasafunctionofthenumberof mentions.Fig.5showsthesestatisticsforthereferencesandmentionsindocumentspublishedin2015.Asingleyearwas chosenforthisanalysisbecausereferenceagesandcitationcountshavebothincreasedovertime,andasingleyeargives acurrent(ratherthananaveraged)pictureofthesedistributions.Fig.5showsthatcitationinterval(i.e.,referenceage) decreaseswiththenumberoftimesareferenceismentioned.Thisistrueforbothdatabases,andforallfields.However, referenceagedoesvarybyfield;SSHreferencesaretheoldest,whileBHSreferences(whicharewellmirroredbythePMCOA data)aretheyoungestatroughly2.5yearsyoungerthantheSSHreferences.

Thestatisticsalsoshowthatreferencesmentionedonlyoncearetypicallymorehighlycitedthanthosethatarementioned multipletimes.Thedistributionsvarywidelybyfield.Forinstance,thedecreaseincitationcountswithincreasingmentions isquitedrasticfortheBHS,LESandPSEfields;forthesefieldscitationcountsforreferencesmentionedfiveormoretimes

(12)

arelessthanhalfthatofreferencesmentionedonlyonce.Referencesmentionedonlyoncearealsolikelytobeaccompanied bylessexplanationthanthosementionedmultipletimes(Zhaoetal.,2017).Notealsothatwehavenotnormalizedcitation countswithrespecttoreferenceage.Referencesmentionedonlyonceareolderthanreferencesmentionedmultipletimes, andthismayaccountforpartofthedifferenceincitationcounts.

WehaveseeninFig.5thatreferencesmentionedonlyoncetendtoberelativelyoldandtendtohavesubstantially highercitationcountsthanreferencesmentionedmultipletimes.Itcouldbeofsignificantinteresttotrytoexplainthese observationsbasedontheoreticalideas,eitherexistingones(e.g.,conceptsymbols,perfunctorycitations,etc.)ornewones, aboutthecitationbehaviorofresearchers.Weleavethisasatopicforfutureresearch.

TheMCSandSSHfieldsshowmuchdifferentbehavior.Here,referencesmentionedtwicehaveslightlyhighercitation countsthanthosecitedonlyonce,andthedecreasewithincreasingmentionsismuchmoregradualthanfortheotherthree fields.Also,itisinterestingthatSSHreferencesaremorehighlycitedthanthoseinotherfieldsforreferencesmentionedat leasttwice.ThatthoseSSHpapersthatarecitedareonaveragemorehighlycitedthanbiomedicalpapersisperhapscoun- terintuitivegiventhatlistsofthemosthighlycitedpapersaretypicallydominatedbybiomedicine(Nicholson&Ioannidis, 2012).

Oneothercaveatwithrespecttocitationcountsneedstobementioned.TheresultsinFig.5donotincludecitationcounts forreferencesthatcannotbeidentifiedinWoS(fortheELSdata)orScopus(forthePMCOAdata),andmightbedifferent ifthesereferencesandtheirtimescitedwereknown.ThisisparticularlytrueforSSHbecauseithasahigherfractionof referencesthatcannotbeidentifiedthantheotherfields(Hicks,2004).

4.3. Distributionsbytextprogression

WhileseveralotherstudieshaveaimedtocharacterizedistributionsofmentionsintermsoftheIMRaDstructure(Bertin etal.,2016;Bornmann&Daniel,2008;Huetal.,2013;Mariˇci ´cetal.,1998),ourstudysimplycharacterizesthesedistributions asafunctionoftextprogression.Wedidnotassignmentionstosectionsduetothelackofuniformityinsectionnaming andorderingacrossjournals.Forexample,whiletheorderingimpliedbyIMRaDislikelyvalidforasignificantshareofall journals,forotherjournalsitisconventionalforthemethodssectiontobeattheendofthearticleratherthanafterthe introduction.Inaddition,manyarticles,inparticularinSSHjournals,haveasectionstructurethatdifferssubstantiallyfrom IMRaD.

Fig.6showsthedistributionofmentionsforthePMCOAandELSdatasets,andfortheLeidenRankingfields,asafunction oftextprogression.Notethat,asalreadymentionedinthediscussionofFig.2,allstatisticshavebeencalculatedatthe aggregatelevel.ThePMCOAandELScurvesbothshowthesamegeneralpattern,withahighlevelofreferencingatthe beginningofadocumentwhichdecreasesrapidlytothe25thcentilewhereitessentiallyflattensout,andwhichisthen followedbyasecondarypeakataroundthe80thcentilebeforedecreasingagainattheendofadocument.However,this generalpatterndoesnotholdforallfields.Forinstance,documentsintheMCSandPSEfieldsdonothaveasecondarypeakat all,butrathermaintainarelativelyconstantrateofreferencingfromthe50ththrough90thcentiles.Documentsinthesetwo fieldshavethehighestfractionoftheirreferences(16–18%)withintheirfirstfivecentileportion.Incontrast,documentsin theSSHfieldhavethelowestfractionoftheirreferences(12%)inthefirstfivecentiles,andthelevelofreferencingdecreases thereaftermoregraduallythanforotherfields;thelowpointinreferencingisnotreacheduntilthe60thcentile,afterwhich thereisasmallsecondarypeakatthe85thcentile.TheBHSfieldhasadistributionthatissimilartothatofthePMCOAdata, whichiswhatwewouldexpecttoseegiventhattheyarebothmedicine-centric.

Inarecentarticle,Bertinetal.(2016)analyzedthedistributionofin-textcitationsof45,000documentsinPLOSjournals.

Theyfoundthatwhenarticleswereused“asis”,thereweresomevariationsbetweenthereferencedistributionsbyjournal.

However,oncethesectionsofallarticleswerere-orderedtotheIMRaDstructure,thereferencedistributionwassimilar acrossallPLOSjournals,whichledthemtocharacterizetheseas“invariant”distributions.Inourrecentconferencepaper, weshowedthatthePMCOAdistributionsaresimilartoPLOS“asis”distributions(Boyacketal.,2017).However,Fig.6 showsthatthePMCOAdistribution–and,byextension,thePLOSdistribution–isspecifictobiomedicine,andthatthere aresignificantvariationsinthedistributionsofreferencesbetweenfields.ThedifferencesbetweenfieldsinFig.6aremuch largerthananyofthedifferencesinthe“asis”distributionsofPLOSjournalsfromBertinetal.Thus,whenallfieldsare considered,thereisnosingle“invariant”distributionofreferencesinscientificarticles.Furthermore,wesuspectthatthere maybesignificantvariationsinthedistributionswithinfieldsthatcanonlybeshownwithamoregranularanalysis.

Fig.6showsthedistributionofmentionsasafunctionoftextprogressionfortwoseparateyears,2005and2015.Com- parisonofthecurvesforthetwoyearssuggeststhatcitingbehaviorintermsofwherereferencesarecitedinthedocument texthasnotchangedappreciablyoverthisten-yearperiod.ThereisagreaterdifferencebetweenthePMCOAandBHScurves in2005thanin2015.Wesuspectthatthisdifferencestemsfromtherelativelysmallnumber(10,174)ofPMCOAdocuments in2005,whichwerealsobiasedtoaparticularsetofopenaccessjournals.

Wealsoinvestigatedcitationintervalasafunctionoftextprogression.Fig.7showsresultsforthePMCOAandELSdatasets, andfortheLeidenRankingfields.ThePMCOAandELScurvesaresimilarinnature,withaslightdecreaseinreferenceage immediatelyafterthefirstfive-centilebin,followedbyanincreasetoapeakvalueatthe30thto35thcentileandthena gentledecaytotheendofthearticle.ThispatternisalsofollowedbytheBHE,LES,andMCSfieldsandintuitivelymakessense fortheseempiricalfields.Theintroductionofapapertypicallystartsbyreferencingestablishedresearch,whichisfollowed byreferencestomorerecentanddirectlyrelatedresearchattheendoftheintroduction.Thisisfollowedbymethodswhich

(13)

Fig.6.Percentageofmentionsasafunctionoftextprogressionfordocumentspublishedin2005(top)and2015(bottom)usingbinwidthsoffivecentiles.

areoftenolder,andyoungerliteratureisthencitedthroughoutthebalanceofthepaperforcomparisonpurposes.However, thepatterndiffersfortheothertwofields.ForMCS,thepeakdoesnotoccuruntilthe50thcentile,andforSSHthereisno pronounceddipafterthefirstfive-centilebin,butratheragentleandnominalincreasetoaplateauwhichextendsfromthe 30thto60thcentile.Thisisthenfollowedbythetypeofdecreaseseenintheotherfields.AswealsoobservedinFig.5,BHS hasthesmallestcitationinterval;referencesareonaveragetwoyearsolderfortheotherfields.Also,onceagain,theBHS andPMCOAcurvesareverysimilar.Field-leveleffectsarethusseenincitationintervalsaswellasinthedistributionsby textprogression.

Finally,weshowtheaveragecitationcountsforreferencesmentionedasafunctionoftextprogression.Fig.8shows thatthePMCOAandELScurvesaresimilar.Referencescitedatthebeginningofanarticlearemorehighlycitedthanthose appearinglaterintheintroduction.Therethenisahugeincreaseincitationcountstoapeakatthe30thcentile,followedby adecrease,andthenanotherslightincreasetowardtheendofthepaper.ThisfinalincreaseismorepronouncedforPMCOA thanforELS,probablybecausePMCOAhasahighershareofarticlesinwhichthemethodssectionislocatedattheendof thearticle.

ThecitationprofilesfortheBHS,LESandPSEfieldsfollowthissamepattern,althoughtheirpeakpointsdiffer.Thepeakis reachedearlierinPSEarticles(25thcentile)andlaterinLESarticles(35thcentile).Nevertheless,ineachofthesecases,these

(14)

Fig.7.Citationintervalasafunctionoftextprogressionfordocumentspublishedin2015usingbinwidthsoffivecentiles.

Fig.8. Citationcountsperreferenceasafunctionoftextprogressionfordocumentspublishedin2015usingbinwidthsoffivecentiles.

peaksstillroughlycorrespondwiththetraditionallocationofamethodssection,andthussuggestthatmethodspapersare overrepresentedamongthemosthighlycitedpapersinthesefields(VanNoorden,Maher,&Nuzzo,2014).Notethatfor eachofthesefields,thepeakisroughlythreetimeshigherthanthevalleyatthe10thcentile.

Theothertwofields,MCSandSSH,havecitationprofilesthatvarydramaticallyfromthoseoftheBHS,LESandPSEfields.

Forexample,fortheSSHfield,thereislittlevariationinthecitationcountstoreferencesoverthefirst20centiles,whichis followedbyasignificantincreasetoapeakatthe60–65thcentileatnearly500citationsperreference.Moredetailedanalysis isrequiredtounderstandexactlywhythiscurvedifferssomuchfromthoseoftheBHS,LESandPSEfields.Nevertheless,it seemsclearfromtheseresults,andfromthoseinFig.6,thatSSHarticleshaveadifferentcognitivestructurethanthosein othersciences(Fanelli&Glänzel,2013).MCShasaprofilethatissimilartotheSSHprofilewithapeakatthe55thcentile, butwithfarlowercitationcounts.

(15)

OneotherdifferenceofinterestfromFig.8isthatthepeakcitationcountsforreferencesinPMCOAarticlesarehigher thanthoseinBHSarticles.Whilethisdifferenceislargeatthepeak,itismuchsmallerforreferencesinthefirst10centiles andfromthe45thto80thcentiles.Thus,theoveralldifferencesarelessthanwhatwouldbesuggestedbysimplycomparing peaks.WesuspectthatanimportantreasonforanydifferencesisthatScopuscitationcountsaretypicallyhigherthanWoS citationcountsforthesamepapersinceScopuscoverageisbroaderthanWoScoverage.RegardingFig.8,wenotealsothat thesamecaveatsmentionedinassociationwithFig.5applyhereaswell.Theseresultsdonotincludereferencesthatcould notbeidentifiedinWoSorScopus,andhavenotbeennormalizedtoaccountforreferenceage.

5. Conclusions

Inthispaper,wehaveanalyzedthein-textcitationsofoverfivemillionarticlesfromtwolargedatabases−thePubMed CentralOpenAccessSubsetandElsevierjournals.Differencesbetweenfieldsofsciencehavebeenstudiedusingthefive fieldsdefinedintheCWTSLeidenRanking.Perhapsmostsignificantly,wefindthattherearefield-leveldifferencesthat arereflectedinpositionwithinthetext,citationinterval(orreferenceage),andcitationcountsofreferences.Theseresults arefundamentallydifferentfromtheresultsofBertinetal.(2016),whoobservedthatthedistributionofreferencesasa functionoftextprogressionwasrelativelyconstantforPLOSjournals.Ingeneral,thefieldsofBiomedicalandHealthSciences (BHS),LifeandEarthSciences(LES),andPhysicalSciencesandEngineering(PSE)havesimilarreferencedistributions,although theyvaryintheirspecifics.Forallthreefields,citationcountsofreferencespeakataroundthe30thcentile,suggestingthat methodspapersaremorehighlycitedthanothertypesofpapers.ThiscontradictsthefindingofDingetal.(2013)thatthe mosthighlycitedpapersarementionedintheintroductionandliteraturesections.

Thetworemainingfields,MathematicsandComputerScience(MCS)andSocialScienceandHumanities(SSH),havevery differentreferencedistributionsfromtheotherthreefields.Referencesaremoreevenlydistributedwithrespecttoposition inarticlesinthesetwofieldsthanintheotherthreefields,withmorereferencesinthe15thto50thcentilerange,andfewer referencesinthelatterhalfofanarticle.Thesefieldsalsodifferinthatcitationcountstoreferencespeakatthe60thcentile ratherthanatthe30thcentile.ThesetwoobservationsmaysuggestthatknowledgeproducedinMCSandSSHrelieson previousworkinadifferentwaythanintheotherfields.Moredetailedworkthatanalyzessemanticaswellaspositional structurewillbeneededtoexploretheseconjectures.

Wehavealsoshownthatnumbersofsentences,references,andmentionshaveallincreasedovertimeforallfields.

Thenumbersof mentionsper referencehaveremained nearlyconstant over thepast15 yearswhen consideringthe entirecorpus.However,onceagain,therearevariationsbyfield.MentionsperreferencehaveincreasedforMCSandSSH, whiletheyhavedecreasedforPSE.Anothernoteworthyfindingisthatreferencesmentionedonlyoncearemuchmore highlycitedthanthosementionedmanytimes.Itcouldbeofsignificantinteresttoexplorepossibleexplanationsforthis finding.

Thereareseverallimitationstothisstudythatshouldbenoted.First,whilethestudyislarge, itstillcoversonlya relativelymodestshareofthearticlespublishedinrecentyears.Additionaldatafromothersourcescouldshowdifferent patterns.However,sincecoverageoftheElsevierfulltextdataisbroadfromadisciplinaryperspective(Boyack,Small,&

Klavans,2013),wedonotexpectpotentialdifferencestobelarge.Second,onlyfivehighlevelfieldswereconsidered,and itispossiblethatwithin-fielddifferencesmaybejustaslargeasthebetween-fielddifferencesshownhere.Third,wedid notattempttonormalizedocumentstotheIMRaDstructureintermsofsectionnames,butsimplyusedtextualposition asabasisofanalysis.Thus,anyvariationsinstructuralformarenotaccountedforinourresults.Wesuspectthatsuch normalizationwouldcreateslightlysharperfeaturesinthedistributions,similartowhatwasnotedbyBertinetal.(2016) whentheyunifiedPLOSdocumentstotheIMRaDstructure.Afinallimitationinthisstudyresultsfromthefactthatdifferent parsingalgorithmswereusedforthePMCOAandELSdatasets.However,thisonlyaffectscomparisonofPMCOAandELS resultsatthesentencelevel,anddoesnotimpactanyotherpartoftheanalysis.

Despitetheselimitations,weconsidertheresultstoberobust.Thisisbyfarthelargeststudyofitskindthathasbeen publishedtodate,covering100timesmorearticlesandreferencesthanthatofBertinetal.(2016).Nowthatdistributionshave beenquantifiedatahighlevel,welookforwardtopursuingadditionalstudiesthatexplorefeaturesoffulltextarticlesatmore detailedlevels,usingbothpositionalandsemanticanalyses.Suchstudieshavethepotentialtoinfluenceourunderstanding ofcitationtheoryandbehavior,andtohavepracticalinfluenceonapplicationssuchasinformationsearchandretrievaland accuratemodelingofthestructureanddynamicsofscience.

Author’scontribution

KevinW.Boyack:Conceivedanddesignedtheanalysis;Collectedthedata;Contributeddataoranalysistools;Performed theanalysis;Wrotethepaper.

NeesJanvanEck:Conceivedanddesignedtheanalysis;Collectedthedata;Contributeddataoranalysistools;Performed theanalysis;Wrotethepaper.

GiovanniColavizza:Conceivedanddesignedtheanalysis;Wrotethepaper.

LudoWaltman:Conceivedanddesignedtheanalysis;Wrotethepaper.

(16)

Acknowledgments

KevinBoyackandGiovanniColavizzaboththankCWTSforhostingthemasvisitingscholars,duringwhichtimemostof thisworkwasperformed.WethankMikePatekofSciTechStrategies,Inc.forextractionandfieldingofthefulltextfrom PubMedCentral,andRichardKlavansandVincentTraagforhelpfuldiscussiononourwork.GiovanniColavizzaisfundedby SwissNationalFundgrantnumberP1ELP2168489.

References

Bertin,M.,Atanassova,I.,Gingras,Y.,&Lariviere,V.(2016).Theinvariantdistributionofreferencesinscientificarticles.JournaloftheAssociationfor InformationScienceandTechnology,67(1),164–177.

Bonzi,S.(1982).Characteristicsofaliteratureaspredictorsofrelatednessbetweencitedandcitingworks.JournaloftheAmericanSocietyforInformation Science,33(4),208–216.

Bornmann,L.,&Daniel,H.-D.(2007).Multiplepublicationonasingleresearchstudy:Doesitpay?Theinfluenceofnumberofresearcharticlesontotal citationcountsinbiomedicine.JournaloftheAmericanSocietyforInformationScienceandTechnology,58(8),1100–1107.

Bornmann,L.,&Daniel,H.-D.(2008).Functionaluseoffrequentlyandinfrequentlycitedarticlesincitingpublications:Acontentanalysisofcitationsto articleswithlowandhighcitationcounts.EuropeanScienceEditing,34(2),35–38.

Boyack,K.W.,Small,H.,&Klavans,R.(2013).Improvingtheaccuracyofco-citationclusteringusingfulltext.JournaloftheAmericanSocietyforInformation ScienceandTechnology,64(9),1759–1767.

Boyack,K.W.,VanEck,N.J.,Colavizza,G.,&Waltman,L.(2017).Referencebehaviorinthefulltextofscientificarticles:Alarge-scaleanalysis.In16th internationalconferenceoftheinternationalsocietyonscientometricsandinformetrics

Broad,W.J.(1981).Thepublishinggame:Gettingmoreforless.Science,211(4487),1137–1139.

Cano,V.(1989).Citationbehavior:Classification,utility,andlocation.JournaloftheAmericanSocietyforInformationScience,40(4),284–290.

Ding,Y.,Liu,X.,Guo,C.,&Cronin,B.(2013).Thedistributionofreferencesacrosstexts:Someimplicationsforcitationanalysis.JournalofInformetrics,7(3), 583–592.

Fanelli,D.,&Glänzel,W.(2013).Bibliometricevidenceforahierarchyofthesciences.PLoSOne,8(6),e66938.

Hicks,D.(2004).Thefourliteraturesofthesocialscience.InH.F.Moed,W.Glänzel,&U.Schmoch(Eds.),Handbookofquantitativescienceandtechnology research:TheuseofpublicationandpatentstatisticsinstudiesofS&Tsystems(pp.473–495).Dordrecht:Springer.

Hooten,P.A.(1991).Frequencyandfunctionaluseofciteddocumentsininformationscience.JournaloftheAmericanSocietyforInformationScience,42(6), 397–404.

Hou,W.-R.,Li,M.,&Niu,D.-K.(2011).Countingcitationsintextratherthanreferenceliststoimprovetheaccuracyofassessingscientificcontribution.

Bioessays,33(10),724–727.

Hu,Z.,Chen,C.,&Liu,Z.(2013).Wherearethecitationslocatedinthebodyofscientificarticles?Astudyofthedistributionsofcitationlocations.Journal ofInformetrics,7(4),887–896.

Hu,Z.,Lin,G.,Sun,T.,&Hou,H.(2017).Understandingmultiplymentionedreferences.JournalofInformetrics,11(4),948–958.

Jones,T.H.,&Hanney,S.(2016).Tracingtheindirectsocietalimpactsofbiomedicalresearch:Developmentandpilotingofatechniquebasedoncitations.

Scientometrics,107(3),975–1003.

Mariˇci ´c,S.,Spaventi,J.,Paviˇci ´c,L.,&Pifat-Mrzljak,G.(1998).Citationcontextversusthefrequencycountsofcitationhistories.JournaloftheAmerican SocietyforInformationScience,49(6),530–540.

McCain,K.W.,&Turner,K.(1989).Citationcontextanalysisandagingpatternsofjournalarticlesinmoleculargenetics.Scientometrics,17(1–2),127–163.

Nicholson,J.M.,&Ioannidis,J.P.A.(2012).Conformandbefunded.Nature,492(7427),34–36.

Schein,M.,&Paladugu,R.(2001).Redundantsurgicalpublications:Tipoftheiceberg?Surgery,129(6),655–661.

Tang,R.,&Safer,M.A.(2008).Author-ratedimportanceofcitedreferencesinbiologyandpsychologypublications.JournalofDocumentation,64(2), 246–272.

Valenzuela,M.,Ha,V.,&Etzioni,O.(2015).Identifyingmeaningfulcitations.In29thAAAIconferenceonartificialintelligence,AAAI2015(pp.21–26).

VanNoorden,R.,Maher,B.,&Nuzzo,R.(2014).Thetop100papers.Nature,514(7524),550–553.

Voos,H.,&Dagaev,K.S.(1976).Areallcitationsequal?Or,didweop.cit.youridem?JournalofAcademicLibrarianship,1(6),19–21.

Wan,X.,&Liu,F.(2014).Areallliteraturecitationsequallyimportant?Automaticcitationstrengthestimationanditsapplications.Journalofthe AssociationforInformationScienceandTechnology,65(9),1929–1938.

Zhao,D.,Cappello,A.,&Johnston,L.(2017).Functionsofuni-andmulti-citations:Implicationsforweightedcitationanalysis.JournalofDataand InformationScience,2(1),51–69.

Zhu,X.,Turney,P.,Lemire,D.,&Vellino,A.(2015).Measuringacademicinfluence:Notallcitationsareequal.JournaloftheAssociationforInformation ScienceandTechnology,66(2),408–427.

Referenties

GERELATEERDE DOCUMENTEN

A past tense verb alerts to just such a Situation of 'lack of immediate evidence.' Note that this holds whether or not a marking of the perfect (cf. sections 4-5) is present äs well;

In order to explore this possibility, the present study classifies the pre-categorised texts contained in the Brown Corpus based on a combination of lexical and emotional

For the construction of a reading comprehension test, Andringa & Hacquebord (2000) carried out text research. They took average sentence length, average word length and the

In addition to its centile position within the text, other data were added to each mention including the number of mentions (for the citing article and reference

zoeken naar Machedon en 1988 en aldus al- le detailinformatie over twee artikelen van Mache-don uit 1988 vinden, en hopelijk kan de lezer van Krantz’ hypothetische citatie

In compari- son to the other witnesses of the early period - the Septuagint (G) and the Samaritan Pentateuch (Smr) - the Qumran texts have an additional value in that they

In contrast to the neutral mentions of both the retracted and the non-retracted work in the introduction of publications in Elsevier journals, there is a large difference in

Because encryption is given as a measure in the GDPR it should be investigated if the algorithms developed in the past can still be used for sensitive information and if there