Combining Full-text analysis & Bibliometric Indicators

(1)

Combining Full-text analysis &

Bibliometric Indicators

a pilot study

Patrick Glenisson ¹ Wolfgang Glänzel ^1,2

Olle Persson ³

(2)

Introduction

 Goal: mapping of scientific processes

 Map of scientific papers

 Characterization of emerging clusters

 Extraction of new search keys

 Using bibliometric as well as lexical indicators of ‘relatedness’

 Full-text analysis

(3)

Overview

 Data sources and Questions asked

 Text mining Ingredients

 Text-based relational analysis of documents

 Contrasts with bibliometric analysis

 Term extraction from full-text Conclusion

(4)

Overview

 Term extraction from full-text

 Conclusion

(5)

Data source

 19 full-text papers from:

Scientometrics, Vol 30, Issue 3 (2004)

  special issue on 9^th international conference on Scientometrics and Informetrics (Beijing, China)

 Validation setup

 Manual assignment in various classes ..

(6)

Data source

Section code Section name Paper

I Advances in Scientometrics Havemann et al. (2004) Moed and Garfield (2004) Small (2004)

Yue and Wilson (2004) II Policy relevant issues Negishi et al. (2004)

Shelton and Holdrige (2004) Markusova et al. (2004) Wu et al. (2004)

III Bibliometric approaches to collaboration in science

Beaver (2004) Kretschmer (2004) Persson et al. (2004)

Yoshikane and Kageura (2004) IV Advances in Informetrics and

Webometrics

Lamirel et al. (2004) Qiu and Chen (2004) Tang and Thelwall (2004) Vaughan and Wu (2004) V Mathematical models in Informetrics

and Scientometrics

Egghe (2004) Glänzel (2004) Shan et al. (2004)

(7)

Research questions

 Comparison text-based mapping vs.

expert classification

 Extracted keywords

 Comparison with bibliometric mapping

(8)

Overview

 Conclusion

(9)

Methodology

 Given a set of documents,

(10)

Methodology

<1 0 0 1 0 1> 

<1 1 0 0 0 1>

<0 0 0 1 1 0>

 compute a representation, called index

(11)

Methodology

<1 0 0 1 0 1> 

<1 1 0 0 0 1>

<0 0 0 1 1 0>

 compute a representation, called index

 to retrieve, summarize, classify or cluster them

(12)

Methodology

 Document processing

 Remove punctuation & grammatical structure (‘Bag of words’ )

 Define a vocabulary

 Identify Multi-word terms (e.g., tumor suppressor) (phrases)

 Eliminate words low content (e.g., and, thus,.. ) (stopwords)

 Map words with same meaning (synonyms)

 Strip plurals, conjugations, ... (stemming)

 Define weighing scheme and/or transformations (tf-idf,svd,..)

(13)

Methodology

 Compute index of textual resources:

T 1 T 3

T 2

(14)

Overview

 Conclusion

(15)

Results – Term statistics

 19 papers

 3610 withheld terms

(including ~400 bigrams)

 Distance Matrix (19x19)

 Apply MDS

 Apply Clustering

(16)

Results – MDS

(17)

Results – MDS

Policy

Mathematical approaches

(18)

Results – Clustering

• Hierarchical clustering Ward method

Cut-off k=4

• Optimal parameters ?

‘Stability-based method’

• Quantified correspondence

with expert assignments ?

‘Rand index’ ..

?

(19)

Results – Peer evaluation

Class

Cluster I II III IV V

1 3 4 1 0 0

2 0 0 0 3 0

3 0 0 1 0 3

4 1 0 2 1 0

Policy

Mathematical approaches Webometrics

(20)

Overview

 Conclusion

(21)

Results – Reference age

(22)

Results – Reference age

Histograms aggregated by expert class

(23)

Results – Ref Age vs. % Serial

(24)

Overview

 Conclusion

(25)

Results – Term extraction

 Calculation of seminal keywords for each article

 Using TF-IDF weighting scheme

 Normalized to norm 1 to accommodate for document length

(26)

Author(s):Persson et al.

Inflationary bibliometric values: the role of scientific collaboration and the need for relative indicators in evaluative studies

Author(s): Glänzel

Towards a model for diachronous and synchronous citation analyses

co_author 0.417794 diachronous_prospect 0.492265

collabor* 0.287652 synchronous 0.377403

domest* 0.208460 synchronous_retrospect 0.360994

self_citat* 0.185298 age 0.250921

explan* 0.170916 diachronous_prospect 0.238375

Growth 0.154099 technic*_reliabl* 0.180497

reference_list 0.151925 citat*_process 0.150553

intern*_collabor* 0.151925 life_time 0.147679

reference_behaviour 0.151468 impact_measur* 0.125460

inflationari 0.151468 random_select* 0.114862

Author(s): Moed and Garfield

In basic science the percentage of 'authoritative' references decreases as bibliographies become shorter

Author(s): Shelton and Holdrige

The US-EU race for leadership of science and technology, Qualitative and quantitative

indicators

research_field 0.358836 EU 0.638957

authorit*_docum* 0.281942 WTEC 0.346503

authorit* 0.241017 panel 0.224208

docum* 0.197558 output_indic* 0.142678

referenc* 0.179418 NAS 0.142678

percent_most 0.179418 leadership 0.142678

refer*_list 0.176746 world 0.119689

refer* 0.165171 input 0.114998

frequent*_cite 0.156779 row 0.102220

persuasion 0.153787 panelist 0.101913

Author(s): Tang and Thelwall

Class: IV

department 0.420497

intern*_inlink 0.315920

gTLD 0.273798

public_impact 0.189552

disciplin* 0.148494

psychologi 0.145234

command 0.145234

region 0.135706

histori 0.123676

disciplinari_differ* 0.105307

(27)

Results – Full-text vs Abstract

 Is a full-text analysis warranted

 for term extraction ?

 for mapping purposes ?

(28)

Results – Full-text vs Abstract

 Less structure

 Less overlap with expert classes:

Rand index = 0.6257 p-value = 0.464 ;

not significant

Full-text is an interesting source for additional keywords

and improved mapping

(29)

Conclusion

 Keyword approach may be naïve

 But applied in a systematic framework in combination with ‘right’ algorithms, it

provides interesting clues

 Complementary to bibliometric approaches

 Weak indications towards benefits of using full-text articles

 Future: extension of this pilot to larger samples

(30)

References

• Bibliometrics; homepage Wolfgang Glänzel

• http://www.steunpuntoos.be/wg.html

• Bibliometrics; homepage Olle Persson

• http://www.umu.se/inforsk/Staff/olle.htm

• Text & Data mining; PhD thesis Patrick Glenisson

• ftp://ftp.esat.kuleuven.ac.be/pub/sista/glenisson/reports/phd.pdf

• Optimal k in clustering; Stability method