Combining Full-text analysis &
Bibliometric Indicators
a pilot study
Patrick Glenisson 1 Wolfgang Glänzel 1,2
Olle Persson 3
Introduction
Goal: mapping of scientific processes
Map of scientific papers
Characterization of emerging clusters
Extraction of new search keys
Using bibliometric as well as lexical indicators of ‘relatedness’
Full-text analysis
Overview
Data sources and Questions asked
Text mining Ingredients
Text-based relational analysis of documents
Contrasts with bibliometric analysis
Term extraction from full-text Conclusion
Overview
Data sources and Questions asked
Text mining Ingredients
Text-based relational analysis of documents
Contrasts with bibliometric analysis
Term extraction from full-text
Conclusion
Data source
19 full-text papers from:
Scientometrics, Vol 30, Issue 3 (2004)
special issue on 9th international conference on Scientometrics and Informetrics (Beijing, China)
Validation setup
Manual assignment in various classes ..
Data source
Section code Section name Paper
I Advances in Scientometrics Havemann et al. (2004) Moed and Garfield (2004) Small (2004)
Yue and Wilson (2004) II Policy relevant issues Negishi et al. (2004)
Shelton and Holdrige (2004) Markusova et al. (2004) Wu et al. (2004)
III Bibliometric approaches to collaboration in science
Beaver (2004) Kretschmer (2004) Persson et al. (2004)
Yoshikane and Kageura (2004) IV Advances in Informetrics and
Webometrics
Lamirel et al. (2004) Qiu and Chen (2004) Tang and Thelwall (2004) Vaughan and Wu (2004) V Mathematical models in Informetrics
and Scientometrics
Egghe (2004) Glänzel (2004) Shan et al. (2004)
Research questions
Comparison text-based mapping vs.
expert classification
Extracted keywords
Comparison with bibliometric mapping
Overview
Data sources and Questions asked
Text mining Ingredients
Text-based relational analysis of documents
Contrasts with bibliometric analysis
Term extraction from full-text
Conclusion
Methodology
Given a set of documents,
Methodology
<1 0 0 1 0 1>
<1 1 0 0 0 1>
<0 0 0 1 1 0>
Given a set of documents,
compute a representation, called index
Methodology
<1 0 0 1 0 1>
<1 1 0 0 0 1>
<0 0 0 1 1 0>
Given a set of documents,
compute a representation, called index
to retrieve, summarize, classify or cluster them
Methodology
Document processing
Remove punctuation & grammatical structure (‘Bag of words’ )
Define a vocabulary
Identify Multi-word terms (e.g., tumor suppressor) (phrases)
Eliminate words low content (e.g., and, thus,.. ) (stopwords)
Map words with same meaning (synonyms)
Strip plurals, conjugations, ... (stemming)
Define weighing scheme and/or transformations (tf-idf,svd,..)
Methodology
Compute index of textual resources:
T 1 T 3
T 2
Overview
Data sources and Questions asked
Text mining Ingredients
Text-based relational analysis of documents
Contrasts with bibliometric analysis
Term extraction from full-text
Conclusion
Results – Term statistics
19 papers
3610 withheld terms
(including ~400 bigrams)
Distance Matrix (19x19)
Apply MDS
Apply Clustering
Results – MDS
Results – MDS
Policy
Mathematical approaches
Results – Clustering
• Hierarchical clustering Ward method
Cut-off k=4
• Optimal parameters ?
‘Stability-based method’
• Quantified correspondence
with expert assignments ?
‘Rand index’ ..
?
Results – Peer evaluation
Class
Cluster I II III IV V
1 3 4 1 0 0
2 0 0 0 3 0
3 0 0 1 0 3
4 1 0 2 1 0
Policy
Mathematical approaches Webometrics
Overview
Data sources and Questions asked
Text mining Ingredients
Text-based relational analysis of documents
Contrasts with bibliometric analysis
Term extraction from full-text
Conclusion
Results – Reference age
Results – Reference age
Histograms aggregated by expert class
Results – Ref Age vs. % Serial
Overview
Data sources and Questions asked
Text mining Ingredients
Text-based relational analysis of documents
Contrasts with bibliometric analysis
Term extraction from full-text
Conclusion
Results – Term extraction
Calculation of seminal keywords for each article
Using TF-IDF weighting scheme
Normalized to norm 1 to accommodate for document length
Author(s):Persson et al.
Inflationary bibliometric values: the role of scientific collaboration and the need for relative indicators in evaluative studies
Author(s): Glänzel
Towards a model for diachronous and synchronous citation analyses
co_author 0.417794 diachronous_prospect 0.492265
collabor* 0.287652 synchronous 0.377403
domest* 0.208460 synchronous_retrospect 0.360994
self_citat* 0.185298 age 0.250921
explan* 0.170916 diachronous_prospect 0.238375
Growth 0.154099 technic*_reliabl* 0.180497
reference_list 0.151925 citat*_process 0.150553
intern*_collabor* 0.151925 life_time 0.147679
reference_behaviour 0.151468 impact_measur* 0.125460
inflationari 0.151468 random_select* 0.114862
Author(s): Moed and Garfield
In basic science the percentage of 'authoritative' references decreases as bibliographies become shorter
Author(s): Shelton and Holdrige
The US-EU race for leadership of science and technology, Qualitative and quantitative
indicators
research_field 0.358836 EU 0.638957
authorit*_docum* 0.281942 WTEC 0.346503
authorit* 0.241017 panel 0.224208
docum* 0.197558 output_indic* 0.142678
referenc* 0.179418 NAS 0.142678
percent_most 0.179418 leadership 0.142678
refer*_list 0.176746 world 0.119689
refer* 0.165171 input 0.114998
frequent*_cite 0.156779 row 0.102220
persuasion 0.153787 panelist 0.101913
Author(s): Tang and Thelwall
Class: IV
department 0.420497
intern*_inlink 0.315920
gTLD 0.273798
public_impact 0.189552
disciplin* 0.148494
psychologi 0.145234
command 0.145234
region 0.135706
histori 0.123676
disciplinari_differ* 0.105307
Results – Full-text vs Abstract
Is a full-text analysis warranted
for term extraction ?
for mapping purposes ?
Results – Full-text vs Abstract
Less structure
Less overlap with expert classes:
Rand index = 0.6257 p-value = 0.464 ;
not significant
Full-text is an interesting source for additional keywords
and improved mapping
Conclusion
Keyword approach may be naïve
But applied in a systematic framework in combination with ‘right’ algorithms, it
provides interesting clues
Complementary to bibliometric approaches
Weak indications towards benefits of using full-text articles
Future: extension of this pilot to larger samples
References
• Bibliometrics; homepage Wolfgang Glänzel
• http://www.steunpuntoos.be/wg.html
• Bibliometrics; homepage Olle Persson
• http://www.umu.se/inforsk/Staff/olle.htm
• Text & Data mining; PhD thesis Patrick Glenisson
• ftp://ftp.esat.kuleuven.ac.be/pub/sista/glenisson/reports/phd.pdf
• Optimal k in clustering; Stability method