• No results found

Combining Full-text analysis & Bibliometric Indicators

N/A
N/A
Protected

Academic year: 2021

Share "Combining Full-text analysis & Bibliometric Indicators"

Copied!
30
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Combining Full-text analysis &

Bibliometric Indicators

a pilot study

Patrick Glenisson 1 Wolfgang Glänzel 1,2

Olle Persson 3

(2)

Introduction

Goal: mapping of scientific processes

Map of scientific papers

Characterization of emerging clusters

Extraction of new search keys

Using bibliometric as well as lexical indicators of ‘relatedness’

Full-text analysis

(3)

Overview

Data sources and Questions asked

Text mining Ingredients

Text-based relational analysis of documents

Contrasts with bibliometric analysis

Term extraction from full-text Conclusion

(4)

Overview

Data sources and Questions asked

Text mining Ingredients

Text-based relational analysis of documents

Contrasts with bibliometric analysis

Term extraction from full-text

Conclusion

(5)

Data source

19 full-text papers from:

Scientometrics, Vol 30, Issue 3 (2004)

 special issue on 9th international conference on Scientometrics and Informetrics (Beijing, China)

Validation setup

Manual assignment in various classes ..

(6)

Data source

Section code Section name Paper

I Advances in Scientometrics Havemann et al. (2004) Moed and Garfield (2004) Small (2004)

Yue and Wilson (2004) II Policy relevant issues Negishi et al. (2004)

Shelton and Holdrige (2004) Markusova et al. (2004) Wu et al. (2004)

III Bibliometric approaches to collaboration in science

Beaver (2004) Kretschmer (2004) Persson et al. (2004)

Yoshikane and Kageura (2004) IV Advances in Informetrics and

Webometrics

Lamirel et al. (2004) Qiu and Chen (2004) Tang and Thelwall (2004) Vaughan and Wu (2004) V Mathematical models in Informetrics

and Scientometrics

Egghe (2004) Glänzel (2004) Shan et al. (2004)

(7)

Research questions

Comparison text-based mapping vs.

expert classification

Extracted keywords

Comparison with bibliometric mapping

(8)

Overview

Data sources and Questions asked

Text mining Ingredients

Text-based relational analysis of documents

Contrasts with bibliometric analysis

Term extraction from full-text

Conclusion

(9)

Methodology

Given a set of documents,

(10)

Methodology

<1 0 0 1 0 1>

<1 1 0 0 0 1>

<0 0 0 1 1 0>

Given a set of documents,

compute a representation, called index

(11)

Methodology

<1 0 0 1 0 1>

<1 1 0 0 0 1>

<0 0 0 1 1 0>

Given a set of documents,

compute a representation, called index

to retrieve, summarize, classify or cluster them

(12)

Methodology

Document processing

Remove punctuation & grammatical structure (‘Bag of words’ )

Define a vocabulary

Identify Multi-word terms (e.g., tumor suppressor) (phrases)

Eliminate words low content (e.g., and, thus,.. ) (stopwords)

Map words with same meaning (synonyms)

Strip plurals, conjugations, ... (stemming)

Define weighing scheme and/or transformations (tf-idf,svd,..)

(13)

Methodology

Compute index of textual resources:

T 1 T 3

T 2

(14)

Overview

Data sources and Questions asked

Text mining Ingredients

Text-based relational analysis of documents

Contrasts with bibliometric analysis

Term extraction from full-text

Conclusion

(15)

Results – Term statistics

19 papers

3610 withheld terms

(including ~400 bigrams)

Distance Matrix (19x19)

Apply MDS

Apply Clustering

(16)

Results – MDS

(17)

Results – MDS

Policy

Mathematical approaches

(18)

Results – Clustering

• Hierarchical clustering Ward method

Cut-off k=4

• Optimal parameters ?

‘Stability-based method’

• Quantified correspondence

with expert assignments ?

‘Rand index’ ..

?

(19)

Results – Peer evaluation

Class

Cluster I II III IV V

1 3 4 1 0 0

2 0 0 0 3 0

3 0 0 1 0 3

4 1 0 2 1 0

Policy

Mathematical approaches Webometrics

(20)

Overview

Data sources and Questions asked

Text mining Ingredients

Text-based relational analysis of documents

Contrasts with bibliometric analysis

Term extraction from full-text

Conclusion

(21)

Results – Reference age

(22)

Results – Reference age

Histograms aggregated by expert class

(23)

Results – Ref Age vs. % Serial

(24)

Overview

Data sources and Questions asked

Text mining Ingredients

Text-based relational analysis of documents

Contrasts with bibliometric analysis

Term extraction from full-text

Conclusion

(25)

Results – Term extraction

Calculation of seminal keywords for each article

Using TF-IDF weighting scheme

Normalized to norm 1 to accommodate for document length

(26)

Author(s):Persson et al.

Inflationary bibliometric values: the role of scientific collaboration and the need for relative indicators in evaluative studies

Author(s): Glänzel

Towards a model for diachronous and synchronous citation analyses

co_author 0.417794 diachronous_prospect 0.492265

collabor* 0.287652 synchronous 0.377403

domest* 0.208460 synchronous_retrospect 0.360994

self_citat* 0.185298 age 0.250921

explan* 0.170916 diachronous_prospect 0.238375

Growth 0.154099 technic*_reliabl* 0.180497

reference_list 0.151925 citat*_process 0.150553

intern*_collabor* 0.151925 life_time 0.147679

reference_behaviour 0.151468 impact_measur* 0.125460

inflationari 0.151468 random_select* 0.114862

Author(s): Moed and Garfield

In basic science the percentage of 'authoritative' references decreases as bibliographies become shorter

Author(s): Shelton and Holdrige

The US-EU race for leadership of science and technology, Qualitative and quantitative

indicators

research_field 0.358836 EU 0.638957

authorit*_docum* 0.281942 WTEC 0.346503

authorit* 0.241017 panel 0.224208

docum* 0.197558 output_indic* 0.142678

referenc* 0.179418 NAS 0.142678

percent_most 0.179418 leadership 0.142678

refer*_list 0.176746 world 0.119689

refer* 0.165171 input 0.114998

frequent*_cite 0.156779 row 0.102220

persuasion 0.153787 panelist 0.101913

Author(s): Tang and Thelwall

Class: IV

department 0.420497

intern*_inlink 0.315920

gTLD 0.273798

public_impact 0.189552

disciplin* 0.148494

psychologi 0.145234

command 0.145234

region 0.135706

histori 0.123676

disciplinari_differ* 0.105307

(27)

Results – Full-text vs Abstract

Is a full-text analysis warranted

for term extraction ?

for mapping purposes ?

(28)

Results – Full-text vs Abstract

Less structure

Less overlap with expert classes:

Rand index = 0.6257 p-value = 0.464 ;

not significant

Full-text is an interesting source for additional keywords

and improved mapping

(29)

Conclusion

Keyword approach may be naïve

But applied in a systematic framework in combination with ‘right’ algorithms, it

provides interesting clues

Complementary to bibliometric approaches

Weak indications towards benefits of using full-text articles

Future: extension of this pilot to larger samples

(30)

References

• Bibliometrics; homepage Wolfgang Glänzel

• http://www.steunpuntoos.be/wg.html

• Bibliometrics; homepage Olle Persson

• http://www.umu.se/inforsk/Staff/olle.htm

• Text & Data mining; PhD thesis Patrick Glenisson

• ftp://ftp.esat.kuleuven.ac.be/pub/sista/glenisson/reports/phd.pdf

• Optimal k in clustering; Stability method

Referenties

GERELATEERDE DOCUMENTEN

These findings indicate that three doses of post-exercise protein supplementation resulting in average protein intake of 1.94 ± 0.43 g/kg/d on race day, 1.97 ± 0.44 g/kg/d at one

At about the same time from point (b) at an altitude of 5 km, a negative leader starts to propagate down, with no previous activity seen at its initiation point, toward the neck,

conflicts of care, COVID-19, organizational care, remote work, research ethics.. [Correction added on 19 October 2020, after online publication: In the original-publication

the European Commission’s Seventh Framework Programme through grant FP7-606740 (FP7-SPACE-2013-1) for the Gaia European Network for Improved data User Services (GENIUS) and

Expression levels of the His575Arg and Asn599Ser mutant MCT8 proteins did not differ from WT MCT8, whereas those of all 3 frameshift variants were lower than WT in total lysates

In two experiments we adapted the WTI-paradigm by providing a central theme to previously used materials (Stafura &amp; Perfetti, 2014). In Experiment 1 we provided a three-

In order to explore this possibility, the present study classifies the pre-categorised texts contained in the Brown Corpus based on a combination of lexical and emotional

Because encryption is given as a measure in the GDPR it should be investigated if the algorithms developed in the past can still be used for sensitive information and if there