What's in a word ?

(1)

What's in a word ?

Term-based approaches across

bioinformatics, scientometrics and knowledge management

Patrick Glenisson

Bio-informatics group

Dept Electrical Engineering K.U.Leuven, Belgium

Steunpunt O&O Statistieken Faculty of Economy

K.U.Leuven, Belgium

(2)

2 ntroduction

I

(3)

3 Introduction: K.U. Leuven

Faculty of Applied Sciences

Department of Electrical Engineering Bio-informatics research

clinical bioinformatics gene regulation

bioinformatics Research on algorithms

and software development for:

Text mining

Gibbs sampling Graphical

models

Classification &

clustering

(4)

4 Introduction: K.U. Leuven

Faculty of Applied Sciences

Department of Electrical Engineering Bio-informatics research

Text mining research

Combine statistical approaches with domain-specific requirements

Knowledge discovery through

literature analysis in various domains:

Bio-informatics

Sciento- & Technometrics

Knowledge management

(5)

5 Overview

• Bio-informatics:

– gene profiling

– multi-view learning

• Scientific trend mapping

– clustering and bibliometric indicators

• Innovation & Spillovers

– Tracing of person in science & technology spaces

25’

5-10’

(6)

6 Overview

Information

Retrieval Information

Extraction

Full NLP parsing Shallow

Statistics

Generic Problem

specific

Domain- specific Shallow Parsing

Document analysis &

Extraction of tokens

 Text mining goals

 Text mining methodology

 Overall approach

(7)

7 ase 1:

C

Literature

& biological data

(8)

8

(9)

9 protein

(10)

10 ‘Post-genome’ biology

 focus shift :

- from single gene to gene groups

- complex interactions within cellular environment

 microarrays measure the simultaneous activity:

Gene

expression measurement

G1G2 G3..

C1 C2

C3 ..

Sample annotations

G en e a n n o ta ti o n s

(11)

11 Clustering Interpretation

ge ne

conditions

Expression data

(12)

12 ge ne

conditions

Expression data

gene expression Databases annotations and relations encoded as free text

PRIOR

INFORMATION

Integrated

analysis

(13)

13 Hence, 2 views:

• Text analysis for interpretation (supportive role)

• Text analytics for ‘inference’

(active role)

(14)

14 A ‘historical’ quote:

`Until now it has been largely overlooked that there is little difference between retrieving an abstract from MEDLINE and downloading

an entry from a biological database’

(M. Gerstein, 2001)

12133521

VEGF is associated with the

development and prognosis of colorectal cancer.

12168088

PTEN modulates angiogenesis in prostate cancer by regulating VEGF expression.

11866538

Vascular endothelial growth factor modulates the Tie-2:Tie-1 receptor complex

GeneRIF GO

• cell proliferation

• heparin binding

• growth factor activity

(15)

15 • Controlled vocabularies are of great value when constructing interoperable and computer-parsable systems.

• Structured vocabularies are on the rise

• GO

• MeSH

• eVOC

• Standards are systematically being adopted to store biological concepts or annotations:

• HUGO for gene names

• GOA

• …

Increased awareness

(16)

16 (GOF) Vector space model

• Document processing

– Remove punctuation & grammatical structure (`Bag of words’) – Define a vocabulary

• Identify Multi-word terms (e.g., tumor suppressor) (phrases)

• Eliminate words low content (e.g., and, gene, ...) (stopwords)

• Map words with same meaning (synonyms)

• Strip plurals, conjugations, ... (stemming)

– Define weighing scheme and/or transformations (tf-idf,svd,..)

• index

T 1 T 3

T 2

vocabulary

gene

(17)

17 Validity of gene index

Genes that are functionally related should be close in text space:

 Modeled wrt a background distribution of

 through random and permuted gene groups

Text-based coherence score

(18)

18 Validity of gene index

Genes that are functionally related

should be close in text space:

(19)

19 Validity of gene index

Genes that are functionally related

should be close in text space:

(20)

20  Data-centered statistical scores

 Coherence vs separation of clusters

 Stability of a cluster solution when leaving out data

Define `optimal’ ?

Optimal number of clusters ?

C1 C3

C2

Text-based scoring

(21)

21  Data-centered statistical scores

 Knowledge-based scores

 Enrichment of GO annotations in clusters

 Literature-based scoring

Define `optimal’ ?

Optimal number of clusters ?

(22)

22 Collaborative gene filtering

(23)

23 TXTGate

• a platform that offers multiple ‘views’ on vast amounts

of (gene-based) free-text information available in selected curated database entries & linked scientific publications.

• incorporates term-based indices ..

• .. and use them as a starting point

– to explore the text through the eyes of different domain vocabularies – to link out to other resources by query building, or

– to sub-cluster genes based on text.

(24)

24 Term-centric

Gene-centric

Domain vocabularies as ‘views’

(25)

25 Query building to external DB

(26)

26 • Flexible tool for analyzing gene groups (~100 genes) due to various term- and gene-centric vocab’s

• … that allow some level of interoperability with external annotation databases

• Sub-clustering gene groups useful to detect biological sub-patterns

• Reasonably robust to corrupted groups

• Gene index normalizes for unbalanced references

Features of the approach

(27)

27 • Text analysis for interpretation (supportive role)

• Text analytics for ‘inference’

(active role)

(28)

28 Meta-clustering text & data

• As multiple information sources are available when

analyzing gene expression data, we pose the question:

“How can we analyze data in an integrated fashion to extract more information than from the expression data alone ? ”

..

(29)

29 Mathematical integration

(30)

30 • In each information space

– Appropriate preprocessing – Choice of distance measures

Integration of text & data

(31)

31 • Combine data:

• confidence attributed to either of the two data types

• in case of distance, we can see it as a scaling constant between the norms of the data- and text

representations.

(32)

32 • However, distribution of distances invoke a bias  Scaling problem

• Therefore, use technique from statistical meta-analysis (so-called omnibus procedure)

Expression Distance histogram

Text Distance

histogram

(33)

33 M-score

expression data only

M -s co re in te gr at ed c lu st er in g Various cutoffs k of the cluster tree

Optimal k ?

(34)

34 A peek inside

(35)

35 A peek inside

Expression Profile Text Profile

Strong

re-enforcement

(36)

36 ase 2:

C

Sciento- & technometrics

(37)

37 Mapping of Science

• Journal

‘Scientometrics’

• Full-text articles

• Document cluster analysis

• Co-word mapping

• Temporal dimension:

clusters over time

(38)

38 Mapping of Science

• Coupling with bibliometric indicators;

– Based on reference (hyperlink)

information

– Mean reference Age

– Nr Serials

(39)

39 Domain studies in Patent space

30 technology classes

‘Seed’ patent

S im ila rit ie s

(40)

40 User profiling & Author-Inventor linkage

• Name resolution

– Same persons (variants, mistakes)

– Different persons (similar initials, or even full name) Van Veldhoven Veldhoven, Van

Wim Van Veldhoven Walter Van Veldhoven Wim Van Veldhoven Wim Van Veldhoven

Vanveldhoven

Van Veldhoven

(41)

41 Content-based name matching

• Detect spillovers and entrepreneurial activities at (e.g.) university-level

• Matching of ‘inventors’ & ‘authors’ time- consuming  semi-automated approach:

Patent DB Publication DB

Relevance ranking

(42)

42 Acknowledgements

Steunpunt O&O Statistieken Debackere K Glänzel W

ESAT / BioI / Text Mining:

Coessens B Van Vooren S Janssens F Van Dromme D

ESAT / BioI:

Moreau Y De Moor B

(43)