What's in a word ?
Term-based approaches across
bioinformatics, scientometrics and knowledge management
Patrick Glenisson
Bio-informatics group
Dept Electrical Engineering K.U.Leuven, Belgium
Steunpunt O&O Statistieken Faculty of Economy
K.U.Leuven, Belgium
2
ntroduction
I
3
Introduction: K.U. Leuven
Faculty of Applied Sciences
Department of Electrical Engineering Bio-informatics research
clinical bioinformatics gene regulation
bioinformatics Research on algorithms
and software development for:
Text mining
Gibbs sampling Graphical
models
Classification &
clustering
4
Introduction: K.U. Leuven
Faculty of Applied Sciences
Department of Electrical Engineering Bio-informatics research
Text mining research
Combine statistical approaches with domain-specific requirements
Knowledge discovery through
literature analysis in various domains:
Bio-informatics
Sciento- & Technometrics
Knowledge management
5
Overview
• Bio-informatics:
– gene profiling
– multi-view learning
• Scientific trend mapping
– clustering and bibliometric indicators
• Innovation & Spillovers
– Tracing of person in science & technology spaces
25’
5-10’
6
Overview
Information
Retrieval Information
Extraction
Full NLP parsing Shallow
Statistics
Generic Problem
specific
Domain- specific Shallow Parsing
Document analysis &
Extraction of tokens
Text mining goals
Text mining methodology
Overall approach
7
ase 1:
C
Literature
& biological data
8
9
protein
10
‘Post-genome’ biology
focus shift :
- from single gene to gene groups
- complex interactions within cellular environment
microarrays measure the simultaneous activity:
Gene
expression measurement
G1G2 G3..
C1 C2
C3 ..
Sample annotations
G en e a n n o ta ti o n s
11
Clustering Interpretation
ge ne
conditions
Expression data
12
ge ne
conditions
Expression data
gene expression Databases annotations and relations encoded as free text
PRIOR
INFORMATION
Integrated
analysis
13
Hence, 2 views:
• Text analysis for interpretation (supportive role)
• Text analytics for ‘inference’
(active role)
14
A ‘historical’ quote:
`Until now it has been largely overlooked that there is little difference between retrieving an abstract from MEDLINE and downloading
an entry from a biological database’
(M. Gerstein, 2001)
12133521
VEGF is associated with the
development and prognosis of colorectal cancer.
12168088
PTEN modulates angiogenesis in prostate cancer by regulating VEGF expression.
11866538
Vascular endothelial growth factor modulates the Tie-2:Tie-1 receptor complex
GeneRIF GO
• cell proliferation
• heparin binding
• growth factor activity
15
• Controlled vocabularies are of great value when constructing interoperable and computer-parsable systems.
• Structured vocabularies are on the rise
• GO
• MeSH
• eVOC
• Standards are systematically being adopted to store biological concepts or annotations:
• HUGO for gene names
• GOA
• …
Increased awareness
16
(GOF) Vector space model
• Document processing
– Remove punctuation & grammatical structure (`Bag of words’) – Define a vocabulary
• Identify Multi-word terms (e.g., tumor suppressor) (phrases)
• Eliminate words low content (e.g., and, gene, ...) (stopwords)
• Map words with same meaning (synonyms)
• Strip plurals, conjugations, ... (stemming)
– Define weighing scheme and/or transformations (tf-idf,svd,..)
• index
T 1 T 3
T 2
vocabulary
gene
17
Validity of gene index
Genes that are functionally related should be close in text space:
Modeled wrt a background distribution of
through random and permuted gene groups
Text-based coherence score
18
Validity of gene index
Genes that are functionally related
should be close in text space:
19
Validity of gene index
Genes that are functionally related
should be close in text space:
20
Data-centered statistical scores
Coherence vs separation of clusters
Stability of a cluster solution when leaving out data
Define `optimal’ ?
Optimal number of clusters ?
C1 C3
C2