Genes and Microarrays
PhD defense
Patrick Glenisson
Integrating Scientific Literature With Large Scale Gene Expression Analysis
Promotor
Prof. Bart De Moor June 11th 2004
Overview
Overview
Genes & microarrays
Gene expression data analysis
Text mining in biology: principles
Text mining in practice: TXTGate
Combining text and gene expression data
Conclusion
Overview
M-score
Overview
Genes & microarrays
Gene expression data analysis
Text mining in biology: principles
Text mining in practice: TXTGate
Combining text and gene expression data
Conclusion
Cluster analysis
Overview
Overview
Genes & microarrays
Gene expression data analysis
Text mining in biology: principles
Text mining in practice: TXTGate
Combining text and gene expression data
Conclusion
Literature analysis
Overview
Overview
Genes & microarrays
Gene expression data analysis
Text mining in biology: principles
Text mining in practice: TXTGate
Combining text and gene expression data
Conclusion
TXTGate
Overview
Overview
Genes & microarrays
Gene expression data analysis
Text mining in biology: principles
Text mining in practice: TXTGate
Combining text and gene expression data
Conclusion
&
Integrated clustering
Overview
Overview
Genes & microarrays
Gene expression data analysis
Text mining in biology: principles
Text mining in practice: TXTGate
Combining text and gene expression data
Conclusion
&
Overview
Overview
Genes & microarrays
Gene expression data analysis
Text mining in biology: principles
Text mining in practice: TXTGate
Combining text and gene expression data
Conclusion
&
Genes and Microarrays
DNA, genes, proteins and cells
Genes and Microarrays
DNA, genes, proteins and cells
protein
Genes and Microarrays
Genes are expressed and regulated
Genes and Microarrays
Microarrays measure gene expression
Laser excitation
Genes
Gene
expression measurement Conditions
G1G2 G3..
C1 C2
C3 ..
Sample annotations
Gene annotations
Genes and Microarrays
Representing expression information
Gene expression experiments are complex :
Too verbose to include in a scientific publication
Too important to compromise on reproducibility
Too valuable for post-genome research to have it scattered around on various websites
Hence, standard for reporting on MA experiments
As a guideline for databases hosting expression compendia
Conditions in which expression occurs
Genes and Microarrays
MIAME standard
Minimum Information About a MicroArray Experiment
Internationally proposed standard
Published in Dec 2001 by International consortium MGED
Some prominent journals (Nature, Lancet, EMBO, Cell) require MIAME-compliant submissions of data
Some hurdles:
Significant overhead in filling out the questionnaire
Scooping of leads (!)
Proprietary information about probe sequences
Overview
Overview
Genes & microarrays
Gene expression data analysis
Text mining in biology: principles
Text mining in practice: TXTGate
Combining text and gene expression data
Conclusion
&
Gene expression data analysis
Questions asked with microarrays
Fundamental
Functional roles of genes (and transcriptional regulation)
Genetic network reconstruction
Clinical
Correlation of genes with a given disease
Diagnosis of disease stage with patients
Pharmacological
Toxicological drug response assessment
Gene expression data analysis
Microarray data analysis
Fundamental
Functional roles of genes (and transcriptional regulation)
Genetic network reconstruction
Clinical
Correlation of genes with a given disease
Diagnosis of disease stage with patients
Pharmacological
Toxicological drug response assessment
Gene expression data analysis
Clustering
Conditions
Genes
Expression data C1
C3
C2
Genes
Genes
Distance matrix
Clustering
Hierarchical clustering k - Means
Gene expression data analysis
Data-centered statistical scores
Coherence vs separation of clusters
Stability of a cluster solution when leaving out data
Cluster validation
Define `optimal’ ?
Optimal number of clusters ?
C1 C3
C2
Gene expression data analysis
Data-centered statistical scores
Knowledge-based scores
Enrichment of GO annotations in clusters
Literature-based scoring
Cluster validation
Define `optimal’ ?
Optimal number of clusters ?
Gene expression data analysis
Cluster validation
Define `optimal’ ?
Optimal number of clusters ?
Data-centered statistical scores
Knowledge-based scores
Motif-based
DNA patterns in regulatory regions of gene groups
Regulatory DNA patterns (motifs)
Gene
Genes expression data analysis
DNA patterns in expression clusters
Significant occurrences of known motifs in cluster
Motifs
Clusters
Cluster-by-Motif
(motif enrichment matrix)
1 2 3 ..
A B C ..
-log(p-value)
M-score Gene clusters
Genes expression data analysis
Cluster-by-motif matrix
cluster
m ot if M-Score for the entire clustering solution
one-shot estimate of the `biological relevance’
Gene expression data analysis
M-score
A motif is less interesting when it (significantly) occurs in many clusters
A cluster that contains a large portion of (significant) motifs is less likely to be biologically relevant.
A `too large' number of clusters is less likely to reflect the true biological diversity underlying the experiment.
Gene expression data analysis
M-score validation
A simplification of reality
No absolute quantification of biological relevance.
Useful tool when experimenting with
• Multiple clustering methods
• Multiple parameterizations
To economize on biological validations
Optimal k in yeast cell cycle expression data
Original studies by Tavazoie et al. used k=30
Overestimation confirmed by analyses of
• De Smet et al. (AQBC)
• Gibbons et al. (GO-based scoring)
k
M-score
Overview
Overview
Genes & microarrays
Gene expression data analysis
Text mining in biology: principles
Text mining in practice: TXTGate
Combining text and gene expression data
Conclusion
Text Mining: principles
Problem setting
Given a set of documents,
compute a representation, called index
to retrieve, summarize, classify or cluster them
<1 0 0 1 0 1>
<1 1 0 0 0 1>
<0 0 0 1 1 0>
Text Mining: principles
Problem setting
Given a set of genes (and their literature),
compute a representation, called gene index
to retrieve, summarize, classify or cluster them
<1 0 0 1 0 1>
<1 1 0 0 0 1>
<0 0 0 1 1 0>
Text Mining: principles
Vector space model
Document processing
Remove punctuation & grammatical structure (`Bag of words’)
Define a vocabulary
• Identify Multi-word terms (e.g., tumor suppressor) (phrases)
• Eliminate words low content (e.g., and, thus, gene, ...) (stopwords)
• Map words with same meaning (synonyms)
• Strip plurals, conjugations, ... (stemming)
Define weighing scheme and/or transformations (tf-idf,svd,..)
Compute index of textual resources:
T 1 T 3
T 2
vocabulary
gene
Text Mining: principles
Validity of gene index
Genes that are functionally related should be close in text space:
Modeled wrt a background distribution of
through random and permuted gene groups
Text-based coherence score
Text Mining: principles
Validity of gene index
Genes that are functionally related should be close in text space:
Text Mining: principles
Validity of gene index
Genes that are functionally related should be close in text space:
Overview
Overview
Genes & microarrays
Gene expression data analysis
Text mining in biology: principles
Text mining in practice: TXTGate
Combining text and gene expression data
Conclusion
TXTGate
TXTGate - a platform to profile group
s of genes
Motivation 1
“ Until now it has been largely overlooked that there is little difference between retrieving a MEDLINE abstract and downloading an
entry from a biological database ”
(M. Gerstein, 2001)12133521
VEGF is associated with the development and prognosis of colorectal cancer.
12168088
PTEN modulates angiogenesis in prostate cancer by regulating VEGF expression.
11866538
Vascular endothelial growth factor modulates the Tie- 2:Tie-1 receptor complex
GeneRIF GO
• cell proliferation
• heparin binding
• growth factor activity
TXTGate - a platform to profile group
s of genes
Motivation 2
Controlled vocabularies are of great value when constructing interoperable and computer-parsable systems.
A number of structured vocabularies have already arisen:
• Gene Ontology (GO)
• MeSH
• eVOC
Standards are systematically being adopted to store biological concepts or annotations:
• HUGO
• GOA@EBI
TXTGate - a platform to profile group
s of genes
Motivation 3
(Figure courtesy: S. Van Vooren)
TXTGate - a platform to profile group
s of genes
TXTGate
Profile
Distance matrix &
Clustering
Other vocabulary
TXTGate - a platform to profile group
s of genes
TXTGate – a case study
Gene modules over various expression data sets
Reported two sub modules of TCA cycle
Two ‘new’ genes ACN9
& CAT8 in module 2
Overview
Overview
Genes & microarrays
Gene expression data analysis
Text mining in biology: principles
Text mining in practice: TXTGate
Combining text and gene expression data
Conclusion
&
Fusion of text and expression data
Problem setting
“How can we analyze data in an integrated fashion to extract more information than solely from
expression data ? ”
Fusion of text and expression data
In each information space
Appropriate preprocessing
Choice of distance measures
Integration of text and data
Fusion of text and expression data
Integration of text and data
Combine data:
confidence attributed to either of the two data types
in case of distance, we can see it as a scaling
constant between the norms of the data- and text representations.
Fusion of text and expression data
Integration of text and data
However, distribution of distances invoke a bias Scaling problem
Therefore, use technique from statistical meta-analysis (so-called omnibus procedure)
Expression Distance histogram
Text Distance histogram
Fusion of text and expression data
Overview meta-clustering
M-score
Clustering
Fusion of text and expression data
Integration improves M-score
M-score
expression data only
M -s co re in te gr at ed c lu st er in g Various cutoffs k of the cluster tree
Optimal k ?
Fusion of text and expression data
A look inside the integration
Fusion of text and expression data
A look inside the integration
Expression Profile Text Profile
Strong
re-enforcement
Overview
Overview
Genes & microarrays
Gene expression data analysis
Text mining in biology: principles
Text mining in practice: TXTGate
Combining text and gene expression data
Conclusion
&
Conclusion
Contributions
Representation of a gene expression experiment
MIAME
Laboratory Information Management System v. at the VIB MicroArray Facility
Gene expression analysis
Iterative clustering to determine optimal k
M-score
Text-based gene representation
To represent functional information about genes
To score gene groups based on literature
To cluster genes based on literature
TXTGate text mining application
To profile, in an flexible and interactive manner, gene groups from different ‘views’
Integration of text and expression data in clustering
Conclusion
Semantically-oriented text mining representations
Algorithm-based:
• Improved phrases (word co-locations)
• Latent Semantic Indexing
• concept clustering, bi-clustering
Knowledge based:
• Gene Ontology distance in a taxonomy
• Basic natural language processing + statistics = Shallow Parsing
Advanced ways of integrating data
Combine link information with term information
Ways to determine
Future work
Conclusion
Publications
Questions
?
?
TXTGate - a platform to profile group
s of genes
TXTGate – final considerations
Flexible tool for analyzing gene groups (~200 genes) due to various term- and gene-centric vocabularies
… that allow some level of interoperability with external annotation databases
Sub-clustering gene groups useful to detect biological sub-patterns, or, shortcomings of the text representation.
Reasonably robust to corrupted groups
Gene index normalizes for unbalanced references and handles
multiple gene function by ‘overruling’
Genes and Microarrays
Representing expression information
Rationale:
Gene expression experiments are a chain of biotechnological operations, protocols and data processing steps
Too verbose to include in a scientific publication
Too important to compromise on reproducibility
Too valuable for post-genome research to have it scattered around on various websites
Standards for reporting on MA experiments
MIAME-compliant databases hosting expression compendia
Conditions in which expression occurs
Gene expression data analysis
Clustering parameterization
Clustering
Hierarchical clustering k - Means
Optimal number of clusters ? Define `optimal’ ?
Data-centered statistical scores exist
(Gap-statistic, FOM, Silhouette coefficient,…)
… but built on data that produced the result, not necessarily biologically relevant
Knowledge-based (GO- or text-based) scores (Neighborhood divergence, Gibbons et al.)
… but cyclic confirmations of truth
( As will be explained later on…)
Genes expression data analysis
Optimal k by looking at DNA patterns
Evaluation :
we constructed a motif-based heuristic
in terms of upstream regulatory sequence patterns in clusters,
To have a one-shot estimate of the `biological relevance’ of
a clustering result.
TXTGate - a platform to profile group
s of genes
TXTGate
multiple ‘views’ (through use of different vocabularies)
on vast amounts of (gene-based) free-text information
available in selected curated
database entries & linked scientific publications.
TXTGate - a platform to profile group
s of genes
TXTGate
incorporates term-based indices ..
(cfr before)
.. and use them as a starting point
to explore terms generated through different domain vocabularies
to link out to other resources by query building, or
to sub-cluster genes based on text.
TXTGate - a platform to profile group
s of genes
TXTGate – case 2
Text Mining: principles
How to construct a gene index
Gene index
Document index Gene-literature
associations
TXTGate - a platform to profile group
s of genes
TXTGate – case 1
Gene clusters from microarray experiment on human immune response
Comparative study with Chaussabel et al.
TXTGate’s disease vocabulary
Fusion of text and expression data