Integrating Scientific Literature With Large Scale Gene Expression Analysis

(1)

Genes and Microarrays



PhD defense

Patrick Glenisson

Integrating Scientific Literature With Large Scale Gene Expression Analysis

Promotor

Prof. Bart De Moor June 11^th 2004

(2)

Overview



Overview

 Genes & microarrays

 Gene expression data analysis

 Text mining in biology: principles

 Text mining in practice: TXTGate

 Combining text and gene expression data

 Conclusion

(3)

Overview



M-score

Overview

 Conclusion



Cluster analysis

(4)

Overview



Overview

 Conclusion



Literature analysis

(5)

Overview



Overview

 Conclusion



TXTGate

(6)

Overview



Overview

 Conclusion



&

Integrated clustering

(7)

Overview



Overview

 Conclusion



&

(8)

Overview



Overview



Genes & microarrays

 Conclusion



&

(9)



DNA, genes, proteins and cells

(10)



DNA, genes, proteins and cells

protein

(11)



Genes are expressed and regulated

(12)



Microarrays measure gene expression

Laser excitation

Genes

Gene

expression measurement Conditions

G1G2 G3..

C1 C2

C3 ..

Sample annotations

Gene annotations

(13)



Representing expression information



Gene expression experiments are complex :



Too verbose to include in a scientific publication



Too important to compromise on reproducibility



Too valuable for post-genome research to have it scattered around on various websites



Hence, standard for reporting on MA experiments



As a guideline for databases hosting expression compendia

Conditions in which expression occurs

(14)



MIAME standard



Minimum Information About a MicroArray Experiment



Internationally proposed standard



Published in Dec 2001 by International consortium MGED



Some prominent journals (Nature, Lancet, EMBO, Cell) require MIAME-compliant submissions of data



Some hurdles:



Significant overhead in filling out the questionnaire



Scooping of leads (!)



Proprietary information about probe sequences

(15)

Overview



Overview



Gene expression data analysis

 Conclusion



&

(16)

Gene expression data analysis



Questions asked with microarrays

 Fundamental



Functional roles of genes (and transcriptional regulation)



Genetic network reconstruction

 Clinical



Correlation of genes with a given disease



Diagnosis of disease stage with patients

 Pharmacological



Toxicological drug response assessment

(17)



Microarray data analysis

 Fundamental



Functional roles of genes (and transcriptional regulation)



Genetic network reconstruction



Clinical



Correlation of genes with a given disease



Diagnosis of disease stage with patients



Pharmacological



Toxicological drug response assessment

(18)



Clustering

Conditions

Genes

Expression data ^C1

C3

C2

Genes

Genes

Distance matrix

Clustering

Hierarchical clustering k - Means

(19)



 Data-centered statistical scores

 Coherence vs separation of clusters

 Stability of a cluster solution when leaving out data

Cluster validation

Define `optimal’ ?

Optimal number of clusters ?

C1 C3

C2

(20)



 Data-centered statistical scores

 Knowledge-based scores

Enrichment of GO annotations in clusters

Literature-based scoring

Cluster validation

(21)



Cluster validation

 Data-centered statistical scores

 Knowledge-based scores

 Motif-based

 DNA patterns in regulatory regions of gene groups

Regulatory DNA patterns (motifs)

Gene

(22)

Genes expression data analysis



DNA patterns in expression clusters

Significant occurrences of known motifs in cluster

Motifs

Clusters

Cluster-by-Motif

(motif enrichment matrix)

1 2 3 ..

A B C ..

-log(p-value)

M-score Gene clusters

(23)



Cluster-by-motif matrix

cluster

m ot if M-Score for the entire clustering solution

 one-shot estimate of the `biological relevance’

(24)



M-score

 A motif is less interesting when it (significantly) occurs in many clusters

 A cluster that contains a large portion of (significant) motifs is less likely to be biologically relevant.

 A `too large' number of clusters is less likely to reflect the true biological diversity underlying the experiment.

(25)



M-score validation

 A simplification of reality

 No absolute quantification of biological relevance.

 Useful tool when experimenting with

• Multiple clustering methods

• Multiple parameterizations

 To economize on biological validations

 Optimal k in yeast cell cycle expression data

 Original studies by Tavazoie et al. used k=30

 Overestimation  confirmed by analyses of

• De Smet et al. (AQBC)

• Gibbons et al. (GO-based scoring)

k

M-score

(26)

Overview



Overview



Text mining in biology: principles

 Conclusion



(27)

Text Mining: principles



Problem setting



Given a set of documents,



compute a representation, called index



to retrieve, summarize, classify or cluster them

<1 0 0 1 0 1>



<1 1 0 0 0 1>

<0 0 0 1 1 0>

(28)



Problem setting



Given a set of genes (and their literature),



compute a representation, called gene index



to retrieve, summarize, classify or cluster them

<1 0 0 1 0 1>



<1 1 0 0 0 1>

<0 0 0 1 1 0>

(29)



Vector space model

 Document processing

 Remove punctuation & grammatical structure (`Bag of words’)

 Define a vocabulary

• Identify Multi-word terms (e.g., tumor suppressor) (phrases)

• Eliminate words low content (e.g., and, thus, gene, ...) (stopwords)

• Map words with same meaning (synonyms)

• Strip plurals, conjugations, ... (stemming)

 Define weighing scheme and/or transformations (tf-idf,svd,..)

 Compute index of textual resources:

T 1 T 3

T 2

vocabulary

gene

(30)



Validity of gene index

Genes that are functionally related should be close in text space:

 Modeled wrt a background distribution of

 through random and permuted gene groups

Text-based coherence score

(31)



Validity of gene index

Genes that are functionally related should be close in text space:

(32)



Validity of gene index

Genes that are functionally related should be close in text space:

(33)

Overview



Overview



Text mining in practice: TXTGate

 Conclusion



TXTGate

(34)

TXTGate - a platform to profile group

s of genes



Motivation 1

“ Until now it has been largely overlooked that there is little difference between retrieving a MEDLINE abstract and downloading an

entry from a biological database ”

(M. Gerstein, 2001)

12133521

VEGF is associated with the development and prognosis of colorectal cancer.

12168088

PTEN modulates angiogenesis in prostate cancer by regulating VEGF expression.

11866538

Vascular endothelial growth factor modulates the Tie- 2:Tie-1 receptor complex

GeneRIF GO

• cell proliferation

• heparin binding

• growth factor activity

(35)

s of genes



Motivation 2

Controlled vocabularies are of great value when constructing interoperable and computer-parsable systems.

A number of structured vocabularies have already arisen:

• Gene Ontology (GO)

• MeSH

• eVOC

Standards are systematically being adopted to store biological concepts or annotations:

• HUGO

• GOA@EBI

(36)

s of genes



Motivation 3

(Figure courtesy: S. Van Vooren)

(37)

s of genes



TXTGate

Profile

Distance matrix &

Clustering

Other vocabulary

(38)

s of genes



TXTGate – a case study

 Gene modules over various expression data sets

 Reported two sub modules of TCA cycle

Two ‘new’ genes ACN9

& CAT8 in module 2

(39)

Overview



Overview



Combining text and gene expression data

 Conclusion



&

(40)

Fusion of text and expression data



Problem setting

“How can we analyze data in an integrated fashion to extract more information than solely from

expression data ? ”

(41)





In each information space

 Appropriate preprocessing

 Choice of distance measures

Integration of text and data

(42)



Integration of text and data



Combine data:



confidence attributed to either of the two data types



in case of distance, we can see it as a scaling

constant between the norms of the data- and text representations.

(43)



Integration of text and data



However, distribution of distances invoke a bias  Scaling problem



Therefore, use technique from statistical meta-analysis (so-called omnibus procedure)

Expression Distance histogram

Text Distance histogram

(44)



Overview meta-clustering

M-score

Clustering

(45)



Integration improves M-score

M-score

expression data only

M -s co re in te gr at ed c lu st er in g Various cutoffs k of the cluster tree

Optimal k ?

(46)



A look inside the integration

(47)



A look inside the integration

Expression Profile Text Profile

Strong

re-enforcement

(48)

Overview



Overview



Conclusion



&

(49)

Conclusion



Contributions



Representation of a gene expression experiment

 MIAME

 Laboratory Information Management System v. at the VIB MicroArray Facility



Gene expression analysis

 Iterative clustering to determine optimal k

 M-score



Text-based gene representation

 To represent functional information about genes

 To score gene groups based on literature

 To cluster genes based on literature



TXTGate text mining application

 To profile, in an flexible and interactive manner, gene groups from different ‘views’



Integration of text and expression data in clustering

(50)

Conclusion





Semantically-oriented text mining representations

 Algorithm-based:

• Improved phrases (word co-locations)

• Latent Semantic Indexing

• concept clustering, bi-clustering

 Knowledge based:

• Gene Ontology  distance in a taxonomy

• Basic natural language processing + statistics = Shallow Parsing



Advanced ways of integrating data

 Combine link information with term information

 Ways to determine

Future work

(51)

Conclusion



Publications

(52)

Questions



?

(53)

s of genes



TXTGate – final considerations



Flexible tool for analyzing gene groups (~200 genes) due to various term- and gene-centric vocabularies



… that allow some level of interoperability with external annotation databases



Sub-clustering gene groups useful to detect biological sub-patterns, or, shortcomings of the text representation.



Reasonably robust to corrupted groups



Gene index normalizes for unbalanced references and handles

multiple gene function by ‘overruling’

(54)



Representing expression information



Rationale:



Gene expression experiments are a chain of biotechnological operations, protocols and data processing steps



Too verbose to include in a scientific publication



Too important to compromise on reproducibility



Too valuable for post-genome research to have it scattered around on various websites



Standards for reporting on MA experiments



MIAME-compliant databases hosting expression compendia

Conditions in which expression occurs

(55)



Clustering parameterization

Clustering

Hierarchical clustering k - Means

Optimal number of clusters ? Define `optimal’ ?

Data-centered statistical scores exist

(Gap-statistic, FOM, Silhouette coefficient,…)

 … but built on data that produced the result, not necessarily biologically relevant

Knowledge-based (GO- or text-based) scores (Neighborhood divergence, Gibbons et al.)

 … but cyclic confirmations of truth

( As will be explained later on…)

(56)



Optimal k by looking at DNA patterns



Evaluation :



we constructed a motif-based heuristic



in terms of upstream regulatory sequence patterns in clusters,



To have a one-shot estimate of the `biological relevance’ of

a clustering result.

(57)

s of genes



TXTGate



multiple ‘views’ (through use of different vocabularies)



on vast amounts of (gene-based) free-text information



available in selected curated

database entries & linked scientific publications.

(58)

s of genes



TXTGate



incorporates term-based indices ..

(cfr before)



.. and use them as a starting point

 to explore terms generated through different domain vocabularies

 to link out to other resources by query building, or

 to sub-cluster genes based on text.

(59)

s of genes



TXTGate – case 2

(60)



How to construct a gene index

Gene index

Document index Gene-literature

associations

(61)

s of genes



TXTGate – case 1

 Gene clusters from microarray experiment on human immune response

 Comparative study with Chaussabel et al.

 TXTGate’s disease vocabulary

(62)