• No results found

Integrating Scientific Literature With Large Scale Gene Expression Analysis

N/A
N/A
Protected

Academic year: 2021

Share "Integrating Scientific Literature With Large Scale Gene Expression Analysis"

Copied!
62
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Genes and Microarrays

PhD defense

Patrick Glenisson

Integrating Scientific Literature With Large Scale Gene Expression Analysis

Promotor

Prof. Bart De Moor June 11th 2004

(2)

Overview

Overview

Genes & microarrays

Gene expression data analysis

Text mining in biology: principles

Text mining in practice: TXTGate

Combining text and gene expression data

Conclusion

(3)

Overview

M-score

Overview

Genes & microarrays

Gene expression data analysis

Text mining in biology: principles

Text mining in practice: TXTGate

Combining text and gene expression data

Conclusion

Cluster analysis

(4)

Overview

Overview

Genes & microarrays

Gene expression data analysis

Text mining in biology: principles

Text mining in practice: TXTGate

Combining text and gene expression data

Conclusion

Literature analysis

(5)

Overview

Overview

Genes & microarrays

Gene expression data analysis

Text mining in biology: principles

Text mining in practice: TXTGate

Combining text and gene expression data

Conclusion

TXTGate

(6)

Overview

Overview

Genes & microarrays

Gene expression data analysis

Text mining in biology: principles

Text mining in practice: TXTGate

Combining text and gene expression data

Conclusion

&

Integrated clustering

(7)

Overview

Overview

Genes & microarrays

Gene expression data analysis

Text mining in biology: principles

Text mining in practice: TXTGate

Combining text and gene expression data

Conclusion

&

(8)

Overview

Overview

Genes & microarrays

Gene expression data analysis

Text mining in biology: principles

Text mining in practice: TXTGate

Combining text and gene expression data

Conclusion

&

(9)

Genes and Microarrays

DNA, genes, proteins and cells

(10)

Genes and Microarrays

DNA, genes, proteins and cells

protein

(11)

Genes and Microarrays

Genes are expressed and regulated

(12)

Genes and Microarrays

Microarrays measure gene expression

Laser excitation

Genes

Gene

expression measurement Conditions

G1G2 G3..

C1 C2

C3 ..

Sample annotations

Gene annotations

(13)

Genes and Microarrays

Representing expression information

Gene expression experiments are complex :

Too verbose to include in a scientific publication

Too important to compromise on reproducibility

Too valuable for post-genome research to have it scattered around on various websites

Hence, standard for reporting on MA experiments

As a guideline for databases hosting expression compendia

Conditions in which expression occurs

(14)

Genes and Microarrays

MIAME standard

Minimum Information About a MicroArray Experiment

Internationally proposed standard

Published in Dec 2001 by International consortium MGED

Some prominent journals (Nature, Lancet, EMBO, Cell) require MIAME-compliant submissions of data

Some hurdles:

Significant overhead in filling out the questionnaire

Scooping of leads (!)

Proprietary information about probe sequences

(15)

Overview

Overview

Genes & microarrays

Gene expression data analysis

Text mining in biology: principles

Text mining in practice: TXTGate

Combining text and gene expression data

Conclusion

&

(16)

Gene expression data analysis

Questions asked with microarrays

Fundamental

Functional roles of genes (and transcriptional regulation)

Genetic network reconstruction

Clinical

Correlation of genes with a given disease

Diagnosis of disease stage with patients

Pharmacological

Toxicological drug response assessment

(17)

Gene expression data analysis

Microarray data analysis

Fundamental

Functional roles of genes (and transcriptional regulation)

Genetic network reconstruction

Clinical

Correlation of genes with a given disease

Diagnosis of disease stage with patients

Pharmacological

Toxicological drug response assessment

(18)

Gene expression data analysis

Clustering

Conditions

Genes

Expression data C1

C3

C2

Genes

Genes

Distance matrix

Clustering

Hierarchical clustering k - Means

(19)

Gene expression data analysis

 Data-centered statistical scores

 Coherence vs separation of clusters

 Stability of a cluster solution when leaving out data

Cluster validation

Define `optimal’ ?

Optimal number of clusters ?

C1 C3

C2

(20)

Gene expression data analysis

Data-centered statistical scores

 Knowledge-based scores

Enrichment of GO annotations in clusters

Literature-based scoring

Cluster validation

Define `optimal’ ?

Optimal number of clusters ?

(21)

Gene expression data analysis

Cluster validation

Define `optimal’ ?

Optimal number of clusters ?

Data-centered statistical scores

 Knowledge-based scores

 Motif-based

 DNA patterns in regulatory regions of gene groups

Regulatory DNA patterns (motifs)

Gene

(22)

Genes expression data analysis

DNA patterns in expression clusters

Significant occurrences of known motifs in cluster

Motifs

Clusters

Cluster-by-Motif

(motif enrichment matrix)

1 2 3 ..

A B C ..

-log(p-value)

M-score Gene clusters

(23)

Genes expression data analysis

Cluster-by-motif matrix

cluster

m ot if M-Score for the entire clustering solution

 one-shot estimate of the `biological relevance’

(24)

Gene expression data analysis

M-score

A motif is less interesting when it (significantly) occurs in many clusters

A cluster that contains a large portion of (significant) motifs is less likely to be biologically relevant.

A `too large' number of clusters is less likely to reflect the true biological diversity underlying the experiment.

(25)

Gene expression data analysis

M-score validation

A simplification of reality

No absolute quantification of biological relevance.

Useful tool when experimenting with

Multiple clustering methods

Multiple parameterizations

To economize on biological validations

Optimal k in yeast cell cycle expression data

Original studies by Tavazoie et al. used k=30

Overestimation  confirmed by analyses of

De Smet et al. (AQBC)

Gibbons et al. (GO-based scoring)

k

M-score

(26)

Overview

Overview

Genes & microarrays

Gene expression data analysis

Text mining in biology: principles

Text mining in practice: TXTGate

Combining text and gene expression data

Conclusion

(27)

Text Mining: principles

Problem setting

Given a set of documents,

compute a representation, called index

to retrieve, summarize, classify or cluster them

<1 0 0 1 0 1>

<1 1 0 0 0 1>

<0 0 0 1 1 0>

(28)

Text Mining: principles

Problem setting

Given a set of genes (and their literature),

compute a representation, called gene index

to retrieve, summarize, classify or cluster them

<1 0 0 1 0 1>

<1 1 0 0 0 1>

<0 0 0 1 1 0>

(29)

Text Mining: principles

Vector space model

Document processing

Remove punctuation & grammatical structure (`Bag of words’)

Define a vocabulary

Identify Multi-word terms (e.g., tumor suppressor) (phrases)

Eliminate words low content (e.g., and, thus, gene, ...) (stopwords)

Map words with same meaning (synonyms)

Strip plurals, conjugations, ... (stemming)

Define weighing scheme and/or transformations (tf-idf,svd,..)

Compute index of textual resources:

T 1 T 3

T 2

vocabulary

gene

(30)

Text Mining: principles

Validity of gene index

Genes that are functionally related should be close in text space:

Modeled wrt a background distribution of

through random and permuted gene groups

Text-based coherence score

(31)

Text Mining: principles

Validity of gene index

Genes that are functionally related should be close in text space:

(32)

Text Mining: principles

Validity of gene index

Genes that are functionally related should be close in text space:

(33)

Overview

Overview

Genes & microarrays

Gene expression data analysis

Text mining in biology: principles

Text mining in practice: TXTGate

Combining text and gene expression data

Conclusion

TXTGate

(34)

TXTGate - a platform to profile group

s of genes

Motivation 1

“ Until now it has been largely overlooked that there is little difference between retrieving a MEDLINE abstract and downloading an

entry from a biological database ”

(M. Gerstein, 2001)

12133521

VEGF is associated with the development and prognosis of colorectal cancer.

12168088

PTEN modulates angiogenesis in prostate cancer by regulating VEGF expression.

11866538

Vascular endothelial growth factor modulates the Tie- 2:Tie-1 receptor complex

GeneRIF GO

• cell proliferation

• heparin binding

• growth factor activity

(35)

TXTGate - a platform to profile group

s of genes

Motivation 2

Controlled vocabularies are of great value when constructing interoperable and computer-parsable systems.

A number of structured vocabularies have already arisen:

Gene Ontology (GO)

MeSH

eVOC

Standards are systematically being adopted to store biological concepts or annotations:

HUGO

GOA@EBI

(36)

TXTGate - a platform to profile group

s of genes

Motivation 3

(Figure courtesy: S. Van Vooren)

(37)

TXTGate - a platform to profile group

s of genes

TXTGate

Profile

Distance matrix &

Clustering

Other vocabulary

(38)

TXTGate - a platform to profile group

s of genes

TXTGate – a case study

Gene modules over various expression data sets

Reported two sub modules of TCA cycle

Two ‘new’ genes ACN9

& CAT8 in module 2

(39)

Overview

Overview

Genes & microarrays

Gene expression data analysis

Text mining in biology: principles

Text mining in practice: TXTGate

Combining text and gene expression data

Conclusion

&

(40)

Fusion of text and expression data

Problem setting

“How can we analyze data in an integrated fashion to extract more information than solely from

expression data ? ”

(41)

Fusion of text and expression data

In each information space

Appropriate preprocessing

Choice of distance measures

Integration of text and data

(42)

Fusion of text and expression data

Integration of text and data

Combine data:

confidence attributed to either of the two data types

in case of distance, we can see it as a scaling

constant between the norms of the data- and text representations.

(43)

Fusion of text and expression data

Integration of text and data

However, distribution of distances invoke a bias  Scaling problem

Therefore, use technique from statistical meta-analysis (so-called omnibus procedure)

Expression Distance histogram

Text Distance histogram

(44)

Fusion of text and expression data

Overview meta-clustering

M-score

Clustering

(45)

Fusion of text and expression data

Integration improves M-score

M-score

expression data only

M -s co re in te gr at ed c lu st er in g Various cutoffs k of the cluster tree

Optimal k ?

(46)

Fusion of text and expression data

A look inside the integration

(47)

Fusion of text and expression data

A look inside the integration

Expression Profile Text Profile

Strong

re-enforcement

(48)

Overview

Overview

Genes & microarrays

Gene expression data analysis

Text mining in biology: principles

Text mining in practice: TXTGate

Combining text and gene expression data

Conclusion

&

(49)

Conclusion

Contributions

Representation of a gene expression experiment

MIAME

Laboratory Information Management System v. at the VIB MicroArray Facility

Gene expression analysis

Iterative clustering to determine optimal k

M-score

Text-based gene representation

To represent functional information about genes

To score gene groups based on literature

To cluster genes based on literature

TXTGate text mining application

To profile, in an flexible and interactive manner, gene groups from different ‘views’

Integration of text and expression data in clustering

(50)

Conclusion

Semantically-oriented text mining representations

Algorithm-based:

Improved phrases (word co-locations)

Latent Semantic Indexing

concept clustering, bi-clustering

Knowledge based:

Gene Ontology  distance in a taxonomy

Basic natural language processing + statistics = Shallow Parsing

Advanced ways of integrating data

Combine link information with term information

Ways to determine

Future work

(51)

Conclusion

Publications

(52)

Questions

?

?

(53)

TXTGate - a platform to profile group

s of genes

TXTGate – final considerations

Flexible tool for analyzing gene groups (~200 genes) due to various term- and gene-centric vocabularies

… that allow some level of interoperability with external annotation databases

Sub-clustering gene groups useful to detect biological sub-patterns, or, shortcomings of the text representation.

Reasonably robust to corrupted groups

Gene index normalizes for unbalanced references and handles

multiple gene function by ‘overruling’

(54)

Genes and Microarrays

Representing expression information

Rationale:

Gene expression experiments are a chain of biotechnological operations, protocols and data processing steps

Too verbose to include in a scientific publication

Too important to compromise on reproducibility

Too valuable for post-genome research to have it scattered around on various websites

Standards for reporting on MA experiments

MIAME-compliant databases hosting expression compendia

Conditions in which expression occurs

(55)

Gene expression data analysis

Clustering parameterization

Clustering

Hierarchical clustering k - Means

Optimal number of clusters ? Define `optimal’ ?

Data-centered statistical scores exist

(Gap-statistic, FOM, Silhouette coefficient,…)

 … but built on data that produced the result, not necessarily biologically relevant

Knowledge-based (GO- or text-based) scores (Neighborhood divergence, Gibbons et al.)

 … but cyclic confirmations of truth

( As will be explained later on…)

(56)

Genes expression data analysis

Optimal k by looking at DNA patterns

Evaluation :

we constructed a motif-based heuristic

in terms of upstream regulatory sequence patterns in clusters,

To have a one-shot estimate of the `biological relevance’ of

a clustering result.

(57)

TXTGate - a platform to profile group

s of genes

TXTGate

multiple ‘views’ (through use of different vocabularies)

on vast amounts of (gene-based) free-text information

available in selected curated

database entries & linked scientific publications.

(58)

TXTGate - a platform to profile group

s of genes

TXTGate

incorporates term-based indices ..

(cfr before)

.. and use them as a starting point

to explore terms generated through different domain vocabularies

to link out to other resources by query building, or

to sub-cluster genes based on text.

(59)

TXTGate - a platform to profile group

s of genes

TXTGate – case 2

(60)

Text Mining: principles

How to construct a gene index

Gene index

Document index Gene-literature

associations

(61)

TXTGate - a platform to profile group

s of genes

TXTGate – case 1

Gene clusters from microarray experiment on human immune response

Comparative study with Chaussabel et al.

TXTGate’s disease vocabulary

(62)

Fusion of text and expression data

Various ways to integrate data

Referenties

GERELATEERDE DOCUMENTEN

In addition to its centile position within the text, other data were added to each mention including the number of mentions (for the citing article and reference

In addition to its centile position within the text, other data were added to each mention including the number of mentions (for the citing article and reference combination),

 The literature-weighted global test can evaluate biomedical con- cepts for association with gene expression changes based on text mining-derived associations.The test uses

We selected three different GSA tools that all allow for user-supplied gene sets but implement different statis- tical tests. 1) ToxProfiler, which implements the un- paired t-test

The raw microarray data are images, which have to be transformed into gene expression matrices, tables where rows represent genes, columns represent various samples such as tissues

Genes that are functionally related should be close in text space:.. Text Mining: principles . Validity of

We develop a simple method to score groups of genes using a distance-based relevance measure and apply these scores in (1) testing to which extent the TF-IDF and LSI

These results show the potential for text mining to discover links between diseases and genes in the biomedical literature. Since a high percentage of the top ranking