From Expression to Regulation: the online analysis of microarray data

(1)

From Expression to Regulation:

the online analysis of microarray data

Gert Thijs

K.U.Leuven, Belgium

ESAT-SCD

(2)

http://www.esat.kuleuven.ac.be/~dna/BioI/

K.U.Leuven

•

Founded in 1425

•

Situated in the center of Belgium

•

Some numbers:

• 25.000 students

• 2.500 researchers

• 1.000 professors

• University Hospital with 1.500 beds

(3)

ESAT-SCD



Faculty of Engineering



Mathematical engineering (120)

–

Systems and control

–

Data mining and Neural Nets

–

Biomedical signal processing

–

Telecommunications

–

Bioinformatics

–

Cryptography

(4)

Bioinformatics team

 Research in medical informatics and bioinformatics

 Research on algorithmic methods

 Interdisciplinary team

– 15 researchers (1 full professor, 4 post-docs, 10 Ph.D. students)

– Engineering, physics, mathematics, computer science, biotech, and medicine

 Collaborative research with molecular biologists and clinicians

– VIB MicroArray Facility: primary analysis of microarray data

– University of Gent-VIB, Plant Genetics: motif discovery

– KUL-VIB, Center for Human Genetics

 Neuronal development in mice neurons

 Targets of PLAG1 (pleiomorphic adenoma gene)

– KUL, Obstetrics and Gynecology

 Diagnosis of ovarian tumors from ultrasonography (IOTA)

 Microarray analysis of ovarian tumor biopsies

(5)

Overview

1.

Short introduction to microarrays

2.

Exploratory analysis of microarray data

3.

Clustering gene expression profiles

4.

Upstream sequence retrieval

5.

Motif finding in sets of co-expressed genes

(6)

cDNA microarrays



Collaboration with VIB microarray facility.



5000 cDNAs (genes, ESTs) spotted on array

–

Cy3, Cy5 labeling of samples

–

Hybridization (test, control)

–

Laser scanning & image analysis

–

Arabidopsis , mouse, and human

(7)

Microarray experiment

1. Collecting samples 2. Extracting mRNA 3. Labeling

4. Hybridizing

5. Scanning

6. Visualizing

(8)

Microarray production

Clones

Plasmide preparation

PCR amplification

Reordering

Spotting

Zoom - pins

(9)

From expression to regulation

Clustering

start start

Blast

Gibbs sampler Microarrays

A1234 Z4321

GenBank

(10)

Exploratory data analysis

(11)

Data exploration



Subset selection based on

–

Gene Ontology functional classes

–

Keywords, gene names



Check the expression profiles of individual genes



Visualization expression profiles of gene families



Link to upstream sequence retrieval

(12)

Gene Ontology

(13)

Subset selection

(14)

Profile inspection

(15)

Profile visualization

(16)

Sequence Retrieval

(17)

Clustering

(18)

Goal of clustering



Exploration of microarray data



Form coherent groups of

–

Genes

–

Patient samples (e.g., tumors)

–

Drug or toxin response



Study these groups to get insight into biological processes

–

Genes in same clusters can have the same function or same

regulation

(19)

K-means



Initialization

– Choose the number of clusters K and start from random positions for the K centers



Iteration

– Assign points to the closest center

– Move each center to the center of mass of the assigned points



Termination

– Stop when the centers

have converged or maximum

number of iterations

Initialization

(20)

K-means



Initialization



Iteration



Termination

have converged or maximum

number of iterations

Iteration 1

(21)

Iteration 1

K-means



Initialization



Iteration



Termination

have converged or maximum number of iterations

(22)

Iteration 3

K-means



Initialization



Iteration



Termination

have converged or maximum number of iterations

(23)

Hierarchical clustering

 Construction of gene tree based on correlation matrix

(24)

K-means clustering

Need for new clustering algorithms



Noisy genes deteriorate consistency of profiles in cluster



All genes forced into cluster

(25)

Adaptive quality-based clustering



For discovery, biologists are looking for highly coherent, reliable clusters



Other needs for clustering microarray data

–

Fast + limited memory (need to analyze thousands of genes)

–

No need to specify number of clusters in advance

–

Few and intuitive parameters



AQBC = 2 step algorithm

–

Cluster center localization

–

Cluster radius estimation with EM



Step 1: localization of cluster center

(27)

Step 2: re-estimation of cluster radius



Distance from cluster center randomly distributed except for small group (= cluster elements)



Size of cluster can be estimated automatically by EM



Step 3: remove cluster points and look for new cluster

(28)

K-means:

A.Q.B.C.

User defined parameters

• Quality criterion (QC):

• % defines how significant a cluster should be separated from background

• Minimal number of genes in a cluster Advantages

• Outcome not sensitive to parameter setting

• Number of clusters is determined automatically

• Based on QC an optimal radius is calculated for each cluster

• Set of smaller clusters containing genes with highly similar expression profile (fewer false positives)

• Noisy genes are rejected

User-defined parameters

• Number of clusters

• Number of iterations

Disadvantages

• Outcome sensitive towards parameter setting

• Extensive fine-tuning required to find optimal number of clusters

• Separation and merging of clusters based on visual inspection and not on statistical foundation

• No quality criterion: more false positives

• All genes will be clustered (noisy clusters)

Disadvantages

• Some information is rejected: clusters too small

Advantages

• Fewer true positives are rejected

Comparison with K-means

(29)

Adaptive Quality-Based Clustering Web Interface

(30)

Cluster results page

Upstream Sequence Retrieval

(31)

Upstream sequence retrieval

(32)

Upstream Sequence Retrieval

1. Identify all genes in cluster based on given accession number and gene name.

2. Delineate upstream region based on sequence annotation.

3. Check for presence of annotated upstream gene.

4. IF upstream gene found THEN select intergenic region

ELSE blast gene to find genomic DNA where gene is annotated.

5. Parse blast reports to find intergenic regions

6. Report results in GFF.

(33)

Gene Identification

(34)

Selected sequences & genes to be blasted

(35)

Results blast report parsing

(36)

Selected sequences

(37)

Motif Finding

(38)

Transcriptional regulation



Complex integration of multiple signals determines gene activity



Combinatorial control

(39)

Identifying regulatory elements from expression data



Cluster genes from microarray expression data to build clusters of co-expressed genes



Co-expressed genes may share regulatory mechanisms



Most regulatory sequences are found in the upstream region of the genes (up to 2kb from A. thaliana)



Motifs that are statistically overrepresented in the

upstream regions are candidate regulatory sequences

(40)

Upstream sequence model



Motifs are hidden in noisy background sequence.



Data set contains two types of sequences:

–

Sequences with one or more copies of the common motif.

–

Sequences with no copy of the common motif.

(41)

Motif Sampler



Algorithm based on the original Gibbs Sampling algorithm (Lawrence et al. 1993, Science 262:208-214)



Probabilistic sequence model



Changes and additions:

–

Use of higher-order background model.

–

Use of probability distribution to estimate number of copies.

–

Different motifs are found and masked in consecutive runs of the algorithm.



Thijs et al. (2001) Bioinformatics 17(12), 1113-1122

–

Thijs et al. (2002) J.Comp.Biol. 9(2), 447-464

(42)

Background model



Representation of DNA sequence by higher-order Markov Chain:

)

| ( )

( )

|

( ₁

1 2

1 l l l m

L m l m

m P bb b P b b b

B S

P _ _







  

Core promoter gene Intergenic region



Reliable model can be build from selected intergenic DNA sequences.



Intergenic sequence = non-coding region between two consecutive genes.



Only regions that contain core promoter are selected.

(43)

) ,

| ( )

,

| (

)

| ( )

(

1 0

1

1 1

2 1 0

S b

b b P B

S b

b P

b b

b P b

b b P P

m l x l

x l x W

l m w

x x

m l l

l L m l m



























Algorithm: Initialization

 Calculate background model score

 Start from random set of motif positions

 Create initial motif model

(44)

Algorithm: iterative procedure

1.

Score sequences with current motif model

2.

Calculate distribution

3.

Sample new alignment position

4.

Iterate for fixed number of steps

(45)

Algorithm: Convergence

Select best scoring positions from Wx to create motif and alingment

(46)

Motif Sampler

(47)

Motif Sampler results page

(48)

Example: Plant wounding



150 Arabidopsis genes



Mechanical plant wounding



7 (or 8) time points over a 24h period



Adaptive quality-based clustering produces 8 clusters of which 4 contain 5 or more genes.



Search for a motif of length 8 and a motif of length 12 in 4 clusters

Reymond, P et al.. 2000. Differential gene expression in response to mechanical wounding and insect feeding in Arabidopsis. Plant Cell 12(5): 707--20.

(49)

Results: Cluster 1

TAArTAAGTCAC 7 TGAGTCA tissue specific GCN4-motif

CGTCA MeJA-responsive element

ATTCAAATTT 8 ATACAAAT element associated to GCN4-motif

CTTCTTCGATCT 5 TTCGACC elicitor responsive element

(50)

Results: Cluster 2

CCCGCGTTTCAA 4 CCCCCG enhancer like element

TTGACyCGy 5 TGACG MeJa responsive element

(T)TGAC(C) Box-W1, elicitor responsive element

mACGTCACct 7 CGTCA MeJA responsive element

ACGT Abcissic response element

(51)

Results: Cluster 4

wATATATATmTT 5 TATATA TATA-box like element

TCTwCnTC 9 TCTCCCT TCCC-motif, part of light responsive element

ATAAATAkGCnT 7 - -

(52)

Results: Cluster 8

yTGACCGTCcsa 9 CCGTCC meristem specific activation of H4 gene CCGTCC A-box, light or elicitor responsive element

TGACG MeJA responsive element CGTCA MeJA responsive element CACGTGG 5 CACGTG G-box, light responsive element

ACGT Abcissic acid response element

GCCTymTT 8 - -

AGAATCAAT 6 - -

(53)

Conclusions



Gene expression data can reveal useful information on transcriptional regulation.



Adaptive quality-based clustering finds coherent groups of co-expressed genes.



Use of higher-order background models improves performance of Motif Sampler.



INCLUSive enables online analysis from clustering to

motif finding

(54)

Acknowledgements

ESAT-SCD

• Prof. Bart De Moor

• Dr. Yves Moreau

• Dr. Kathleen Marchal

• Frank De Smet

• Stein Aerts

• all others

STWW Project

•

Pierre Rouzé (VIB Gent, INRA)

• Stephane Rombauts (VIB Gent, INRA)

• Magali Lescot (LGPD, Marseille)