From Expression to Regulation:
the online analysis of microarray data
Gert Thijs
K.U.Leuven, Belgium
ESAT-SCD
http://www.esat.kuleuven.ac.be/~dna/BioI/
K.U.Leuven
•
Founded in 1425
•
Situated in the center of Belgium
•
Some numbers:
• 25.000 students
• 2.500 researchers
• 1.000 professors
• University Hospital with 1.500 beds
http://www.esat.kuleuven.ac.be/~dna/BioI/
ESAT-SCD
Faculty of Engineering
Mathematical engineering (120)
–
Systems and control
–
Data mining and Neural Nets
–
Biomedical signal processing
–
Telecommunications
–
Bioinformatics
–
Cryptography
http://www.esat.kuleuven.ac.be/~dna/BioI/
Bioinformatics team
Research in medical informatics and bioinformatics
Research on algorithmic methods
Interdisciplinary team
– 15 researchers (1 full professor, 4 post-docs, 10 Ph.D. students)
– Engineering, physics, mathematics, computer science, biotech, and medicine
Collaborative research with molecular biologists and clinicians
– VIB MicroArray Facility: primary analysis of microarray data
– University of Gent-VIB, Plant Genetics: motif discovery
– KUL-VIB, Center for Human Genetics
Neuronal development in mice neurons
Targets of PLAG1 (pleiomorphic adenoma gene)
– KUL, Obstetrics and Gynecology
Diagnosis of ovarian tumors from ultrasonography (IOTA)
Microarray analysis of ovarian tumor biopsies
http://www.esat.kuleuven.ac.be/~dna/BioI/
Overview
1.
Short introduction to microarrays
2.
Exploratory analysis of microarray data
3.
Clustering gene expression profiles
4.
Upstream sequence retrieval
5.
Motif finding in sets of co-expressed genes
http://www.esat.kuleuven.ac.be/~dna/BioI/
cDNA microarrays
Collaboration with VIB microarray facility.
5000 cDNAs (genes, ESTs) spotted on array
–
Cy3, Cy5 labeling of samples
–
Hybridization (test, control)
–
Laser scanning & image analysis
–
Arabidopsis , mouse, and human
http://www.esat.kuleuven.ac.be/~dna/BioI/
Microarray experiment
1. Collecting samples 2. Extracting mRNA 3. Labeling
4. Hybridizing
5. Scanning
6. Visualizing
http://www.esat.kuleuven.ac.be/~dna/BioI/
Microarray production
Clones
Plasmide preparation
PCR amplification
Reordering
Spotting
Zoom - pins
http://www.esat.kuleuven.ac.be/~dna/BioI/
From expression to regulation
Clustering
start start
Blast
Gibbs sampler Microarrays
A1234 Z4321
GenBank
http://www.esat.kuleuven.ac.be/~dna/BioI/
Exploratory data analysis
http://www.esat.kuleuven.ac.be/~dna/BioI/
Data exploration
Subset selection based on
–
Gene Ontology functional classes
–
Keywords, gene names
Check the expression profiles of individual genes
Visualization expression profiles of gene families
Link to upstream sequence retrieval
http://www.esat.kuleuven.ac.be/~dna/BioI/
Gene Ontology
http://www.esat.kuleuven.ac.be/~dna/BioI/
Subset selection
http://www.esat.kuleuven.ac.be/~dna/BioI/
Profile inspection
http://www.esat.kuleuven.ac.be/~dna/BioI/
Profile visualization
http://www.esat.kuleuven.ac.be/~dna/BioI/
Sequence Retrieval
http://www.esat.kuleuven.ac.be/~dna/BioI/
Clustering
http://www.esat.kuleuven.ac.be/~dna/BioI/
Goal of clustering
Exploration of microarray data
Form coherent groups of
–
Genes
–
Patient samples (e.g., tumors)
–
Drug or toxin response
Study these groups to get insight into biological processes
–
Genes in same clusters can have the same function or same
regulation
http://www.esat.kuleuven.ac.be/~dna/BioI/
K-means
Initialization
– Choose the number of clusters K and start from random positions for the K centers
Iteration
– Assign points to the closest center
– Move each center to the center of mass of the assigned points
Termination
– Stop when the centers
have converged or maximum
number of iterations
Initialization
http://www.esat.kuleuven.ac.be/~dna/BioI/
K-means
Initialization
– Choose the number of clusters K and start from random positions for the K centers
Iteration
– Assign points to the closest center
– Move each center to the center of mass of the assigned points
Termination
– Stop when the centers
have converged or maximum
number of iterations
Iteration 1
http://www.esat.kuleuven.ac.be/~dna/BioI/
Iteration 1
K-means
Initialization
– Choose the number of clusters K and start from random positions for the K centers
Iteration
– Assign points to the closest center
– Move each center to the center of mass of the assigned points
Termination
– Stop when the centers
have converged or maximum number of iterations
http://www.esat.kuleuven.ac.be/~dna/BioI/
Iteration 3
K-means
Initialization
– Choose the number of clusters K and start from random positions for the K centers
Iteration
– Assign points to the closest center
– Move each center to the center of mass of the assigned points
Termination
– Stop when the centers
have converged or maximum number of iterations
http://www.esat.kuleuven.ac.be/~dna/BioI/
Hierarchical clustering
Construction of gene tree based on correlation matrix
http://www.esat.kuleuven.ac.be/~dna/BioI/
K-means clustering
Need for new clustering algorithms
Noisy genes deteriorate consistency of profiles in cluster
All genes forced into cluster
http://www.esat.kuleuven.ac.be/~dna/BioI/
Adaptive quality-based clustering
For discovery, biologists are looking for highly coherent, reliable clusters
Other needs for clustering microarray data
–
Fast + limited memory (need to analyze thousands of genes)
–
No need to specify number of clusters in advance
–
Few and intuitive parameters
AQBC = 2 step algorithm
–
Cluster center localization
–
Cluster radius estimation with EM
Read more:
– De Smet et al. (2002) Bioinformatics, in press.
http://www.esat.kuleuven.ac.be/~dna/BioI/
Step 1: localization of cluster center
http://www.esat.kuleuven.ac.be/~dna/BioI/
Step 2: re-estimation of cluster radius
Distance from cluster center randomly distributed except for small group (= cluster elements)
Size of cluster can be estimated automatically by EM
Step 3: remove cluster points and look for new cluster
K-means:
A.Q.B.C.
User defined parameters
• Quality criterion (QC):
• % defines how significant a cluster should be separated from background
• Minimal number of genes in a cluster Advantages
• Outcome not sensitive to parameter setting
• Number of clusters is determined automatically
• Based on QC an optimal radius is calculated for each cluster
• Set of smaller clusters containing genes with highly similar expression profile (fewer false positives)
• Noisy genes are rejected
User-defined parameters
• Number of clusters
• Number of iterations
Disadvantages
• Outcome sensitive towards parameter setting
• Extensive fine-tuning required to find optimal number of clusters
• Separation and merging of clusters based on visual inspection and not on statistical foundation
• No quality criterion: more false positives
• All genes will be clustered (noisy clusters)
Disadvantages
• Some information is rejected: clusters too small
Advantages
• Fewer true positives are rejected
Comparison with K-means
http://www.esat.kuleuven.ac.be/~dna/BioI/
Adaptive Quality-Based Clustering Web Interface
http://www.esat.kuleuven.ac.be/~dna/BioI/
Cluster results page
Upstream Sequence Retrieval
http://www.esat.kuleuven.ac.be/~dna/BioI/
Upstream sequence retrieval
http://www.esat.kuleuven.ac.be/~dna/BioI/
Upstream Sequence Retrieval
1. Identify all genes in cluster based on given accession number and gene name.
2. Delineate upstream region based on sequence annotation.
3. Check for presence of annotated upstream gene.
4. IF upstream gene found THEN select intergenic region
ELSE blast gene to find genomic DNA where gene is annotated.
5. Parse blast reports to find intergenic regions
6. Report results in GFF.
http://www.esat.kuleuven.ac.be/~dna/BioI/
Gene Identification
http://www.esat.kuleuven.ac.be/~dna/BioI/
Selected sequences & genes to be blasted
http://www.esat.kuleuven.ac.be/~dna/BioI/
Results blast report parsing
http://www.esat.kuleuven.ac.be/~dna/BioI/
Selected sequences
http://www.esat.kuleuven.ac.be/~dna/BioI/
Motif Finding
http://www.esat.kuleuven.ac.be/~dna/BioI/
Transcriptional regulation
Complex integration of multiple signals determines gene activity
Combinatorial control
http://www.esat.kuleuven.ac.be/~dna/BioI/
Identifying regulatory elements from expression data
Cluster genes from microarray expression data to build clusters of co-expressed genes
Co-expressed genes may share regulatory mechanisms
Most regulatory sequences are found in the upstream region of the genes (up to 2kb from A. thaliana)
Motifs that are statistically overrepresented in the
upstream regions are candidate regulatory sequences
http://www.esat.kuleuven.ac.be/~dna/BioI/
Upstream sequence model
Motifs are hidden in noisy background sequence.
Data set contains two types of sequences:
–
Sequences with one or more copies of the common motif.
–
Sequences with no copy of the common motif.
http://www.esat.kuleuven.ac.be/~dna/BioI/
Motif Sampler
Algorithm based on the original Gibbs Sampling algorithm (Lawrence et al. 1993, Science 262:208-214)
Probabilistic sequence model
Changes and additions:
–
Use of higher-order background model.
–
Use of probability distribution to estimate number of copies.
–
Different motifs are found and masked in consecutive runs of the algorithm.
Read more:
–
Thijs et al. (2001) Bioinformatics 17(12), 1113-1122
–
Thijs et al. (2002) J.Comp.Biol. 9(2), 447-464
http://www.esat.kuleuven.ac.be/~dna/BioI/
Background model
Representation of DNA sequence by higher-order Markov Chain:
)
| ( )
( )
|
( 1
1 2
1 l l l m
L m l m
m P bb b P b b b
B S
P
Core promoter gene Intergenic region
Reliable model can be build from selected intergenic DNA sequences.
Intergenic sequence = non-coding region between two consecutive genes.
Only regions that contain core promoter are selected.
http://www.esat.kuleuven.ac.be/~dna/BioI/
) ,
| ( )
,
| (
)
| ( )
(
1 0
1
1 1
2 1 0
S b
b b P B
S b
b P
b b
b P b
b b P P
m l x l
x l x W
l m w
x x
m l l
l L m l m
Algorithm: Initialization
Calculate background model score
Start from random set of motif positions
Create initial motif model
http://www.esat.kuleuven.ac.be/~dna/BioI/
Algorithm: iterative procedure
1.
Score sequences with current motif model
2.
Calculate distribution
3.
Sample new alignment position
4.
Iterate for fixed number of steps
http://www.esat.kuleuven.ac.be/~dna/BioI/
Algorithm: Convergence
Select best scoring positions from Wx to create motif and alingment
http://www.esat.kuleuven.ac.be/~dna/BioI/
Motif Sampler
http://www.esat.kuleuven.ac.be/~dna/BioI/
Motif Sampler results page
http://www.esat.kuleuven.ac.be/~dna/BioI/
Example: Plant wounding
150 Arabidopsis genes
Mechanical plant wounding
7 (or 8) time points over a 24h period
Adaptive quality-based clustering produces 8 clusters of which 4 contain 5 or more genes.
Search for a motif of length 8 and a motif of length 12 in 4 clusters
Reymond, P et al.. 2000. Differential gene expression in response to mechanical wounding and insect feeding in Arabidopsis. Plant Cell 12(5): 707--20.
http://www.esat.kuleuven.ac.be/~dna/BioI/
Results: Cluster 1
TAArTAAGTCAC 7 TGAGTCA tissue specific GCN4-motif
CGTCA MeJA-responsive element
ATTCAAATTT 8 ATACAAAT element associated to GCN4-motif
CTTCTTCGATCT 5 TTCGACC elicitor responsive element
http://www.esat.kuleuven.ac.be/~dna/BioI/
Results: Cluster 2
CCCGCGTTTCAA 4 CCCCCG enhancer like element
TTGACyCGy 5 TGACG MeJa responsive element
(T)TGAC(C) Box-W1, elicitor responsive element
mACGTCACct 7 CGTCA MeJA responsive element
ACGT Abcissic response element
http://www.esat.kuleuven.ac.be/~dna/BioI/
Results: Cluster 4
wATATATATmTT 5 TATATA TATA-box like element
TCTwCnTC 9 TCTCCCT TCCC-motif, part of light responsive element
ATAAATAkGCnT 7 - -
http://www.esat.kuleuven.ac.be/~dna/BioI/
Results: Cluster 8
yTGACCGTCcsa 9 CCGTCC meristem specific activation of H4 gene CCGTCC A-box, light or elicitor responsive element
TGACG MeJA responsive element CGTCA MeJA responsive element CACGTGG 5 CACGTG G-box, light responsive element
ACGT Abcissic acid response element
GCCTymTT 8 - -
AGAATCAAT 6 - -
http://www.esat.kuleuven.ac.be/~dna/BioI/
Conclusions
Gene expression data can reveal useful information on transcriptional regulation.
Adaptive quality-based clustering finds coherent groups of co-expressed genes.
Use of higher-order background models improves performance of Motif Sampler.
INCLUSive enables online analysis from clustering to
motif finding
http://www.esat.kuleuven.ac.be/~dna/BioI/
Acknowledgements
ESAT-SCD
• Prof. Bart De Moor
• Dr. Yves Moreau
• Dr. Kathleen Marchal
• Frank De Smet
• Stein Aerts
• all others
STWW Project
•
Pierre Rouzé (VIB Gent, INRA)• Stephane Rombauts (VIB Gent, INRA)
• Magali Lescot (LGPD, Marseille)