• No results found

From Expression to Regulation: the online analysis of microarray data

N/A
N/A
Protected

Academic year: 2021

Share "From Expression to Regulation: the online analysis of microarray data"

Copied!
54
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

From Expression to Regulation:

the online analysis of microarray data

Gert Thijs

K.U.Leuven, Belgium

ESAT-SCD

(2)

http://www.esat.kuleuven.ac.be/~dna/BioI/

K.U.Leuven

Founded in 1425

Situated in the center of Belgium

Some numbers:

25.000 students

2.500 researchers

1.000 professors

University Hospital with 1.500 beds

(3)

http://www.esat.kuleuven.ac.be/~dna/BioI/

ESAT-SCD

Faculty of Engineering

Mathematical engineering (120)

Systems and control

Data mining and Neural Nets

Biomedical signal processing

Telecommunications

Bioinformatics

Cryptography

(4)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Bioinformatics team

Research in medical informatics and bioinformatics

Research on algorithmic methods

Interdisciplinary team

15 researchers (1 full professor, 4 post-docs, 10 Ph.D. students)

Engineering, physics, mathematics, computer science, biotech, and medicine

Collaborative research with molecular biologists and clinicians

VIB MicroArray Facility: primary analysis of microarray data

University of Gent-VIB, Plant Genetics: motif discovery

KUL-VIB, Center for Human Genetics

Neuronal development in mice neurons

Targets of PLAG1 (pleiomorphic adenoma gene)

KUL, Obstetrics and Gynecology

Diagnosis of ovarian tumors from ultrasonography (IOTA)

Microarray analysis of ovarian tumor biopsies

(5)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Overview

1.

Short introduction to microarrays

2.

Exploratory analysis of microarray data

3.

Clustering gene expression profiles

4.

Upstream sequence retrieval

5.

Motif finding in sets of co-expressed genes

(6)

http://www.esat.kuleuven.ac.be/~dna/BioI/

cDNA microarrays

Collaboration with VIB microarray facility.

5000 cDNAs (genes, ESTs) spotted on array

Cy3, Cy5 labeling of samples

Hybridization (test, control)

Laser scanning & image analysis

Arabidopsis , mouse, and human

(7)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Microarray experiment

1. Collecting samples 2. Extracting mRNA 3. Labeling

4. Hybridizing

5. Scanning

6. Visualizing

(8)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Microarray production

Clones

Plasmide preparation

PCR amplification

Reordering

Spotting

Zoom - pins

(9)

http://www.esat.kuleuven.ac.be/~dna/BioI/

From expression to regulation

Clustering

start start

Blast

Gibbs sampler Microarrays

A1234 Z4321

GenBank

(10)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Exploratory data analysis

(11)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Data exploration

Subset selection based on

Gene Ontology functional classes

Keywords, gene names

Check the expression profiles of individual genes

Visualization expression profiles of gene families

Link to upstream sequence retrieval

(12)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Gene Ontology

(13)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Subset selection

(14)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Profile inspection

(15)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Profile visualization

(16)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Sequence Retrieval

(17)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Clustering

(18)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Goal of clustering

Exploration of microarray data

Form coherent groups of

Genes

Patient samples (e.g., tumors)

Drug or toxin response

Study these groups to get insight into biological processes

Genes in same clusters can have the same function or same

regulation

(19)

http://www.esat.kuleuven.ac.be/~dna/BioI/

K-means

Initialization

Choose the number of clusters K and start from random positions for the K centers

Iteration

Assign points to the closest center

Move each center to the center of mass of the assigned points

Termination

Stop when the centers

have converged or maximum

number of iterations

Initialization

(20)

http://www.esat.kuleuven.ac.be/~dna/BioI/

K-means

Initialization

Choose the number of clusters K and start from random positions for the K centers

Iteration

Assign points to the closest center

Move each center to the center of mass of the assigned points

Termination

Stop when the centers

have converged or maximum

number of iterations

Iteration 1

(21)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Iteration 1

K-means

Initialization

Choose the number of clusters K and start from random positions for the K centers

Iteration

Assign points to the closest center

Move each center to the center of mass of the assigned points

Termination

Stop when the centers

have converged or maximum number of iterations

(22)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Iteration 3

K-means

Initialization

Choose the number of clusters K and start from random positions for the K centers

Iteration

Assign points to the closest center

Move each center to the center of mass of the assigned points

Termination

Stop when the centers

have converged or maximum number of iterations

(23)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Hierarchical clustering

Construction of gene tree based on correlation matrix

(24)

http://www.esat.kuleuven.ac.be/~dna/BioI/

K-means clustering

Need for new clustering algorithms

Noisy genes deteriorate consistency of profiles in cluster

All genes forced into cluster

(25)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Adaptive quality-based clustering

For discovery, biologists are looking for highly coherent, reliable clusters

Other needs for clustering microarray data

Fast + limited memory (need to analyze thousands of genes)

No need to specify number of clusters in advance

Few and intuitive parameters

AQBC = 2 step algorithm

Cluster center localization

Cluster radius estimation with EM

Read more:

De Smet et al. (2002) Bioinformatics, in press.

(26)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Step 1: localization of cluster center

(27)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Step 2: re-estimation of cluster radius

Distance from cluster center randomly distributed except for small group (= cluster elements)

Size of cluster can be estimated automatically by EM

Step 3: remove cluster points and look for new cluster

(28)

K-means:

A.Q.B.C.

User defined parameters

• Quality criterion (QC):

• % defines how significant a cluster should be separated from background

• Minimal number of genes in a cluster Advantages

• Outcome not sensitive to parameter setting

• Number of clusters is determined automatically

• Based on QC an optimal radius is calculated for each cluster

• Set of smaller clusters containing genes with highly similar expression profile (fewer false positives)

• Noisy genes are rejected

User-defined parameters

• Number of clusters

• Number of iterations

Disadvantages

• Outcome sensitive towards parameter setting

• Extensive fine-tuning required to find optimal number of clusters

• Separation and merging of clusters based on visual inspection and not on statistical foundation

• No quality criterion: more false positives

• All genes will be clustered (noisy clusters)

Disadvantages

• Some information is rejected: clusters too small

Advantages

• Fewer true positives are rejected

Comparison with K-means

(29)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Adaptive Quality-Based Clustering Web Interface

(30)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Cluster results page

Upstream Sequence Retrieval

(31)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Upstream sequence retrieval

(32)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Upstream Sequence Retrieval

1. Identify all genes in cluster based on given accession number and gene name.

2. Delineate upstream region based on sequence annotation.

3. Check for presence of annotated upstream gene.

4. IF upstream gene found THEN select intergenic region

ELSE blast gene to find genomic DNA where gene is annotated.

5. Parse blast reports to find intergenic regions

6. Report results in GFF.

(33)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Gene Identification

(34)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Selected sequences & genes to be blasted

(35)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Results blast report parsing

(36)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Selected sequences

(37)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Motif Finding

(38)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Transcriptional regulation

Complex integration of multiple signals determines gene activity

Combinatorial control

(39)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Identifying regulatory elements from expression data

Cluster genes from microarray expression data to build clusters of co-expressed genes

Co-expressed genes may share regulatory mechanisms

Most regulatory sequences are found in the upstream region of the genes (up to 2kb from A. thaliana)

Motifs that are statistically overrepresented in the

upstream regions are candidate regulatory sequences

(40)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Upstream sequence model

Motifs are hidden in noisy background sequence.

Data set contains two types of sequences:

Sequences with one or more copies of the common motif.

Sequences with no copy of the common motif.

(41)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Motif Sampler

Algorithm based on the original Gibbs Sampling algorithm (Lawrence et al. 1993, Science 262:208-214)

Probabilistic sequence model

Changes and additions:

Use of higher-order background model.

Use of probability distribution to estimate number of copies.

Different motifs are found and masked in consecutive runs of the algorithm.

Read more:

Thijs et al. (2001) Bioinformatics 17(12), 1113-1122

Thijs et al. (2002) J.Comp.Biol. 9(2), 447-464

(42)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Background model

Representation of DNA sequence by higher-order Markov Chain:

)

| ( )

( )

|

( 1

1 2

1 l l l m

L m l m

m P bb b P b b b

B S

P

  

Core promoter gene Intergenic region

Reliable model can be build from selected intergenic DNA sequences.

Intergenic sequence = non-coding region between two consecutive genes.

Only regions that contain core promoter are selected.

(43)

http://www.esat.kuleuven.ac.be/~dna/BioI/

) ,

| ( )

,

| (

)

| ( )

(

1 0

1

1 1

2 1 0

S b

b b P B

S b

b P

b b

b P b

b b P P

m l x l

x l x W

l m w

x x

m l l

l L m l m

Algorithm: Initialization

Calculate background model score

Start from random set of motif positions

Create initial motif model

(44)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Algorithm: iterative procedure

1.

Score sequences with current motif model

2.

Calculate distribution

3.

Sample new alignment position

4.

Iterate for fixed number of steps

(45)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Algorithm: Convergence

Select best scoring positions from Wx to create motif and alingment

(46)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Motif Sampler

(47)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Motif Sampler results page

(48)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Example: Plant wounding

150 Arabidopsis genes

Mechanical plant wounding

7 (or 8) time points over a 24h period

Adaptive quality-based clustering produces 8 clusters of which 4 contain 5 or more genes.

Search for a motif of length 8 and a motif of length 12 in 4 clusters

Reymond, P et al.. 2000. Differential gene expression in response to mechanical wounding and insect feeding in Arabidopsis. Plant Cell 12(5): 707--20.

(49)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Results: Cluster 1

TAArTAAGTCAC 7 TGAGTCA tissue specific GCN4-motif

CGTCA MeJA-responsive element

ATTCAAATTT 8 ATACAAAT element associated to GCN4-motif

CTTCTTCGATCT 5 TTCGACC elicitor responsive element

(50)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Results: Cluster 2

CCCGCGTTTCAA 4 CCCCCG enhancer like element

TTGACyCGy 5 TGACG MeJa responsive element

(T)TGAC(C) Box-W1, elicitor responsive element

mACGTCACct 7 CGTCA MeJA responsive element

ACGT Abcissic response element

(51)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Results: Cluster 4

wATATATATmTT 5 TATATA TATA-box like element

TCTwCnTC 9 TCTCCCT TCCC-motif, part of light responsive element

ATAAATAkGCnT 7 - -

(52)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Results: Cluster 8

yTGACCGTCcsa 9 CCGTCC meristem specific activation of H4 gene CCGTCC A-box, light or elicitor responsive element

TGACG MeJA responsive element CGTCA MeJA responsive element CACGTGG 5 CACGTG G-box, light responsive element

ACGT Abcissic acid response element

GCCTymTT 8 - -

AGAATCAAT 6 - -

(53)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Conclusions

Gene expression data can reveal useful information on transcriptional regulation.

Adaptive quality-based clustering finds coherent groups of co-expressed genes.

Use of higher-order background models improves performance of Motif Sampler.

INCLUSive enables online analysis from clustering to

motif finding

(54)

http://www.esat.kuleuven.ac.be/~dna/BioI/

Acknowledgements

ESAT-SCD

• Prof. Bart De Moor

• Dr. Yves Moreau

• Dr. Kathleen Marchal

• Frank De Smet

• Stein Aerts

• all others

STWW Project

Pierre Rouzé (VIB Gent, INRA)

• Stephane Rombauts (VIB Gent, INRA)

• Magali Lescot (LGPD, Marseille)

IWT-Vlaanderen

Referenties

GERELATEERDE DOCUMENTEN

The sub-array loess normalization methods described in this article are based on the fact that dye balance typically varies with spot intensity and with spatial position on the

Starting with the clustering of microarray data by adaptive quality-based clustering, it then retrieves the DNA sequences relating to the genes in a cluster in a semiautomated

Taking into account that data separation strategies constrain commercial communication and strengthen responsible gambling approaches, their implementation may lead

regions of high intensity error and are therefore very unlikely.. It is notable how well the Cy3 and Cy5 intensities, and the relationships between them, can be explained

Starting with the clustering of microarray data by adaptive quality-based clustering, it then retrieves the DNA sequences relating to the genes in a cluster in a semiautomated

expression level of a single gene t in a single biological condition u) based on all measurements that were obtained for this combination of gene and condition. Although

This review fo- cuses on the problems associated with this inte- gration, which are (1) efficient access to and exchange of microarray data, (2) validation and comparison of data

As the final preparation before we go into deeper discussion of clustering techniques on microarray data, in Section 4 , we address some other basic but necessary ideas such as