Gibbs biclustering of microarray data

(1)

Gibbs biclustering of microarray data

Yves Moreau

(2)

 Microarray cost per expression measurement 

 Budgets and expertise 

 Publicly available microarray data 

 Need for exchange standards & repositories

 Big consortia set up big microarray projects

 Genome projects  “transcriptome” projects (= compendia)

 Change in microarray projects ( sequence analysis)



Analyze public data first to generate an hypothesis



Design and perform your own microarray experiment

From genome projects to

transcriptome projects

(3)

 Data becomes more heterogeneous



Gene clustering



Group genes that behave similarly over all conditions



Gene biclustering



Group genes that behave similarly over a subset of conditions



“Feature selection”



More suitable

for heterogeneous compendium

Why biclustering?

(4)

Distribution of expression values for a given gene

High Medium Low

Bicluster

 Discretized microarray data set

 Discretizing microarray data



Microarray data is continuous



Discretize by equal frequency

ge ne s

conditions

(5)

Bicluster

(6)

Likelihood

0 1

Background Pattern

(7)

Likelihood ⁰

1





.9.9.9.9.9



.9.05.9.9.9



.9.9.9.9.9

.05.9.9.9.9



.9.9.9.9.05



( | , , )

P D g c  

(8)

Likelihood ⁰

1





.9.05.05.05.9



.05.9.9.05.05



.05.05.05.05.05



.05.05.9.9.05



( | ', , ) ( | , , ) P D g c

P D g c



 



Get the right genes

(9)

Likelihood ⁰

1





.9.9.05.05.9



.9.05.05.9.9



.9.9 .05 .05.9

.05.9.05 .05.9



.9.9 .05 .05.05



( | , ', ) ( | , , ) P D g c

P D g c



 

Get the right conditions

(10)

Likelihood ⁰

1





.6.6.2.2.6



.6.2.2.2.6



.6.6.2.2.6

.2.6.2.2.6



.2.6.2.2.2



( | , , ') ( | , , ) P D g c

P D g c



 

Get the right frequency pattern

(11)

Optimizing the bicluster

 Find the right bicluster



Genes



Conditions



Pattern

 For a given choice of genes and conditions, the “best” pattern is given by the frequencies found in the extracted pattern



No more need to optimize over the pattern

 Maximum likelihood : find genes and conditions that maximize

 Gibbs sampling: find genes and conditions that optimize

( | , ) P D g c

( , | )

P g c D

(12)

Gibbs sampling

Current configuration

1 1

( 1| , , )?

P g  g c D

2 2

( 1| , , )?

P g  g c D

Next gene configuration

3 3

( 1| , , )?

P g  g c D

(13)

Updated gene configuration

Next complete configuration

 iterate many times

(14)

Gibbs biclustering

( , | ) ( |

_i _i

, , ) ( | , , )

_j _j

i j

P g c D   P g g c D  P c c g D

(15)

Simulated data

(16)

Remarks

 Gibbs biclustering allows noisy patterns

 Optimized configuration is obtained by averaging successive iterated configurations

 Biclustering is oriented



Find subset of samples for which a subset of genes is consistenly expressed across genes



Find subset of genes that are consistently expressed across a subset of samples

 Searching for multiple patterns



For gene biclustering, remove the data of the genes from the current bicluster



Search for a new pattern



Stop if only empty pattern repeatedly found

(17)

Multiple biclusters

(18)

Leukemia fingerprints

(19)

Mixed-Lineage Leukemia

 Armstrong et al., Nature Genetics, 2002

 Mixed-Lineage Leukemia (MLL) is a subtype of ALL

 Caused by chromosomal rearrangement in MLL gene

 Poorer prognosis than ALL

 Microarray analysis shows that MLL is distinct from ALL

 FLT3 tyrosine kinase distinguishes most strongly between MLL, ALL, and AML

 Candidate drug target

(20)

 PCA Features

(21)