Gibbs sampling for motif finding

(1)

Gibbs sampling for motif finding

Yves Moreau

(2)

2

Overview



Markov Chain Monte Carlo



Gibbs sampling



Motif finding in cis-regulatory DNA



Biclustering microarray data

(3)

3

Markov Chain Monte-Carlo



Markov chain with transition matrix T

)

|

( X

₁

j X i P

T

_ij



_t_



_t



A C G T A 0.0643 0.8268 0.0659 0.0430 C 0.0598 0.0484 0.8515 0.0403 G 0.1602 0.3407 0.1736 0.3255 T 0.1507 0.1608 0.3654 0.3231

X=A

X=C X=G

X=T

(4)

4

Markov Chain Monte-Carlo



Markov chains can sample from complex distributions

ACGCGGTGTGCGTTTGACGA ACGGTTACGCGACGTTTGGT ACGTGCGGTGTACGTGTACG ACGGAGTTTGCGGGACGCGT ACGCGCGTGACGTACGCGTG AGACGCGTGCGCGCGGACGC ACGGGCGTGCGCGCGTCGCG AACGCGTTTGTGTTCGGTGC ACCGCGTTTGACGTCGGTTC ACGTGACGCGTAGTTCGACG ACGTGACACGGACGTACGCG ACCGTACTCGCGTTGACACG ATACGGCGCGGCGGGCGCGG ACGTACGCGTACACGCGGGA ACGCGCGTGTTTACGACGTG ACGTCGCACGCGTCGGTGTG ACGGCGGTCGGTACACGTCG ACGTTGCGACGTGCGTGCTG ACGGAACGACGACGCGACGC ACGGCGTGTTCGCGGTGCGG

AC G T

%

Position

(5)

5

Markov Chain Monte-Carlo



Let us look at the transition after two steps



Similarly, after n steps

T T T

T T

i X k X

P k X

j X

P

i X k X

P i X k X

j X

P i

X j X

P T

S k

kj ik S k

t t

S k

t t

t ij

.

)

| (

)

| (

)

| (

) ,

| (

)

| (

) 2 (

1 1

2 1

1 1

2 2

) 2 (







   



( )ⁿ

(

_{t n}

|

_t

)

ⁿ

T  P X

_

X  T

(6)

6

Markov Chain Monte-Carlo



Stationary distribution 



If the samples are generated to the distribution , the samples at the next step will also be generated according to 



 is a left eigenvector of T



Equilibrium distribution



Rows of T

^

are stationary distributions



 From an arbitrary initial condition and after a sufficient number of steps (burn-in), the successive states of the Markov chains are

samples from a stationary distribution



 T 

T T

T T T

T T

n n

















lim

1

lim 0.1188 0.0643 0.8268 0.0659 0.0430 0.1188 0.2788 0.0598 0.0484 0.8515 0.0403 0.2788

. =

0.3905 0.1602 0.3407 0.1736 0.3255 0.3905 0.2119 0.1507 0.1608 0.3654 0.3231 0.2119

  

T

  

     

     

(7)

7

Detailed balance



A sufficient condition for the Markov chain to converge to the stationary distribution p is that they satisfy the

condition of detailed balance



Proof:



Problem: disjoint regions in probability space , ,

i ij j ji

p T  p T  i j

 

i j ji i ij i ij i

^,

j j j

pT   p T   p T  p  T  p  i

(8)

8

Gibbs sampling



Markov chain for Gibbs sampling

1

1 1

0 0 0

( | , )

1

( | , )

1 1

( | , )

1 1 1

( , , ) ( | , ) ( | , ) ( | , ) ( , , )

( , , ) ( , , )

( , , )

i i

P A B b C c

i i i i

P B A a C c

i i i i

P C A a B b

i i i i

P A B C P A B C P B A C P C A B a b c

a a b c

b a b c

c a b c



 

 



 

 

 

  

  

 



(9)

9

Gibbs sampling



Detailed balance



Detailed balance for the Gibbs sampler



Prove detailed balance



Bayes’ rule



Q.E.D.

1 1 1 1

1 1 1 1 1 1

( , , ) ( | , , , , , )

( , , , , , , ) ( | , , , , , )

n i i i n

i i i n i i i n

P x x P x x x x x

P x x x x x P x x x x x

 

   



   

    

( ) ( | ) ( ) ( | ), , P x  y x  P y  x y  x y

1 1 1

( | ) y x P x x ( | , ,

_i

x

_i

, x

_i

, , ) x

_n

 

¹

 

_ _



( ) ( , , )

_n

P x  P x  x

1 1 1 1 1 1 1

1 1 1 1 1 1

( , , ) ( , , , , , , ) / ( , , , , , ) ( , , , , , , ) ( , , ) / ( , , , , , )

n i i i n i i n

i i i n i n i i n

P x x P x x x x x P x x x x

P x x x x x P x x P x x x x

   



     

     

(10)

10 Data augmentation Gibbs sampling



Introducing unobserved variables often simplifies the expression of the likelihood



A Gibbs sampler can then be set up



Samples from the Gibbs sampler can be used to estimate parameters

( , | ) ( | , ) ( | , )

( | , ) ( | , )

model parameters, missing data, data

i j

P M D P M D P M D

P M D P M D

M D

  

 



 

  

PME

1

( | ) ( , | ) 1

N k

M k

E D P M D dMd



N

     



     

(11)

11

Pros and cons



Pros



Clear probabilistic interpretation



Bayesian framework



“Global optimization”



Cons



Mathematical details not easy to work out



Relatively slow

(12)

12 Motif finding

(13)

13

Gibbs sampler



Gibbs sampling for motif finding

 Set up a Gibbs sampler for the joint probability of the motif matrix and the alignment given the sequences

 Sequence by sequence



Lawrence et al.

 One motif of fixed length

 One occurrence per sequence

 Background model based on single nucleotides

 Too sensitive to noise

 Lots of parameter tuning

( , | ) ( | , ) ( | , )

motif matrix, alignment, sequences

P A S P A S P A S

A S

  



 

) ,

| ( )

,

|

(

1 i i i

K

i

P a S

S A

P  





(14)



 









 







2 . 0 05 . 0 9

. 0 05 . 0 04 . 0 05 . 0 05 . 0 1 . 0

4 . 0 9

. 0 05 . 0 9

. 0 03 . 0 05 . 0 04 . 0 2 . 0

2 . 0 02 . 0 01 . 0 01 . 0 8

. 0 1

. 0 9

. 0 4 . 0

2 . 0 03 . 0 04 . 0 04 . 0 03 . 0 8

. 0 01 . 0 3 . 0

NCACGTGN

: model Motif

T G C A



 









 







28 . 0

24 . 0

16 . 0

32 . 0

model Background

T G C A

1 2

0 Motif

( | ,

_W

, )

_bg _bg

P S a  B  P P P

Motif , 1

1 ^{x j}

W

j b j

P q

_{ }



 

1

1 0,

1 ^j

a

bg b

j

P

^

q



 

Translation start 500 bp

2 0, _j

L

bg b

j a W

P q

 

 

(15)

15

Gibbs motif finding



Initialization



Sequences



Random motif matrix



Iteration



Sequence scoring



Alignment update



Motif instances



Motif matrix



Termination



Convergence of the alignment

and of the motif matrix

(16)

16

Gibbs motif finding



Initialization



Sequences



Random motif matrix



Iteration



Sequence scoring



Alignment update



Motif instances



Motif matrix



Termination



Convergence of the alignment

and of the motif matrix

(17)

17

Gibbs motif finding



Initialization



Sequences



Random motif matrix



Iteration



Sequence scoring



Alignment update



Motif instances



Motif matrix



Termination



Convergence of the alignment and of the motif matrix

 

1

,

0 1 0,

, ,

( | , ) ( ) ( | , )

l i

l l W

W i b

W

i b

x b b

P x S W x P x S

 

 



  



 

(18)

18

Gibbs motif finding



Initialization



Sequences



Random motif matrix



Iteration



Sequence scoring



Alignment update



Motif instances



Motif matrix



Termination



Convergence of the alignment

and of the motif matrix

(19)

19

Gibbs motif finding



Initialization



Sequences



Random motif matrix



Iteration



Sequence scoring



Alignment update



Motif instances



Motif matrix



Termination



Convergence of the alignment

and of the motif matrix

(20)

20

Gibbs motif finding



Initialization



Sequences



Random motif matrix



Iteration



Sequence scoring



Alignment update



Motif instances



Motif matrix



Termination



Convergence of the alignment

and of the motif matrix

(21)

21

Gibbs motif finding



Initialization



Sequences



Random motif matrix



Iteration



Sequence scoring



Alignment update



Motif instances



Motif matrix



Termination



Convergence of the alignment

and of the motif matrix

(22)

22

Gibbs motif finding



Initialization



Sequences



Random motif matrix



Iteration



Sequence scoring



Alignment update



Motif instances



Motif matrix



Termination



Stabilization of the motif matrix

(not of the alignment)

(23)

23 Motif Sampler (extended Gibbs sampling)



Model



One motif of fixed length per round



Several occurrences per sequence



Sequence have a discrete probability distribution over the number of copies of the motif (under a maximum bound)



Multiple motifs found in successive rounds by masking occurrences of previous motifs



Improved background model based on oligonucleotides



Gapped motifs

(24)



 









 







2 . 0 05 . 0 9

. 0 05 . 0 04 . 0 05 . 0 05 . 0 1 . 0

4 . 0 9

. 0 05 . 0 9

. 0 03 . 0 05 . 0 04 . 0 2 . 0

2 . 0 02 . 0 01 . 0 01 . 0 8

. 0 1

. 0 9

. 0 4 . 0

2 . 0 03 . 0 04 . 0 04 . 0 03 . 0 8

. 0 01 . 0 3 . 0

NCACGTGN

: model Motif

T G C

A ( | ... )

model Background

2

1 j j m

j

b b b

b

P

_ _ _

0

Motif 1

( | , , , )

c i i

m bg bg

i

P S a c  B P P P



 

Motif , 1

1 ^aⁱ ^j i W

j b j

P q

  

 



   

 ^x

m j

m j j

j m

bg P b b P b b b

P

1

1 1

0 ( ,..., ) ( | ... )

1

1 1

( | ... )

i

a i

bg j j j m

j a w

P

^

P b b

_

b

_

  

 

Translation start

500 bp