Gibbs sampling for motif finding
Yves Moreau
2
Overview
Markov Chain Monte Carlo
Gibbs sampling
Motif finding in cis-regulatory DNA
Biclustering microarray data
3
Markov Chain Monte-Carlo
Markov chain with transition matrix T
)
|
( X
1j X i P
T
ij
t
t
A C G T A 0.0643 0.8268 0.0659 0.0430 C 0.0598 0.0484 0.8515 0.0403 G 0.1602 0.3407 0.1736 0.3255 T 0.1507 0.1608 0.3654 0.3231
X=A
X=C X=G
X=T
4
Markov Chain Monte-Carlo
Markov chains can sample from complex distributions
ACGCGGTGTGCGTTTGACGA ACGGTTACGCGACGTTTGGT ACGTGCGGTGTACGTGTACG ACGGAGTTTGCGGGACGCGT ACGCGCGTGACGTACGCGTG AGACGCGTGCGCGCGGACGC ACGGGCGTGCGCGCGTCGCG AACGCGTTTGTGTTCGGTGC ACCGCGTTTGACGTCGGTTC ACGTGACGCGTAGTTCGACG ACGTGACACGGACGTACGCG ACCGTACTCGCGTTGACACG ATACGGCGCGGCGGGCGCGG ACGTACGCGTACACGCGGGA ACGCGCGTGTTTACGACGTG ACGTCGCACGCGTCGGTGTG ACGGCGGTCGGTACACGTCG ACGTTGCGACGTGCGTGCTG ACGGAACGACGACGCGACGC ACGGCGTGTTCGCGGTGCGG
AC G T
%
Position
5
Markov Chain Monte-Carlo
Let us look at the transition after two steps
Similarly, after n steps
T T T
T T
i X k X
P k X
j X
P
i X k X
P i X k X
j X
P i
X j X
P T
S k
kj ik S k
t t
t t
S k
t t
t t
t t
t ij
.
)
| (
)
| (
)
| (
) ,
| (
)
| (
) 2 (
1 1
1 1
2 1
1 1
2 2
) 2 (
( )n
(
t n|
t)
nT P X
X T
6
Markov Chain Monte-Carlo
Stationary distribution
If the samples are generated to the distribution , the samples at the next step will also be generated according to
is a left eigenvector of T
Equilibrium distribution
Rows of T
are stationary distributions
From an arbitrary initial condition and after a sufficient number of steps (burn-in), the successive states of the Markov chains are
samples from a stationary distribution
T
T T
T T T
T T
n n
n n
lim
1lim 0.1188 0.0643 0.8268 0.0659 0.0430 0.1188 0.2788 0.0598 0.0484 0.8515 0.0403 0.2788
. =
0.3905 0.1602 0.3407 0.1736 0.3255 0.3905 0.2119 0.1507 0.1608 0.3654 0.3231 0.2119
T
7
Detailed balance
A sufficient condition for the Markov chain to converge to the stationary distribution p is that they satisfy the
condition of detailed balance
Proof:
Problem: disjoint regions in probability space , ,
i ij j ji
p T p T i j
i j ji i ij i ij i,
j j j
pT p T p T p T p i
8
Gibbs sampling
Markov chain for Gibbs sampling
1
1 1
0 0 0
( | , )
1
( | , )
1 1
( | , )
1 1 1
( , , ) ( | , ) ( | , ) ( | , ) ( , , )
( , , ) ( , , )
( , , )
i i
i i
i i
P A B b C c
i i i i
P B A a C c
i i i i
P C A a B b
i i i i
P A B C P A B C P B A C P C A B a b c
a a b c
b a b c
c a b c
9
Gibbs sampling
Detailed balance
Detailed balance for the Gibbs sampler
Prove detailed balance
Bayes’ rule
Q.E.D.
1 1 1 1
1 1 1 1 1 1
( , , ) ( | , , , , , )
( , , , , , , ) ( | , , , , , )
n i i i n
i i i n i i i n
P x x P x x x x x
P x x x x x P x x x x x
( ) ( | ) ( ) ( | ), , P x y x P y x y x y
1 1 1
( | ) y x P x x ( | , ,
ix
i, x
i, , ) x
n
1
( ) ( , , )
nP x P x x
1 1 1 1 1 1 1
1 1 1 1 1 1
( , , ) ( , , , , , , ) / ( , , , , , ) ( , , , , , , ) ( , , ) / ( , , , , , )
n i i i n i i n
i i i n i n i i n
P x x P x x x x x P x x x x
P x x x x x P x x P x x x x
10
Data augmentation Gibbs sampling
Introducing unobserved variables often simplifies the expression of the likelihood
A Gibbs sampler can then be set up
Samples from the Gibbs sampler can be used to estimate parameters
( , | ) ( | , ) ( | , )
( | , ) ( | , )
model parameters, missing data, data
i j
i j
P M D P M D P M D
P M D P M D
M D
PME
1
( | ) ( , | ) 1
N k
M k
E D P M D dMd
N
11
Pros and cons
Pros
Clear probabilistic interpretation
Bayesian framework
“Global optimization”
Cons
Mathematical details not easy to work out
Relatively slow
12
Motif finding
13
Gibbs sampler
Gibbs sampling for motif finding
Set up a Gibbs sampler for the joint probability of the motif matrix and the alignment given the sequences
Sequence by sequence
Lawrence et al.
One motif of fixed length
One occurrence per sequence
Background model based on single nucleotides
Too sensitive to noise
Lots of parameter tuning
( , | ) ( | , ) ( | , )
motif matrix, alignment, sequences
P A S P A S P A S
A S
) ,
| ( )
,
|
(
1 i i iK
i
P a S
S A
P
2 . 0 05 . 0 9
. 0 05 . 0 04 . 0 05 . 0 05 . 0 1 . 0
4 . 0 9
. 0 05 . 0 9
. 0 03 . 0 05 . 0 04 . 0 2 . 0
2 . 0 02 . 0 01 . 0 01 . 0 8
. 0 1
. 0 9
. 0 4 . 0
2 . 0 03 . 0 04 . 0 04 . 0 03 . 0 8
. 0 01 . 0 3 . 0
NCACGTGN
: model Motif
T G C A
28 . 0
24 . 0
16 . 0
32 . 0
model Background
T G C A
1 2
0 Motif
( | ,
W, )
bg bgP S a B P P P
Motif , 1
1 x j
W
j b j
P q
1
1 0,
1 j
a
bg b
j
P
q
Translation start 500 bp
2 0, j
L
bg b
j a W
P q
15
Gibbs motif finding
Initialization
Sequences
Random motif matrix
Iteration
Sequence scoring
Alignment update
Motif instances
Motif matrix
Termination
Convergence of the alignment
and of the motif matrix
16
Gibbs motif finding
Initialization
Sequences
Random motif matrix
Iteration
Sequence scoring
Alignment update
Motif instances
Motif matrix
Termination
Convergence of the alignment
and of the motif matrix
17
Gibbs motif finding
Initialization
Sequences
Random motif matrix
Iteration
Sequence scoring
Alignment update
Motif instances
Motif matrix
Termination
Convergence of the alignment and of the motif matrix
1
1
1
,
0 1 0,
, ,
( | , ) ( ) ( | , )
l i
l i
l l W
W i b
W
i b
x b b
P x S W x P x S
18
Gibbs motif finding
Initialization
Sequences
Random motif matrix
Iteration
Sequence scoring
Alignment update
Motif instances
Motif matrix
Termination
Convergence of the alignment
and of the motif matrix
19
Gibbs motif finding
Initialization
Sequences
Random motif matrix
Iteration
Sequence scoring
Alignment update
Motif instances
Motif matrix
Termination
Convergence of the alignment
and of the motif matrix
20
Gibbs motif finding
Initialization
Sequences
Random motif matrix
Iteration
Sequence scoring
Alignment update
Motif instances
Motif matrix
Termination
Convergence of the alignment
and of the motif matrix
21
Gibbs motif finding
Initialization
Sequences
Random motif matrix
Iteration
Sequence scoring
Alignment update
Motif instances
Motif matrix
Termination
Convergence of the alignment
and of the motif matrix
22
Gibbs motif finding
Initialization
Sequences
Random motif matrix
Iteration
Sequence scoring
Alignment update
Motif instances
Motif matrix
Termination
Stabilization of the motif matrix
(not of the alignment)
23
Motif Sampler (extended Gibbs sampling)
Model
One motif of fixed length per round
Several occurrences per sequence
Sequence have a discrete probability distribution over the number of copies of the motif (under a maximum bound)
Multiple motifs found in successive rounds by masking occurrences of previous motifs
Improved background model based on oligonucleotides
Gapped motifs
2 . 0 05 . 0 9
. 0 05 . 0 04 . 0 05 . 0 05 . 0 1 . 0
4 . 0 9
. 0 05 . 0 9
. 0 03 . 0 05 . 0 04 . 0 2 . 0
2 . 0 02 . 0 01 . 0 01 . 0 8
. 0 1
. 0 9
. 0 4 . 0
2 . 0 03 . 0 04 . 0 04 . 0 03 . 0 8
. 0 01 . 0 3 . 0
NCACGTGN
: model Motif
T G C
A ( | ... )
model Background
2
1 j j m
j
j
b b b
b
P
0
Motif 1
( | , , , )
c i i
m bg bg
i
P S a c B P P P
Motif , 1
1 ai j i W
j b j
P q
x
m j
m j j
j m
bg P b b P b b b
P
1
1 1
0 ( ,..., ) ( | ... )
1
1 1
( | ... )
i
i
a i
bg j j j m
j a w
P
P b b
b