Expectation-Maximization for HMMs and Motif Discovery
Yves Moreau
2
Overview
The general Expectation-Maxization algorithm
EM interpretation of the Baum-Welch algorithm for the learning of HMMs
MEME for motif discovery
3
The general EM algorithm
4
EM algorithm
Maximum likelihood estimation
Let us assume we have an algorithm that tries to optimize the likelihood
Let us look at the change in likelihood between two iterations of the algorithm
)
| ( ln max arg
)
| ( max
*
arg
P D
P D
)
| (
)
| ln (
)
| ( ln )
| ( ln )
,
(
1 1 1i i i
i i
i
P D
D D P
P D
P
algorithm
*
0 i D i 1
5
EM algorithm
The likelihood is sometimes difficult to compute
We use a simpler generative model based on unobserved data (data augmentation)
We try to integrate out the unobserved data
The expectation can be an integral or a sum
)
| (
)
| , ln (
) ,
(
1 | 1i i m
i
i
P D
m D P E
|
( , |
1) ( , | ) ( | , ) ( | )
m i
m m
E
P D m
P D m d P D m P m d
|
( , |
1) ( , | ) ( | , ) ( | )
m i
m m
E
P D m
P D m P D m P m
6
EM algorithm
Without loss of generality, we work with a sum
Problem: the expression contains the logarithm of a sum
Jensen’s inequality
1 1
( , | ) ( , ) ln
( | )
i m
i i
i
P D m P D
j j
j j
j j j
j
1 ln y ln y
7
EM algorithm
Application of Jensen’s inequality
This gives a lower bound for the variation of the likelihood
) ,
| (
) ,
| ( )
| (
)
| , ( ln
) , (
1 1
i i i
m
i i
i
P m D
D m P D
P
m D P
1 1 1
( , | )
( , ) ( | , )ln ( , )
( | ) ( | , )
i i i i i i
m i i
P D m P m D
P D P m D
1 1
ln ( | P D
i) ln ( | ) P D
i (
i, )
i8
EM algorithm
Let us try to maximize (the bound on) the variation )
,
| ( )
| (
)
| , ln (
) ,
| ( )
, (
i m i
i
i
P D P m D
m D D P
m
P
))
| , ( (ln max
arg )
( max
arg
)
| , ( ln ) ,
| ( max
arg
) ,
| ( )
| (
)
| , ln (
) ,
| ( max
arg
) ,
( max arg
EM
EM | ,
EM
EM EM
EM EM
EM 1
m D P E
Q
m D P D
m P
D m P D
P
m D D P
m P
i
i mD
m
i
i m i
i i
i
independent of
9
Optimization strategy of EM
)
| (
ln P D
) ( )
| (
ln EM
EM
Q i
D
P i
EM
i iEM1 )
| (
ln P D iEM )
| (
ln P D iEM1
10
EM for independent records
If the data set consists of N independent records, we can introduce independent unobserved data
The expectation step (including the use of Jensen’s inequality) takes place “inside” the summation over all records
*
arg max ( | ) arg max ln ( | )
arg max ln ( | ) arg max
nln ( | )
nn n
P D P D
P x P x
EM EM
1
arg max ( | , ) ln ( , | )
n
i n n i n n
n m
P m x P x m
11
Convergence of the EM algorithm
Likelihood increases monotonically
+ Equilibrium * of EM is maximum of lnP(D|
)+ ( ,
)
Thus * must be a stationary point of the likelihood (because the bound is tangent to the log-likelihood)
No guarantee for a global optimum (often local minima)
In some cases the stationary point is not a even maximum
EM EM max EM EM
1
EM EM
1
( , ) ( , ) 0
ln ( | ) ln ( | )
i i i i
i i
P D P D
12
Why EM?
EM serves to find a maximum likelihood solution
This can also be achieved by gradient descent
But the computation of the gradients of the likelihood P(D| ) is often difficult
By introducing the unobserved data in the EM algorithm,
we can compute the Expectation step more easily
13
Generalized EM
It is not absolutely necessary to maximize Q() at the Expectation step
If Q(
) Q(
), convergence can also be achieved
This is the generalized EM algorithm
This algorithm is applied when the results of the
Expectation step are too complex to maximize directly
14
Baum-Welch algorithm
15
Hidden Markov Models
A .8 C 0 G 0 T .2
A 0 C .8 G .2 T 0
A .8 C .2 G 0 T 0
A 1 C 0 G 0 T 0
A 0 C 0 G .2 T .8
A 0 C .8 G .2 T 0 A .2
C .4 G .2 T .2
1.0 1.0
.6
.4
.6 .4
1.0 1.0
Sequence score
0472 0
8 0 1 8 0 1 1 6 0 4 0
6 0 8 0 1 8 0 1 8 0 )
(
.
. .
. .
. .
. .
P
ACACATC
Transition probabilities
Emission probabilities
16
Hidden Markov Model
In a hidden Markov model, we observe the symbol
sequence x but we want to reconstruct the hidden state sequence (path )
Transition probabilities ( : a
0l, : a
k0)
Emission probabilities
Joint probability of the sequence ,x
1,...,x
L, and the path )
|
( l
1k
P
a
kl
i
i
)
| (
)
( b P x b k
e
k
i
i
Li
i i i
i
x a
e a
x P
1
0 1
( )
1) ,
(
17
The forward algorithm
The forward algorithm let us compute the probability P(x) of a sequence w.r.t. an HMM
This is important for the computation of posterior probabilities and the comparison of HMMs
The sum over all paths (exponentially many) can be computed by dynamic programming
Les us define fk(i) as the probability of the sequence for the paths that end in state k with the emission of symbol xi
Then we can compute this probability as
) , ( )
( x P x
P
) ,
,..., (
)
( i P x
1x k f
k
i
i
k
kl k
i l
l
i e x f i a
f ( 1 ) (
1) ( )
18
The backward algorithm
The backward algorithm let us compute the probability of the complete sequence together with the condition that symbol xi is emitted from state k
This is important to compute the probability of a given state at symbol xi
P(x1,...,xi,i=k) can be computed by the forward algorithm fk(i)
Let us define bk(i) as the probability that the rest of the sequence for the paths that pass through state k at symbol xi
)
| ,...,
( ) ,
,..., (
) ,
,...,
| ,...,
( ) ,
,..., (
) ,
(
1 1
1 1
1
k x
x P k x
x P
k x
x x
x P k x
x P k
x P
i L
i i
i
i i L
i i
i i
)
| ,...,
( )
( i P x
1x k
b
k
i L
i
19
EM interpretation of Baum-Welch
We want to estimate the parameters of the hidden Markov model (transition probabilities en emission probabilities that maximize the likelihood of the
sequence(s)
Unobserved data = paths :
EM algorithm
)
| ( max
*
arg
P x
)
| , ( ln ) ,
| ( max
arg
) ( max
arg
EM EM
1 EM
x P x
P Q
i
i i
) , ( )
( x P x
P
20
EM interpretation of Baum-Welch
Let us work out the function Q further
The generative model gives the joint probability of the sequence and the path
Define the number of times that a given probability gets used for a given path
Define the number of times that a given emission is observed for a given sequence and a given path
)
| , ( ln ) ,
| ( )
(
P x P x
Q
ii
) ( A
kl)
,
( b
E
k21
EM interpretation van Baum-Welch
The joint probability of the sequence and the path can be written as
By taking the logarithm, the function Q becomes
( , ) 1
( )1 0 1
( , | ) ( )
k klM M M
E b A
k kl
k b k l
P x e b
a
1
1 0 1
( ) ( | , ) ( , ) ln ( ) ( ) ln
i
M M M
i k k kl kl
k b k l
Q
P x E b e b A a
22
EM interpretation van Baum-Welch
Define the expected number of times that a transition gets used (independently of the path)
Define the expected number of times that a transition is observed (independently of the path)
| , ) ( )
(
i klkl
P x A
A
| , ) ( , ) (
)
( b P x E b
E
k i k23
EM interpretation of Baum-Welch
For the function Q, we have
Given that P(x, | ) is independent of k and b, we can reorder the sums and use the definitions of A
kland E
k(b)
Let us now maximize Q w.r.t. : a
kl, e
k(b)
1
1 0 1
( ) ( | , ) ( , ) ln ( ) ( ) ln
i
M M M
i k k kl kl
k b k l
Q
P x E b e b A a
1
1 0 1
( ) ( ) ln ( ) ln
i
M M M
k k kl kl
k b k l
Q
E b e b
A a
24
EM interpretation of Baum-Welch
Let us look at the A term
Let us define the following candidate for the optimum
Compare with other parameter choices
m
km kl kl
A a
0A
1 1 1 0
0
0 1 0 1 0 1
1 0 0
0 1 1
ln ln ln
ln
M M M M M M
kl
kl kl kl kl kl
k l k l k l kl
M M M
kl
km kl
k m l kl
A a A a A a
a A a a
a
25
EM interpretation Baum-Welch
The previous sum has the form of a relative entropy and is always positive
Our candidate maximize thus the A term
Identical procedure for the E term
1 0 0 1
ln 0
M
kl kl
l kl
a a
a
' 0
) ' (
) ) (
(
b
k k k
b E
b b E
e
26
EM interpretation of Baum-Welch
Baum-Welch
Expectation step
Compute the expected number of times that a transition gets used
Compute the expected number of times that an emission is observed
Use the forward and backward algorithm for this
Maximization step
Update the parameters with the normalized counts
| , ) ( )
(
i klkl
P x A
A
| , ) ( , ) (
)
( b P x E b
E
k i k
' 1
) ' (
) ) (
(
b
k i k
k
E b
b b E
e
m
km i kl
kl
A
a
1A
27
EM interpretation of Baum-Welch
For the transitions
For the emissions
)
| (
) 1 (
) (
) ) (
,
| ,
( 1 1
P x
i b x
e a i x f
l k
P i i k kl l i l
j i
j l j i l kl j
j k
j i
j i
i
kl f i a e x b i
x x P
l k
P
A ( ) ( ) ( 1)
)
| ( ) 1
,
| ,
( 1 1
j i x b
j k j
j k
j ix b
j i
k
ij ij
i b i x f
x P k P
b E
}
| { }
| {
) ( ) ) (
| ( ) 1
,
| (
)
(
28
Motif finding
29
Combinatorial control
Complex integration of multiple cis-regulatory signals controls gene activity
30
1, ,W 1, ,4
Motif model : NCACGTGN
0.3 0.01 0.8 0.03 0.04 0.04 0.03 0.2 0.4 0.9 0.1 0.8 0.01 0.01 0.02 0.2 0.2 0.04 0.05 0.03 0.9 0.05 0.9 0.4 0.1 0.05 0.05 0.04 0.05 0.9 0.05 0.2 A
C G T
0 1, ,4
Background model 0.32
0.16 0.24 0.28 A
C G T
Sequence model : one occurrence per sequence
1 2
0 Motif
( | , W, ) bg bg P S a B P P P
Motif , 1
1 x j
W
j b j
P q
1
1 0, 1 j a
bg b
j
P q
2 0, j
L
bg b
j a W
P q
31
Iterative motif discovery
Initialization
Sequences
Random motif matrix
Iteration
Sequence scoring
Alignment update
Motif instances
Motif matrix
Termination
Convergence of the alignment and of the motif matrix
32
Iterative motif discovery
Initialization
Sequences
Random motif matrix
Iteration
Sequence scoring
Alignment update
Motif instances
Motif matrix
Termination
Convergence of the alignment and of the motif matrix
33
Iterative motif discovery
Initialization
Sequences
Random motif matrix
Iteration
Sequence scoring
Alignment update
Motif instances
Motif matrix
Termination
Convergence of the alignment and of the motif matrix
34
Iterative motif discovery
Initialization
Sequences
Random motif matrix
Iteration
Sequence scoring
Alignment update
Motif instances
Motif matrix
Termination
Convergence of the alignment and of the motif matrix
35
Iterative motif discovery
Initialization
Sequences
Random motif matrix
Iteration
Sequence scoring
Alignment update
Motif instances
Motif matrix
Termination
Convergence of the alignment and of the motif matrix
36
Iterative motif discovery
Initialization
Sequences
Random motif matrix
Iteration
Sequence scoring
Alignment update
Motif instances
Motif matrix
Termination
Convergence of the alignment and of the motif matrix
37
Iterative motif discovery
Initialization
Sequences
Random motif matrix
Iteration
Sequence scoring
Alignment update
Motif instances
Motif matrix
Termination
Convergence of the alignment and of the motif matrix
38
Iterative motif discovery
Initialization
Sequences
Random motif matrix
Iteration
Sequence scoring
Alignment update
Motif instances
Motif matrix
Termination
Convergence of the alignment and of the motif matrix
39
Multiple EM for Motif Elicitation
(MEME)
40
MEME
Expectation-Maximization
Data = set of independent sequences
Likelihood = “one occurrence per sequence” model
Parameters = motif matrix (+ background model)
Missing data = alignment
EM EM
1
1 1
arg max
k( | , ) ln ( | , ) ( | )
k
K L
i k k i k k k
k a
P a S P S a P a
1
1
0 0
1 1
ln ( | , )
kln
kln
k kln
ki ak i i
k
a W L
i
k k b b b
i i i a W
P S a
41
MEME
Sequence scoring (per sequence)
Uniform prior
Sequence scoring for uniform prior
1
( | , ) ( | ) ( | , )
( | , ) ( | )
k
k k k
k k L
k q
P S a P a P a S
P S q P q
( | )k 1
k
P a L
1
( | , )
( | , ) ( )
( | , )
k
k k
k k L k
k q
P S a
P a S W a
P S q
42
MEME
Expectation
Maximization - intuitively
If we had only one alignment
Background model: observed frequences at background positions
Motif matrix: observe frequenties at aligned positions
Here: sum over all possible alignments (independently for each sequence)
Weighted sum:
EM
( )
EM( ) ln ( | , )
i i
k
k k k
k a
Q
W
a P S a
EM
( )
i k
W
a
43
Summary
The abstract Expectation-Maximization algorithm
EM interpretation of Baum-Welch training for HMMs
EM for motif finding
MEME