Expectation-Maximization for HMMs and Motif Discovery

(1)

Expectation-Maximization for HMMs and Motif Discovery

Yves Moreau

(2)

2 Overview



The general Expectation-Maxization algorithm



EM interpretation of the Baum-Welch algorithm for the learning of HMMs



MEME for motif discovery

(3)

3 The general EM algorithm

(4)

4 EM algorithm



Maximum likelihood estimation



Let us assume we have an algorithm that tries to optimize the likelihood



Let us look at the change in likelihood between two iterations of the algorithm

)

| ( ln max arg

)

| ( max

*

arg  

 



P D 



P D

)

| (

)

| ln (

)

| ( ln )

,

(

₁ ₁ ¹

i i i

i i

i

P D

D D P

P D

P 

 



_



_

 

^



algorithm

*

0 i D i 1

    

_

 

(5)

5 EM algorithm



The likelihood is sometimes difficult to compute



We use a simpler generative model based on unobserved data (data augmentation)



We try to integrate out the unobserved data



The expectation can be an integral or a sum

 

)

| (

)

| , ln (

) ,

(

₁ ^| ¹

i i m

i

P D

m D P E



 



_



^ ^



 

|

( , |

1

) ( , | ) ( | , ) ( | )

m i

m m

E

_

P D m 

_

  P D m   d   P D m  P m   d

 

|

( , |

1

) ( , | ) ( | , ) ( | )

m i

m m

E

_

P D m 

_

  P D m    P D m  P m 

(6)

6 EM algorithm



Without loss of generality, we work with a sum



Problem: the expression contains the logarithm of a sum



Jensen’s inequality

1 1

( , | ) ( , ) ln

( | )

i m

i i

i

P D m P D



  







 

j j

j j j

j

1 ln  y  ln y

 ^ ^ ^ ^ ^ ^

(7)

7 EM algorithm



Application of Jensen’s inequality



This gives a lower bound for the variation of the likelihood

) ,

| (

) ,

| ( )

| (

)

| , ( ln

) , (

1 1

i i i

m

i i

i

P m D

D m P D

P

m D P



 

^







1 1 1

( , | )

( , ) ( | , )ln ( , )

( | ) ( | , )

i i i i i i

m i i

P D m P m D

P D P m D

      



^



 

   

1 1

ln ( | P D 

_i_

) ln ( | )  P D 

_i

    (

_i_

, )

_i

(8)

8 EM algorithm



Let us try to maximize (the bound on) the variation )

,

| ( )

| (

)

| , ln (

) ,

| ( )

, (

i m i

i

P D P m D

m D D P

m

P  

 



 ^ 

))

| , ( (ln max

arg )

( max

arg

)

| , ( ln ) ,

| ( max

arg

) ,

| ( )

| (

)

| , ln (

) ,

| ( max

arg

) ,

( max arg

EM

EM | ,

EM

EM EM

EM 1



 







 



m D P E

Q

m D P D

m P

D m P D

P

m D D P

m P

i

i mD

m

i

i m i

i i

i







independent of 

(9)

9 Optimization strategy of EM

)

| (

ln P D 

) ( )

| (

ln EM

EM 

 _

Q i

D

P _i 



EM

i _i^EM_₁ )

| (

ln P D _i^EM )

| (

ln P D _i^EM_₁

(10)

10 EM for independent records



If the data set consists of N independent records, we can introduce independent unobserved data



The expectation step (including the use of Jensen’s inequality) takes place “inside” the summation over all records

*

arg max ( | ) arg max ln ( | )

arg max ln ( | ) arg max

_n

ln ( | )

_n

n n

P D P D

P x P x

 

  

 

 

   

EM EM

1

arg max ( | , ) ln ( , | )

n

i n n i n n

n m

P m x P x m



_





  

(11)

11 Convergence of the EM algorithm



Likelihood increases monotonically



+ Equilibrium  * of EM is maximum of lnP(D| 

^

)+ (  , 

^

)



Thus  * must be a stationary point of the likelihood (because the bound is tangent to the log-likelihood)



No guarantee for a global optimum (often local minima)



In some cases the stationary point is not a even maximum

EM EM max EM EM

1

EM EM

1

( , ) ( , ) 0

ln ( | ) ln ( | )

i i i i

i i

P D P D

     

 



 

 

(12)

12 Why EM?



EM serves to find a maximum likelihood solution



This can also be achieved by gradient descent



But the computation of the gradients of the likelihood P(D|  ) is often difficult



By introducing the unobserved data in the EM algorithm,

we can compute the Expectation step more easily

(13)

13 Generalized EM



It is not absolutely necessary to maximize Q() at the Expectation step



If Q(

_

)  Q(

_

), convergence can also be achieved



This is the generalized EM algorithm



This algorithm is applied when the results of the

Expectation step are too complex to maximize directly

(14)

14 Baum-Welch algorithm

(15)

15 Hidden Markov Models

A .8 C 0 G 0 T .2

A 0 C .8 G .2 T 0

A .8 C .2 G 0 T 0

A 1 C 0 G 0 T 0

A 0 C 0 G .2 T .8

A 0 C .8 G .2 T 0 A .2

C .4 G .2 T .2

1.0 1.0

.6

.4

.6 .4

1.0 1.0

Sequence score

0472 0

8 0 1 8 0 1 1 6 0 4 0

6 0 8 0 1 8 0 1 8 0 )

(

.

. .

P





 ACACATC 

Transition probabilities

Emission probabilities

(16)

16 Hidden Markov Model



In a hidden Markov model, we observe the symbol

sequence x but we want to reconstruct the hidden state sequence (path  )



Transition probabilities (  : a

_0l

,  : a

_k0

)



Emission probabilities



Joint probability of the sequence  ,x

₁

,...,x

_L

,  and the path )

|

( l

₁

k

P

a

_kl

 

_i

 

_i_



)

| (

)

( b P x b k

e

_k



_i

 

_i





 ^



^L

i

i _i _i

i

x a

e a

x P

1

0 ₁

( )

₁

) ,

( 

_ _ _ _

(17)

17 The forward algorithm

 The forward algorithm let us compute the probability P(x) of a sequence w.r.t. an HMM

 This is important for the computation of posterior probabilities and the comparison of HMMs

 The sum over all paths (exponentially many) can be computed by dynamic programming

 Les us define fk(i) as the probability of the sequence for the paths that end in state k with the emission of symbol xi

 Then we can compute this probability as







 ) , ( )

( x P x

P

) ,

,..., (

)

( i P x

₁

x k f

_k



_i



_i









k

kl k

i l

l

i e x f i a

f ( 1 ) (

₁

) ( )

(18)

18 The backward algorithm

 The backward algorithm let us compute the probability of the complete sequence together with the condition that symbol xi is emitted from state k

 This is important to compute the probability of a given state at symbol xi

 P(x1,...,xi,i=k) can be computed by the forward algorithm fk(i)

 Let us define bk(i) as the probability that the rest of the sequence for the paths that pass through state k at symbol xi

)

| ,...,

( ) ,

,..., (

) ,

,...,

| ,...,

( ) ,

,..., (

) ,

(

1 1

1

k x

x P k x

x P

k x

x x

x P k x

x P k

x P

i L

i i

i

i i L

i i







)

| ,...,

( )

( i P x

₁

x k

b

_k



_i_ _L



_i



(19)

19 EM interpretation of Baum-Welch



We want to estimate the parameters of the hidden Markov model (transition probabilities en emission probabilities that maximize the likelihood of the

sequence(s)



Unobserved data = paths :



EM algorithm

)

| ( max

*

arg 

 



P x

)

| , ( ln ) ,

| ( max

arg

) ( max

arg

EM EM

1 ^EM











 

 

x P x

P Q

i

i _i















 ) , ( )

( x P x

P

(20)

20 EM interpretation of Baum-Welch



Let us work out the function Q further



The generative model gives the joint probability of the sequence and the path

 Define the number of times that a given probability gets used for a given path

 Define the number of times that a given emission is observed for a given sequence and a given path

)

| , ( ln ) ,

| ( )

(     

 

P x P x

Q

_i

i

^ 

) (  A

kl

)

,

( b 

E

_k

(21)

21 EM interpretation van Baum-Welch



The joint probability of the sequence and the path can be written as



By taking the logarithm, the function Q becomes

 

^{( , )} ¹

 

^{( )}

1 0 1

( , | ) ( )

^k ^kl

M M M

E b A

k kl

k b k l

P x   e b

^ ^

a

^

  

  

1

1 0 1

( ) ( | , ) ( , ) ln ( ) ( ) ln

i

M M M

i k k kl kl

k b k l

Q

_

P x E b e b A a



   

^



  

 

   

 

  

(22)

22 EM interpretation van Baum-Welch



Define the expected number of times that a transition gets used (independently of the path)



Define the expected number of times that a transition is observed (independently of the path)









 | , ) ( )

(

_i _kl

kl

P x A

A









 | , ) ( , ) (

)

( b P x E b

E

_k _i _k

(23)

23 EM interpretation of Baum-Welch



For the function Q, we have



Given that P(x,  |  ) is independent of k and b, we can reorder the sums and use the definitions of A

_kl

and E

_k

(b)



Let us now maximize Q w.r.t.  : a

_kl

, e

_k

(b)

1

1 0 1

( ) ( | , ) ( , ) ln ( ) ( ) ln

i

M M M

i k k kl kl

k b k l

Q

_

P x E b e b A a



   

^



  

 

   

 

  

1

1 0 1

( ) ( ) ln ( ) ln

i

M M M

k k kl kl

k b k l

Q

_

 E b e b

^

A a

  

   

(24)

24 EM interpretation of Baum-Welch



Let us look at the A term



Let us define the following candidate for the optimum



Compare with other parameter choices

 

m

km kl kl

A a

⁰

A

1 1 1 0

0

0 1 0 1 0 1

1 0 0

0 1 1

ln ln ln

ln

M M M M M M

kl

kl kl kl kl kl

k l k l k l kl

M M M

kl

km kl

k m l kl

A a A a A a

a A a a

a

  

     



  

 

 

  

 

   

  

(25)

25 EM interpretation Baum-Welch



The previous sum has the form of a relative entropy and is always positive



Our candidate maximize thus the A term



Identical procedure for the E term

1 0 0 1

ln 0

M

kl kl

l kl

a a

a





 

 

' 0

) ' (

) ) (

(

b

k k k

b E

b b E

e

(26)

26 EM interpretation of Baum-Welch



Baum-Welch

 Expectation step

 Compute the expected number of times that a transition gets used

 Compute the expected number of times that an emission is observed

 Use the forward and backward algorithm for this

 Maximization step

 Update the parameters with the normalized counts









 | , ) ( )

(

_i _kl

kl

P x A

A









 | , ) ( , ) (

)

( b P x E b

E

_k _i _k

 



' 1

) ' (

) ) (

(

b

k i k

k

E b

b b E

 e





m

km i kl

kl

A

a

¹

A

(27)

27 EM interpretation of Baum-Welch



For the transitions



For the emissions

)

| (

) 1 (

) (

) ) (

,

| ,

( ₁ ¹

 



 P x

i b x

e a i x f

l k

P _i  _i_   ^k ^kl ^l ⁱ^ ^l 

 



^ ^ ^ ^

 _ _

j i

j l j i l kl j

j k

j i

i

kl f i a e x b i

x x P

l k

P

A ( ) ( ) ( 1)

)

| ( ) 1

,

| ,

( ₁ ₁

 



 

 



j i x b

j k j

j k

j ix b

j i

k

ij ij

i b i x f

x P k P

b E

}

| { }

| {

) ( ) ) (

| ( ) 1

,

| (

)

(   

(28)

28 Motif finding

(29)

29 Combinatorial control

 Complex integration of multiple cis-regulatory signals controls gene activity

(30)

30

1, ,W 1, ,4

Motif model : NCACGTGN

0.3 0.01 0.8 0.03 0.04 0.04 0.03 0.2 0.4 0.9 0.1 0.8 0.01 0.01 0.02 0.2 0.2 0.04 0.05 0.03 0.9 0.05 0.9 0.4 0.1 0.05 0.05 0.04 0.05 0.9 0.05 0.2 A

C G T



 

 

 



0 1, ,4

Background model 0.32

0.16 0.24 0.28 A

C G T



 

 

 



Sequence model : one occurrence per sequence

1 2

0 Motif

( | , _W, ) _bg _bg P S a  B  P P P

Motif , 1

1 ^{x j}

W

j b j

P q _{ }





1

1 0, 1 ^j a

bg b

j

P ^ q





2 0, _j

L

bg b

j a W

P q

 





(31)

31 Iterative motif discovery



Initialization

 Sequences

 Random motif matrix



Iteration

 Sequence scoring

 Alignment update

 Motif instances

 Motif matrix



Termination

 Convergence of the alignment and of the motif matrix

(32)

32 Iterative motif discovery



Initialization

 Sequences



Iteration

 Motif instances

 Motif matrix



Termination

(33)

33 Iterative motif discovery



Initialization

 Sequences



Iteration

 Motif instances

 Motif matrix



Termination

(34)

34 Iterative motif discovery



Initialization

 Sequences



Iteration

 Motif instances

 Motif matrix



Termination

(35)

35 Iterative motif discovery



Initialization

 Sequences



Iteration

 Motif instances

 Motif matrix



Termination

(36)

36 Iterative motif discovery



Initialization

 Sequences



Iteration

 Motif instances

 Motif matrix



Termination

(37)

37 Iterative motif discovery



Initialization

 Sequences



Iteration

 Motif instances

 Motif matrix



Termination

(38)

38 Iterative motif discovery



Initialization

 Sequences



Iteration

 Motif instances

 Motif matrix



Termination

(39)

39 Multiple EM for Motif Elicitation

(MEME)

(40)

40 MEME



Expectation-Maximization

 Data = set of independent sequences

 Likelihood = “one occurrence per sequence” model

 Parameters = motif matrix (+ background model)

 Missing data = alignment

 

EM EM

1

1 1

arg max

^k

( | , ) ln ( | , ) ( | )

k

K L

i k k i k k k

k a

P a S P S a P a

 

 

     

1

0 0

1 1

ln ( | , )

^k

ln

k

ln

k ^k

ln

k

i ak i i

k

a W L

i

k k b b b

i i i a W

P S a   

 



   

      

(41)

41 MEME



Sequence scoring (per sequence)



Uniform prior



Sequence scoring for uniform prior

1

( | , ) ( | ) ( | , )

( | , ) ( | )

k

k k k

k k L

k q

P S a P a P a S

P S q P q



 

 

 



( | )_k 1

k

P a   L

1

( | , )

( | , ) ( )

( | , )

k

k k

k k L k

k q

P S a

P a S W a

P S q





   





(42)

42 MEME



Expectation



Maximization - intuitively

 If we had only one alignment

 Background model: observed frequences at background positions

 Motif matrix: observe frequenties at aligned positions

 Here: sum over all possible alignments (independently for each sequence)

 Weighted sum:

EM

( )

EM

( ) ln ( | , )

i i

k

k k k

k a

Q

_

   W

_

a P S a 

EM

( )

i k

W

_

a

(43)

43 Summary



The abstract Expectation-Maximization algorithm



EM interpretation of Baum-Welch training for HMMs



EM for motif finding

 MEME