• No results found

Expectation-Maximization for HMMs and Motif Discovery

N/A
N/A
Protected

Academic year: 2021

Share "Expectation-Maximization for HMMs and Motif Discovery"

Copied!
43
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Expectation-Maximization for HMMs and Motif Discovery

Yves Moreau

(2)

2

Overview

The general Expectation-Maxization algorithm

EM interpretation of the Baum-Welch algorithm for the learning of HMMs

MEME for motif discovery

(3)

3

The general EM algorithm

(4)

4

EM algorithm

Maximum likelihood estimation

Let us assume we have an algorithm that tries to optimize the likelihood

Let us look at the change in likelihood between two iterations of the algorithm

)

| ( ln max arg

)

| ( max

*

arg  

 

P D

P D

)

| (

)

| ln (

)

| ( ln )

| ( ln )

,

(

1 1 1

i i i

i i

i

P D

D D P

P D

P

 

 

algorithm

*

0 i D i 1

    

 

(5)

5

EM algorithm

The likelihood is sometimes difficult to compute

We use a simpler generative model based on unobserved data (data augmentation)

We try to integrate out the unobserved data

The expectation can be an integral or a sum

 

)

| (

)

| , ln (

) ,

(

1 | 1

i i m

i

i

P D

m D P E

 

 

|

( , |

1

) ( , | ) ( | , ) ( | )

m i

m m

E

P D m

  P D m   d   P D mP m   d

 

|

( , |

1

) ( , | ) ( | , ) ( | )

m i

m m

E

P D m

  P D m    P D mP m

(6)

6

EM algorithm

Without loss of generality, we work with a sum

Problem: the expression contains the logarithm of a sum

Jensen’s inequality

1 1

( , | ) ( , ) ln

( | )

i m

i i

i

P D m P D

  

 

j j

j j

j j j

j

1 ln  y  ln y

(7)

7

EM algorithm

Application of Jensen’s inequality

This gives a lower bound for the variation of the likelihood

) ,

| (

) ,

| ( )

| (

)

| , ( ln

) , (

1 1

i i i

m

i i

i

P m D

D m P D

P

m D P

 

1 1 1

( , | )

( , ) ( | , )ln ( , )

( | ) ( | , )

i i i i i i

m i i

P D m P m D

P D P m D

      

   

1 1

ln ( | P D

i

) ln ( | )  P D

i

    (

i

, )

i

(8)

8

EM algorithm

Let us try to maximize (the bound on) the variation )

,

| ( )

| (

)

| , ln (

) ,

| ( )

, (

i m i

i

i

P D P m D

m D D P

m

P  

 

))

| , ( (ln max

arg )

( max

arg

)

| , ( ln ) ,

| ( max

arg

) ,

| ( )

| (

)

| , ln (

) ,

| ( max

arg

) ,

( max arg

EM

EM | ,

EM

EM EM

EM EM

EM 1

 

m D P E

Q

m D P D

m P

D m P D

P

m D D P

m P

i

i mD

m

i

i m i

i i

i

independent of 

(9)

9

Optimization strategy of EM

)

| (

ln P D

) ( )

| (

ln EM

EM

Q i

D

P i

EM

iiEM1 )

| (

ln P DiEM )

| (

ln P DiEM1

(10)

10

EM for independent records

If the data set consists of N independent records, we can introduce independent unobserved data

The expectation step (including the use of Jensen’s inequality) takes place “inside” the summation over all records

*

arg max ( | ) arg max ln ( | )

arg max ln ( | ) arg max

n

ln ( | )

n

n n

P D P D

P x P x

  

 

 

   

EM EM

1

arg max ( | , ) ln ( , | )

n

i n n i n n

n m

P m x P x m

  

(11)

11

Convergence of the EM algorithm

Likelihood increases monotonically

+ Equilibrium  * of EM is maximum of lnP(D|

)+ (  ,

)

Thus  * must be a stationary point of the likelihood (because the bound is tangent to the log-likelihood)

No guarantee for a global optimum (often local minima)

In some cases the stationary point is not a even maximum

EM EM max EM EM

1

EM EM

1

( , ) ( , ) 0

ln ( | ) ln ( | )

i i i i

i i

P D P D

     

 

 

 

(12)

12

Why EM?

EM serves to find a maximum likelihood solution

This can also be achieved by gradient descent

But the computation of the gradients of the likelihood P(D|  ) is often difficult

By introducing the unobserved data in the EM algorithm,

we can compute the Expectation step more easily

(13)

13

Generalized EM

It is not absolutely necessary to maximize Q() at the Expectation step

If Q(



)  Q(

), convergence can also be achieved

This is the generalized EM algorithm

This algorithm is applied when the results of the

Expectation step are too complex to maximize directly

(14)

14

Baum-Welch algorithm

(15)

15

Hidden Markov Models

A .8 C 0 G 0 T .2

A 0 C .8 G .2 T 0

A .8 C .2 G 0 T 0

A 1 C 0 G 0 T 0

A 0 C 0 G .2 T .8

A 0 C .8 G .2 T 0 A .2

C .4 G .2 T .2

1.0 1.0

.6

.4

.6 .4

1.0 1.0

Sequence score

0472 0

8 0 1 8 0 1 1 6 0 4 0

6 0 8 0 1 8 0 1 8 0 )

(

.

. .

. .

. .

. .

P

ACACATC

Transition probabilities

Emission probabilities

(16)

16

Hidden Markov Model

In a hidden Markov model, we observe the symbol

sequence x but we want to reconstruct the hidden state sequence (path  )

Transition probabilities (  : a

0l

,  : a

k0

)

Emission probabilities

Joint probability of the sequence  ,x

1

,...,x

L

,  and the path )

|

( l

1

k

P

a

kl

 

i

 

i

)

| (

)

( b P x b k

e

k

i

 

i

L

i

i i i

i

x a

e a

x P

1

0 1

( )

1

) ,

( 

(17)

17

The forward algorithm

The forward algorithm let us compute the probability P(x) of a sequence w.r.t. an HMM

This is important for the computation of posterior probabilities and the comparison of HMMs

The sum over all paths (exponentially many) can be computed by dynamic programming

Les us define fk(i) as the probability of the sequence for the paths that end in state k with the emission of symbol xi

Then we can compute this probability as

 ) , ( )

( x P x

P

) ,

,..., (

)

( i P x

1

x k f

k

i

i

k

kl k

i l

l

i e x f i a

f ( 1 ) (

1

) ( )

(18)

18

The backward algorithm

The backward algorithm let us compute the probability of the complete sequence together with the condition that symbol xi is emitted from state k

This is important to compute the probability of a given state at symbol xi

P(x1,...,xi,i=k) can be computed by the forward algorithm fk(i)

Let us define bk(i) as the probability that the rest of the sequence for the paths that pass through state k at symbol xi

)

| ,...,

( ) ,

,..., (

) ,

,...,

| ,...,

( ) ,

,..., (

) ,

(

1 1

1 1

1

k x

x P k x

x P

k x

x x

x P k x

x P k

x P

i L

i i

i

i i L

i i

i i

)

| ,...,

( )

( i P x

1

x k

b

k

i L

i

(19)

19

EM interpretation of Baum-Welch

We want to estimate the parameters of the hidden Markov model (transition probabilities en emission probabilities that maximize the likelihood of the

sequence(s)

Unobserved data = paths :

EM algorithm

)

| ( max

*

arg 

 

P x

)

| , ( ln ) ,

| ( max

arg

) ( max

arg

EM EM

1 EM

x P x

P Q

i

i i

 ) , ( )

( x P x

P

(20)

20

EM interpretation of Baum-Welch

Let us work out the function Q further

The generative model gives the joint probability of the sequence and the path

Define the number of times that a given probability gets used for a given path

Define the number of times that a given emission is observed for a given sequence and a given path

)

| , ( ln ) ,

| ( )

(     

P x P x

Q

i

i

) (  A

kl

)

,

( b

E

k

(21)

21

EM interpretation van Baum-Welch

The joint probability of the sequence and the path can be written as

By taking the logarithm, the function Q becomes

 

( , ) 1

 

( )

1 0 1

( , | ) ( )

k kl

M M M

E b A

k kl

k b k l

P x   e b

a

  

1

1 0 1

( ) ( | , ) ( , ) ln ( ) ( ) ln

i

M M M

i k k kl kl

k b k l

Q

P x E b e b A a

   

 

   

 

  

(22)

22

EM interpretation van Baum-Welch

Define the expected number of times that a transition gets used (independently of the path)

Define the expected number of times that a transition is observed (independently of the path)

 | , ) ( )

(

i kl

kl

P x A

A

 | , ) ( , ) (

)

( b P x E b

E

k i k

(23)

23

EM interpretation of Baum-Welch

For the function Q, we have

Given that P(x,  |  ) is independent of k and b, we can reorder the sums and use the definitions of A

kl

and E

k

(b)

Let us now maximize Q w.r.t.  : a

kl

, e

k

(b)

1

1 0 1

( ) ( | , ) ( , ) ln ( ) ( ) ln

i

M M M

i k k kl kl

k b k l

Q

P x E b e b A a

   

 

   

 

  

1

1 0 1

( ) ( ) ln ( ) ln

i

M M M

k k kl kl

k b k l

Q

E b e b

A a

   

(24)

24

EM interpretation of Baum-Welch

Let us look at the A term

Let us define the following candidate for the optimum

Compare with other parameter choices

 

m

km kl kl

A a

0

A

1 1 1 0

0

0 1 0 1 0 1

1 0 0

0 1 1

ln ln ln

ln

M M M M M M

kl

kl kl kl kl kl

k l k l k l kl

M M M

kl

km kl

k m l kl

A a A a A a

a A a a

a

 

 

  

 

   

  

(25)

25

EM interpretation Baum-Welch

The previous sum has the form of a relative entropy and is always positive

Our candidate maximize thus the A term

Identical procedure for the E term

1 0 0 1

ln 0

M

kl kl

l kl

a a

a

 

 

' 0

) ' (

) ) (

(

b

k k k

b E

b b E

e

(26)

26

EM interpretation of Baum-Welch

Baum-Welch

Expectation step

Compute the expected number of times that a transition gets used

Compute the expected number of times that an emission is observed

Use the forward and backward algorithm for this

Maximization step

Update the parameters with the normalized counts

 | , ) ( )

(

i kl

kl

P x A

A

 | , ) ( , ) (

)

( b P x E b

E

k i k

 

' 1

) ' (

) ) (

(

b

k i k

k

E b

b b E

e

m

km i kl

kl

A

a

1

A

(27)

27

EM interpretation of Baum-Welch

For the transitions

For the emissions

)

| (

) 1 (

) (

) ) (

,

| ,

( 1 1

 

P x

i b x

e a i x f

l k

P ii   k kl l i l

 



j i

j l j i l kl j

j k

j i

j i

i

kl f i a e x b i

x x P

l k

P

A ( ) ( ) ( 1)

)

| ( ) 1

,

| ,

( 1 1

 

 

 

j i x b

j k j

j k

j ix b

j i

k

ij ij

i b i x f

x P k P

b E

}

| { }

| {

) ( ) ) (

| ( ) 1

,

| (

)

(   

(28)

28

Motif finding

(29)

29

Combinatorial control

Complex integration of multiple cis-regulatory signals controls gene activity

(30)

30

1, ,W 1, ,4

Motif model : NCACGTGN

0.3 0.01 0.8 0.03 0.04 0.04 0.03 0.2 0.4 0.9 0.1 0.8 0.01 0.01 0.02 0.2 0.2 0.04 0.05 0.03 0.9 0.05 0.9 0.4 0.1 0.05 0.05 0.04 0.05 0.9 0.05 0.2 A

C G T

0 1, ,4

Background model 0.32

0.16 0.24 0.28 A

C G T

Sequence model : one occurrence per sequence

1 2

0 Motif

( | , W, ) bg bg P S aBP P P

Motif , 1

1 x j

W

j b j

P q  

1

1 0, 1 j a

bg b

j

P q

2 0, j

L

bg b

j a W

P q

 

(31)

31

Iterative motif discovery

Initialization

Sequences

Random motif matrix

Iteration

Sequence scoring

Alignment update

Motif instances

Motif matrix

Termination

Convergence of the alignment and of the motif matrix

(32)

32

Iterative motif discovery

Initialization

Sequences

Random motif matrix

Iteration

Sequence scoring

Alignment update

Motif instances

Motif matrix

Termination

Convergence of the alignment and of the motif matrix

(33)

33

Iterative motif discovery

Initialization

Sequences

Random motif matrix

Iteration

Sequence scoring

Alignment update

Motif instances

Motif matrix

Termination

Convergence of the alignment and of the motif matrix

(34)

34

Iterative motif discovery

Initialization

Sequences

Random motif matrix

Iteration

Sequence scoring

Alignment update

Motif instances

Motif matrix

Termination

Convergence of the alignment and of the motif matrix

(35)

35

Iterative motif discovery

Initialization

Sequences

Random motif matrix

Iteration

Sequence scoring

Alignment update

Motif instances

Motif matrix

Termination

Convergence of the alignment and of the motif matrix

(36)

36

Iterative motif discovery

Initialization

Sequences

Random motif matrix

Iteration

Sequence scoring

Alignment update

Motif instances

Motif matrix

Termination

Convergence of the alignment and of the motif matrix

(37)

37

Iterative motif discovery

Initialization

Sequences

Random motif matrix

Iteration

Sequence scoring

Alignment update

Motif instances

Motif matrix

Termination

Convergence of the alignment and of the motif matrix

(38)

38

Iterative motif discovery

Initialization

Sequences

Random motif matrix

Iteration

Sequence scoring

Alignment update

Motif instances

Motif matrix

Termination

Convergence of the alignment and of the motif matrix

(39)

39

Multiple EM for Motif Elicitation

(MEME)

(40)

40

MEME

Expectation-Maximization

Data = set of independent sequences

Likelihood = “one occurrence per sequence” model

Parameters = motif matrix (+ background model)

Missing data = alignment

 

EM EM

1

1 1

arg max

k

( | , ) ln ( | , ) ( | )

k

K L

i k k i k k k

k a

P a S P S a P a

     

1

1

0 0

1 1

ln ( | , )

k

ln

k

ln

k k

ln

k

i ak i i

k

a W L

i

k k b b b

i i i a W

P S a   

 

 

      

(41)

41

MEME

Sequence scoring (per sequence)

Uniform prior

Sequence scoring for uniform prior

1

( | , ) ( | ) ( | , )

( | , ) ( | )

k

k k k

k k L

k q

P S a P a P a S

P S q P q

 

( | )k 1

k

P a   L

1

( | , )

( | , ) ( )

( | , )

k

k k

k k L k

k q

P S a

P a S W a

P S q

 

(42)

42

MEME

Expectation

Maximization - intuitively

If we had only one alignment

Background model: observed frequences at background positions

Motif matrix: observe frequenties at aligned positions

Here: sum over all possible alignments (independently for each sequence)

Weighted sum:

EM

( )

EM

( ) ln ( | , )

i i

k

k k k

k a

Q

   W

a P S a

EM

( )

i k

W

a

(43)

43

Summary

The abstract Expectation-Maximization algorithm

EM interpretation of Baum-Welch training for HMMs

EM for motif finding

MEME

Referenties

GERELATEERDE DOCUMENTEN

In the rest of this article I will attempt to apply the ideas underlying the un- canny valley to yet another domain: that of privacy perception in relation to (1) social network

A limitation of the GLTM is that the numerical integration to be performed for parameter estimation can involve summation over a large number of points when the number of

Erg diepgravend zijn de essays die Zwagerman heeft gebundeld niet, maar de gemeenschappelijke preoccupatie met de godsdienst garandeert in elk geval een samenhang die niet van

Choudhury-Lema 2009 Bangladesh auditors, bankers, students • existence of an AEG • 12 questions on auditor responsibility, audit reliability, and decision usefulness of

In de RJ-richtlijnen zowel als de Beklamel- jurisprudentie als artikel 1 F gaat het kennelijk over een liquiditeitstekort (niet meer tijdig kunnen voldoen aan verplichtingen), dat

The challenge addressed in this paper is to show how a range of uses of applicative constructions defy an analysis in purely syntactic terms, and to outline an alternative analysis

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of

Construeer  ABC als gegeven zijn  C en de projecties van BC op de twee