• No results found

Hidden Markov Models

N/A
N/A
Protected

Academic year: 2021

Share "Hidden Markov Models"

Copied!
32
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Hidden Markov Models

Yves Moreau

Katholieke Universiteit Leuven

(2)

Regular expressions

Alignment

Regular expression

Problem: regular expression does not distinguish

Exceptional TGCTAGG

Consensus ACACATC

ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC

[AT][CG][AC][ACGT]*A[TG][GC]

(3)

Hidden Markov Models

A .8 C 0 G 0 T .2

A 0 C .8 G .2 T 0

A .8 C .2 G 0 T 0

A 1 C 0 G 0 T 0

A 0 C 0 G .2 T .8

A 0 C .8 G .2 T 0 A .2

C .4 G .2 T .2

1.0 1.0

.6

.4

.6 .4

1.0 1.0

Sequence score

0472 0

8 0 1 8 0 1 1 6 0 4 0

6 0 8 0 1 8 0 1 8 0 )

(

.

. .

. .

. .

. .

P

 ACACATC 

Transition probabilities

Emission probabilities

(4)

Log odds

Use logarithm for scaling and normalize by random model

Log odds for sequence S:

A: 1.16 T:-0.22

C: 1.16 G:-0.22

A: 1.16

C:-0.22 A: 1.39 G:-0.22

T: 1.16

C: 1.16 G:-0.22 A:-0.22

C: 0.47 G:-0.22 T:-0.22

0 0

-0.51

-0.92

-0.51 -0.92

0 0

25 . 0 log )

( 25 log

. 0

)

log P ( S P S L

L

 

(5)

Log odds

65 . 6

16 . 1 0 16 . 1 39 . 1 51 . 0 47 . 0

51 . 0 16 . 1 0 16 . 1 0 16 . 1 )

( odds log

ACACATC

Sequence Log odds

ACAC--ATC (consensus) 6.7

ACA---ATG 4.9

TCAACTATC 3.0

ACAC--AGC 5.3

AGA---ATC 4.9

ACCG--ATC 4.6

-0.97

(6)

Markov chain

Sequence:

Example of a Markov chain

Probabilistic model of a DNA sequence

)

|

( x t x

1

s P

a

st

i

i

Transition probabilities

, , ,)

with ,...,

,

2

(e.g.,

1

x x x A C G T

x

x

L i

 A A 

A

C G

T

(7)

Markov property

Probability of a sequence through Bayes’ rule

Markov property

“The future is only function of the present and not of the past”

1 1

1 1 1 2 1 1

( ) ( , ,..., | length )

( | ,..., ) ( | ,..., ) ( )

L L

L L L L

P x P x x x L

P x x x P x x x P x

 

 

L

i

x x

L L

L L

i

a

i

x P

x P x x

P x

x P x

x P x

P

2 1

1 1

2 2

1 1

)

1

(

) ( )

| ( )

| (

)

| ( )

( 

(8)

Beginning and end of a sequence

Computation of the probability is not homogeneous

Length distribution is not modeled

P(length=L) unspecified

Solution

Modeling of beginning and end of the sequence

The probability to observe a sequence of a given length decreases

with the length of the sequence

1 1

Sequence: , ,..., ,

( )

( | )

L s

t L

x x

a P x s

a P x t

 

 

 

A

C G

T

 

(9)

Hidden Markov Models

A .8 C 0 G 0 T .2

A 0 C .8 G .2 T 0

A .8 C .2 G 0 T 0

A 1 C 0 G 0 T 0

A 0 C 0 G .2 T .8

A 0 C .8 G .2 T 0 A .2

C .4 G .2 T .2

1.0 1.0

.6

.4

.6 .4

1.0 1.0

Sequence score

0472 0

8 0 1 8 0 1 1 6 0 4 0

6 0 8 0 1 8 0 1 8 0 )

(

.

. .

. .

. .

. .

P

 ACACATC 

Transition probabilities

Emission probabilities

(10)

Hidden Markov Model

In a hidden Markov model, we observe the symbol

sequence x but we want to reconstruct the hidden state sequence (path  )

Transition probabilities (  : a

0l

,  : a

k0

)

Emission probabilities

Joint probability of the sequence  ,x

1

,...,x

L

,  and the path )

|

( l

1

k

P

a

kl

 

i

 

i

)

| (

)

( b P x b k

e

k

i

 

i

L

i

i i i

i

x a

e a

x P

1

0 1

( )

1

) ,

( 

(11)

Casino (I) – problem setup

The casino uses mostly a fair die but switches sometimes to a loaded die

We observe the outcome x of the successive throws but want to know when the die was fair or loaded (path )

1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6

1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 0.05

0.1 0.95 0.9

Fair Loaded

(12)

Estimation of the sequence and state

probabilities

(13)

The Viterbi algorithm

We look for the most probable path 

*

This problem can be tackled by dynamic programming

Let us define v

k

(i) as the probability of the most probable path that ends in state k for the emission of symbol x

i

Then we can compute this probability recursively as

*

arg max ( , ) arg max ( | ) P x P x

    

) )

( ( max )

( )

1

(

1 k kl

i k l

l

i e x v i a

v  

1,..., 1 1 1 1

( ) max ( ,..., , ,.... , )

i

k i i i

v i P x x k

  

 

(14)

The Viterbi algorithm

The Viterbi algorithm grows the best path dynamically

Initial condition: sequence in beginning state

Traceback pointers tot follow the best path (= decoding)

) ( ptr :

) 1 ,..., (

Traceback

) )

( ( max arg

) )

( ( max )

, ( :

n Terminatio

) )

1 (

( max arg

) ( ptr

) )

1 (

( max )

( )

( :

) ,..., 1 (

Recursion

0 ,

0 )

0 ( , 1 ) 0 ( :

) 0 (

tion Initializa

*

* 1

0

*

0

* 0

i i i

k k k

L

k k k

kl k k

i

kl k k

i l l

k

L i

a L v

a L v x

P

a i

v l

a i

v x

e i

v L

i

k v

v i

(15)

Casino (II) - Viterbi

(16)

The forward algorithm

The forward algorithm let us compute the probability P(x) of a sequence w.r.t. an HMM

This is important for the computation of posterior probabilities and the comparison of HMMs

The sum over all paths (exponentially many) can be computed by dynamic programming

Les us define f

k

(i) as the probability of the sequence for the paths that end in state k with the emission of symbol x

i

Then we can compute this probability as

 ) , ( )

( x P x

P

) ,

,..., (

)

( i P x

1

x k

f

k

i

i

(17)

The forward algorithm

The forward algorithm grows the total probability dynamically from the beginning to the end of the sequence

Initial condition: sequence in beginning state

End: all states converge to the end state

k

k k

k

kl k

i l l

k

a L f

x P

a i

f x

e i

f L

i

k f

f i

0 0

) ( )

( :

End

) 1 (

) ( )

( :

) ,..., 1 (

Recursion

0 ,

0 )

0 ( ,

1 ) 0 ( :

) 0 (

tion

Initializa

(18)

The backward algorithm

The backward algorithm let us compute the probability of the complete sequence together with the condition that symbol x

i

is emitted from state k

This is important to compute the probability of a given state at symbol x

i

P(x

1

,...,x

i

,

i

=k) can be computed by the forward algorithm f

k

(i)

Let us define b

k

(i) as the probability that the rest of the sequence for the paths that pass through state k at symbol x

i

)

| ,...,

( ) ,

,..., (

) ,

,...,

| ,...,

( ) ,

,..., (

) ,

(

1 1

1 1

1

k x

x P k x

x P

k x

x x

x P k x

x P k

x P

i L

i i

i

i i L

i i

i i

)

| ,...,

( )

( i P x

1

x k

b

k

i L

i

(19)

The backward algorithm

The backward algorithm grows the probability b

k

(i) dynamically backwards (from end to beginning)

Border condition: start in end state

Once both forward and backward probabilities are available, we can compute the posterior probability of the state

l

l l

l l

l i

l kl k

k k

b x e a x

P

i b x

e a i

b L

i

k a

L b L

i

) 1 ( ) ( )

( :

n Terminatio

) 1 (

) (

) ( :

) 1 ,..., 1 (

Recursion

, )

( :

) (

tion Initializa

1 0

1 0

) (

) ( ) ) (

|

( P x

i b i x f

k

P

i

 

k k

(20)

Posterior decoding

Instead of using the most probable path for decoding (Viterbi), we can use the path of the most probable states

The path 

^

can be “illegal” (P(

^

|x)=0)

This approach can also be used when we are interested in a function g(k) of the state (e.g., labeling)

)

| (

max

ˆ arg P

i

k x

i

k

 

k

i

k x g k

P x

i

G ( | ) (  | ) ( )

(21)

Casino (III) – posterior decodering

Posterior probability of the state “fair” w.r.t. the die

throws

(22)

Casino (IV) – posterior decodering

New situation : P(x

i+1

= FAIR | x

i

= FAIR) = 0.99

Viterbi decoding cannot detect the cheating from 1000 throws, while posterior decoding does

1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6

1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 0.01

0.1 0.99 0.9

Fair Loaded

(23)

Parameter estimation for HMMs

(24)

Choice of the architecture

For the parameter estimation, we assume that the architecture of the HMM is known

Choice of architecture is an essential design choice

Duration modeling

“Silent states” for gaps

(25)

Parameter estimation with known paths

HMM with parameters  (transition and emission probabilities)

Training set D of N sequences x

1

,...,x

N

Score of the model is the likelihood of the parameters given the training data

N

j

j

N

P x

x x

P

1

1

,..., | ) log ( | ) (

log )

, (

Score D   

(26)

Parameter estimation with known paths

If the state paths are known, the parameters are

estimated through counts (how often is a transition used, how often is a symbol produced by a given state)

Use of ‘pseudocounts’ if necessary

A

kl

= number of transitions from k to l in training set + pseudocount r

kl

E

k

(b) = number of emissions of b from k in training set + pseudocount r

k

(b)

b

k k k

l

l k kl kl

b E

b b E

A e a A

) (

) ) (

(

and

(27)

Parameter estimation with unknown paths: Viterbi training

Strategy: iterative method

Suppose that the parameters are known and find the best path

Use Viterbi decoding to estimate the parameters

Iterate till convergence

Viterbi training does not maximize the likelihood of the parameters

Viterbi training converges exactly in a finite number of steps

)) (

),..., (

,

| ,...,

( max

arg

1 * 1 *

Vit

P x x

N

x

x

N

  

 

(28)

Parameter estimation with unknown paths: Baum-Welch training

Strategy: parallel to Viterbi but we use the expected value for the transition and emission counts (instead of using only the best path)

For the transitions

For the emissions

)

| (

) 1 (

) (

) ) (

,

| ,

(

1 1

 

P x

i b x

e a i x f

l k

P

i

i

 

k kl l i l

 



j i

j l j i l kl j

j k

j i

j i

i

kl

f i a e x b i

x x P

l k

P

A ( ) ( ) ( 1 )

)

| ( ) 1

,

| ,

(

1 1

 

 

 

j i x b

j k j

j k

j ix b

j i

k

j j

i b i x f

x P k P

b E

}

| { }

| {

) ( ) ) (

| ( ) 1

,

| (

)

(   

(29)

Parameter estimation with unknown paths: Baum-Welch training

Initialization: Choose arbitrary model parameters

Recursion:

Set all transitions and emission variables to their pseudocount

For all sequences j = 1,...,n

Compute fk(i) for sequence j with the forward algorithm

Compute bk(i) for sequence j with the backward algorithm

Add the contributions to A and E

Compute the new model parameters a

kl

=A

kl

/ 

kl’

and e

k

(b)

Compute the log-likelihood of the model

End: stop when the log-likelihood does not change more

than by some threshold or when the maximum number

of iterations is exceeded

(30)

Casino (V) – Baum-Welch training

1: 0.19 2: 0.19 3: 0.23 4: 0.08 5: 0.23 6: 0.08

1: 0.07 2: 0.10 3: 0.10 4: 0.17 5: 0.05 6: 0.52 0.27

0.29 0.73 0.71

Fair Loaded

1: 0.17 2: 0.17 3: 0.17 4: 0.17 5: 0.17 6: 0.15

1: 0.10 2: 0.11 3: 0.10 4: 0.11 5: 0.10 6: 0.48 0.07

0.12 0.93 0.88

Fair Loaded

1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6

1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 0.05

0.1

0.9 0.95

Fair Loaded

Original model

300 throws 30000 throws

(31)

Numerical stability

Many expressions contain products of many probabilities

This causes underflow when we compute these expressions

For Viterbi, this can be solved by working with the logarithms

For the forward and backward algorithms, we can work

with an approximation to the logarithm or by working with

rescaled variables

(32)

Summary

Hidden Markov Models

Computation of sequence and state probabilities

Viterbi computation of the best state path

The forward algorithm for the computation of the probability of a sequence

The backward algorithm for the computation of state probabilities

Parameter estimation for HMMs

Parameter estimation with known paths

Parameter estimation with unknown paths

Viterbi training

Baum-Welch training

Referenties

GERELATEERDE DOCUMENTEN

This article showed that the cub model, for which specialized software and developments have recently been proposed, is a restricted loglinear latent class model that falls within

Section 25 of the Ordinance for the Administration of Justice in Superior Courts (Orange River Colony) 14 provides: “The High Court [established by section 1] shall

Dus duurt het 18 jaar voordat een hoeveelheid vier keer zo groot is geworden.... Uitdagende

As was the case with Mealy models, one can always obtain a model equivalent to a given quasi or positive Moore model by permuting the states of the original model.. In contrast to

If all the states of the quasi-Moore model have a different output distribution (i.e. no two rows of L are equal to each other) and if the state transition matrix A Q has full

In Section 5, we use the factorization to find an approximate hidden Markov model corresponding to given probabilities of output strings of length 2 (i.e. the hidden Markov

Each data source consists of (i) a unique transition matrix A, generated according to one of the five topologies: Forward, Bakis, Skip, Ergodic and Flat, (ii) a randomly

4.Updated model replaces initial model and steps are repeated until model convergence.. Hidden