Hidden Markov Models

(1)

Hidden Markov Models

Yves Moreau

Katholieke Universiteit Leuven

(2)

Regular expressions



Alignment



Regular expression



Problem: regular expression does not distinguish

 Exceptional TGCTAGG

 Consensus ACACATC

ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC

[AT][CG][AC][ACGT]*A[TG][GC]

(3)

Hidden Markov Models

A .8 C 0 G 0 T .2

A 0 C .8 G .2 T 0

A .8 C .2 G 0 T 0

A 1 C 0 G 0 T 0

A 0 C 0 G .2 T .8

A 0 C .8 G .2 T 0 A .2

C .4 G .2 T .2

1.0 1.0

.6

.4

.6 .4

1.0 1.0

Sequence score

0472 0

8 0 1 8 0 1 1 6 0 4 0

6 0 8 0 1 8 0 1 8 0 )

(

.

. .

P





 ACACATC 

Transition probabilities

Emission probabilities

(4)

Log odds



Use logarithm for scaling and normalize by random model



Log odds for sequence S:

A: 1.16 T:-0.22

C: 1.16 G:-0.22

A: 1.16

C:-0.22 A: 1.39 G:-0.22

T: 1.16

C: 1.16 G:-0.22 A:-0.22

C: 0.47 G:-0.22 T:-0.22

0 0

-0.51

-0.92

-0.51 -0.92

0 0

25 . 0 log )

( 25 log

. 0

)

log P ( S P S L

L

 

(5)

Log odds

65 . 6

16 . 1 0 16 . 1 39 . 1 51 . 0 47 . 0

51 . 0 16 . 1 0 16 . 1 0 16 . 1 )

( odds log

















ACACATC

Sequence Log odds

ACAC--ATC (consensus) 6.7

ACA---ATG 4.9

TCAACTATC 3.0

ACAC--AGC 5.3

AGA---ATC 4.9

ACCG--ATC 4.6

-0.97

(6)

Markov chain



Sequence:



Example of a Markov chain



Probabilistic model of a DNA sequence

)

|

( x t x

₁

s P

a

_st



_i



_i_



Transition probabilities

 ^, ^, ^,  ⁾

with ,...,

,

₂

(e.g.,

1

x x x A C G T

x

x 

_L _i

 A A 

A

C G

T

(7)

Markov property



Probability of a sequence through Bayes’ rule



Markov property

 “The future is only function of the present and not of the past”

1 1

1 1 1 2 1 1

( ) ( , ,..., | length )

( | ,..., ) ( | ,..., ) ( )

L L

L L L L

P x P x x x L

P x x x P x x x P x



  

 

 













L

i

x x

L L

i

a

i

x P

x P x x

P x

x P x

P

2 1

1 1

2 2

1 1

)

1

(

) ( )

| ( )

| (

)

| ( )

( 

(8)

Beginning and end of a sequence

 Computation of the probability is not homogeneous

 Length distribution is not modeled

 P(length=L) unspecified

 Solution

 Modeling of beginning and end of the sequence

 The probability to observe a sequence of a given length decreases

with the length of the sequence

1 1

Sequence: , ,..., ,

( )

( | )

L s

t L

x x

a P x s

a P x t





 



 

A

C G

T

 

(9)

Hidden Markov Models

A .8 C 0 G 0 T .2

A 0 C .8 G .2 T 0

A .8 C .2 G 0 T 0

A 1 C 0 G 0 T 0

A 0 C 0 G .2 T .8

A 0 C .8 G .2 T 0 A .2

C .4 G .2 T .2

1.0 1.0

.6

.4

.6 .4

1.0 1.0

Sequence score

0472 0

8 0 1 8 0 1 1 6 0 4 0

6 0 8 0 1 8 0 1 8 0 )

(

.

. .

P





 ACACATC 

Transition probabilities

Emission probabilities

(10)

Hidden Markov Model



In a hidden Markov model, we observe the symbol

sequence x but we want to reconstruct the hidden state sequence (path  )



Transition probabilities (  : a

_0l

,  : a

_k0

)



Emission probabilities



Joint probability of the sequence  ,x

₁

,...,x

_L

,  and the path )

|

( l

₁

k

P

a

_kl

 

_i

 

_i_



)

| (

)

( b P x b k

e

_k



_i

 

_i





 ^



^L

i

i _i _i

i

x a

e a

x P

1

0 ₁

( )

₁

) ,

( 

_ _ _ _

(11)

Casino (I) – problem setup



The casino uses mostly a fair die but switches sometimes to a loaded die



We observe the outcome x of the successive throws but want to know when the die was fair or loaded (path )

1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6

1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 0.05

0.1 0.95 0.9

Fair Loaded

(12)

Estimation of the sequence and state

probabilities

(13)

The Viterbi algorithm



We look for the most probable path 

^*



This problem can be tackled by dynamic programming



Let us define v

k

(i) as the probability of the most probable path that ends in state k for the emission of symbol x

i



Then we can compute this probability recursively as

*

arg max ( , ) arg max ( | ) P x P x

 

    

) )

( ( max )

( )

1 (

₁ _k _kl

i k l

l

i e x v i a

v  

_

1,..., 1 1 1 1

( ) max ( ,..., , ,.... , )

i

k i i i

v i P x x k

 

  

 

 

(14)

The Viterbi algorithm



The Viterbi algorithm grows the best path dynamically



Initial condition: sequence in beginning state



Traceback pointers tot follow the best path (= decoding)

) ( ptr :

) 1 ,..., (

Traceback

) )

( ( max arg

) )

( ( max )

, ( :

n Terminatio

) )

1 (

( max arg

) ( ptr

) )

1 (

( max )

( )

( :

) ,..., 1 (

Recursion

0 ,

0 )

0 ( , 1 ) 0 ( :

) 0 (

tion Initializa

*

* 1

0

*

0

* 0

i i i

k k k

L

k k k

kl k k

i

kl k k

i l l

k

L i

a L v

a L v x

P

a i

v l

a i

v x

e i

v L

i

k v

v i





















(15)

Casino (II) - Viterbi

(16)

The forward algorithm



The forward algorithm let us compute the probability P(x) of a sequence w.r.t. an HMM



This is important for the computation of posterior probabilities and the comparison of HMMs



The sum over all paths (exponentially many) can be computed by dynamic programming



Les us define f

k

(i) as the probability of the sequence for the paths that end in state k with the emission of symbol x

i



Then we can compute this probability as







 ) , ( )

( x P x

P

) ,

,..., (

)

( i P x

₁

x k

f

_k



_i



_i



(17)

The forward algorithm



The forward algorithm grows the total probability dynamically from the beginning to the end of the sequence



Initial condition: sequence in beginning state



End: all states converge to the end state















k

k k

k

kl k

i l l

k

a L f

x P

a i

f x

e i

f L

i

k f

f i

0 0

) ( )

( :

End

) 1 (

) ( )

( :

) ,..., 1 (

Recursion

0 ,

0 )

0 ( ,

1 ) 0 ( :

) 0 (

tion

Initializa

(18)

The backward algorithm



The backward algorithm let us compute the probability of the complete sequence together with the condition that symbol x

i

is emitted from state k



This is important to compute the probability of a given state at symbol x

i



P(x

1

,...,x

i

,

i

=k) can be computed by the forward algorithm f

k

(i)



Let us define b

k

(i) as the probability that the rest of the sequence for the paths that pass through state k at symbol x

i

)

| ,...,

( ) ,

,..., (

) ,

,...,

| ,...,

( ) ,

,..., (

) ,

(

1 1

1

k x

x P k x

x P

k x

x x

x P k x

x P k

x P

i L

i i

i

i i L

i i







)

| ,...,

( )

( i P x

₁

x k

b

_k



_i_ _L



_i



(19)

The backward algorithm



The backward algorithm grows the probability b

k

(i) dynamically backwards (from end to beginning)



Border condition: start in end state



Once both forward and backward probabilities are available, we can compute the posterior probability of the state



















l

l l

l i

l kl k

k k

b x e a x

P

i b x

e a i

b L

i

k a

L b L

i

) 1 ( ) ( )

( :

n Terminatio

) 1 (

) (

) ( :

) 1 ,..., 1 (

Recursion

, )

( :

) (

tion Initializa

1 0

) (

) ( ) ) (

|

( P x

i b i x f

k

P 

_i

 

^k ^k

(20)

Posterior decoding



Instead of using the most probable path for decoding (Viterbi), we can use the path of the most probable states



The path 

^{^}

can be “illegal” (P(

^{^}

|x)=0)



This approach can also be used when we are interested in a function g(k) of the state (e.g., labeling)

)

| (

max

ˆ arg P

_i

k x

i



k

 



 ^



k

i

k x g k

P x

i

G ( | ) (  | ) ( )

(21)

Casino (III) – posterior decodering



Posterior probability of the state “fair” w.r.t. the die

throws

(22)

Casino (IV) – posterior decodering



New situation : P(x

_i+1

= FAIR | x

_i

= FAIR) = 0.99



Viterbi decoding cannot detect the cheating from 1000 throws, while posterior decoding does

1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6

1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 0.01

0.1 0.99 0.9

Fair Loaded

(23)

Parameter estimation for HMMs

(24)

Choice of the architecture



For the parameter estimation, we assume that the architecture of the HMM is known



Choice of architecture is an essential design choice



Duration modeling



“Silent states” for gaps

(25)

Parameter estimation with known paths



HMM with parameters  (transition and emission probabilities)



Training set D of N sequences x

¹

,...,x

^N



Score of the model is the likelihood of the parameters given the training data





^N

j

N

P x

x x

P

1

,..., | ) log ( | ) (

log )

, (

Score D   

(26)

Parameter estimation with known paths



If the state paths are known, the parameters are

estimated through counts (how often is a transition used, how often is a symbol produced by a given state)



Use of ‘pseudocounts’ if necessary



A

_kl

= number of transitions from k to l in training set + pseudocount r

kl



E

k

(b) = number of emissions of b from k in training set + pseudocount r

k

(b)



 







b

k k k

l

l k kl kl

b E

b b E

A e a A

) (

) ) (

(

and

(27)

Parameter estimation with unknown paths: Viterbi training



Strategy: iterative method



Suppose that the parameters are known and find the best path



Use Viterbi decoding to estimate the parameters



Iterate till convergence



Viterbi training does not maximize the likelihood of the parameters



Viterbi training converges exactly in a finite number of steps

)) (

),..., (

,

| ,...,

( max

arg

¹ ^* ¹ ^*

Vit

P x x

N _

x

_

x

N



  

 

(28)

Parameter estimation with unknown paths: Baum-Welch training



Strategy: parallel to Viterbi but we use the expected value for the transition and emission counts (instead of using only the best path)



For the transitions



For the emissions

)

| (

) 1 (

) (

) ) (

,

| ,

(

₁ ¹

 



 P x

i b x

e a i x f

l k

P

_i



_i_

 

^k ^kl ^l ⁱ^ ^l



 

 ^ ^ ^ ^



_ _

j i

j l j i l kl j

j k

j i

i

kl

f i a e x b i

x x P

l k

P

A ( ) ( ) ( 1 )

)

| ( ) 1

,

| ,

(

₁ ₁

 



 

 



j i x b

j k j

j k

j ix b

j i

k

j j

i b i x f

x P k P

b E

}

| { }

| {

) ( ) ) (

| ( ) 1

,

| (

)

(   

(29)

Parameter estimation with unknown paths: Baum-Welch training



Initialization: Choose arbitrary model parameters



Recursion:



Set all transitions and emission variables to their pseudocount



For all sequences j = 1,...,n

 Compute fk(i) for sequence j with the forward algorithm

 Compute bk(i) for sequence j with the backward algorithm

 Add the contributions to A and E



Compute the new model parameters a

kl

=A

_kl

/ 

kl’

and e

k

(b)



Compute the log-likelihood of the model



End: stop when the log-likelihood does not change more

than by some threshold or when the maximum number

of iterations is exceeded

(30)

Casino (V) – Baum-Welch training

1: 0.19 2: 0.19 3: 0.23 4: 0.08 5: 0.23 6: 0.08

1: 0.07 2: 0.10 3: 0.10 4: 0.17 5: 0.05 6: 0.52 0.27

0.29 0.73 0.71

Fair Loaded

1: 0.17 2: 0.17 3: 0.17 4: 0.17 5: 0.17 6: 0.15

1: 0.10 2: 0.11 3: 0.10 4: 0.11 5: 0.10 6: 0.48 0.07

0.12 0.93 0.88

Fair Loaded

1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6

1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 0.05

0.1

0.9 0.95

Fair Loaded

Original model

300 throws 30000 throws

(31)

Numerical stability



Many expressions contain products of many probabilities



This causes underflow when we compute these expressions



For Viterbi, this can be solved by working with the logarithms



For the forward and backward algorithms, we can work

with an approximation to the logarithm or by working with

rescaled variables

(32)

Summary



Hidden Markov Models



Computation of sequence and state probabilities



Viterbi computation of the best state path



The forward algorithm for the computation of the probability of a sequence



The backward algorithm for the computation of state probabilities



Parameter estimation for HMMs



Parameter estimation with known paths



Parameter estimation with unknown paths

 Viterbi training

 Baum-Welch training

Hidden Markov Models