Hidden Markov Models
Yves Moreau
Katholieke Universiteit Leuven
Regular expressions
Alignment
Regular expression
Problem: regular expression does not distinguish
Exceptional TGCTAGG
Consensus ACACATC
ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC
[AT][CG][AC][ACGT]*A[TG][GC]
Hidden Markov Models
A .8 C 0 G 0 T .2
A 0 C .8 G .2 T 0
A .8 C .2 G 0 T 0
A 1 C 0 G 0 T 0
A 0 C 0 G .2 T .8
A 0 C .8 G .2 T 0 A .2
C .4 G .2 T .2
1.0 1.0
.6
.4
.6 .4
1.0 1.0
Sequence score
0472 0
8 0 1 8 0 1 1 6 0 4 0
6 0 8 0 1 8 0 1 8 0 )
(
.
. .
. .
. .
. .
P
ACACATC
Transition probabilities
Emission probabilities
Log odds
Use logarithm for scaling and normalize by random model
Log odds for sequence S:
A: 1.16 T:-0.22
C: 1.16 G:-0.22
A: 1.16
C:-0.22 A: 1.39 G:-0.22
T: 1.16
C: 1.16 G:-0.22 A:-0.22
C: 0.47 G:-0.22 T:-0.22
0 0
-0.51
-0.92
-0.51 -0.92
0 0
25 . 0 log )
( 25 log
. 0
)
log P ( S P S L
L
Log odds
65 . 6
16 . 1 0 16 . 1 39 . 1 51 . 0 47 . 0
51 . 0 16 . 1 0 16 . 1 0 16 . 1 )
( odds log
ACACATC
Sequence Log odds
ACAC--ATC (consensus) 6.7
ACA---ATG 4.9
TCAACTATC 3.0
ACAC--AGC 5.3
AGA---ATC 4.9
ACCG--ATC 4.6
-0.97
Markov chain
Sequence:
Example of a Markov chain
Probabilistic model of a DNA sequence
)
|
( x t x
1s P
a
st
i
i
Transition probabilities
, , , )
with ,...,
,
2(e.g.,
1
x x x A C G T
x
x
L i A A
A
C G
T
Markov property
Probability of a sequence through Bayes’ rule
Markov property
“The future is only function of the present and not of the past”
1 1
1 1 1 2 1 1
( ) ( , ,..., | length )
( | ,..., ) ( | ,..., ) ( )
L L
L L L L
P x P x x x L
P x x x P x x x P x
L
i
x x
L L
L L
i
a
ix P
x P x x
P x
x P x
x P x
P
2 1
1 1
2 2
1 1
)
1(
) ( )
| ( )
| (
)
| ( )
(
Beginning and end of a sequence
Computation of the probability is not homogeneous
Length distribution is not modeled
P(length=L) unspecified
Solution
Modeling of beginning and end of the sequence
The probability to observe a sequence of a given length decreases
with the length of the sequence
1 1
Sequence: , ,..., ,
( )
( | )
L s
t L
x x
a P x s
a P x t
A
C G
T
Hidden Markov Models
A .8 C 0 G 0 T .2
A 0 C .8 G .2 T 0
A .8 C .2 G 0 T 0
A 1 C 0 G 0 T 0
A 0 C 0 G .2 T .8
A 0 C .8 G .2 T 0 A .2
C .4 G .2 T .2
1.0 1.0
.6
.4
.6 .4
1.0 1.0
Sequence score
0472 0
8 0 1 8 0 1 1 6 0 4 0
6 0 8 0 1 8 0 1 8 0 )
(
.
. .
. .
. .
. .
P
ACACATC
Transition probabilities
Emission probabilities
Hidden Markov Model
In a hidden Markov model, we observe the symbol
sequence x but we want to reconstruct the hidden state sequence (path )
Transition probabilities ( : a
0l, : a
k0)
Emission probabilities
Joint probability of the sequence ,x
1,...,x
L, and the path )
|
( l
1k
P
a
kl
i
i
)
| (
)
( b P x b k
e
k
i
i
Li
i i i
i
x a
e a
x P
1
0 1
( )
1) ,
(
Casino (I) – problem setup
The casino uses mostly a fair die but switches sometimes to a loaded die
We observe the outcome x of the successive throws but want to know when the die was fair or loaded (path )
1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6
1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 0.05
0.1 0.95 0.9
Fair Loaded
Estimation of the sequence and state
probabilities
The Viterbi algorithm
We look for the most probable path
*
This problem can be tackled by dynamic programming
Let us define v
k(i) as the probability of the most probable path that ends in state k for the emission of symbol x
i
Then we can compute this probability recursively as
*
arg max ( , ) arg max ( | ) P x P x
) )
( ( max )
( )
1
(
1 k kli k l
l
i e x v i a
v
1,..., 1 1 1 1
( ) max ( ,..., , ,.... , )
i
k i i i
v i P x x k
The Viterbi algorithm
The Viterbi algorithm grows the best path dynamically
Initial condition: sequence in beginning state
Traceback pointers tot follow the best path (= decoding)
) ( ptr :
) 1 ,..., (
Traceback
) )
( ( max arg
) )
( ( max )
, ( :
n Terminatio
) )
1 (
( max arg
) ( ptr
) )
1 (
( max )
( )
( :
) ,..., 1 (
Recursion
0 ,
0 )
0 ( , 1 ) 0 ( :
) 0 (
tion Initializa
*
* 1
0
*
0
* 0
i i i
k k k
L
k k k
kl k k
i
kl k k
i l l
k
L i
a L v
a L v x
P
a i
v l
a i
v x
e i
v L
i
k v
v i
Casino (II) - Viterbi
The forward algorithm
The forward algorithm let us compute the probability P(x) of a sequence w.r.t. an HMM
This is important for the computation of posterior probabilities and the comparison of HMMs
The sum over all paths (exponentially many) can be computed by dynamic programming
Les us define f
k(i) as the probability of the sequence for the paths that end in state k with the emission of symbol x
i
Then we can compute this probability as
) , ( )
( x P x
P
) ,
,..., (
)
( i P x
1x k
f
k
i
i
The forward algorithm
The forward algorithm grows the total probability dynamically from the beginning to the end of the sequence
Initial condition: sequence in beginning state
End: all states converge to the end state
k
k k
k
kl k
i l l
k
a L f
x P
a i
f x
e i
f L
i
k f
f i
0 0
) ( )
( :
End
) 1 (
) ( )
( :
) ,..., 1 (
Recursion
0 ,
0 )
0 ( ,
1 ) 0 ( :
) 0 (
tion
Initializa
The backward algorithm
The backward algorithm let us compute the probability of the complete sequence together with the condition that symbol x
iis emitted from state k
This is important to compute the probability of a given state at symbol x
i
P(x
1,...,x
i,
i=k) can be computed by the forward algorithm f
k(i)
Let us define b
k(i) as the probability that the rest of the sequence for the paths that pass through state k at symbol x
i)
| ,...,
( ) ,
,..., (
) ,
,...,
| ,...,
( ) ,
,..., (
) ,
(
1 1
1 1
1
k x
x P k x
x P
k x
x x
x P k x
x P k
x P
i L
i i
i
i i L
i i
i i
)
| ,...,
( )
( i P x
1x k
b
k
i L
i
The backward algorithm
The backward algorithm grows the probability b
k(i) dynamically backwards (from end to beginning)
Border condition: start in end state
Once both forward and backward probabilities are available, we can compute the posterior probability of the state
l
l l
l l
l i
l kl k
k k
b x e a x
P
i b x
e a i
b L
i
k a
L b L
i
) 1 ( ) ( )
( :
n Terminatio
) 1 (
) (
) ( :
) 1 ,..., 1 (
Recursion
, )
( :
) (
tion Initializa
1 0
1 0
) (
) ( ) ) (
|
( P x
i b i x f
k
P
i
k kPosterior decoding
Instead of using the most probable path for decoding (Viterbi), we can use the path of the most probable states
The path
^can be “illegal” (P(
^|x)=0)
This approach can also be used when we are interested in a function g(k) of the state (e.g., labeling)
)
| (
max
ˆ arg P
ik x
i
k
k
i
k x g k
P x
i
G ( | ) ( | ) ( )
Casino (III) – posterior decodering
Posterior probability of the state “fair” w.r.t. the die
throws
Casino (IV) – posterior decodering
New situation : P(x
i+1= FAIR | x
i= FAIR) = 0.99
Viterbi decoding cannot detect the cheating from 1000 throws, while posterior decoding does
1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6
1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 0.01
0.1 0.99 0.9
Fair Loaded
Parameter estimation for HMMs
Choice of the architecture
For the parameter estimation, we assume that the architecture of the HMM is known
Choice of architecture is an essential design choice
Duration modeling
“Silent states” for gaps
Parameter estimation with known paths
HMM with parameters (transition and emission probabilities)
Training set D of N sequences x
1,...,x
N
Score of the model is the likelihood of the parameters given the training data
Nj
j
N
P x
x x
P
1
1
,..., | ) log ( | ) (
log )
, (
Score D
Parameter estimation with known paths
If the state paths are known, the parameters are
estimated through counts (how often is a transition used, how often is a symbol produced by a given state)
Use of ‘pseudocounts’ if necessary
A
kl= number of transitions from k to l in training set + pseudocount r
kl
E
k(b) = number of emissions of b from k in training set + pseudocount r
k(b)
b
k k k
l
l k kl kl
b E
b b E
A e a A
) (
) ) (
(
and
Parameter estimation with unknown paths: Viterbi training
Strategy: iterative method
Suppose that the parameters are known and find the best path
Use Viterbi decoding to estimate the parameters
Iterate till convergence
Viterbi training does not maximize the likelihood of the parameters
Viterbi training converges exactly in a finite number of steps
)) (
),..., (
,
| ,...,
( max
arg
1 * 1 *Vit
P x x
N x
x
N
Parameter estimation with unknown paths: Baum-Welch training
Strategy: parallel to Viterbi but we use the expected value for the transition and emission counts (instead of using only the best path)
For the transitions
For the emissions
)
| (
) 1 (
) (
) ) (
,
| ,
(
1 1
P x
i b x
e a i x f
l k
P
i
i
k kl l i l
j i
j l j i l kl j
j k
j i
j i
i
kl
f i a e x b i
x x P
l k
P
A ( ) ( ) ( 1 )
)
| ( ) 1
,
| ,
(
1 1
j i x b
j k j
j k
j ix b
j i
k
j j
i b i x f
x P k P
b E
}
| { }
| {
) ( ) ) (
| ( ) 1
,
| (
)
(
Parameter estimation with unknown paths: Baum-Welch training
Initialization: Choose arbitrary model parameters
Recursion:
Set all transitions and emission variables to their pseudocount
For all sequences j = 1,...,n
Compute fk(i) for sequence j with the forward algorithm
Compute bk(i) for sequence j with the backward algorithm
Add the contributions to A and E
Compute the new model parameters a
kl=A
kl/
kl’and e
k(b)
Compute the log-likelihood of the model
End: stop when the log-likelihood does not change more
than by some threshold or when the maximum number
of iterations is exceeded
Casino (V) – Baum-Welch training
1: 0.19 2: 0.19 3: 0.23 4: 0.08 5: 0.23 6: 0.08
1: 0.07 2: 0.10 3: 0.10 4: 0.17 5: 0.05 6: 0.52 0.27
0.29 0.73 0.71
Fair Loaded
1: 0.17 2: 0.17 3: 0.17 4: 0.17 5: 0.17 6: 0.15
1: 0.10 2: 0.11 3: 0.10 4: 0.11 5: 0.10 6: 0.48 0.07
0.12 0.93 0.88
Fair Loaded
1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6
1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 0.05
0.1
0.9 0.95
Fair Loaded
Original model
300 throws 30000 throws
Numerical stability
Many expressions contain products of many probabilities
This causes underflow when we compute these expressions
For Viterbi, this can be solved by working with the logarithms
For the forward and backward algorithms, we can work
with an approximation to the logarithm or by working with
rescaled variables
Summary
Hidden Markov Models
Computation of sequence and state probabilities
Viterbi computation of the best state path
The forward algorithm for the computation of the probability of a sequence
The backward algorithm for the computation of state probabilities
Parameter estimation for HMMs
Parameter estimation with known paths
Parameter estimation with unknown paths
Viterbi training
Baum-Welch training