A (discrete first order) Hidden Markov Model is characterized by a set of states,

S = {s1, s2, s3, ..., sn},

an alphabet,

X = {x1, x2, x3, ..., xm},

and two probability distributions, the state transition probability

P( s[k+1] = sj | s[k] = si )

and the output distribution

P( x[k+1] = xj | s[k] = si ).

There may be a known initial state s[1], or there may be a probability distribution on initial states. The state transitions are not observed directly, so given a sequence of outputs, we would like to answer three questions:

1. What is the probability that this HMM produced this sequence of outputs?
2. What is the most likely sequence of state transitions to produce this sequence of outputs?
3. Given a sequence of outputs, how should we update the transition and output probabilities?

1. Unless many of the output probabilities are zero, there will be exponentially many state sequences corresponding to any output seuqence. To calculate this probability efficiently, we can use a dynamic programming approach. Define the forward variable ai(k) as the probability of being in state si at time k after producing the observed sequence x[1], x[2], ... x[k], then

ai(1) = P( s[1] = i ) * P( x[1] | i ) and

ai(k+1) = sum over j ( aj(k) * P( i | j ) ) * P( x[k+1] | i )

Similarly, we can define the backward variable bi[k] as the conditional probability of producing the observed sequence x[k+1], x[k+2], ..., x[T] given that s[k] was i. These can be calculated by

bi[T] = 1, and

bi[k] = sum over j (bj[k+1] * P( x[k+1] | j ) * P(j | i)

2. (Viterbi's algorithm) Define di(k) as the probability of the given sequence of outputs and the most likely state sequence up to state i at time k. We can initialize this to

di(1) = P( s[1] = i ) * P( x[1] | i )

and we can initialize ci(1) (the most likely state) to

ci(1) = 0

We can then calculate the next step from the previous:

di(k+1) = P(x[k+1] | i) * max [ dj(k) * P( i | j ) ] and
ci(k+1) = argmax [ dj(k) * P( i | j ) ]

where the max and argmax are taken over all states j. When we reach T, we pick the state with the largest di(T), and backtrack using the c's to find the state sequence. As an alternative (to avoid multiplications) you could define

Vi[k] = log P( x[k] | i) and

Bji[k] = log P(i | j),

use these values to replace the probabilities and replace the multiplications by additions.

3. We can estimate the transition probabilities using the forward and backward variables. Define eij[k] as the probability of being in state i at time k and state j at time k+1, given the observed sequence. We can calculate the numerator as

ai[k] * P( j | i) ) * P( x[k+1] | j ) * bj[k+1]

and then normalize

eij[k] = ai[k] * P( j | i) ) * P( x[k+1] | j ) * bj[k+1] divided by the numerator summed over i and j

We can also calculate gi[k], the probability of being in state i at time k by summing eij[k] over j. We can then sum these over k to get the expected number of times state i is visited and the expected number of transitions from i to j. We can use these expected values to get new initial, transition, and output probabilities for the model.

This procedure finds improved, but not optimal, probabilities for the model. It can be iterated to further improve the model.

This procedure is known as the Baum-Welch method, and is a special case of Expectation-Maximization (EM).

References: Discrete Random Signals and Statistical Signal Processing, by Charles W. Therrien, Prentice Hall, 1992

L.R. Rabiner and B.H. Juang, An introduction to hidden Markov models, IEEE ASSP Magazine, vol 3, no. 1, pp. 4-16, January 1986.

Lawrence Rabiner and Biing-Hwang Juang, Fundamentals of Speech Recognition, Prentice Hall, 1993.