Hidden Markov Models

A (discrete first order) Hidden Markov Model is characterized by a set of states,

S = {s₁, s₂, s₃, ..., s_n},

an alphabet,

X = {x₁, x₂, x₃, ..., x_m},

and two probability distributions, the state transition probability

P( s[k+1] = s_j | s[k] = s_i )

and the output distribution

P( x[k+1] = x_j | s[k] = s_i ).

There may be a known initial state s[1], or there may be a probability distribution on initial states. The state transitions are not observed directly, so given a sequence of outputs, we would like to answer three questions:

What is the probability that this HMM produced this sequence of outputs?
What is the most likely sequence of state transitions to produce this sequence of outputs?
Given a sequence of outputs, how should we update the transition and output probabilities?

Answers:

Unless many of the output probabilities are zero, there will be exponentially many state sequences corresponding to any output seuqence. To calculate this probability efficiently, we can use a dynamic programming approach. Define the forward variable a_i(k) as the probability of being in state s_i at time k after producing the observed sequence x[1], x[2], ... x[k], then
a_i(1) = P( s[1] = i ) * P( x[1] | i ) and
a_i(k+1) = sum over j ( a_j(k) * P( i | j ) ) * P( x[k+1] | i )
Similarly, we can define the backward variable b_i[k] as the conditional probability of producing the observed sequence x[k+1], x[k+2], ..., x[T] given that s[k] was i. These can be calculated by
b_i[T] = 1, and
b_i[k] = sum over j (b_j[k+1] * P( x[k+1] | j ) * P(j | i)
(Viterbi's algorithm) Define d_i(k) as the probability of the given sequence of outputs and the most likely state sequence up to state i at time k. We can initialize this to
d_i(1) = P( s[1] = i ) * P( x[1] | i )
and we can initialize c_i(1) (the most likely state) to
c_i(1) = 0
We can then calculate the next step from the previous:
d_i(k+1) = P(x[k+1] | i) * max [ d_j(k) * P( i | j ) ] and
c_i(k+1) = argmax [ d_j(k) * P( i | j ) ]
where the max and argmax are taken over all states j. When we reach T, we pick the state with the largest d_i(T), and backtrack using the c's to find the state sequence. As an alternative (to avoid multiplications) you could define
V_i[k] = log P( x[k] | i) and
B_ji[k] = log P(i | j),
use these values to replace the probabilities and replace the multiplications by additions.
We can estimate the transition probabilities using the forward and backward variables. Define e_ij[k] as the probability of being in state i at time k and state j at time k+1, given the observed sequence. We can calculate the numerator as
a_i[k] * P( j | i) ) * P( x[k+1] | j ) * b_j[k+1]
and then normalize
e_ij[k] = a_i[k] * P( j | i) ) * P( x[k+1] | j ) * b_j[k+1] divided by the numerator summed over i and j
We can also calculate g_i[k], the probability of being in state i at time k by summing e_ij[k] over j. We can then sum these over k to get the expected number of times state i is visited and the expected number of transitions from i to j. We can use these expected values to get new initial, transition, and output probabilities for the model.
This procedure finds improved, but not optimal, probabilities for the model. It can be iterated to further improve the model.
This procedure is known as the Baum-Welch method, and is a special case of Expectation-Maximization (EM).

References: Discrete Random Signals and Statistical Signal Processing, by Charles W. Therrien, Prentice Hall, 1992

L.R. Rabiner and B.H. Juang, An introduction to hidden Markov models, IEEE ASSP Magazine, vol 3, no. 1, pp. 4-16, January 1986.

Lawrence Rabiner and Biing-Hwang Juang, Fundamentals of Speech Recognition, Prentice Hall, 1993.