S = {s1, s2, s3, ..., sn},
an alphabet,
X = {x1, x2, x3, ..., xm},
and two probability distributions, the state transition probability
P( s[k+1] = sj | s[k] = si )
and the output distribution
P( x[k+1] = xj | s[k] = si ).
There may be a known initial state s[1], or there may be a probability distribution on initial states. The state transitions are not observed directly, so given a sequence of outputs, we would like to answer three questions:
Answers:
ai(1) = P( s[1] = i ) * P( x[1] | i ) and
ai(k+1) = sum over j ( aj(k) * P( i | j ) ) * P( x[k+1] | i )
Similarly, we can define the backward variable bi[k] as the conditional probability of producing the observed sequence x[k+1], x[k+2], ..., x[T] given that s[k] was i. These can be calculated by
bi[T] = 1, and
bi[k] = sum over j (bj[k+1] * P( x[k+1] | j ) * P(j | i)
di(1) = P( s[1] = i ) * P( x[1] | i )
and we can initialize ci(1) (the most likely state) to
ci(1) = 0
We can then calculate the next step from the previous:
di(k+1) = P(x[k+1] | i) * max [ dj(k) * P( i | j
) ] and
ci(k+1) = argmax [ dj(k) * P( i | j ) ]
where the max and argmax are taken over all states j. When we reach T, we pick the state with the largest di(T), and backtrack using the c's to find the state sequence. As an alternative (to avoid multiplications) you could define
Vi[k] = log P( x[k] | i) and
Bji[k] = log P(i | j),
use these values to replace the probabilities and replace the multiplications by additions.
ai[k] * P( j | i) ) * P( x[k+1] | j ) * bj[k+1]
and then normalize
eij[k] = ai[k] * P( j | i) ) * P( x[k+1] | j ) * bj[k+1] divided by the numerator summed over i and j
We can also calculate gi[k], the probability of being in state i at time k by summing eij[k] over j. We can then sum these over k to get the expected number of times state i is visited and the expected number of transitions from i to j. We can use these expected values to get new initial, transition, and output probabilities for the model.
This procedure finds improved, but not optimal, probabilities for the model. It can be iterated to further improve the model.
This procedure is known as the Baum-Welch method, and is a special case of Expectation-Maximization (EM).
References: Discrete Random Signals and Statistical Signal Processing, by Charles W. Therrien, Prentice Hall, 1992
L.R. Rabiner and B.H. Juang, An introduction to hidden Markov models, IEEE ASSP Magazine, vol 3, no. 1, pp. 4-16, January 1986.
Lawrence Rabiner and Biing-Hwang Juang, Fundamentals of Speech Recognition, Prentice Hall, 1993.
Peter Clote and Rolf Backofen, Computational Molecular Biology: An Introduction, Wiley, 2000.