A (discrete first order) Hidden Markov Model is characterized by a set of states,

S = {s1, s2, s3, ..., sn},

an alphabet,

X = {x1, x2, x3, ..., xm},

and two probability distributions, the state transition probability

P( s[k+1] = sj | s[k] = si )

and the output distribution

P( x[k+1] = xj | s[k] = si ).

There may be a known initial state s[1], or there may be a probability distribution on initial states. The state transitions are not observed directly, so given a sequence of outputs, we would like to answer three questions:

  1. What is the probability that this HMM produced this sequence of outputs?
  2. What is the most likely sequence of state transitions to produce this sequence of outputs?
  3. Given a sequence of outputs, how should we update the transition and output probabilities?


  1. Unless many of the output probabilities are zero, there will be exponentially many state sequences corresponding to any output sequence. To calculate this probability efficiently, we can use a dynamic programming approach. Define the forward variable ai(k) as the probability of being in state si at time k after producing the observed sequence x[1], x[2], ... x[k], then

    ai(1) = P( s[1] = i ) * P( x[1] | i ) and

    ai(k+1) = sum over j ( aj(k) * P( i | j ) ) * P( x[k+1] | i )

    Similarly, we can define the backward variable bi[k] as the conditional probability of producing the observed sequence x[k+1], x[k+2], ..., x[T] given that s[k] was i. These can be calculated by

    bi[T] = 1, and

    bi[k] = sum over j (bj[k+1] * P( x[k+1] | j ) * P(j | i)

    Sample calculation of forward and backward variables

  2. (Viterbi's algorithm) Define di(k) as the probability of the given sequence of outputs and the most likely state sequence up to state i at time k. We can initialize this to

    di(1) = P( s[1] = i ) * P( x[1] | i )

    and we can initialize ci(1) (the most likely state) to

    ci(1) = 0

    We can then calculate the next step from the previous:

    di(k+1) = P(x[k+1] | i) * max [ dj(k) * P( i | j ) ] and
    ci(k+1) = argmax [ dj(k) * P( i | j ) ]

    where the max and argmax are taken over all states j. When we reach T, we pick the state with the largest di(T), and backtrack using the c's to find the state sequence. As an alternative (to avoid multiplications) you could define

    Vi[k] = log P( x[k] | i) and

    Bji[k] = log P(i | j),

    use these values to replace the probabilities and replace the multiplications by additions.

  3. We can estimate the transition probabilities using the forward and backward variables. Define eij[k] as the probability of being in state i at time k and state j at time k+1, given the observed sequence. We can calculate the numerator as

    ai[k] * P( j | i) ) * P( x[k+1] | j ) * bj[k+1]

    and then normalize

    eij[k] = ai[k] * P( j | i) ) * P( x[k+1] | j ) * bj[k+1] divided by the numerator summed over i and j

    We can also calculate gi[k], the probability of being in state i at time k by summing eij[k] over j. We can then sum these over k to get the expected number of times state i is visited and the expected number of transitions from i to j. We can use these expected values to get new initial, transition, and output probabilities for the model.

    This procedure finds improved, but not optimal, probabilities for the model. It can be iterated to further improve the model.

    This procedure is known as the Baum-Welch method, and is a special case of Expectation-Maximization (EM).

References: Discrete Random Signals and Statistical Signal Processing, by Charles W. Therrien, Prentice Hall, 1992

L.R. Rabiner and B.H. Juang, An introduction to hidden Markov models, IEEE ASSP Magazine, vol 3, no. 1, pp. 4-16, January 1986.

Lawrence Rabiner and Biing-Hwang Juang, Fundamentals of Speech Recognition, Prentice Hall, 1993.

Peter Clote and Rolf Backofen, Computational Molecular Biology: An Introduction, Wiley, 2000.