S = {s_{1}, s_{2}, s_{3}, ..., s_{n}},

an alphabet,

X = {x_{1}, x_{2}, x_{3}, ..., x_{m}},

and two probability distributions, the state transition probability

P( s[k+1] = s_{j} | s[k] = s_{i} )

and the output distribution

P( x[k+1] = x_{j} | s[k] = s_{i} ).

There may be a known initial state s[1], or there may be a probability distribution on initial states. The state transitions are not observed directly, so given a sequence of outputs, we would like to answer three questions:

- What is the probability that this HMM produced this sequence of outputs?
- What is the most likely sequence of state transitions to produce this sequence of outputs?
- Given a sequence of outputs, how should we update the transition and output probabilities?

**Answers:**

- Unless many of the output probabilities are zero, there will be exponentially
many state sequences corresponding to any output seuqence. To calculate
this probability efficiently, we can use a dynamic programming approach.
Define the
**forward variable**a_{i}(k) as the probability of being in state s_{i}at time k after producing the observed sequence x[1], x[2], ... x[k], thena

_{i}(1) = P( s[1] = i ) * P( x[1] | i ) anda

_{i}(k+1) = sum over j ( a_{j}(k) * P( i | j ) ) * P( x[k+1] | i )Similarly, we can define the

**backward variable**b_{i}[k] as the conditional probability of producing the observed sequence x[k+1], x[k+2], ..., x[T] given that s[k] was i. These can be calculated byb

_{i}[T] = 1, andb

_{i}[k] = sum over j (b_{j}[k+1] * P( x[k+1] | j ) * P(j | i) **(Viterbi's algorithm)**Define d_{i}(k) as the probability of the given sequence of outputs and the most likely state sequence up to state i at time k. We can initialize this tod

_{i}(1) = P( s[1] = i ) * P( x[1] | i )and we can initialize c

_{i}(1) (the most likely state) toc

_{i}(1) = 0We can then calculate the next step from the previous:

d

_{i}(k+1) = P(x[k+1] | i) * max [ d_{j}(k) * P( i | j ) ] and

c_{i}(k+1) = argmax [ d_{j}(k) * P( i | j ) ]where the max and argmax are taken over all states j. When we reach T, we pick the state with the largest d

_{i}(T), and backtrack using the c's to find the state sequence. As an alternative (to avoid multiplications) you could defineV

_{i}[k] = log P( x[k] | i) andB

_{ji}[k] = log P(i | j),use these values to replace the probabilities and replace the multiplications by additions.

- We can estimate the transition probabilities using the forward and
backward variables. Define e
_{ij}[k] as the probability of being in state i at time k and state j at time k+1, given the observed sequence. We can calculate the numerator asa

_{i}[k] * P( j | i) ) * P( x[k+1] | j ) * b_{j}[k+1]and then normalize

e

_{ij}[k] = a_{i}[k] * P( j | i) ) * P( x[k+1] | j ) * b_{j}[k+1] divided by the numerator summed over i and jWe can also calculate g

_{i}[k], the probability of being in state i at time k by summing e_{ij}[k] over j. We can then sum these over k to get the expected number of times state i is visited and the expected number of transitions from i to j. We can use these expected values to get new initial, transition, and output probabilities for the model.This procedure finds improved, but not optimal, probabilities for the model. It can be iterated to further improve the model.

This procedure is known as the

**Baum-Welch**method, and is a special case of**Expectation-Maximization**(EM).

**References:**
Discrete Random Signals and Statistical Signal Processing, by Charles W. Therrien,
Prentice Hall, 1992

L.R. Rabiner and B.H. Juang, An introduction to hidden Markov models, IEEE ASSP Magazine, vol 3, no. 1, pp. 4-16, January 1986.

Lawrence Rabiner and Biing-Hwang Juang, Fundamentals of Speech Recognition, Prentice Hall, 1993.