next up previous
Next: Equivalent HMMs Up: Three Problems for HMMs Previous: Solving Problem 2: The Uncovering Problem

Solving Problem 3: The Training Problem

Sadly, there is no known analytic method for determining a maximum likelihood model adjustment. Therefore, iterative techniques such as Baum-Welch or gradient descent techniques must be used. We examine only the Baum-Welch method.

Training is particularly difficult since we are given only a set of observation sequences that the process actually produced. We are not given the associated state transitions that occurred. Given those actual state transitions, the model adjustment would be simpler. But without those hidden variables, we must make an educated guess as to the state transitions that occurred.

If we sum $ \gamma_{t}^{}$(i) over t, we get the expected number of times that state i is visited, or equivalently, the number of transitions made from state i, if we exclude the last time point. Thus, we get the following.

$ \sum^{T-1}_{t=1}$$ \gamma_{t}^{}$(i) =the expected number of transitions made from state i.

$ \sum^{T-1}_{t=1}$$ \xi_{t}^{}$(i, j) =the expected number of transitions made from state i to state j.

We now have tools for counting state transitions for model adjustment.

The Baum-Welch reestimation formulas are as follows.

  1. $ \widehat{\pi _{i}}$ = $ \gamma_{1}^{}$(i), for j $ \in$ Q.
  2. $ \widehat{a_{i,j}}$ = $ {\frac{\sum ^{T-1}_{t=1}\xi _{t}(i,j)}{\sum ^{T-1}_{t=1}\gamma _{t}(i)}}$
  3. $ \widehat{b_{i}(k)}$ = $ {\frac{\sum ^{T}_{t=1}\left[ \gamma _{t}(i)\times \left\{ \begin{array}{l}
1\,...
...\
0\, otherwise
\end{array}\right\} \right] }{\sum ^{T}_{t=1}\gamma _{t}(i)}}$ In the numerator, we only include those t such that (i.e., s.t.) Ot = k, where k is the observation being examined.
(Yet to write: Adjustments to a and b when there are no observations in certain states.)

When you have a set of L observation sequences, O1,..., OL, we perform a similar estimation but over all the sequences at once. I will add superscripts to all the appropriate symbols to refer to the difference observation sequences.

  1. $ \widehat{\pi _{i}}$ = $ {\frac{\sum _{l=1}^{L}\gamma ^{l}_{1}(i)}{L}}$
  2. $ \widehat{a_{i,j}}$ = $ {\frac{\sum _{l=1}^{L}\sum ^{T-1}_{t=1}\xi ^{l}_{t}(i,j)}{\sum _{l=1}^{L}\sum ^{T-1}_{t=1}\gamma ^{l}_{t}(i)}}$
  3. $ \widehat{b_{i}(k)}$ = $ {\frac{\sum _{l=1}^{L}\sum ^{T}_{t=1}\left[ \gamma ^{l}_{t}(i)\times \left\{ \...
...end{array}\right\} \right] }{\sum _{l=1}^{L}\sum ^{T}_{t=1}\gamma ^{l}_{t}(i)}}$
Here is the algorithm to train a model.

  1. Create an initial model $ \lambda_{0}^{}$. This can be done randomly, or by a mixture of knowledge of the process plus some randomness.
  2. Using the above, create $ \lambda_{1}^{}$ from $ \lambda_{0}^{}$. Then create $ \lambda_{2}^{}$ from $ \lambda_{1}^{}$.
  3. Repeat this until the process converges, or until you exhaust resources, yielding a final estimated model $ \lambda{^\prime}$.
The test for similarity between models is complex, and I know almost nothing about it. For now, consider using a distance metric. For example, you might measure the sum of the squares of the differences between the corresponding parameters (i.e., probabilities) of the two models. Or you might do this, with more (or less) weight on, say, the transition matrix.

(I stopped writing here.)


next up previous
Next: Equivalent HMMs Up: Three Problems for HMMs Previous: Solving Problem 2: The Uncovering Problem
Webmaster: Jim Schmolze, schmolze@eecs.tufts.edu.  [Get postscript version]