∫ Hidden Markov models: baseline algorithms


Kernel:

  • A review of baseline algorithms for hidden Markov model inference and learning.

In this blog post, I will review some of the baseline algorithms for hidden Markov model (HMM) inference and learning, Although, there are tons of explanations and tutorials out there, but I find that most of them using inconsistent terminologies and notations. For example, do you know that there are two Viterbi algorithms? And the forward-backward algorithm is sometimes known as the Baum-Welch algorithm? while in the wikipedia these are two different algorithms? This is wild! And hopefully, I can make things clearer.

∂ Hidden Markov models

[{ "type": "node", "hidden_state": { "name": "X", "subscript": "1" }, "observation": { "name": "Y", "subscript": "1" } }, { "type": "connector" }, { "type": "node", "hidden_state": { "name": "X", "subscript": "t" }, "observation": { "name": "Y", "subscript": "t" } }, { "type": "node", "hidden_state": { "name": "X", "subscript": "t+1" }, "observation": { "name": "Y", "subscript": "t+1" } },{ "type": "connector" },{ "type": "node", "hidden_state": { "name": "X", "subscript": "T" }, "observation": { "name": "Y", "subscript": "T" } }]

Hidden Markov models are a class of probabilistic graphical models that are widely used in speech recognition, bioinformatics, and other fields. They are used to model sequences of observations, where the observations are assumed to be generated by a sequence of hidden states. The hidden states are not observed, but the observations are assumed to depend only on the current state, hence the name Markov.

A hidden Markov model is defined by the following parameters:

\( \Pr(X) \) is a short form of \( \Pr(X=x) \), the probability of \( X \) taking the value \( x \) whatever \( x \) is.

In general, there are four types of problems that we want to solve with hidden Markov models:

  1. Given HMM parameters, sample some observations.
  2. Given a sequence of observations and HMM parameters, what is the most likely sequence of hidden states that generated the observations? This is known as the decoding problem.
  3. Given a sequence of observations and HMM parameters, what is the probability of a sequence of observations? This is known as the evaluation problem.
  4. Given a sequence of observations, how can we learn the parameters of the model? This is known as the learning problem.
The sampling problem is relatively straightforward given the model's parameters, but the other problems are not. Fortunately, HMMs have been around for decades and there are many algorithms to solve these problems:

∂ Forward algorithm

The forward algorithm is used to solve the evaluation problem of the latest hidden state given the sequence of observations so far.

$$ \begin{align*} \arg\max_{X_t} \, \Pr(X_t | Y_{1:t}) \end{align*} $$

One nice thing about inference algorithms in HMM is that they can usually be arranged in a recursive form. The recursive form of the forward algorithm is as follows:

$$ \begin{align*} f(X_t) &= \Pr(X_t , Y_{1:t}) \\ &= \Pr(X_t, Y_t, Y_{1:t-1}) \\ &= \Pr(Y_t| X_t, Y_{1:t-1}) \Pr(X_t, Y_{1:t-1}) \\ &= \Pr(Y_t| X_t) \Pr(X_t, Y_{1:t-1}) \\ &= \Pr(Y_t | X_t) \sum_{X_{t-1}} \Pr(X_t, X_{t-1}, Y_{1:t-1}) \\ &= \Pr(Y_t | X_t) \sum_{X_{t-1}} \Pr(X_t | X_{t-1}, Y_{1:t-1}) \Pr(X_{t-1}, Y_{1:t-1}) \\ &= \Pr(Y_t | X_t) \sum_{X_{t-1}} \Pr(X_t | X_{t-1}) \Pr(X_{t-1}, Y_{1:t-1}) \\ &= \Pr(Y_t | X_t) \sum_{X_{t-1}} \Pr(X_t | X_{t-1}) f(X_{t-1}) \tag{1} \\ \end{align*} $$

Then in order to solve the evaluation problem, we can use the following equation:

$$ \begin{align*} \max_{X_t} \, \Pr(X_t | Y_{1:t}) &= \max_{X_t} \, \frac{\Pr(X_t, Y_{1:t})}{\Pr(Y_{1:t})} \\ &= \eta \max_{X_t} \Pr(X_t, Y_{1:t}) \\ &= \eta \max_{X_t} f(X_t) \\ \end{align*} $$
\( \eta \) is a normalizing constant that makes the probability sum to 1. When you are computing maximum, you can safely ignore it.

Because of this property, the forward algorithm can be used to solve the evaluation problems where we want to know the most likely state at the current time given the sequence of observations so far. We often find the use for this algorithm in realtime inference tasksRabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257-286.Welch, G., & Bishop, G. (1995). An introduction to the Kalman filter.Djuric, P. M., Kotecha, J. H., Zhang, J., Huang, Y., Ghirmai, T., Bugallo, M. F., & Miguez, J. (2003). Particle filtering. IEEE signal processing magazine, 20(5), 19-38..

∂ Forward-backward algorithm

What if you want to solve the evaluation problem of the hidden state at any time given the sequence of observations so far, for the benefit of hindsight? This is where the forward-backward algorithm comes in.

$$ \begin{align*} \Pr(X_t | Y_{1:T}) &= \frac{\Pr(Y_{t+1:T} | X_t) \Pr(X_t | Y_{1:t})}{\Pr(Y_{t+1:T} | Y_{1:t})} \\ \end{align*} $$

The forward-backward algorithm has two folds: the forward pass and the backward pass. The forward pass is the same as the forward algorithm. Let me show you how the backward pass is computed:

$$ \begin{align*} b(X_t) &= \Pr(Y_{t+1:T}| X_t) \\ &= \sum_{X_t+1} \Pr(Y_{t+1:T}, X_{t+1}| X_t) \\ &= \sum_{X_t+1} \Pr(Y_{t+2:T}, X_{t+1}, Y_{t+1}| X_t) \\ &= \sum_{X_t+1} \Pr(Y_{t+2:T}| X_{t+1}, Y_{t+1}, X_t) \Pr(X_{t+1}, Y_{t+1}| X_t) \\ &= \sum_{X_t+1} \Pr(Y_{t+2:T}| X_{t+1}) \Pr(X_{t+1}, Y_{t+1}| X_t) \\ &= \sum_{X_t+1} b(X_{t+1}) \Pr(X_{t+1}, Y_{t+1}| X_t) \\ &= \sum_{X_t+1} b(X_{t+1}) \Pr(Y_{t+1}| X_{t+1}, X_t) \Pr(X_{t+1}| X_t) \\ &= \sum_{X_t+1} b(X_{t+1}) \Pr(Y_{t+1}| X_{t+1}) \Pr(X_{t+1}| X_t) \tag{3} \\ \end{align*} $$

Now we combine the forward and backward passes to solve the evaluation problem of the hidden state at any time given the sequence of observations so far.

$$ \begin{align*} \max_{X_t} \, \Pr( X_t| Y_{1:T}) &= \max_{X_t} \, \frac{\Pr(Y_{t+1:T} | X_t) \Pr(X_t | Y_{1:t})}{\Pr(Y_{t+1:T} | Y_{1:t})} \\ &= \mu \max_{X_t} \, \Pr(Y_{t+1:T} | X_t) \Pr(X_t | Y_{1:t}) \\ &= \mu \max_{X_t} \, \Pr(Y_{t+1:T} | X_t) \eta \Pr(X_t, Y_{1:t}) \\ &= \mu \eta \max_{X_t} \, b(X_t) f(X_t) \\ \end{align*} $$
\( \mu \) and \( \eta \) are normalizing factors as before.

If you want to compute the probability, you can use the following equation:

$$ \begin{align*} \Pr( X_t | Y_{1:T}) &= \frac{\Pr(Y_{t+1:T} | X_t) \Pr(X_t | Y_{1:t})}{\Pr(Y_{t+1:T} | Y_{1:t})} \\ &= \frac{\Pr(Y_{t+1:T} | X_t) \Pr(X_t, Y_{1:t})}{\Pr(Y_{t+1:T}, Y_{1:t})} \\ &= \frac{\Pr(Y_{t+1:T} | X_t) \Pr(X_t, Y_{1:t})}{\sum_{X_t} \Pr(Y_{t+1:T}, X_t, Y_{1:t})} \\ &= \frac{\Pr(Y_{t+1:T} | X_t) \Pr(X_t, Y_{1:t})}{\sum_{X_t} \Pr(Y_{t+1:T} | X_t)\Pr(X_t | Y_{1:t})} \\ &= \frac {b(X_t) f(X_t)}{\sum_{X_t} b(X_t) f(X_t)} \\ \end{align*} $$

∂ Viterbi algorithm

Next stop is the Viterbi algorithm. We use to solve the decoding problem of the most likely sequence of hidden states given the sequence of observations.

$$ \begin{align*} \arg\max_{X_{1:t}} \, \Pr( X_{1:t}| Y_{1:t}) \end{align*} $$

Same as the forward algorithm, the Viterbi algorithm can be arranged in a recursive form:

$$ \begin{align*} v(X_t) &= \max_{X_{1:t-1}} \, \Pr(X_{1:t}, Y_{1:t}) \\ &= \max_{X_{1:t-1}} \, \Pr( X_t, Y_t | X_{1:t-1}, Y_{1:t-1}) \Pr( X_{1:t-1}, Y_{1:t-1}) \\ &= \max_{X_{1:t-1}} \, \Pr( Y_t | X_t) \Pr( X_t | X_{1:t-1}, Y_{1:t-1}) \Pr( X_{1:t-1}, Y_{1:t-1}) \\ &= \max_{X_{1:t-1}} \, \Pr( Y_t | X_t) \Pr( X_t | X_{t-1}) \Pr( X_{1:t-1}, Y_{1:t-1}) \\ &= \Pr( Y_t | X_t) \max_{X_{1:t-1}} \, \Pr( X_t | X_{t-1}) \Pr( X_{1:t-1}, Y_{1:t-1}) \\ &= \Pr( Y_t | X_t) \max_{X_{t-1}} \, \Pr( X_t | X_{t-1}) \max_{X_{1:t-2}} \, \Pr( X_{1:t-1}, Y_{1:t-1}) \\ &= \Pr( Y_t | X_t) \max_{X_{t-1}} \, \Pr( X_t | X_{t-1}) v(X_{t-1}) \tag{2} \\ \end{align*} $$

Then in order to solve the decoding problem, we can use the following equation:

$$ \begin{align*} \max_{X_{1:t}} \, \Pr( X_{1:t}| Y_{1:t}) &= \max_{X_{1:t}} \, \frac{\Pr( X_{1:t}, Y_{1:t})}{\Pr(Y_{1:t})} \\ &= \eta \max_{X_{1:t}} \Pr( X_{1:t}, Y_{1:t}) \\ &= \eta \max_{X_t} \, \max_{X_{1:t-1}} \, \Pr(X_{1:t}, Y_{1:t}) \\ &= \eta \max_{X_t} \, v(X_t) \end{align*} $$

You may notice that the Viterbi algorithm is very similar to the forward algorithm, but instead of summing over all possible previous states, we take the maximum over all possible previous states. The similarity is not a coincidence. It depends on the way we choose to arrange the probability terms by pulling out the time \( t \) terms first, similar to the way we did in the forward algorithm. (If you want to challenge yourself, try to solve the decoding problem backward and see what you come up with.)

The Viterbi algorithm is a special case of the max sum algorithm for graphical models.Koller, D., & Friedman, N. (2009). Probabilistic graphical models: principles and techniques. MIT press.

∂ Viterbi training algorithm

Do not confuse this with the Viterbi decoding algorithm. The Viterbi training algorithm is a training algorithm for hidden Markov model where the goal is to learn the parameters of the model given sequences of observations.

$$ \begin{align*} \arg\max_{\theta} \, \Pr(Y_{1:T} | \theta) \end{align*} $$
\( \theta \) is the set of parameters of the model, which includes the transition probability and the observation probability.

The Viterbi training algorithm uses the Viterbi decoding algorithm to find the most likely sequence of states, and then uses it to update the parameters of the model. It's a special case of the Expectation-Maximization (EM) algorithmBishop, C. (2006). Pattern recognition and machine learning. Springer google schola, 2, 5-43., which has to be run iteratively until the parameters converge.

  1. Compute the most likely sequence of states using the Viterbi algorithm.
  2. Update the observation parameters
    $$ \begin{align*} \Pr(Y=y | X=x) = \frac{\text{frequency of observing both} \, y \, \text{and} \, x \, \text{from all steps}}{\text{frequency of} \, x \, \text{from all steps in the sequence}} \end{align*} $$
  3. Update the transition parameters
    $$ \begin{align*} \forall t \, \Pr(X_{t+1 }= x' | X_{t} = x) = \frac{\text{frequency of} \, x \, \text{then} \, x' \, \text{from all steps}}{\text{frequency of} \, x \, \text{from all steps in the sequence}} \end{align*} $$
  4. Repeat until the parameters converge.
Note that both the observation and the transition parameters are shared across all time steps.

∂ Baum-Welch algorithm

This is another training algorithm for hidden Markov model. It uses the forward-backward algorithm to update the parameters of the model instead of relying on statistical counting. In practice, this is more stable than the Viterbi training algorithmLam, T. Y., & Meyer, I. M. (2010). Efficient algorithms for training the parameters of hidden Markov models using stochastic expectation maximization (EM) training and Viterbi training. Algorithms for Molecular Biology, 5, 1-16..

$$ \begin{align*} \Pr(Y | X) &= \frac{\sum_t \Pr(Y_t, X_t)}{\sum_t \Pr(X_t)} \\ &= \frac{\sum_t \sum_{Y_{1:T} \setminus Y_{t}} \Pr(X_t, Y_{1:T})}{\sum_t \sum_{Y_{1:T}} \Pr(X_t, Y_{1:T})} \\ &= \frac{\sum_t \Pr(Y_t) \sum_{Y_{1:T} \setminus Y_{t}} \Pr(Y_{1:T} \setminus Y_{t} | Y_{t}) \Pr(X_t | Y_{1:T})}{\sum_t \sum_{Y_{1:T}}\Pr(Y_{1:T}) \Pr(X_t| Y_{1:T})} \\ \end{align*} $$
  • \( Y_{1:T} \setminus Y_{t} \) means the sequence of all observations except \( Y_t \).
  • If you see this form \( \sum_{A} P(A) K \), it means that we are compute the expectation of \( K \) following the distribution of \( A \). If \( A = Y_{1:T} \), it's the distribution of the observed sequences.
$$ \begin{align*} \forall t \, \Pr(X_{t+1} | X_{t}) &= \frac{\sum_t \Pr(X_{t+1}, X_t)}{\sum_t \Pr(X_t)} \\ &= \frac{\sum_t \sum_{Y_{1:T}} \Pr(X_{t+1}, X_t, Y_{1:T})}{\sum_t \Pr(X_t)} \end{align*} $$
The term \( \Pr(X_{t+1}, X_t, Y_{1:T}) \), or the \( \xi \) conventionally, can be expanded as follows:
$$ \begin{align*} \Pr(X_{t+1}, X_t, Y_{1:T}) &= \Pr(X_{t+1}, Y_{1:T}| X_t) \Pr(X_t) \\ &= \Pr(X_{t+1}, Y_{t+1:T}, Y_{1:t}| X_t) \Pr(X_t) \\ &= \Pr(X_{t+1}, Y_{t+1:T} | X_t) \Pr(Y_{1:t} | X_t) \Pr(X_t) \\ &= \Pr(X_{t+1}, Y_{t+1:T} | X_t) f(X_t) \\ &= \Pr(Y_{t+1}, X_{t+1}, Y_{t+2:T} | X_t) f(X_t) \\ &= \Pr(Y_{t+1} | X_{t+1}) \Pr(X_{t+1}, Y_{t+2:T} | X_t) f(X_t) \\ &= \Pr(Y_{t+1} | X_{t+1}) \Pr(Y_{t+2:T} | X_{t+1}) \Pr(X_{t+1} | X_t) f(X_t) \\ &= {\color{Rhodamine}\Pr(Y_{t+1} | X_{t+1})} b(X_{t+1}) {\color{Cerulean}\Pr(X_{t+1} | X_t)} f(X_t) \tag{4} \\ \end{align*} $$

This is the original Baum-Welch algorithm. Notice that equation (4) reuses the current parameters of the model to compute the expected statistics. And we run this iteratively until the parameters converge.

One might wonder can't we simplify the equation like this:

$$ \begin{align*} \forall t \, \Pr(X_{t+1}| X_{t}) &= \frac{\sum_t \Pr(X_{t+1}, X_t)}{\sum_t \Pr(X_t)} \\ &= \frac{\sum_t \Pr(X_{t+1}| X_t) \sum_{Y_{1:T}} \Pr(X_t, Y_{1:T})}{\sum_t \Pr(X_t)} \\ &= \frac{\sum_t \Pr(X_{t+1}| X_t) \sum_{Y_{1:T}} \Pr(Y_{1:T}) \Pr(X_t| Y_{1:T})}{\sum_t \Pr(X_t)} \end{align*} $$

This simplified version uses only the forward-backward algorithm at each time \(t\) while the original Baum-Welch algorithm uses pincer estimation of both \(X_{t+1}\) from the backward path (3), and \(X_t\) from the forward path (1), bridge by the current transition probability and the observation probability (4).

However, this derivation will not work! Because it's the exact equation! The principle of Expectation-Maximization is to compute first the expectation of the estimated statistics given the current parameters. The estimated statistics in this case are 1) \( \Pr(Y_{t} | X_{t}) \) and 2) \(\Pr(X_{t+1}| X_t, Y_{1:T})\) which are computed inside the Baum-Welch algorithm but not in the simplified version.

∂ Conclusion

To recap, we have reviewed some of the baseline algorithms for hidden Markov model inference and learning.

AlgorithmProblemObjective
ForwardEvaluate the most likely state at the current time given the sequence of observations so far.\( \arg\max_{X_t} \, \Pr(X_t | Y_{1:t}) \)
Forward-backwardEvaluate the most likely state at any time given the sequence of observations so far.\( \arg\max_{X_t} \, \Pr(X_t | Y_{1:T}) \)
ViterbiDecode the most likely sequence of hidden states given the sequence of observations.\( \arg\max_{X_{1:t}} \, \Pr( X_{1:t}| Y_{1:t}) \)
Viterbi trainingLearn the parameters of the model given sequences of observations.\( \arg\max_{\theta} \, \Pr(Y_{1:T} | \theta) \)
Baum-WelchLearn the parameters of the model given sequences of observations.\( \arg\max_{\theta} \, \Pr(Y_{1:T} | \theta) \)

That is it! These are important algorithms for hidden Markov models that we should know!