HW4: Hidden Markov Models | Statistical Pattern Recognition

Last modified: 2020-04-13 09:38

Due date: Wed Apr 15, 2020 at 11:59pm ET

Status: RELEASED

How to turn in: Submit PDF to https://www.gradescope.com/courses/82862/assignments/449882/

Questions?: Post to the hw4 topic on the Piazza discussion forums.

Instructions for Preparing your PDF Report

What to turn in: PDF of typeset answers via LaTeX. No handwritten solutions will be accepted, so that grading can be speedy and you get prompt feedback.

Please use the official LaTeX Template: https://raw.githubusercontent.com/tufts-ml-courses/comp136-spr-20s-assignments/master/hw4/hw4_template.tex

Your PDF should include (in order):

Your full name
About how many hours did you spend (debugging your approach, asking for help, writing it up)?
Collaboration statement
Problem 1 answer
Problem 2 answer
Problem 3 answer

For each problem:

- Be sure to write a human-readable label for each part (e.g. "1c", 2a" and "3b")
- Be sure to mark each part via the in-browser Gradescope annotation tool)

Throughout this homework, we are practicing the skills necessary to derive and analyze mathematical expressions involving probabiility.

Each step of a mathematical derivation that you turn in should be justified by an accompanying short verbal phrase (e.g. "using Bayes rule").

Solutions that lack justifications or skip key steps without showing work will receive poor marks.

Problem 1: Independence Assumptions for HMMs

Background: Assume we have \(T\) timesteps, with each timestep indexed by \(t \in \{1, 2, \ldots T\}\). Consider a Hidden Markov Model with \(K\) discrete states.

This HMM defines a distribution over two sequences of random variables:

a discrete state sequence \(z_{1:T} = [z_1, z_2, \ldots z_T]\), where each value is an integer indicator \(z_t \in \{1, 2, \ldots K\}\)
a observed data sequence \(x_{1:T} = [x_1, x_2, \ldots x_T]\), where each value is a measured feature \(x_t\) (if you like, this is a scalar real)

In order to specify the joint distribution over \(z_{1:T}, x_{1:T}\), the HMM makes two key conditional independence assumptions:

\begin{align} \text{HMM assumption A:} \quad & p(z_{t+1} | z_t, z_{t-1}, ... z_1) = p(z_{t+1} | z_{t}) \qquad t \in 2, 3, \ldots T \\ \text{HMM assumption B:} \quad & p(x_t | z_t, z_{1:t-1}, z_{t+1:T}, x_{1:t-1}, x_{t+1:T}) = p(x_t | z_t), \qquad t \in 1, 2, \ldots T \end{align}

In words, the first assumption (A) says the state at time \(t\), given the state at time \(t-1\), is conditionally independent of all other state variables before time \(t-1\). This is the first-order Markov assumption.

The second assumption (B) says that given the hidden state at time \(t\), the observation at time \(t\) is conditionally independent of all other variables in the model.

1a: Prove the following property for all timesteps \(t \geq 1\). Remember to provide a short verbal justification for every step.

\begin{align} p(z_{t+1} | x_{t}, z_{t}) = p(z_{t+1} | z_{t} ) \end{align}

You can only use the following transformations: sum rule, product rule, Bayes rule, property A above, and property B above.

1b: Prove the following property for all timesteps \(t \geq 1\). Remember to provide a short verbal justification for every step.

\begin{align} p(x_{t+1} | x_{1:t}, z_{1:t}) = p(x_{t+1} | z_t) \end{align}

You can only use the following transformations: sum rule, product rule, Bayes rule, property A above, and property B above.

Problem 2: Understanding EM for HMMs with binary observations

Suppose have a sequence \(x_{1:T} = x_1, x_2, \ldots x_T\) of \(T\) binary vectors, where \(x_t\) is a \(D\)-length binary vector \(x_t = [x_{t1}, x_{t2}, \ldots x_{td} \ldots x_{tD}]\). At each timestep \(t\) and feature dimension \(d\), you have a scalar binary value: \(x_{td} \in \{0, 1\}\).

You wish to model this sequence using a hidden Markov model with \(K\) states. This HMM has parameters \(\theta = \{\pi, A, \phi\}\), where \(\pi\) is a \(K\)-length vector that sums to one, and \(A\) is a \(K \times K\) matrix whose rows sums to one. Your model assumes the following joint distribution:

\begin{align} p(z_{1:T}, x_{1:T} | \theta) &= p(z_{1:T} | \pi, A) p(x_{1:T} | z_{1:T}, \phi) \end{align}

Probabilistic model

The \(z_{1:T}\) are generated by a Markov model:

\begin{align} p(z_{1:T} | \pi, A) &= \text{CatPMF}(z_1 | \pi) \cdot \prod_{t=2}^T \text{CatPMF}( z_t | A_{z_{t-1}} ) \\ &= \prod_{k=1}^K \pi_k^{\delta(z_1, k)} \cdot \prod_{t=2}^T \prod_{j=1}^K \prod_{k=1}^K A_{jk}^{\delta(z_{t-1}, j) \delta(z_{t}, k)} \end{align}

Here, we'll use the notation \(\delta(a, b)\) to be a binary indicator that is 1 if \(a == b\) is true, and 0 otherwise.

Remember that the vector \([ \delta(z_t, 1) ~ \delta(z_t, 2) ~ \ldots \delta(z_t, K) ]\) is one hot, meaning exactly one of the \(K\) entries will be 1.

Given the \(z_{1:T}\), each \(x_t\) is drawn iid from a multivariate Bernoulli given that timestep \(t\)'s assigned state \(z_{t}\)

\begin{align} p(x_{1:T} | z_{1:T}, \phi) &= \prod_{t=1}^T \prod_{d=1}^D \prod_{k=1}^K \text{BernPMF}( x_{td} | \phi_{kd} )^{\delta(z_t, k)} \end{align}

Here, the parameter \(\phi_{kd}\) is the probability that the binary value of dimension \(d\) of vector \(x_t\) will be "on" or "1", if generated when time \(t\) assigned to state \(k\). The value of \(\phi_kd\) must be a valid Bernoulli parameter: \(0 \leq \phi_{kd} \leq 1\).

Approximate posterior

We have defined an "approximate posterior" distribution \(q(z_{1:T} | s)\) over our state sequence, with learnable parameters \(s\). The parameters \(s = \{s_t \}_{t=1}^{T-1}\) specify joint distributions over each adjacent pair \(z_{t}, z_{t+1}\), and of course must satisfy the constraints that neighboring marginals are the same.

For full details, see the notes from day 18: https://www.cs.tufts.edu/comp/136/2020s/notes/day18.pdf#page=4

But for this problem, you just need to be able to use \(s\) to evaluate the following expectations:

\begin{align} \mathbb{E}_{q(z_{1:T}|s)}[ \delta(z_t, k) ] &= r_{tk}(s), \quad r_{tk} = \sum_{j=1}^K s_{tjk} = \sum_{\ell=1}^K s_{tk\ell} \\ \mathbb{E}_{q(z_{1:T}|s)}[ \delta(z_{t}, j)\delta(z_{t+1}, k) ] &= s_{tjk} \end{align}

where again, the notation \(\delta(a, b)\) is a binary indicator that is 1 if \(a == b\) is true, and 0 otherwise.

Problems

2a: Expected log likelihood Write out an expression for the expected complete log likelihood:

\begin{align} \mathbb{E}_{q(z_{1:T} |s)} \left[ \log p(z_{1:T}, x_{1:T} | \theta) \right] \end{align}

Use the HMM probabilistic model \(p(z_{1:T}, x_{1:T} | \theta)\) and the approximate posterior \(q(z_{1:T} | s)\) defined above.

Your answer should be a function of the data \(x\), the local sequence parameters \(s\) and \(r(s)\), as well as the HMM parameters \(\pi, A, \phi\).

2b: M-step for data-per-state parameters Using your objective function from 2a above, show that for the M-step optimal update to the Bernoulli parameters \(\phi_{kd}\), the optimal update is given by:

\begin{align} \phi_{kd} = \frac{ \sum_{t=1}^T r_{tk} x_{td} }{ \sum_{t=1}^T r_{tk} } \end{align}

2c: Explaining the M-step for data-per-state parameters Provide a short verbal summary of the update for \(\phi_{kd}\). How should we interpret the numerator? The denominator?

Problem 3: Stationary distributions

Consider a Markov model with \(K=3\) states, and the following initial probability vector and transition probability matrix:

\begin{align} \pi &= \left[ {\begin{array}{ccc} 0.25 & 0.50 & 0.25 \\ \end{array} } \right] \\ A &= \left[ {\begin{array}{ccc} 0.97 & 0.01 & 0.02 \\ 0.10 & 0.80 & 0.10 \\ 0.05 & 0.04 & 0.91 \\ \end{array} } \right] \end{align}

3a: Stationary distribution:

What is the stationary distribution of the Markov chain? In other words, if we sample a sequence \(z_1, z_2, \ldots z_T\), for large \(T \gg 0\), what is the marginal probability \(p(z_T)\)?

Provide the exact distribution (numerical values of its parameters to 3 decimal places).

Hint: See this notebook https://github.com/tufts-ml-courses/comp136-spr-20s-assignments/blob/master/notebooks/MarkovChainsAndStationaryDistributions.ipynb