HW3: Mixture Models and Expectation-Maximization


Last modified: 2020-03-30 18:08

Due date: Wed Apr 1, 2020 at 11:59pm ET

Status: RELEASED

How to turn in: Submit PDF to https://www.gradescope.com/courses/82862/assignments/407358/

Jump to: Problem 1   Problem 2   Problem 3

Questions?: Post to the hw3 topic on the Piazza discussion forums.

Instructions for Preparing your PDF Report

What to turn in: PDF of typeset answers via LaTeX. No handwritten solutions will be accepted, so that grading can be speedy and you get prompt feedback.

Please use the official LaTeX Template: https://raw.githubusercontent.com/tufts-ml-courses/comp136-spr-20s-assignments/master/hw3/hw3_template.tex

Your PDF should include (in order):

  • Your full name
  • About how many hours did you spend (debugging your approach, asking for help, writing it up)?
  • Collaboration statement
  • Problem 1 answer
  • Problem 2 answer
  • Problem 3 answer

For each problem:

Throughout this homework, we are practicing the skills necessary to derive and analyze mathematical expressions involving probabiility.

Each step of a mathematical derivation that you turn in should be justified by an accompanying short verbal phrase (e.g. "using Bayes rule").

Solutions that lack justifications or skip key steps without showing work will receive poor marks.

Problem 1: Means and Covariances under Mixture Models

Background: Consider a continuous random variable \(x\) which is a vector in \(D\)-dimensional space: \(x \in \mathbb{R}^D\).

We assume that \(x\) follows a mixture distribution with PDF \(p^{\text{mix}}\), using \(K\) components indexed by integer \(k\):

\begin{align} p^{\text{mix}}(x | \pi, \mu, \Sigma) = \sum_{k=1}^K \pi_k f_k(x | \mu_k, \Sigma_k) \end{align}

The \(k\)-th component has a mixture "weight" probability of \(\pi_k\). Across all \(K\) components, we have a parameter \(\pi = [ \pi_1 ~ \pi_2 ~ \ldots ~ \pi_K]\), whose entries are non-negative and sum to one.

The \(k\)-th component has a specific data-generating PDF \(f_k\). We don't know the form of this PDF (it could be Gaussian, or something else), and the distribution could be different for every \(k\). However, we do know that this PDF \(f_k\) takes two parameters, a vector \(\mu_k \in \mathbb{R}^D\) and a matrix \(\Sigma_k\) which is a \(D \times D\) symmetric, positive definite matrix. We further know that these parameters represent the mean and covariance of vector \(x\) under the pdf \(f_k\):

\begin{align} \mathbb{E}_{(f_k(x|\mu_k, \Sigma_k)}[ x ] = \mu_k \\ \text{Cov}_{(f_k(x|\mu_k, \Sigma_k)}[ x ] = \Sigma_k \end{align}

Problem 1a: Prove that the mean of vector \(x\) under the mixture distribution is given by:

\begin{align} \mathbb{E}_{p^{\text{mix}}(x)}[x] = \sum_{k=1}^K \pi_k \mu_k \end{align}

Problem 1b: Prove that the covariance of vector \(x\) under the mixture distribution is given by:

\begin{align} \text{Cov}_{p^{\text{mix}}(x)}[x] = \sum_{k=1}^K \pi_k (\Sigma_k + \mu_k \mu_k^T ) - \mathbb{E}_{p^{\text{mix}(x)}}[x] \mathbb{E}_{p^{\text{mix}(x)}}[x]^T \end{align}

Problem 2: Entropy and KL Divergences

Problem 2a: Consider a discrete random variable \(z\) with \(K\) possible values, represented as a one-hot vector of size \(K\). Let the PMF of this variable be: \(q(z | r) = \prod_{k=1}^K r_k^{z_k}\), where \(r \in \Delta^K\) is a probability vector of size \(K\) that sums to one.

Compute the entropy of this distribution:

\begin{align} \mathbb{H}[q] = \mathbb{E}_q[ - \log q(z) ] \end{align}

Problem 2b: What is the largest possible entropy for a discrete r.v. of size \(K\)? The smallest? What value of parameter vector \(r\) produces each one?

Problem 2c: Understanding the "Evidence Lower Bound" (ELBo)

Consider a joint probabilistic model for discrete hidden variable \(z\) and observed data vectors \(x\) where you assume \(p(x,z) = f(z) g(x|z)\), where \(f\) is a known PMF and \(g\) is a known PDF.

Show that the following inequality holds for any distribution \(q(z)\) that puts some probability mass everywhere (i.e. \(q(z) > 0\) for all \(z\)):

\begin{align} \log p(x) = \log \sum_z p(x,z) \geq \mathbb{E}_{q} \big[ \log p(x, z) - \log q(z) \big] \end{align}

Hint: You should use Jensen's inequality applied to the natural logarithm (which is a concave function): \(\log \mathbb{E}_{q(z)}[ a(z) ] \geq \mathbb{E}_q(z)[ \log a(z) ]\) for any function \(a\) that produces scalar outputs and any valid distribution \(q\) over a random variable \(z\).

Problem 3: The EM Algorithm and Summary Statistics

Consider a Gaussian mixture model with \(K\) components, where we are performing EM to maximize a (lower bound on) the marginal likelihood.

We derived in class the following optimization objective that we wish to maximize:

\begin{align} \mathcal{L}(x, \gamma, \pi, \mu, \Sigma) = = \sum_{n=1}^N \sum_{k=1} \gamma_{nk} \log \pi_k + \gamma_{nk} \log \text{NormalPDF}(x_n | \mu_k, \Sigma_k) - \gamma_{nk} \log \gamma_{nk} \end{align}

Now, consider the following "summary" statistics for component \(k\):

\begin{align} N_k(\gamma,x) &= \sum_{n=1}^N \gamma_{nk} \\ S_k(\gamma,x) &= \sum_{n=1}^N \gamma_{nk} x_n \\ T_k(\gamma,x) &= \sum_{n=1}^N \gamma_{nk} x_n x_n^T \\ H_k(\gamma,x) &= \sum_{n=1}^N \gamma_{nk} \log \gamma_{nk} \end{align}

Problem 3a: Rewrite the M-step update equation for \(\mu_k\) in (Bishop PRML Eq. 9.24) in terms of the summaries \(N_k, S_k, T_k\) above.

Problem 3b: Rewrite the M-step update equation for \(\Sigma_k\) in (Bishop PRML Eq. 9.25) in terms of \(N_k, S_k, T_k\) above.

Problem 3c: Rewrite the optimization objective above \(\mathcal{L}\) as a different function \(\mathcal{J}\), which takes as input not the data and responsibilities \(x_n\) and \(\gamma_n\) for each example (indexed by \(n\)), but instead the summary statistics \(N, S, T,\) and \(H\) defined above.

Provide a definition of \(\mathcal{J}\) that is exactly equivalent to \(\mathcal{L}\) provided the same \(x, \gamma\):

\begin{align} \mathcal{L}(x, \gamma, \pi, \mu, \Sigma) = \mathcal{J}( N(\gamma, x), S(\gamma, x), T(\gamma, x), H(\gamma, x), \pi, \mu, \Sigma) \end{align}

Hint: You can use the following two identities involving the trace operator:

First, for any vector \(v \in \mathbb{R}^D\) and \(D \times D\) square matrix \(S\), we have:

\begin{align} v^T S v = \text{trace}(S v v^T) \qquad \text{hint 3c(i)} \end{align}

Second, trace is a linear operator!. This means the trace of a linear combination is equal to the linear combination of the traces:

\begin{align} \sum_{n=1}^N a_n \text{trace}(S v_n v_n^T) = \text{trace} \Bigg( S \left( \sum_{n=1}^N a_n v_n v_n^T \right) \Bigg) \qquad \text{hint 3c(ii)} \end{align}

Where \(a_n\) is a scalar real weight, \(v_n\) is a vector, and \(S\) is again a square matrix.