HW2: Gaussian Probabilities and (Un)biased Estimators


Last modified: 2021-02-19 21:09

Due date: Wed. Mar. 3, 2021 at 11:59pm AoE (anywhere on Earth)

Status: RELEASED.

How to turn in: Submit PDF to https://www.gradescope.com/courses/239096/assignments/1030404/

Jump to: Problem 1   Problem 2   Problem 3   Problem 4

Questions?: Post to the hw2 topic on the Piazza discussion forums.

Instructions for Preparing your PDF Report

What to turn in: PDF of typeset answers via LaTeX. No handwritten solutions will be accepted, so that grading can be speedy and you get prompt feedback.

Please use the official LaTeX Template: https://raw.githubusercontent.com/tufts-ml-courses/comp136-21s-assignments/master/hw2/hw2_template.tex

Your PDF should include (in order):

For each problem:

Throughout this homework, we are practicing the skills necessary to derive and analyze mathematical expressions involving probabiility.

Each step of a mathematical derivation that you turn in should be justified by an accompanying short verbal phrase (e.g. "using Bayes rule").

Solutions that lack justifications or skip key steps without showing work will receive poor marks.

Problem 1: Estimators of Variance and Bias

Consider the following estimator for the variance of a univariate Gaussian, given \(N\) observed data points \(\{x_n \}_{n=1}^N\):

\begin{align} \hat{\sigma}^2(x_1, \ldots x_N) = \frac{1}{N} \sum_{n=1}^N (x_n - \mu_{\text{true}})^2 \end{align}

where \(\mu_{\text{true}} \in \mathbb{R}\) is a known "true" mean value (not an estimator).

Note that this is not the maximum likelihood estimator of the variance, because we're assuming we know the true mean vector \(\mu\).

1a

Compute the expected value of estimator \(\hat{\sigma}^2\), under the sampling distribution of the observations \(x\).

You should assume each \(x_n\) are drawn i.i.d from \(x_n \sim \mathcal{N}( \mu_{\text{true}}, \sigma^2_{\text{true}})\).

1b

Given your result above, explain if the estimator is biased or unbiased.

1c

How does your answer about bias of this estimator in 1b compare to what we learned about the bias of the maximum likelihood estimator for the variance? Explain why the answers might be the same or different.

Problem 2: Estimators of Gradients for Large Datasets

Consider a model that explains a large dataset of \(N\) observations \(\mathbf{x} = \{x_1, x_2, \ldots x_N\}\). We wish to model these as i.i.d. from the same likelihood distribution, with known PDF function \(f\) and parameters \(\theta \in \mathbb{R}^M\).

\begin{align} p( \{x_n\}_{n=1}^N | \theta) = \prod_{n=1}^N f( x_n | \theta) \end{align}

As usual, we can use the log likelihood for a more numerically stable objective. Our optimization problem becomes

\begin{align} \theta^* &= \arg \max_{\theta} \log p( \{x_n\}_{n=1}^N | \theta) \\ &= \sum_{n=1}^N \log f( x_n | \theta) \end{align}

In order to estimate parameters, we can perform gradient descent. The gradient \(G\) of our objective is given by:

\begin{align} G(\theta, \mathbf{x} ) = \sum_{n=1}^N g(\theta, x_n), \quad g(\theta, x_n) = \nabla_{\theta} \log f(x_n | \theta) \end{align}

Dimensionality checK: \(G\) is a vector of size \(M\). It is the same size as \(\theta \in \mathbb{R}^M\).

Now, if the dataset size \(N\) is too large, computing this gradient might be time-consuming (yes it is linear in \(N\), but having to touch each of a million examples is costly).

Thus, the key idea of stochastic gradient descent (SGD) is to compute an estimator of the whole dataset gradient in two steps:

  • First, randomly select one example (identified by index \(i\)) from a distribution \(U\) that puts uniform probability over the each example in the dataset. The PMF of \(U\) for each possible selected index \(n\) is:
\begin{align} \text{PMF}_{U}(i = n) = \frac{1}{N}, \forall n \in \{1, \ldots N \} \end{align}
  • Next, compute an estimator of the gradient using just this one sample:
\begin{align} \hat{G}(\theta, x_i) = N g(\theta, x_i), \quad i \sim U \end{align}

2a

Show that the SGD estimator is an unbiased estimator of the whole dataset gradient.

That is, show:

\begin{align} \mathbb{E}_{i \sim U}[ \hat{G}(\theta, x_i) ] = G(\theta, \mathbf{x}) \end{align}

Remember, we denote the whole dataset as \(\mathbf{x}\), and a single select observation's feature vector with \(x_i\).

2b

Given a finite dataset of size \(N\), consider the estimated parameter \(\theta\) produced by one SGD step with known constant step size \(\epsilon > 0\) and known initial value \(\theta_0\):

\begin{align} \theta_1 \gets \theta_{0} + \epsilon \hat{G}(\theta_{0}, x_i), \quad i \sim U \end{align}

Short Answer: How will the updated value \(\theta_1\) produced by whole dataset gradient descent (which has no randomness) compare to resulting \(\theta_1\) from SGD? Why is it important that the SGD estimator is unbiased, if we want to find a solution to our original objective?

Problem 3: Recognizing Gaussians and Completing the Square

3a

Suppose you are told that a vector random variable \(x \in \mathbb{R}^M\) has the following log pdf function:

\begin{align} \log p(x) = \text{const} - \frac{1}{2} x^T A x + b^T x \end{align}

where \(A\) is a symmetric positive definite matrix and \(b\) is any vector.

Show that \(x\) has a multivariate Gaussian distribution.

Hint: You can solve this by transforming the log pdf above so that has the form:

\begin{align} \log p(x | \mu, S^{-1} ) = \text{const} - \frac{1}{2} (x - \mu)^T S (x - \mu) \end{align}

where the constant is with respect to \(x\), and you define a mean vector \(\mu \in \mathbb{R}^M\) and precision matrix \(S\) (symmetric, positive definite) in terms of the original coefficients \(A, b\) above.

Partial Credit Option: If you wish, you can solve this problem for the \(M=1\) univariate case only, for 75% credit.

Problem 4: Predictive Posteriors for Gaussians

We have generally seen in class and in the textbook that seeing more data leads to a decrease in the uncertainty of distributions that condition on that data. In this problem, we'll prove that for the Gaussian-Gaussian model of linear regression, each additional data point observed cannot make the predictive posterior's variance increase. It must be either equal to what it was before, or smaller.

Let's focus on Bayesian linear regression. Consider a joint probabilistic model of weight vectors \(w \in \mathbb{R}^M\) and observed scalar responses \(t_n \in \mathbb{R}\). This model assumes as fixed all input feature vectors \(\phi(x_n) \in \mathbb{R}^M\).

Prior: \(p(w) = \mathcal{N}( m_0, S_0 )\), where \(m_0 \in \mathbb{R}^M\) and \(S_0\) is \(M \times M\) symmetric, positive definite covariance matrix.

Likelihood: \(p(t | x, w) = \prod_{n=1}^N \mathcal{N}( t_n | w^T \phi(x_n), \beta^{-1})\), where \(\beta > 0\) is a known precision

This model has the predictive distribution given in the textbook (Eq. 3.57):

\begin{align} p(t_* | x_*, \{x_n, t_n\}_{n=1}^N ) &= \mathcal{N}( t_* | m^T_N \phi(x_*), \sigma_N(x_*)^2 ) \\ \sigma_N^2(x_*) &= \beta^{-1} + \phi(x_*)^T S_N \phi(x_*) \\ S_N^{-1} &= S_0^{-1} + \beta \Phi^T \Phi &&\quad \text{see Eq. 3.51} \\ m_N &= S_N \left( S_0^{-1} m_0 + \beta \Phi^T t \right) &&\quad \text{see Eq. 3.50} \end{align}

Partial Credit Option: If you wish, you can solve any subproblem for the \(M=1\) univariate case only, for 75% credit. If you do this, please indicate it with one sentence at the top of the relevant subproblem: "I am solving the M=1 case for partial credit".

4a

Show that we can write \(S_{N+1}^{-1} = S_N^{-1} + vv^T\) for some vector \(v\).

4b

Next, consider the following identity, which holds for any invertible matrix A:

\begin{align} (A + vv^T)^{-1} = A^{-1} - \frac{ (A^{-1}v)(v^T A^{-1})}{ 1 + v^T A^{-1} v} \end{align}

Substitute \(A = S_N^{-1}\) and \(v\) as defined in 4a into the above. Simplify to write an expression for \(S_{N+1}\) in terms of \(S_{N}\).

4c

Show that \(\sigma^2_{N+1}(x_*) - \sigma^2_{N}(x_*) = \phi(x_*)^T \left[ S_{N+1} - S_{N} \right] \phi(x_*)\)

4d

Finally, plug your result from 4b defining \(S_{N+1}\) into 4c, plus the fact that \(S_N\) must be positive definite, to show that:

\begin{align} \sigma_{N+1}^2(x_*) \leq \sigma_N^2(x_*) \end{align}

This would prove that the predictive variance cannot increase with each additional data point. In other words, as we see more data, we will never be "less certain" about the next observation if we gather more data.