HW2: Gaussian Probabilities and (Un)biased Estimators

Last modified: 2020-02-25 14:12

Due date: Wed Feb 26, 2020 at 11:59pm

Status: Released

Updates:

2020-02-20 : Updated wording in 5d: "Substitute 5b into 5c"

How to turn in: Submit PDF to https://www.gradescope.com/courses/82862/assignments/359626/

Jump to: Problem 1 Problem 2 Problem 3 Problem 4 Problem 5

Questions?: Post to the hw2 topic on the Piazza discussion forums.

Instructions for Preparing your PDF Report

What to turn in: PDF of typeset answers via LaTeX. No handwritten solutions will be accepted, so that grading can be speedy and you get prompt feedback.

Please use the official LaTeX Template: https://raw.githubusercontent.com/tufts-ml-courses/comp136-spr-20s-assignments/master/hw2/hw2_template.tex

Your PDF should include (in order):

Your full name
About how many hours did you spend (debugging your approach, asking for help, writing it up)?
Collaboration statement
Problem 1 answer
Problem 2 answer
Problem 3 answer
Problem 4 answer
Problem 5 answer

For each problem:

- Be sure to write a human-readable label for each part (e.g. "1c", 2a" and "3b")
- Be sure to mark each part via the in-browser Gradescope annotation tool)

Throughout this homework, we are practicing the skills necessary to derive and analyze mathematical expressions involving probabiility.

Each step of a mathematical derivation that you turn in should be justified by an accompanying short verbal phrase (e.g. "using Bayes rule").

Solutions that lack justifications or skip key steps without showing work will receive poor marks.

Problem 1: Estimators of Variance and Bias

Consider the following estimator for the variance of a univariate Gaussian, given $N$ observed data points $\{x_n \}_{n=1}^N$:

\begin{align} \hat{\sigma}^2(x_1, \ldots x_N) = \frac{1}{N} \sum_{n=1}^N (x_n - \mu_{\text{true}})^2 \end{align}

where $\mu_{\text{true}} \in \mathbb{R}$ is a known "true" mean value (not an estimator).

1a

Compute the expected value of estimator $\hat{\sigma}^2$, under the sampling distribution of the observations $x$.

You should assume each $x_n$ are drawn i.i.d from $x_n \sim \mathcal{N}( \mu_{\text{true}}, \sigma^2_{\text{true}})$.

1b

Given your result above, describe if the estimator is biased or unbiased.

1c

How does your answer about bias in 1b compare to what we learned about whether the maximum likelihood estimator for the variance is biased? Provide some justification.

Problem 2: Estimators of Gradients for Large Datasets

Consider a model that explains a large dataset of $N$ observations $\mathbf{x} = \{x_1, x_2, \ldots x_N\}$. We wish to model these as i.i.d. from the same likelihood distribution, with known PDF function $f$ and parameters $\theta \in \mathbb{R}^M$.

\begin{align} p( \{x_n\}_{n=1}^N | \theta) = \prod_{n=1}^N f( x_n | \theta) \end{align}

In order to estimate parameters using the maximum likelihood objective, we can perform gradient descent, using the gradient of the log likelihood as usual for numerical stability:

\begin{align} G(\theta, \mathbf{x} ) = \sum_{n=1}^N g(\theta, x_n), \quad g(\theta, x_n) = \nabla_{\theta} \log f(x_n | \theta) \end{align}

However, if the dataset size $N$ is too large, computing this gradient might be time-consuming (yes it is linear in $N$, but having to touch each of a million examples is costly).

Thus, the key idea of stochastic gradient descent (SGD) is to compute an \emph{estimator} of the whole dataset gradient in two steps:

First, randomly select one example (identified by index $i$) from a distribution $U$ that puts uniform probability over the each example in the dataset. The PMF of $U$ for each possible selected index $n$ is:

\begin{align} \text{PMF}_{U}(i = n) = \frac{1}{N}, \forall n \in \{1, \ldots N \} \end{align}

Next, compute an estimator of the gradient using just this one sample:

\begin{align} \hat{G}(\theta, x_i) = N g(\theta, x_i), \quad i \sim U \end{align}

2a

Show that the SGD estimator is an unbiased estimator of the whole dataset gradient.

That is, show:

\begin{align} \mathbb{E}_{i \sim U}[ \hat{G}(\theta, x_i) ] = G(\theta, x) \end{align}

2b

Given a finite dataset of size $N$, consider the final values of estimated parameters $\theta$ produced by SGD after many repeated gradient updates (indexed $t = 1, 2, \ldots$) with step size $\epsilon > 0$:

\begin{align} \theta_t \gets \theta_{t-1} + \epsilon \hat{G}(\theta_{t-1}, x_i), \quad i \sim U \end{align}

Short Answer: How will the $\theta$ value estimated by whole dataset gradient descent (which has no randomness) typically compare to resulting $\theta$ from SGD? Why is it important that the SGD estimator is unbiased?

Problem 3: Recognizing Gaussians and Completing the Square

3a

Suppose you are told that a univariate random variable $x \in \mathbb{R}$ has the following log pdf function:

\begin{align} \log p(x) = \text{const} - \frac{1}{2} a x^2 + bx \end{align}

where $a > 0$ and $b$ is any number.

Show that $x$ has a univariate Gaussian distribution.

Hint: You can solve this by transforming the log pdf above so that has the form:

\begin{align} \log p(x | \mu, \beta) = \text{const} - \frac{1}{2} \beta (x - \mu)^2 \end{align}

where the constant is with respect to $x$, and you define a scalar mean parameter $\mu \in \mathbb{R}$ and scalar precision parameter $\beta > 0$ in terms of the original coefficients $a, b$ above.

Hint 2: You should use a technique called "completing the square". You might find this note Completing the Square, step by step by Prof. Robert Collins helpful.

3b

Suppose you are told that a vector random variable $x \in \mathbb{R}^M$ has the following log pdf function:

\begin{align} \log p(x) = \text{const} - \frac{1}{2} x^T A x + b^T x \end{align}

where $A$ is a symmetric positive definite matrix and $b$ is any vector.

Show that $x$ has a multivariate Gaussian distribution.

Hint: You can solve this by transforming the log pdf above so that has the form:

\begin{align} \log p(x | \mu, S^{-1} ) = \text{const} - \frac{1}{2} (x - \mu)^T S (x - \mu) \end{align}

where the constant is with respect to $x$, and you define a mean vector $\mu \in \mathbb{R}^M$ and precision matrix $S$ (symmetric, positive definite) in terms of the original coefficients $A, b$ above.

Problem 4: Predictive Posteriors for 1-dim. Gaussians

We have generally seen in class and in the textbook that seeing more data leads to a decrease in the uncertainty of distributions that condition on that data. In this problem, we'll prove that for the Gaussian-Gaussian model of linear regression, each additional data point observed cannot make the predictive posterior's variance increase. It must be either equal to what it was before, or smaller.

Let's focus on Bayesian linear regression. Consider a joint probabilistic model of weight vectors $w \in \mathbb{R}^M$ and observed scalar responses $t_n \in \mathbb{R}$. This model assumes as fixed all input feature vectors $\phi(x_n) \in \mathbb{R}^M$.

Prior: $p(w) = \mathcal{N}( m_0, S_0 )$, where $m_0 \in \mathbb{R}^M$ and $S_0$ is $M \times M$ symmetric, positive definite covariance matrix.

Likelihood: $p(t | x, w) = \prod_{n=1}^N \mathcal{N}( t_n | w^T \phi(x_n), \beta^{-1})$, where $\beta > 0$ is a known precision

This model has the predictive distribution given in the textbook (Eq. 3.57):

\begin{align} p(t_* | x_*, \{x_n, t_n\}_{n=1}^N ) &= \mathcal{N}( t_* | m^T_N \phi(x_*), \sigma_N(x_*)^2 ) \\ \sigma_N^2(x_*) &= \beta^{-1} + \phi(x_*)^T S_N \phi(x_*) \\ S_N^{-1} &= S_0^{-1} + \beta \Phi^T \Phi &&\quad \text{see Eq. 3.51} \\ m_N &= S_N \left( S_0^{-1} m_0 + \beta \Phi^T t \right) &&\quad \text{see Eq. 3.50} \end{align}

To build up our understanding of this predictive distribution, we'll first focus on the special case that $M=1$. This means each feature is a scalar, and we can turn all matrices into scalars as well.

4a

For the case $M=1$, show that we can write $S_{N+1}^{-1} = S_N^{-1} + v^2$ for some scalar value $v$.

4b

For the case $M=1$, write an expression for $\sigma^2_{N+1} - \sigma^2_{N}$, simplifying as much as possible.

4c

By studying the terms of your expression from 4b, prove the following statement:

$$ \sigma_{N+1}^2(x_*) \leq \sigma_N^2(x_*) $$

Problem 5: Predictive Posteriors for $M$-dim. Gaussians

Throughout this problem, consider the setting from Problem 4 but for a multivariate parameter space: $M > 1$.

5a

Show that we can write $S_{N+1}^{-1} = S_N^{-1} + vv^T$ for some vector $v$.

5b

Next, consider the following identity, which holds for any invertible matrix A:

\begin{align} (A + vv^T)^{-1} = A^{-1} - \frac{ (A^{-1}v)(v^T A^{-1})}{ 1 + v^T A^{-1} v} \end{align}

Substitute $A = S_N^{-1}$ and $v$ as defined in 5a into the above. Simplify to write an expression for $S_{N+1}$ in terms of $S_{N}$.

5c

Show that $\sigma^2_{N+1}(x_*) - \sigma^2_{N}(x_*) = \phi(x_*)^T \left[ S_{N+1} - S_{N} \right] \phi(x_*)$

5d

Finally, plug your result from 5b defining $S_{N+1}$ into 5c, plus the fact that $S_N$ must be positive definite, to show that:

\begin{align} \sigma_{N+1}^2(x_*) \leq \sigma_N^2(x_*) \end{align}

This would prove that the predictive variance cannot increase with each additional data point.