HW1: Gaussian Processes | Bayesian Deep Learning

Last modified: 2019-09-05 11:37

Due date: Monday Sept. 16 2019 at 11:59PM.

Status: RELEASED

What to submit:

PDF report: https://www.gradescope.com/courses/61382/assignments/239782
ZIP file: https://www.gradescope.com/courses/61382/assignments/238279

Your PDF should include:

Your full name
Collaboration statement
Problem 1 answer (with parts clearly labeled and marked via the gradescope annotation tool)
Problem 2 answer (with parts clearly labeled and marked via the gradescope annotation tool)

Your ZIP file should include

Any code you developed, as .py or .ipynb (Jupyter notebook) files.
See Coding Clarifications below.

Questions?: Post to the hw1 topic on the discussion forums.

Jump to: Problem 1 Problem 2

Background

Gaussian processes are flexible probabilistic models for regression problems.

The key idea is that you can specify assumptions about the mean prediction (what predictions should look like as a function of the input), as well as assumptions about the covariance of predictions (how "similar" the outputs should be of "similar" inputs). To formalize notions of similarity, we use kernel functions.

For a clear introduction to GPs, see Rasmussen & Williams Ch. 2

Squared exponential kernel

See the squared exponential kernel formula in the Rasmussen & Williams (R&W) textbook Eq. 4.9:

$$ k_{\text{sq-exp}}(x, x') = \exp \left( -\frac{(x - x')^2}{2 \ell^2} \right) $$

Hyperparameters: lengthscale $\ell > 0$

Matern kernel

See the Matern kernel formula in the R&W textbook Eq. 4.14:

$$ k_{\text{Matern}}(x, x') = \frac{2^{1-\nu}}{\Gamma(\nu)} \bigg( \frac {\sqrt{2\nu} |x - x'| } { \ell } \bigg)^\nu K^{\text{bessel}}_\nu\bigg(\frac{\sqrt{2\nu} |x - x'| }{\ell}\bigg) $$

Hyperparameters: lengthscale $\ell > 0$ and smoothness $\nu > 0$

Useful hints:

$\Gamma(\cdot)$ is the Gamma function, available in SciPy as from scipy.special import gamma
$K^{\text{bessel}}_{\nu}(\cdot)$ is the modified Bessel function, available in SciPy as from scipy.special import kv

Helpful Hint

You might wish to define two Python functions, sqexp_kernel_func and matern_kernel_func to compute these kernels given any two possible inputs, $x$ and $x'$. It is probably smart to write these functions in a vectorized form, so that given two vectors of length $A$ and $B$, the function returns a kernel matrix of size $A x B$.

Problem 1: Sampling from the Prior

Write Python code to sample function values from a Gaussian Process (GP) prior.

You should sample the function values that correspond to a set of at least 200 evenly-spaced test points $\{x_i\}$ between -20 and 20. One way to generate a 1D array of $G$ points would be: x_grid_G = np.linspace(-20, 20, G).

Your prior should be specified by:

mean function equal to 0
covariance function ("kernel") which is specified as a Python function, which you can pass as the named "cov_func" argument below

Here's an example template (you don't have to use it):

def draw_GP_prior_samples_at_x_grid(
        x_grid_G, mean_func, cov_func,
        random_seed=42,
        n_samples=1):
    """ Draw sample from GP prior given mean/cov functions

    Args
    ----
    x_grid_G : 1D array, size n_grid_pts (G)
        Contains G specific x_i values to evaluate function at
    mean_func : function, maps (1D arr size A) to (1D arr size A)
        Computes mean value $m(x_i)$ at each input x_i value 
    cov_func : function, maps (1D arr size A, 1D arr size B) to (2D AxB)
        Computes covariance (kernel) value at each pair of inputs.
    random_seed : int
        See for the random number generator
    n_samples : int
        Number of samples to draw from the prior

    Returns
    -------
    f_SG : 2D array, n_samples (S) x n_grid_pts (G)
        Contains sampled function values at each point of x_grid
    """
    # TODO compute mean at each grid point

    # TODO compute covariance matrix across grid points

    # Use consistent random number generator for reproducibility
    prng = np.random.RandomState(int(random_seed))
    TODO = prng.multivariate_normal(TODO)

    return TODO

To demonstrate your implementation, you'll make plots of sampled function values. Each individual plot should show a line plot of the test grid points $x_i$ and the corresponding sampled function values $f_i = f(x_i)$. Use a matplotlib line style '.-' to show both the specific $\{x_i, f_i\}$ pair values and connecting lines.

For Problem 1, your report PDF should include:

a. 1 row x 3 column grid of plots, where each panel shows 5 samples from the prior using a squared exponential kernel

For the columns, try 3 possible length scale values for $\ell$: 0.25, 1.0, and 4.0

b. Short text description of what each of the SE kernel's hyperparameters do (in terms of qualitative trends). A few short but complete sentences.

c. 3 row x 3 column grid of plots, where each panel shows 5 samples from the prior using a Matern kernel

For the columns, try 3 possible lengthscales for $\ell$: 0.25, 1.0, 4.0
For the rows, try 3 possible $\nu$ values: 0.5, 2, 8

d. Short text description of what each of the Matern kernel's hyperparameters do (in terms of qualitative trends). A few short but complete sentences.

Problem 2: Sampling from the Posterior

Consider the following training data with $N=6$ example pairs of $x$ and $y$ values:

x_train_N = np.asarray([-2.,    -1.8,   -1.,  1.,  1.8,     2.])
y_train_N = np.asarray([-3.,  0.2224,    3.,  3.,  0.2224, -3.])

As in R&W Ch. 2's section on 'Prediction using Noisy Observations', we assume that the observed $y_i$ values are themselves 'noisy' perturbations of the true function values $f(x_i)$ that are modeled by a GP. So the likelihood is: $y_i | x_i \sim \mathcal{N}(f(x_i), \sigma^2)$. For this problem, you can assume the likelihood noise standard deviation is $\sigma = 0.1$.

Write Python code to sample the function values $\{f_i\}_{i=1}^G$ that correspond to a given grid of new $\{x^{\text{grid}}_i\}_{i=1}^G$ values from a Gaussian Process (GP) posterior given this training data. Use the same GP prior specification as in Problem 1 (same mean function, same possible kernel functions).

Use a one-dimensional grid of test x points between -20 and 20, with at least 200 grid points.

def draw_GP_posterior_samples_at_x_grid(
        x_train_N, y_train_N, x_grid_G, mean_func, cov_func,
        sigma=0.1,
        random_seed=42,
        n_samples=1):
    """ Draw sample from GP posterior given training data and mean/cov

    Args
    ----
    x_train_N : 1D array, size n_train_examples (N)
        Each entry i provides the x value observed at training example i 
    y_train_N : 1D array, size n_train_examples (N)
        Each entry i provides the y value observed at training example i 
    sigma : scalar float
        Specifies the standard deviation of the likelihood.
        y_i drawn from a 1D Normal with mean f(x_i) and std. dev. \sigma.
    Other args same as earlier function: draw_GP_prior_samples_at_x_grid

    Returns
    -------
    f_SG : 2D array, n_samples (S) x n_grid_pts (G)
        Contains sampled function values at each point of x_grid
    """
    TODO

For Problem 2, your report PDF should include:

2a. 1 row x 3 col grid of plots, where each panel shows 5 posterior samples using a squared exponential kernel

For the columns, try 3 possible length scale values for $\ell$: 0.25, 1.0, and 4.0

2b. Short answer (1-3 sentences): What happens at $x=0$ that differs across the three different choices of $\ell$? Why?

2c. 3 row x 3 col grid of plots, where each panel shows 5 posterior samples using a Matern kernel

For the columns, try 3 possible length scale values for $\ell$: 0.25, 1.0, and 4.0
For the rows, try 3 possible $\nu$ values: 0.5, 2, 8

For the remaining questions, consider the Matern kernel with $\ell=4$ and $\nu=8$. Use the posterior given the training data above.

2d. Calculation: What is the expected value of the random variable $f(x)$ under the posterior when $x=+15$? What formula did you use? How does this calculation compare to your plot from 2c?

2e. Short answer: What do you expect the expected value would be at $x=+100$? No calculation needed, just make an argument.

2f. Calculation: What is the variance of the random variable $f(x)$ under the posterior when $x=+15$. What formula did you use? How does this calculation compare to your plot from 2c?

2g. Short answer: What do you expect the variance would be at $x=+100$? No calculation needed, just make an argument.

2h. Short answer (provide an equation or two): How would you specify the GP prior so that the expected value of $f(x)$ under the posterior was equal to 5.0 at all input values $x$ far from the training data?

2i. Short answer (provide an equation or two): How would you specify the GP prior so that the standard deviation of $f(x)$ under the posterior was equal to 4.0 at all input values $x$ far from the training data?

Coding Clarifications

For both problems, your implementation should of course not use any third-party Gaussian Process libraries, but instead be written from scratch to turn the math/pseudocode provided in resources like R&W Ch. 2 into useful Python code.

You may use existing libraries (e.g. numpy, scipy) for basic functions like:

random number generation
array/matrix operations (product, inverse, linear solve, cholesky, eigenvalues)
special functions (like bessel or gamma)

We will be unlikely to run your code, but we will possibly read it and thus it must clearly solve the problem at hand.

We reserve the right to have you reproduce key parts of your code at a whiteboard without references if we have questions about integrity. You should be able to explain any code you turn in (because you must have developed it yourself).