HW2: Bayesian Neural Nets + Hamiltonian Monte Carlo


Last modified: 2019-09-24 23:08

Due date:

  • Updated: Mon Sept. 30, 2019 at 11:59 PM
  • Wed Sept. 25, 2019 at 11:59 PM

Status: RELEASED

What to turn in:

Your PDF should include (in order):

  • Your full name
  • Collaboration statement
  • About how many hours did you spend (coding, asking for help, writing it up)?
  • About how many hours did your CPU spend on Problem 2 & 3?
  • Problem 1 answer (with parts clearly labeled in the PDF itself and marked via the gradescope annotation tool)
  • Problem 2 answer (with parts clearly labeled in the PDF itself and marked via the gradescope annotation tool)
  • Problem 3 answer (with parts clearly labeled in the PDF itself and marked via the gradescope annotation tool)

Your ZIP file should include

  • Any code you developed, as .py or .ipynb (Jupyter notebook) files.

Questions?: Post to the hw2 topic on the discussion forums.

Jump to: Problem 1   Problem 2   Problem 3   Template Code for HMC   Debugging Tips

Background

We want to solve a regression problem, where our observed "inputs" are vectors \(x\) in \(D\)-dimensional space, and the observed "outputs" are scalars \(y \in \mathbb{R}\). We'll have a training set of \(N\) observations: \(\{x_n, y_n\}_{n=1}^N\), where each input vector \(x_n\) has size \(D\) and each output \(y_n\) has size 1.

We consider a general feed-forward neural network (multi-layer perceptron) with \(L\) total layers (including the output layer and \(L-1\) hidden layers). We'll index each layer with integer identifier \(\ell \in 1, 2, \ldots \ell, \ldots L\). When convenient, we'll denote the "input" (the vector \(x_n\) as a layer with index \(\ell=0\).

To help us keep track of things, we'll define our layers such that each layer \(\ell\) produces a vector \(h^{(\ell)}\) with size given by the integer \(J^\ell\). Let integer \(J^0 (= D)\) define the input dimension. Then let \(J^1, J^2, \ldots J^{L-1}\) define the hidden layer sizes. Finally, \(J^L\) defines the output layer size, which we always fix to \(J^L = 1\).

Hidden layer 1: The 1st layer takes input of size \(J^0\) (which equals \(D\), the input dimension) and produces a vector \(h^{(1)}\) of size \(J^1\). The parameters of this layer are the weights \(w^{(1)} \in \mathbb{R}^{J^1 \times J^0}\) and the bias \(b^{1} \in \mathbb{R}^{ J^1}\).

Each hidden unit of layer one, indexed by \(j\), produces a scalar value \(h^{(1)}_{j}\) by multiplying the input vector \(x\) by a weight vector, adding a bias, and feeding the resulting scalar through an activation function:

$$ h^{(1)}_{j}(x, w, b) = \text{activation}\left( b^{(1)}_{j} + \sum_{k=1}^{J^0} w^{(1)}_{j,k} x_{k} \right) $$

Hidden layers 2, 3, ... L-1: Each other hidden layer (indexed by \(\ell\)), takes input of size \(J^{\ell-1}\) and produces output of size \(J^{\ell}\). The parameters of this layer are the weights \(w^{(\ell)} \in \mathbb{R}^{J^{\ell} \times J^{\ell-1}}\), and the bias vector \(b^{\ell} \in \mathbb{R}^{J^\ell}\).

Each hidden unit of layer \(\ell\), indexed by \(j\), is produced by weight multiplication, adding a bias, then feeding through an activation function:

$$ h^{(\ell)}_{j}(x, w, b) = \text{activation}\left( b^{(\ell)}_{j} + \sum_{k=1}^{J^{(\ell-1)}} w^{(\ell)}_{j,k} h^{(\ell-1)}_{k}(x,w,b) \right) $$

Final output layer: At the final layer with index \(L\), we take input of size \(J^{L-1}\) and produce a scalar predicted output. The parameters are weights \(w^{(L)} \in \mathbb{R}^{1 \times J^{L-1}}\) and bias \(b^{(L)} \in \mathbb{R}^{1}\).

The scalar output value \(f(x, w, b)\) is given as follows:

$$ f(x, w, b) = b^{(L)}_{1} + \sum_{k=1}^{J^{(L-1)}} w^{(L)}_{1, k} h^{(L-1)}_{k}(x,w,b) $$

Note: there is NOT any activation function applied to the output layer (because this is regression, and the output could be any real value).

Useful special case: If we set \(L=1\), there are no hidden layers, and our proposed prediction model reduces to "linear regression", with a \(D\)-dimensional set of weights and a scalar bias parameter:

$$ f(x, w, b) = b_1 + \sum_{d=1}^{D} w_{1,d} x_{d} \qquad \text{(when no hidden layers)} $$

Possible Activation Functions

Any non-linear element-wise function could be used to produce useful neural network predictions.

Common choices:

  • ReLU: \(\text{relu}(z) = \text{max}(0, z)\)
  • SquaredExponential (aka 'RBF'): \(\text{sqexp}(z) = \exp(-z^2)\)
  • Tanh: Hyperbolic tangent function -- see numpy's np.tanh
  • Sigmoid

Possible Architectures

A general neural network for regression with L total layers will have L-1 hidden layers, each one with different numbers of hidden units. We can specify the size of the hidden network as a list of integers, like this:

  • [5] means there is one hidden layer with 5 units
  • [5, 3] means there are two hidden layers, the first has 5 units and second has 3 units.
  • [] means there are no hidden layers and no hidden units (equivalent to linear regression)

In general, the number of hidden layers (L-1) is equal to the length of the list, and the size of the \(\ell\)-th layer is given by the integer at position \(\ell\) in the list.

Prior: Beliefs about the weights

The prior is a distribution over weights (and biases).

Across all layers, we will generally assume a simple independent prior for each weight scalar value and bias scalar value, which is Gaussian with mean zero and variance one:

\begin{align} p(w, b) = \prod_{\ell=1}^L \prod_{j=1}^{J^{\ell}} \prod_{k=1}^{J^{\ell-1}} \mathcal{N}( w^{\ell}_{j,k} | 0, 1) \cdot \prod_{\ell=1}^L \prod_{j=1}^{J^\ell} \mathcal{N}( b^{\ell}_j | 0, 1) \end{align}

Likelihood: Beliefs about outputs given inputs and weights

The likelihood is a distribution over the output \(y_n\) (a scalar value) of the regression when given an input \(x_n\) (vector of size D) and some weights and biases.

We will model a dataset of size \(N\) as having drawn each output independently given its input and the weights.

Each independent density will be a Gaussian, where the scalar value produced by the feed-foward neural network -- \(f(x_n, w, b)\) -- is the mean:

\begin{align} p(y|x, w, b) &= \prod_{n=1}^N p(y_n | x_n, w, b) \\ &= \prod_{n=1}^N \mathcal{N}(y_n \mid f(x_n, w, b), \sigma^2) \end{align}

Let \(\sigma\) be a standard deviation hyperparameter. This will be assumed "known" throughout.

Problem 1: Sampling from a BNN Prior

Write Python code to sample function values produced by the output of a general BNN regression architecture, where the weight parameters and biases at every layer each have an independent Gaussian prior -- Normal(mean=0, variance=1).

You should sample the function values that correspond to a set of at least 200 evenly-spaced test points \(\{x_i\}\) between -20 and 20. One way to generate a 1D array of \(G\) points would be: x_grid_G = np.linspace(-20, 20, G).

To demonstrate your implementation, you'll make plots of sampled function values (just like in HW1). Each individual plot should show a line plot of the test grid points \(x_i\) and the corresponding sampled function values \(f_i = f(x_i)\). Use a matplotlib line style '-' to emphasize the connecting lines between the specific \(\{x_i, f_i\}\) pair values (do not show the dots themselves -- the specific dots can make it tough to observe qualitative trends).

For Problem 1, your report PDF should include:

a. 4 row x 3 column grid of plots, where each panel shows 7 samples from the prior

  • For the rows, try 4 different architectures: [2], [10], [2,2], [10, 10]
  • For the columns, try 3 different activation functions: ReLu, tanh, and squared exponential (aka 'RBF')

b. Short text description of the qualitative trends you observe. How does a deeper network impact the function shape? How does the activation function impact function shape? A few short but complete sentences.

Develop your own HMC implementation

Your goal is to write an implementation of HMC that can sample from the posterior over weights and biases given some training data for a regression problem. Hint: the pseudocode algorithm on Page 14 of Neal's Handbook of MCMC Chapter on HMC is a helpful resource: https://arxiv.org/pdf/1206.1901.pdf#page=14

You may use automatic differentiation tools (like autograd, PyTorch, or Tensorflow) as you wish. We recommend autograd as a simple option that has been demonstrated in class. See 'Part 5' of the Jupyter Notebook we introduced in class for examples of useful NN data structures and gradient descent training of NNs with autograd: https://github.com/tufts-ml-courses/comp150-bdl-19f-assignments/blob/master/notebooks/intro_to_autograd_and_neural_net_training.ipynb

You may refer to existing HMC implementations you find online for high-level understanding, but you must write your own code and be able to defend any code you submit as your original work. Working with a small set of fellow students from this class (at most 2 partners other than yourself) is encouraged, provided you abide by the collaboration policy.

You should think carefully about how you set the step_size and the number of leapfrog steps for an HMC proposal. You may need to try several values and find the one that performs best.

Problem 2: Sample from Posterior using HMC for a BNN with 0 hidden layers

Consider the following training data with \(N=6\) example pairs of \(x\) and \(y\) values (not quite the same as HW1, but close):

x_train_N = np.asarray([-2.,    -1.8,   -1.,  1.,  1.8,     2.])
y_train_N = np.asarray([-2.,  1.2224,    4.,  4.,  1.2224, -2.])

Here, we'll first try a simple prediction model: Bayesian linear regression (BayesLR). While BayesLR isn't a "smart" model for this data (clearly y has a non-linear dependency on x), it will help us sanity check our implementation and provide a baseline for later improvements.

In this problem, you'll use your HMC implementation to sample from the posterior.

Then, we'll compare results to the "analytical posterior" (since with Bayesian Linear Regression, the true posterior is Normal with known formulas for the mean and covariance).

If you want the formulas required for the exact posterior for Bayesian linear regression, see here: https://cedar.buffalo.edu/~srihari/CSE574/Chap3/3.4-BayesianRegression.pdf#page=10

Remember that for linear regression with a bias, the feature transform is a 2D vector: \(\phi(x_n) = [x_n ~~ 1]\)

Implementation Details:

  • Fix the BNN architecture to [], which means 0 hidden units. (Note: this is equivalent to Bayesian linear regression, where you're learning one weight scalar and one bias scalar)
  • Use a Normal(0, 1) prior for all weights and biases.
  • Use sigma=0.1 for the likelihood's Gaussian noise level.
  • Run at least 3 chains (runs of HMC started from different random initializations) each for at least 1000 iterations (after burnin)
  • Remember to be sure each chain has "burned in" ... transitioned from the initial parameters to the "typical set".

Instructions: For Problem 2, your report PDF should include:

2a. Diagnostic plot of "potential energy" vs. sampler iteration

  • 2 panels (1 row x 2 column)
  • X-axis: Sampler's iteration number (at least 1000 iterations)
  • Y-axis: Potential energy (aka negative log joint probability)
  • First panel 'HMC': Show one line for each chain (3 total chains)
  • Second panel 'Exact Posterior': Show one line evaluating the potential energy of 1000 samples from the exact posterior

2b. Plot of posterior samples of regression function

Out of all the 1000+ samples of weights/biases from each chain, pick 10 at random to plot.

  • 4 panels (1 row x 4 columns)
  • X-axis: 1D input "x" to the regression problem (use regularly-spaced grid)
  • Y-axis: Predicted output of the BNN function \(f(x, w, b)\)
  • First 3 panels: each should show \(f(x, w^s, b^s)\) for 10 samples (indexed by \(s\)) from the "posterior" of one chain of HMC
  • Last panel: show should 10 samples from the exact posterior

As much as possible, avoid showing samples in the transient burn-in phase of the sampler.

From weight samples to a distribution on \(f(x)\)

Consider the posterior distribution of \(f(x)\) for a given test input \(x\). Using all 1000+ samples from your posterior (index by \(s\)), we can estimate the mean and variance of the random variable \(f(x)\) under the posterior as follows:

\begin{align} \mathbb{E}[f] &\approx \texttt{mean}( \{ f(x, w^s, b^s) \}_{s=1}^S ) \\ \end{align}

2c. Plot of smoothed mean of regression function

  • 4 panels (1 row x 4 columns)
  • X-axis: 1D input "x" to the regression problem (use regularly-spaced grid)
  • Y-axis: Predicted output of the BNN function \(f(x, w, b)\)
  • First 3 panels: each should show the mean +/- 2 standard deviations of the empirical distribution over samples for one chain of HMC
  • Last panel: show the exact posterior's mean +/- 2 standard deviations (using the analytical formulas)

For showing means and standard deviations, see matplotlib's fill_between function).

As much as possible, avoid showing samples in the transient burn-in phase of the sampler.

Problem 3: Sample from Posterior using HMC for a BNN with 1 hidden layers

Consider the same training data as in Problem 2.

Now try a more interesting model: a BNN with 1 hidden layer with 10 units, and 1 output layer. This should produce much more accurate predictions on the training data.

Implementation Details:

  • Fix the BNN architecture to [10], which means 10 hidden units in one hidden layer.
  • Use a Normal(0, 1) prior for all weights and biases.
  • Use sigma=0.1 for the likelihood's Gaussian noise level.
  • Run at least 3 chains each for at least 2000 total iterations (remember to worry about burnin).

Instructions: For Problem 3, your report PDF should include:

3a. Brief comment on what settings worked for you (step size, learning rate, number of iterations)

Just a few sentences

3b. Plot of posterior samples of regression function

Out of all the 1000+ samples of weights/biases from each chain, pick 10 at random to plot.

  • 3 panels (1 row x 3 columns)
  • X-axis: 1D input "x" to the regression problem (use regularly-spaced grid)
  • Y-axis: Predicted output of the BNN function \(f(x, w, b)\)
  • First 3 panels: each should show \(f(x, w^s, b^s)\) for 10 samples (indexed by \(s\)) from the "posterior" of one chain of HMC

3c. Plot of smoothed mean of regression function

  • 3 panels (1 row x 3 columns)
  • X-axis: 1D input "x" to the regression problem (use regularly-spaced grid)
  • Y-axis: Predicted output of the BNN function \(f(x, w, b)\)
  • First 3 panels: each should show the mean +/- 2 standard deviations of the empirical distribution over samples for one chain of HMC

Use fill-between, as before in 2c.

Template Python Code

The following two function templates might help you solve Problem 2. You are not required to use either of these, but it just might help.

See also the pseudocode found on Page 14 of Neal's paper: https://arxiv.org/pdf/1206.1901.pdf#page=14.

Brief template code intro: we assume we have defined Python functions that can

  • Calculate the kinetic energy, given momentum values
  • Calculate the potential energy of BNN regression, given some bnn parameter values (aka 'position')
  • Calculate the gradient of the potential energy (perhaps via autograd)

(You'll need to write each of these functions).

Given these functions, we can build an HMC sampler using the run_HMC_sampler and make_proposal_via_leapfrog_steps functions defined below.

Template: Run HMC Sampler for many iterations

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
def run_HMC_sampler(
        init_bnn_params=None,
        n_hmc_iters=100,
        n_leapfrog_steps=1,
        step_size=1.0,
        random_seed=42,
        calc_potential_energy=None,
        calc_kinetic_energy=None,
        calc_grad_potential_energy=None,
        ):
    """ Run HMC sampler for many iterations (many proposals)

    Returns
    -------
    bnn_samples : list
        List of samples of NN parameters produced by HMC
        Can be viewed as 'approximate' posterior samples if chain runs to convergence.
    info : dict
        Tracks energy values at each iteration and other diagnostics.

    References
    ----------
    See Neal's pseudocode algorithm for a single HMC proposal + acceptance:
    https://arxiv.org/pdf/1206.1901.pdf#page=14

    This function repeats many HMC proposal steps.
    """
    # Create random-number-generator with specific seed for reproducibility
    prng = np.random.RandomState(int(random_seed))

    # Set initial bnn params
    cur_bnn_params = init_bnn_params
    cur_potential_energy = calc_potential_energy(cur_bnn_params)

    bnn_samples = list()
    # TODO make lists to track energies over iterations

    n_accept = 0
    for t in range(n_hmc_iters):
        # Draw momentum for CURRENT configuration
        cur_momentum_vec = # TODO draw momentum using prng

        # Create PROPOSED configuration
        prop_bnn_params, prop_momentum_vec = make_proposal_via_leapfrog_steps(
            cur_bnn_params, cur_momentum_vec,
            n_leapfrog_steps=n_leapfrog_steps,
            step_size=step_size,
            calc_grad_potential_energy=calc_grad_potential_energy)

        # TODO Compute probability of accept/reject for proposal
        # TODO You'll use need to use kinetic and potential energy functions   
        accept_proba = 0.0 # (Placeholder)

        # Draw random value from (0,1) to determine if we accept or not
        if prng.rand() < accept_proba:
            # If here, we accepted the proposal
            n_accept += 1

            # TODO what current state needs to be updated?

        # Update list of samples from "posterior"
        bnn_samples.append(cur_bnn_params)
        # TODO update energy tracking lists

        # Print some diagnostics every 50 iters
        if t < 5 or ((t+1) % 50 == 0) or (t+1) == n_hmc_iters:
            accept_rate = float(n_accept) / float(t+1)
            print("iter %6d/%d after %7.1f sec | accept_rate %.3f" % (
                t+1, n_hmc_iters, time.time() - start_time_sec, accept_rate))

    return (
        bnn_samples,
        dict(
            n_accept=n_accept,
            n_hmc_iters=n_hmc_iters,
            accept_rate=accept_rate),
        )

Template: Construct a Single HMC Proposal

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def make_proposal_via_leapfrog_steps(
        cur_bnn_params, cur_momentum_vec,
        n_leapfrog_steps=1,
        step_size=1.0,
        calc_grad_potential_energy=None):
    """ Construct one HMC proposal via leapfrog integration

    Returns
    -------
    prop_bnn_params : same type/size as cur_bnn_params
    prop_momentum_vec : same type/size as cur_momentum_vec

    """
    # Initialize proposed variables as copies of current values
    prop_bnn_params = copy.deepcopy(cur_bnn_params)
    prop_momentum_vec = copy.deepcopy(cur_momentum_vec)

    # TODO: half step update of momentum
    # This will use the grad of potential energy (use provided function)

    for step_id in range(n_leapfrog_steps):
        # TODO: full step update of 'position' (aka bnn_params)
        # This will use the grad of kinetic energy (has simple closed form)

        if step_id < (n_leapfrog_steps - 1):
            # TODO: full step update of momentum

        else:
            # Special case for final step
            # TODO: half step update of momentum


    # TODO: don't forget to flip sign of momentum (ensure symmetry)

    return prop_bnn_params, prop_momentum_vec

```

Debugging Tips

Here are some tricks to simplify your life if you are having trouble:

  • Before you try HMC, see if you can get a simpler random walk Metropolis-Hastings sampler to work. This should be quite easy (just a few lines of code), but would let you verify that you understand how to evaluate acceptance probabilities for the BNN model.

  • For Problem 2, the variables you are sampling (one weight and one bias) are easily plotted in 2D space. Make a scatter plot of samples from your sampler and the "ideal" analytical posterior. Do they look like the same distribution?