HW2: Bayesian Neural Nets + Hamiltonian Monte Carlo

PDF writeup due by Thurs Sept 20 ~~Tuesday Sept 18~~ at 11:59PM. Submit your PDF to Gradescope: https://www.gradescope.com/courses/23483/assignments/98319

Your PDF should include (in order):

Your full name
Collaboration statement
About how many hours did you spend (coding, asking for help, writing it up)?
About how many hours did your CPU spend on Problem 2?
Problem 1 answer (with parts clearly labeled)
Problem 2 answer (with parts clearly labeled)
Any code you developed (as an appendix to the PDF)

Jump to: Problem 1 Problem 2 Template Code for Problem 2 Debugging Tips

Background

We consider a general feed-forward BNN with $L$ total layers (including the output layer and $L-1$ hidden layers), indexed $1, 2, \ldots \ell, \ldots L$. It takes as input a vector $x$ and produces a real-valued scalar output $f(x) \in \mathbb{R}$.

The 1st layer takes as input a $D$-dimensional data vector: $x = [x_1, x_2, \ldots x_D]$. Each of the $J^{(1)}$ hidden units, indexed by $j$, produces a scalar value by multiplying the input vector $x$ by a weight vector, adding a bias, and feeding the resulting scalar through an activation function:

$$ h^{(1)}_{j}(x, w, b) = \text{activation}(b^{(1)}_{j} + \sum_{d=1}^D w^{(1)}_{j,d} x_{d} ) $$

If there are 2 or more hidden layers, we'll write that layer $\ell$ has $J^{\ell}$ units. Each unit produces a scalar in the same fashion: taking as input the $J^{(\ell-1)}$-length vector $h^{\ell-1}$ produced by the previous layer, multiplying by unit-specific weights, adding unit-specific bias, and applying an activation function:

$$ h^{(\ell)}_{j}(x, w, b) = \text{activation}(b^{(\ell)}_{j} + \sum_{k=1}^{J^{(\ell-1)}} w^{(\ell)}_{j,k} h^{(\ell-1)}_{k}(x,w,b) ) $$

At the final layer $L$, we produce a scalar value $f(x, w, b)$ via:

$$ f(x, w, b) = b^{(L)}_{1} + \sum_{k=1}^{J^{(L-1)}} w^{(L)}_{k} h^{(L-1)}_{k}(x,w,b) $$

Possible Activation Functions

Tanh: Hyperbolic tangent function -- see numpy's np.tanh
ReLU: $\text{relu}(z) = \text{max}(0, z)$
SquaredExponential (aka 'RBF'): $\text{sqexp}(z) = \exp(-z^2)$

Possible Architectures

A general neural network for regression with L total layers will have L-1 hidden layers, each one with different numbers of hidden units. We can specify the size of the hidden network as a list of integers, like this:

[] means there are no hidden layers and no hidden units (equivalent to linear regression)
[5] means there is one hidden layer with 5 units
[5, 3] means there are two hidden layers, the first has 5 units and second has 3 units.

In general, the number of hidden layers (L-1) is equal to the length of the list, and the size of the $\ell$-th layer is given by the integer at position $\ell$ in the list.

Likelihood model

We can use the scalar value produced by the feed-foward neural network as the mean of a Gaussian likelihood distribution that explains observed input/output training data pairs: $y_n, x_n$:

\begin{align} p(y|x, w, b) &= \prod_{n=1}^N p(y_n | x_n, w, b) \\ &= \prod_{n=1}^N \mathcal{N}(y_n \mid f(x_n, w, b), \sigma^2) \end{align}

Where $\sigma$ is a given standard deviation hyperparameter.

Problem 1: Sampling from a BNN Prior

Write Python code to sample function values produced by the output of a general BNN regression architecture, where the weight parameters and biases at every layer each have an independent Gaussian prior -- Normal(mean=0, variance=1).

You should sample the function values that correspond to a set of at least 200 evenly-spaced test points $\{x_i\}$ between -20 and 20. One way to generate a 1D array of $G$ points would be: x_grid_G = np.linspace(-20, 20, G).

To demonstrate your implementation, you'll make plots of sampled function values (just like in HW1). Each individual plot should show a line plot of the test grid points $x_i$ and the corresponding sampled function values $f_i = f(x_i)$. Use a matplotlib line style '-' to emphasize the connecting lines between the specific $\{x_i, f_i\}$ pair values (showing the specific dots themselves can make it tough to observe qualitative patterns).

For Problem 1, your report PDF should include:

a. 4 row x 3 column grid of plots, where each panel shows 5 samples from the prior

For the rows, try 4 different architectures: [2], [10], [2,2], [10, 10]
For the columns, try 3 different activation functions: ReLu, tanh, and squared exponential (aka 'RBF')

b. Short text description of the qualitative trends you observe. How does a deeper network impact the function shape? How does the activation function impact function shape? A few short but complete sentences.

Problem 2: Sample from Posterior using your own HMC implementation

Consider the following training data with $N=6$ example pairs of $x$ and $y$ values (as in HW1):

x_train_N = np.asarray([-2.,    -1.8,   -1.,  1.,  1.8,     2.])
y_train_N = np.asarray([-3.,  0.2224,    3.,  3.,  0.2224, -3.])

Your goal is to write an implementation of HMC that can sample from the posterior given this data. Hint: the pseudocode algorithm on Page 14 of Neal's Handbook of MCMC Chapter on HMC is a helpful resource: https://arxiv.org/pdf/1206.1901.pdf#page=14

You may use automatic differentiation tools (like autograd, PyTorch, or Tensorflow) as you wish. We recommend autograd as a simple option that has been demonstrated in class. See 'Part 5' of the Jupyter Notebook we worked on in class for examples of useful NN data structures and gradient descent training of NNs with autograd: intro_to_autograd_and_neural_net_training.ipynb.

You should think carefully about how you set the step_size and the number of leapfrog steps for an HMC proposal. You may need to try several values and find the one that performs best.

You may refer to existing HMC implementations you find online for high-level understanding, but you must write your own code and be able to defend any code you submit as your original work. Working with a small set of fellow students from this class (at most 2 partners) is encouraged, provided you abide by the collaboration policy.

Implementation Details:

Fix the BNN architecture to 1 layer of 10 hidden units, with tanh as the activation function.
Use a Normal(0, 1) prior for all weights and biases (as in Part 1).
Use sigma=0.1 for the likelihood's Gaussian noise level.
Run at least 3 chains each for at least 2000 total iterations (remember to worry about burnin).

Instructions: For Problem 2, your report PDF should include:

a. Plot of the "potential energy" (aka negative log joint probability) vs. iteration, for each of your chains. This should be one plot with multiple lines.

b. Plot of sampled function values from the "posterior", for multiple chains. Show 10 samples per plot. Avoid showing samples in the transient burn-in phase of the sampler.

c. Plot of the empirical mean of many samples from the "posterior". Also show +/- 2 standard deviations (see matplotlib's fill_between function). Avoid showing samples in the transient burn-in phase of the sampler.

Template Python Code

The following two function templates might help you solve Problem 2. You are not required to use either of these, but it just might help.

See also the pseudocode found on Page 14 of Neal's paper: https://arxiv.org/pdf/1206.1901.pdf#page=14.

Brief template code intro: we assume we have defined Python functions that can

Calculate the kinetic energy, given momentum values
Calculate the potential energy of BNN regression, given some bnn parameter values (aka 'position')
Calculate the gradient of the potential energy (perhaps via autograd)

(You'll need to write each of these functions).

Given these functions, we can build an HMC sampler using the run_HMC_sampler and make_proposal_via_leapfrog_steps functions defined below.

Template: Run HMC Sampler for many iterations

def run_HMC_sampler(
        init_bnn_params=None,
        n_hmc_iters=100,
        n_leapfrog_steps=1,
        step_size=1.0,
        random_seed=42,
        calc_potential_energy=None,
        calc_kinetic_energy=None,
        calc_grad_potential_energy=None,
        ):
    """ Run HMC sampler for many iterations (many proposals)

    Returns
    -------
    bnn_samples : list
        List of samples of NN parameters produced by HMC
        Can be viewed as 'approximate' posterior samples if chain runs to convergence.
    info : dict
        Tracks energy values at each iteration and other diagnostics.

    References
    ----------
    See Neal's pseudocode algorithm for a single HMC proposal + acceptance:
    https://arxiv.org/pdf/1206.1901.pdf#page=14

    This function repeats many HMC proposal steps.
    """
    # Create random-number-generator with specific seed for reproducibility
    prng = np.random.RandomState(int(random_seed))

    # Set initial bnn params
    cur_bnn_params = init_bnn_params
    cur_potential_energy = calc_potential_energy(cur_bnn_params)

    bnn_samples = list()
    # TODO make lists to track energies over iterations

    n_accept = 0
    for t in range(n_hmc_iters):
        # Draw momentum for CURRENT configuration
        cur_momentum_vec = # TODO draw momentum using prng

        # Create PROPOSED configuration
        prop_bnn_params, prop_momentum_vec = make_proposal_via_leapfrog_steps(
            cur_bnn_params, cur_momentum_vec,
            n_leapfrog_steps=n_leapfrog_steps,
            step_size=step_size,
            calc_grad_potential_energy=calc_grad_potential_energy)

        # TODO Compute probability of accept/reject for proposal
        # TODO You'll use need to use kinetic and potential energy functions   
        accept_proba = 0.0 # (Placeholder)

        # Draw random value from (0,1) to determine if we accept or not
        if prng.rand() < accept_proba:
            # If here, we accepted the proposal
            n_accept += 1

            # TODO what current state needs to be updated?

        # Update list of samples from "posterior"
        bnn_samples.append(cur_bnn_params)
        # TODO update energy tracking lists

        # Print some diagnostics every 50 iters
        if t < 5 or ((t+1) % 50 == 0) or (t+1) == n_hmc_iters:
            accept_rate = float(n_accept) / float(t+1)
            print("iter %6d/%d after %7.1f sec | accept_rate %.3f" % (
                t+1, n_hmc_iters, time.time() - start_time_sec, accept_rate))

    return (
        bnn_samples,
        dict(
            n_accept=n_accept,
            n_hmc_iters=n_hmc_iters,
            accept_rate=accept_rate),
        )

Template: Construct a Single HMC Proposal

def make_proposal_via_leapfrog_steps(
        cur_bnn_params, cur_momentum_vec,
        n_leapfrog_steps=1,
        step_size=1.0,
        calc_grad_potential_energy=None):
    """ Construct one HMC proposal via leapfrog integration

    Returns
    -------
    prop_bnn_params : same type/size as cur_bnn_params
    prop_momentum_vec : same type/size as cur_momentum_vec

    """
    # Initialize proposed variables as copies of current values
    prop_bnn_params = copy.deepcopy(cur_bnn_params)
    prop_momentum_vec = copy.deepcopy(cur_momentum_vec)

    # TODO: half step update of momentum
    # This will use the grad of potential energy (use provided function)

    for step_id in range(n_leapfrog_steps):
        # TODO: full step update of 'position' (aka bnn_params)
        # This will use the grad of kinetic energy (has simple closed form)

        if step_id < (n_leapfrog_steps - 1):
            # TODO: full step update of momentum

        else:
            # Special case for final step
            # TODO: half step update of momentum


    # TODO: don't forget to flip sign of momentum (ensure symmetry)

    return prop_bnn_params, prop_momentum_vec

```

Debugging Tips

Here are some tricks to simplify your life if you are having trouble:

Before you try HMC, see if you can get a simpler random walk Metropolis-Hastings sampler to work. This should be quite easy (just a few lines of code), but would let you verify that you understand how to evaluate acceptance probabilities for the BNN model.
Before you try HMC on a network with hidden units, try it with 0 hidden layers (equivalent to Bayesian linear regression). This should be a much simpler model with just two scalar random variables (weight and bias).