Bayesian Deep Learning

Tufts CS Special Topics Course | COMP 150 - 03 BDL | Fall 2018

HW3: Bayesian Neural Nets + Variational Inference


PDF writeup due by Tues. Oct 2 at 11:59PM. Submit your PDF to Gradescope: https://www.gradescope.com/courses/23483/assignments/102932/

Your PDF should include:

  • Your full name
  • Collaboration statement
  • About how many hours did you spend (coding, asking for help, writing it up)?
  • About how many hours did your CPU spend on Problem 2?
  • Problem 1 answer (plots and text, with parts clearly labeled)
  • Problem 2 answer (plots and text, with parts clearly labeled)
  • Any code you developed (as an appendix)

Jump to: Problem 1   Problem 2   Debugging Tips

Background

We have now learned about black-box variational inference (see the [Alg. 1 pseudocode on Page 3 of BBVI arXiv paper]. In this HW, we will implement our own version of BBVI and use it to fit a neural network univariate regression model.

We assume the NN has \(L\) total layers (including the output layer and \(L-1\) hidden layers). Each layer \(\ell\) of the network has a set of \(J^{\ell}\) weights \(w^{(\ell)}\) and a set of \(K^{\ell}\) scalar biases \(b^{\ell}\).

As in the previous homework, we assume the network takes an input \(x_n\) (we'll assume it is univariate), and produces a predicted scalar output via a feed-forward function \(f(x_n, w, b) \in \mathbb{R}\).

Prior

Each weight and bias has an independent univariate Gaussian prior:

\begin{align} p(w) &= \prod_{\ell=1}^{L} \prod_{j=1}^{J^{(\ell)}} \mathcal{N}(w_j | 0, 1) \\ p(b) &= \prod_{\ell=1}^{L} \prod_{k=1}^{K^{(\ell)}} \mathcal{N}(b_k | 0, 1) \end{align}

Likelihood

We explain the \(N\) total examples we observe -- each indexed by \(n\) and consisting of a pair of output/input \(\{y_n, x_n\}\) values -- via a Gaussian likelihood, with mean given by the scalar network output value \(f(x_n, w, b)\) and standard deviation given by known hyperparameter \(\sigma > 0\):

$$ p(y| x, w, b) = \prod_{n=1}^N \mathcal{N}(y_n | f(x_n, w, b), \sigma^2) $$

Target Posterior

Our goal is to estimate the posterior over the weights and biases given the training data:

$$ p(w, b | y, x) $$

Knowing this posterior would let us make useful estimates, such as the value of \(y_*\) at a new test input \(x_*\).

Sadly, we don't have a way to compute this posterior in closed form. It is intractable when the model contains hidden layers with non-linear activations.

Approximate Posterior

To gain tractability, we'll try to solve a simpler problem. We will look at possible approximate posteriors that have more managable form.

Let us assume each scalar weight and bias parameter has an independent Normal approximate posteriors with a mean value \(m_j \in \mathbb{R}\) and log-standard-deviation value \(s_j \in \mathbb{R}\) that are free parameters.

$$ q(w, b | m, s) = \prod_{\ell=1}^L \Big( \prod_{j=1}^{J^{\ell}} \mathcal{N}( w_j \mid \tilde{m}^{\ell}_j, (e^{\tilde{s}^{\ell}_j})^2 ) \Big) \Big( \prod_{k=1}^{K^{\ell}} \mathcal{N}( b_k \mid \bar{m}^{\ell}_k, (e^{\bar{s}^{\ell}_k})^2 ) \Big) $$

We use bars over variables to indicate those that belong to biases, and tildes to denote the variables that belong to weights. For convenience, when it is clear from context we'll just use \(m = [\tilde{m}, \bar{m}]\) to denote the means of all weights and biases, and \(s = [\tilde{s}, \bar{s}]\) to denote the LOG standard deviations for all weights and biases.

At given specific \(m\) and \(s\) values, we have a valid distribution over weights and biases. Certain values of \(m,s\) will give us distributions that are closer to the true posterior than others. Our goal is to find the values that bring us as close as possible.

VI Objective Function to Maximize

\begin{align} \mathcal{L}(m,s) &= \mathbb{E}_{q(w|m,s)} \Big[ \log p(y | x, w) + \log p(w,b) - \log q(w,b|m,s) \Big] \end{align}

The function \(\mathcal{L}(m,s)\) takes in possible mean and log-std-dev parameters \(m,s\), and produces a scalar real value.

The readings discuss how this function is a "lower bound" on the marginal likelihood \(\log p(y | x) = \log \int_w p(y,w|x) dw\). We wish to maximize this lower bound (make it as tight as possible), by finding the best possible values of the parameters \(m,s\) for our approximate posterior.

VI Loss Function (Objective to Minimize)

Often, we are interested in framing inference as a minimization problem (not maximization). Esp. in deep learning, minimization is the common goal of optimization toolboxes.

We can thus define that our "VI loss function" (what we want to minimize) is just -1.0 times the lower bound objective above.

"VI loss function": \(-1.0 \cdot \mathcal{L}(m,s)\)

Training Optimization Problem

Our goal is to learn a set of \(m,s\) values which make the approximate posterior as close as possible to our true intractable posterior. Our readings show this is equivalent to solving the following optimization problem:

\begin{align} \min_{m,s} \quad & - \mathcal{L}(m,s) %{m} \in \mathbb{R}^{\sum_{\ell} J^{\ell} + K^{\ell}}, %{s} \in \mathbb{R}^{\sum_{\ell} J^{\ell} + K^{\ell}}} %\\ %\mathcal{L}(m,s) &= \mathbb{E}_{q(w,b|m,s)} %\Big[ % \log p(y | x, w) + \log p(w) - \log q(w) %\Big] \end{align}

We can then use:

  • Monte Carlo methods to evaluate the expectation
  • Score Function trick to evaluate the gradient of the expectation

These two ideas, together, allow us to do BBVI inference.

Problem 1: Estimating expectations and gradients of expectations with Monte Carlo and the Score Function Trick

For this problem only, consider a simple linear regression model, equivalent to a BNN with zero hidden layers and zero hidden units. The random variables are one weight parameter \(w\) and one bias parameter \(b\), both scalars.

$$ f(x_n, w, b) \triangleq w x_n + b $$

Thus, our approximate posterior will have four parameters, a pair \(\tilde{m},\tilde{s}\) for \(q(w)\) and a pair \(\bar{m}, \bar{s}\) for \(q(b)\).

Consider the dataset of \(N=5\) example points:

x_train_N = np.asarray([-5.0,  -2.50, 0.00, 2.50, 5.0])
y_train_N = np.asarray([-4.91, -2.48, 0.05, 2.61, 5.09])

Observe that our proposed linear model works very well for this problem if the slope is near 1 and the bias is near 0. This leads us to the following "near-ideal" values for the approximate posteriors \(q(w)\) and \(q(b)\).

Near-Ideal parameter values for Problem 1

You can assume that \(q(w)\) and \(q(b)\) are close to ideal if we set parameters such that:

  • \(\tilde{m}\), mean of \(w\) = 1.0
  • \(\tilde{s}\), log stddev of \(w = \log(0.1)\)
  • \(\bar{m}\), mean of \(b\) = 0.0
  • \(\bar{s}\), log stddev of \(b = \log(0.1)\)
Functions needed for Problem 1

Write a function to estimate the VI loss function defined above, \(-1.0 \cdot \mathcal{L}(\tilde{m},\tilde{s}, \bar{m}, \bar{s})\), using a given number of Monte Carlo (MC) samples.

Write another function to estimate the gradient of the VI loss function using the score function trick and a given number of MC samples.

We recommend these functions to be general-purpose (not restricted to linear models), so you can use them later in Problem 2. But if it helps, you can first write them for the linear case.

Instructions for Problem 1

Your PDF report should include the following labeled parts:

a. 1 row x 4 column plot of the estimated VI loss

  • Each column should show results using 1, 10, 100, 1000 samples
  • Each panel should show the estimated value of \(-\mathcal{L}(\tilde{m},\tilde{s},\bar{m}, \bar{s})\) across a list of possible input parameters:
    • Fix \(\tilde{s}, \bar{m}, \bar{s}\) to their Near-Ideal values in the chart above
    • Let \(\tilde{m}\) (the mean of \(w\) under \(q\)) vary across a grid of 20 possible values np.linspace(-3.0, 5.0, 20)

b. Short Answer: Describe any trends between number of samples and accuracy. How trustworthy is the 1-sample estimate?

c. 1 row x 4 column plot of the estimated gradient of the VI loss w.r.t \(\tilde{m}\)

  • Each column should show results using 1, 10, 100, 1000 samples
  • Each panel should show the estimated value of the gradient of \(-\mathcal{L}\) with respect to the scalar \(\tilde{m}\), across the same range of parameters in part a.

d. Short Answer: Does the plot from Part 1c look plausible as the gradient of the plots in Part 1a? Describe any trends between number of samples and accuracy for the Part 1c plot. How trustworthy is the 1-sample estimate? About how many samples does it take for the gradient to appear accurate (and thus) trustworthy?

Problem 2: Fit an Approximate Posterior using your own BBVI implementation

Consider the following training data with \(N=6\) example pairs of \(x\) and \(y\) values (as in HW2):

x_train_N = np.asarray([-2.,    -1.8,   -1.,  1.,  1.8,     2.])
y_train_N = np.asarray([-3.,  0.2224,    3.,  3.,  0.2224, -3.])

Your goal is to write an implementation of Black-box Variational Inference (BBVI) that fits an approximate posterior for BNN neural network weights given this data.

You may use automatic differentiation tools (like autograd, PyTorch, or Tensorflow) as you wish. We recommend autograd as a simple option that has been demonstrated in class.

You should think carefully about how you set hyperparameters such as the step size (aka learning rate) and number of Monte Carlo samples.

You may refer to existing BBVI implementations you find online for high-level understanding, but you must write your own code and be able to defend any code you submit as your original work. Working with a small set of fellow students from this class (at most 2 partners) is encouraged, provided you abide by the collaboration policy.

Model Implementation Details:

  • Fix the BNN architecture to 1 layer of 10 hidden units, with tanh as the activation function.
  • Use a Normal(0, 1) prior for all weights and biases.
  • Use sigma=0.1 for the final output layer's Gaussian noise level.

Approximate Posterior Family Details:

  • As indicated in the 'Background' above, you should learn a mean and log-standard-deviation parameter for each weight and bias.
  • Initialize weight means to draws from a standard normal distribution
  • Initialize bias means to draws from a standard normal distribution
  • Initialize weight standard deviations to 1
  • Initialize bias standard deviations to 1

Algorithm Details:

  • Run at least 3 separate runs from independent random initializations, each for at least 2000 total iterations (or until convergence).
  • You can use a constant step size (aka learning rate). This needs some tuning, but something around 1e-5 might be OK. Probably 1e-8 is too small, and 0.01 is too big.

Instructions: For Problem 2, your report PDF should include:

a. Plot of the Monte-Carlo-estimated optimization objective \(\mathcal{L}(w,b)\) (aka evidence lower bound aka 'ELBO') vs. iteration, for each of your runs. This should be one plot with multiple lines. Write a clear caption indicating number of Monte Carlo samples and step-size used, as well as 1-2 sentences summarizing your strategy for selecting these values.

b. Make a 1x3 grid of plots, each one showing sampled function values \(f\) from each of your independent runs. Show 10 samples per plot.

You can draw samples from your posterior \(w^s,b^s \sim q(w,b)\), and use these to produce function values \(f(x_g, w^s, b^s)\). Use an evaluation grid of \(x_g\) values such that x_grid_G = np.linspace(-20, 20, 200), like in previous homeworks.

c. Make a 1x3 grid of plots, each one showing a filled posterior plot of the posterior (mean +/- 2 std dev) from each of your independent runs. Show one plot per run/chain.

Filled plots have a line for the learned mean, and show fill at +/- 2 standard deviations. See matplotlib's fill_between function).

d. Short answer (1-3 sentences): How does this result compare to the HMC posterior from HW2. What is different? Which one is a better fit? Which method is more reliable?

Debugging Tips

Here are some tricks to simplify your life if you are having trouble:

  • Before you try fitting BBVI on a network with hidden units, try it with 0 hidden layers (equivalent to Bayesian linear regression).

  • Before you optimize all the weight means and weight log-std-devs, consider trying to optimize only the weight means or only the log-std-devs, holding others fixed. You can easily modify your gradient descent to do this just by masking the right gradient values with zeros.

Bonus: Real-world Datasets

Here are some possible datasets you could play with you are curious. Each one is a 'real' dataset with a univariate input \(x\) and univariate output \(y\), so you could easily fit a BNN to the dataset using your existing code.

Carbon dioxide over time

We measure C02 over time at Mauna Loa observatory. The input \(x\) is the time since 1958, the output \(y\) is the C02 concentration in parts-per-million.

NBA Basketball heights and weights

Height (\(x\)) vs. weight (\(y\)) for Professional Basketball (NBA) Players

  • [Dataset file (.csv)]
  • Key question: Given a new player's height, how accurately can we predict weight?
Revenue of Harry Potter films over time.

At each week (\(x\)) since the film's release, we have the total revenue in dollars \(y\).

  • [Dataset description]
  • [Dataset file (.csv)]
  • Key question: Given data from weeks 1-15, how accurate can we predict revenue in week 16 or 17? How much worse would we do if we only had weeks 1-10?

Bonus: Models and Algorithms

Here are some ideas you could play with if you have leftover time:

  • Is there a smart way to initialize that might lead to faster convergence / better fits?

  • Try using a Robbins-Munro sequence of step sizes. Does this help? Do you notice any differences?

  • Could you use some version of line search to pick the step size at each step in a smart way? (Google: Backtracking line search).