CP2: Bayesian Linear Regression


Last modified: 2021-03-05 17:22

Due date: Wed. Mar. 11, 2021 at 11:59pm AoE (anywhere on Earth)

Status: RELEASED

Updates:

  • 2021-03-05: Short answers 3e and 3f are no longer required. You can leave them out.

What to turn in:

Your ZIP file should include

  • All starter code .py files (with your edits) (in the top-level directory)

Your PDF should include (in order):

  • Your full name
  • Collaboration statement
  • Problem 1 figures and short answers
  • Problem 2 figure and short answers
  • Problem 3 figure and short answers

Questions?: Post to the cp2 topic on the discussion forums.

Jump to: Problem 1   Problem 2   Problem 3   Starter Code

Background

Here we will implement the Bayesian linear regression model discussed in class and in your readings.

The problems below will address two key questions:

  • Given training data, how should we estimate the probability of a new observation? How might estimates change if we have very little (or abundant) training data?
  • How can we select hyperparameter values to improve our predictions on heldout data? How can we contrast methods like cross validation with methods that might use our whole training set more effectively?

Probabilistic Model

We are given \(N\) data examples of input/output pairs: \(\{ x_n, t_n \}_{n=1}^N\).

We will form a joint distribution for linear regression tasks that models both the weights and the observed "outputs".

Prior model

We place a multivariate Normal prior over the weight vector \(w \in \mathbb{R}^M\).

$$ p(w) = \mathcal{N}( w | 0, \alpha^{-1} I_M ) $$

Here, the scalar \(\alpha > 0\) is the "prior precision" hyperparameter.

Likelihood model

Next, we model the \(N\) observed outputs \(t = \{t_n \}_{n=1}^N\) as follows:

$$ p(t | x, w) = \prod_{n=1}^N \mathcal{N}( t_n | w^T \phi(x_n), \beta^{-1}) $$

Here, the scalar \(\beta > 0\) is the "likelihood precision" hyperparameter.

We'll assume the raw features \(x_n\) are just one dimensional, and we will use polynomial features for the feature transform:

\begin{align} \phi^{(0)}(x_n) &= [ 1 ]^T \\ \phi^{(1)}(x_n) &= [ 1 ~ x_n ]^T \\ \phi^{(2)}(x_n) &= [ 1 ~ x_n ~ x_n^2]^T \end{align}

where we get to pick the order \(r\) of the polynomial as a hyperparameter.

Key Concept 1: How should we estimate the probability of new outputs given new inputs?

Maximum A-Posteriori (MAP) estimator

We can solve the following MAP estimation problem to get one estimate of vector \(w\):

\begin{align} w_{\text{MAP}} &= \arg\max_{w} \log p( w | \{x_n, t_n\}_{n=1}^N, \alpha, \beta) \\ &= \left( \Phi^T \Phi + \frac{\alpha}{\beta} I \right)^{-1} \Phi^T t \end{align}

See discussion around Eq. 3.52 - 3.58 for how to see why this estimate will solve the MAP estimation problem.

(Side note: Recall that we could also motivate this estimate not via prior probabilities, but by a view that just seeks a "penalized" ML estimator, combining the maximum likelihood objective with an additional term that penalizes the sum of squares of "w" with strength \(\lambda > 0\). See Bishop's textbook Eq. 3.28 for the penalized ML estimator of \(w\).)

Given the MAP estimate of \(w\), when presented with a new input/output data pair \(t_*, x_*\), we can compute its likelihood via the following univariate Normal PDF:

$$ p( t_* | x_*, w = w_{\text{MAP}}, \beta ) = \mathcal{N}( t_* | w_{\text{MAP}}^T \phi(x_*), \beta^{-1}) $$

We get this directly from our i.i.d. likelihood assumptions, now applied to new data point \(*\) rather than the original \(N\) data points.

Posterior Predictive estimator

The MAP estimator above requires us to commit to one particular weight vector. We might instead want to estimate the entire posterior distribution, and then average over this distribution when estimating what new outputs should look like.

This average-over-the-posterior strategy yields the Posterior Predictive distribution:

$$ p( t_* | x_*, \{x_n, t_n \}_{n=1}^N, \alpha, \beta ) = \mathcal{N}( t_* | m_N^T \phi(x_*), \sigma_N^2(x_*)) $$

Where the mean \(m_N\) is from Bishop PRML Eq. 3.53 and the variance \(\sigma^2_N\) is from Bishop PRML Eq. 3.59.

Key Concept 2: How to select the hyperparameter settings using training data?

Grid search with Cross Validation

Using cross validation, we can perform a grid search to select one best hyperparameter setting from a candidate set of values.

  1. Divide dataset into K=5 folds

  2. For each candidate hyperparameter, compute:

      • fold1score = score on fold 5 after training on folds 1,2,3,4
      • fold2score = score on fold 4 after training on folds 1,2,3,5
      • fold3score = score on fold 3 after training on folds 1,2,4,5
      • fold4score = score on fold 2 after training on folds 1,3,4,5
      • fold5score = score on fold 1 after training on folds 2,3,4,5
      • average_score = mean([fold1score, fold2score, fold3score, fold4score, fold5score])
  3. Select hyperparameter with best average_score

Grid Search with Evidence

  1. For each candidate hyperparameter, compute evidence (see textbook Eq. 3.86)
  2. Select hyperparameter with best evidence

Hints for numerical stability

When you need to compute \(A^{-1} x\) for some matrix \(A\) and vector \(x\), the best way to do that within numpy is to do:

np.linalg.solve(A, x)

This computes the desired product in one step, rather than first computing the inverse and then later the product. Doing the "solve" directly avoids potential inaccuracies (though for the small scale problems in this CP this is unlikely to matter too much).

Data and Starter Code

You can find the starter code and datasets in the course Github repository here:

https://github.com/tufts-ml-courses/comp136-21s-assignments/tree/main/cp2

Starter Code Installation Notes:

  • If you have the spr_2021s_env already installed from CP1, you don't need to do anything else.

  • If not, follow directions in the README for how to install the required Python packages

Provided Datasets

Inside the data/ folder of the starter code repo, you will find several CSV files for two datasets

ToyWave

512 train points, 512 test points.

Made by us to be as similar as possible to "sine wave" data described in Bishop Fig. 1.4

ToyLine

512 train points, 512 test points.

Made by us to represent input/output pairs for a line with known slope and intercept, plus additive noise.

Problem 1

Tasks for Code Implementation

(All these steps will be evaluated by Autograder)

CODE 1(i) Implement fit and predict methods of starter code MAPEstimator.py

CODE 1(ii) Implement fit and predict and predict_variance methods of starter code PosteriorPredictiveEstimator.py

CODE 1(iii) Fill in the main method of run_compare_estimators_with_fixed_hypers.py to solve the tasks below.

Tasks for Report PDF

In problems below, be sure to enforce the following settings:

  • Fix prior precision alpha = 1.0 for all estimators
  • Fix likelihood precision beta = 20.0 for all estimators
  • Make sure all your figures have clear axis labels, titles, and legends when needed (As in any good scientific communication).

1a: FIGURE: MAP Predictions for ToyLine dataset In your report PDF, using the run_compare_estimators_with_fixed_hypers.py script, produce a figure with 8 total panels (2 rows, 4 columns). Each column will explore a different polynomial feature transform from order 0 to order 3. Each row will look at a different amount of training data (a small amount using N=8 and the whole dataset using N=512). Each panel should show the MAP estimator's predicted mean for \(t\) as a function of the provided inputs \(x\). Using matplotlib's fill-between (already in starter code), also show +/-3 standard deviations, to understand the MAP estimator's uncertainty in predictions.

1b: FIGURE: PPE Predictions for ToyLine dataset Repeat 1a for the PPE estimator on the ToyLine dataset. Be sure your estimates for the mean and standard deviation of the prediction distribution change appropriately.

1c: FIGURE: MAP Predictions for ToyWave dataset Repeat 1a for the MAP estimator on the ToyWave dataset.

1d: FIGURE: PPE Predictions for ToyWave dataset Repeat 1b for the PPE estimator on the ToyWave dataset.

1e: SHORT ANSWER Think about the heldout predictions of these estimators of new outputs given new inputs (e.g. as in plots 1a and 1c). How are the PPE mean estimates different from the MAP mean estimates? How are the PPE variance estimates different from the MAP variance estimates? Base your answer on what you expect from the underlying math, as well as any plots you have made.

Problem 2: Model Selection via Heldout Probability via Cross-Validation

In problem 1, we set \(\alpha\) and \(\beta\) manually to a single value. To do better, we'll now explore one principled way to select the value of \(\alpha, \beta\) via grid search to optimize performance: K=5 fold cross validation.

We'll look at the following candidate hyperparameters in our grid search:

  • 7 possible alpha parameters: [0.000, 0.001, 0.010, 0.100, 1.000, 10.000, 100.000]
  • 21 possible beta parameters: [0.010, 0.016, 0.025, 0.040, 0.063, 0.100, 0.158, 0.251, 0.398, 0.631, 1.000, 1.585, 2.512, 3.981, 6.310, 10.000, 15.849, 25.119, 39.811, 63.096, 100.000]

Our goal is to determine the best values of \(\alpha, \beta\) for each polynomial order in [0, 1, 2, 3, 4, 5, 6, 7, 8].

Tasks for Code Implementation

(All code steps will be evaluated by Autograder)

CODE 2(i) Implement the main method of run_grid_search_to_maximize_likelihood_on_5fold_heldout.py, filling in required TODOs

Tasks for Report PDF

2a: FIGURE: Model Score vs. Polynomial Order for ToyWave dataset In your report PDF, produce a figure that shows the computed heldout score (averaged over 5 folds) as a function of the polynomial feature order. You should show two lines, one for N=16 (the first 16 data points in training set) and one for N=512 (full training set). (Starter code makes most of this plot already).

2b: SHORT ANSWER: Best hyperparameters for N=512 training set Report verbatim the selected hyperparameter values as well as the model's score on the test set, following the template below:

order = 0
alpha = 1.00
beta = 1.00
test likelihood score: -0.12345
required time = 5.67 sec

Problem 3: Model Selection via Evidence on the Training Set

In problem 2, we needed to perform K separate fits of a linear regression model across K folds to select \(\alpha\) and \(\beta\) for our grid search. We'll now explore a method that requires only one fit of the model for each \(\alpha, \beta\) value: model selection via the evidence (marginal likelihood).

Tasks for Code Implementation

(All these steps will be evaluated by Autograder)

CODE 3(i) Implement the fit_and_calc_log_evidence method in the starter code LinearRegressionPosteriorPredictiveEstimator.py. Use the formula given in Bishop PRML Eq. 3.86.

CODE 3(ii) Implement the main method of run_grid_search_to_maximize_evidence_on_train.py, filling in required TODOs.

Tasks for Report PDF

3a: FIGURE: Model Score vs. Polynomial Order for ToyWave dataset In your report PDF, produce a figure that shows the computed log evidence as a function of the polynomial feature order. You should show two lines, one for N=16 and one for N=512. To get these two lines on the same scale, please plot the per-example log evidence (e.g. log_ev / N).

3b: SHORT ANSWER: Best hyperparameters for N=512 training set Report verbatim the selected hyperparameter values as well as the model's score on the test set. Like in 2b.

3c: SHORT ANSWER: Reflect on runtime Which grid search was more expensive in terms of run time, 5-fold CV in problem 2 or the evidence in problem 3?

3d: SHORT ANSWER: Reflect on model selection Did the final model chosen in Problem 2 grid search differ substantially from the best model from the Problem 3 grid search? Discuss any differences in selected hyperparameter values as well as test set performance.

3e: SHORT ANSWER: Reflect on your reading How does the figure you made in 3a compare with Bishop PRML textbook's figure 3.14? Does Fig 3a still favor order 1 over order 2 models? How would you explain other key differences?

3f: SHORT ANSWER: Summarize your own experiment Repeat 2b and 3b for the ToyLine dataset. This dataset was truly generated by an order 1 polynomial. Does either procedure recover this ground truth?