Last modified: 2021-03-05 17:22
Due date: Wed. Mar. 11, 2021 at 11:59pm AoE (anywhere on Earth)
Status: RELEASED
Updates:
- 2021-03-05: Short answers 3e and 3f are no longer required. You can leave them out.
What to turn in:
- ZIP file of source code: https://www.gradescope.com/courses/239096/assignments/1030441
- PDF report file: https://www.gradescope.com/courses/239096/assignments/1030540
Your ZIP file should include
- All starter code .py files (with your edits) (in the top-level directory)
Your PDF should include (in order):
- Your full name
- Collaboration statement
- Problem 1 figures and short answers
- Problem 2 figure and short answers
- Problem 3 figure and short answers
Questions?: Post to the cp2 topic on the discussion forums.
Jump to: Problem 1 Problem 2 Problem 3 Starter Code
Background
Here we will implement the Bayesian linear regression model discussed in class and in your readings.
The problems below will address two key questions:
- Given training data, how should we estimate the probability of a new observation? How might estimates change if we have very little (or abundant) training data?
- How can we select hyperparameter values to improve our predictions on heldout data? How can we contrast methods like cross validation with methods that might use our whole training set more effectively?
Probabilistic Model
We are given \(N\) data examples of input/output pairs: \(\{ x_n, t_n \}_{n=1}^N\).
We will form a joint distribution for linear regression tasks that models both the weights and the observed "outputs".
Prior model
We place a multivariate Normal prior over the weight vector \(w \in \mathbb{R}^M\).
Here, the scalar \(\alpha > 0\) is the "prior precision" hyperparameter.
Likelihood model
Next, we model the \(N\) observed outputs \(t = \{t_n \}_{n=1}^N\) as follows:
Here, the scalar \(\beta > 0\) is the "likelihood precision" hyperparameter.
We'll assume the raw features \(x_n\) are just one dimensional, and we will use polynomial features for the feature transform:
where we get to pick the order \(r\) of the polynomial as a hyperparameter.
Key Concept 1: How should we estimate the probability of new outputs given new inputs?
Maximum A-Posteriori (MAP) estimator
We can solve the following MAP estimation problem to get one estimate of vector \(w\):
See discussion around Eq. 3.52 - 3.58 for how to see why this estimate will solve the MAP estimation problem.
(Side note: Recall that we could also motivate this estimate not via prior probabilities, but by a view that just seeks a "penalized" ML estimator, combining the maximum likelihood objective with an additional term that penalizes the sum of squares of "w" with strength \(\lambda > 0\). See Bishop's textbook Eq. 3.28 for the penalized ML estimator of \(w\).)
Given the MAP estimate of \(w\), when presented with a new input/output data pair \(t_*, x_*\), we can compute its likelihood via the following univariate Normal PDF:
We get this directly from our i.i.d. likelihood assumptions, now applied to new data point \(*\) rather than the original \(N\) data points.
Posterior Predictive estimator
The MAP estimator above requires us to commit to one particular weight vector. We might instead want to estimate the entire posterior distribution, and then average over this distribution when estimating what new outputs should look like.
This average-over-the-posterior strategy yields the Posterior Predictive distribution:
Where the mean \(m_N\) is from Bishop PRML Eq. 3.53 and the variance \(\sigma^2_N\) is from Bishop PRML Eq. 3.59.
Key Concept 2: How to select the hyperparameter settings using training data?
Grid search with Cross Validation
Using cross validation, we can perform a grid search to select one best hyperparameter setting from a candidate set of values.
-
Divide dataset into K=5 folds
-
For each candidate hyperparameter, compute:
-
-
- fold1score = score on fold 5 after training on folds 1,2,3,4
-
-
-
- fold2score = score on fold 4 after training on folds 1,2,3,5
-
-
-
- fold3score = score on fold 3 after training on folds 1,2,4,5
-
-
-
- fold4score = score on fold 2 after training on folds 1,3,4,5
-
-
-
- fold5score = score on fold 1 after training on folds 2,3,4,5
-
-
-
- average_score = mean([fold1score, fold2score, fold3score, fold4score, fold5score])
-
-
Select hyperparameter with best average_score
Grid Search with Evidence
- For each candidate hyperparameter, compute evidence (see textbook Eq. 3.86)
- Select hyperparameter with best evidence
Hints for numerical stability
When you need to compute \(A^{-1} x\) for some matrix \(A\) and vector \(x\), the best way to do that within numpy is to do:
np.linalg.solve(A, x)
This computes the desired product in one step, rather than first computing the inverse and then later the product. Doing the "solve" directly avoids potential inaccuracies (though for the small scale problems in this CP this is unlikely to matter too much).
Data and Starter Code
You can find the starter code and datasets in the course Github repository here:
https://github.com/tufts-ml-courses/comp136-21s-assignments/tree/main/cp2
Starter Code Installation Notes:
-
If you have the
spr_2021s_env
already installed from CP1, you don't need to do anything else. -
If not, follow directions in the README for how to install the required Python packages
Provided Datasets
Inside the data/
folder of the starter code repo, you will find several CSV files for two datasets
ToyWave
512 train points, 512 test points.
Made by us to be as similar as possible to "sine wave" data described in Bishop Fig. 1.4
ToyLine
512 train points, 512 test points.
Made by us to represent input/output pairs for a line with known slope and intercept, plus additive noise.
Problem 1
Tasks for Code Implementation
(All these steps will be evaluated by Autograder)
CODE 1(i) Implement fit
and predict
methods of starter code MAPEstimator.py
CODE 1(ii) Implement fit
and predict
and predict_variance
methods of starter code PosteriorPredictiveEstimator.py
CODE 1(iii) Fill in the main method of run_compare_estimators_with_fixed_hypers.py
to solve the tasks below.
Tasks for Report PDF
In problems below, be sure to enforce the following settings:
- Fix prior precision alpha = 1.0 for all estimators
- Fix likelihood precision beta = 20.0 for all estimators
- Make sure all your figures have clear axis labels, titles, and legends when needed (As in any good scientific communication).
1a: FIGURE: MAP Predictions for ToyLine dataset
In your report PDF, using the run_compare_estimators_with_fixed_hypers.py
script, produce a figure with 8 total panels (2 rows, 4 columns).
Each column will explore a different polynomial feature transform from order 0 to order 3.
Each row will look at a different amount of training data (a small amount using N=8 and the whole dataset using N=512).
Each panel should show the MAP estimator's predicted mean for \(t\) as a function of the provided inputs \(x\).
Using matplotlib's fill-between (already in starter code), also show +/-3 standard deviations, to understand the MAP estimator's uncertainty in predictions.
1b: FIGURE: PPE Predictions for ToyLine dataset Repeat 1a for the PPE estimator on the ToyLine dataset. Be sure your estimates for the mean and standard deviation of the prediction distribution change appropriately.
1c: FIGURE: MAP Predictions for ToyWave dataset Repeat 1a for the MAP estimator on the ToyWave dataset.
1d: FIGURE: PPE Predictions for ToyWave dataset Repeat 1b for the PPE estimator on the ToyWave dataset.
1e: SHORT ANSWER Think about the heldout predictions of these estimators of new outputs given new inputs (e.g. as in plots 1a and 1c). How are the PPE mean estimates different from the MAP mean estimates? How are the PPE variance estimates different from the MAP variance estimates? Base your answer on what you expect from the underlying math, as well as any plots you have made.
Problem 2: Model Selection via Heldout Probability via Cross-Validation
In problem 1, we set \(\alpha\) and \(\beta\) manually to a single value. To do better, we'll now explore one principled way to select the value of \(\alpha, \beta\) via grid search to optimize performance: K=5 fold cross validation.
We'll look at the following candidate hyperparameters in our grid search:
- 7 possible alpha parameters: [0.000, 0.001, 0.010, 0.100, 1.000, 10.000, 100.000]
- 21 possible beta parameters: [0.010, 0.016, 0.025, 0.040, 0.063, 0.100, 0.158, 0.251, 0.398, 0.631, 1.000, 1.585, 2.512, 3.981, 6.310, 10.000, 15.849, 25.119, 39.811, 63.096, 100.000]
Our goal is to determine the best values of \(\alpha, \beta\) for each polynomial order in [0, 1, 2, 3, 4, 5, 6, 7, 8].
Tasks for Code Implementation
(All code steps will be evaluated by Autograder)
CODE 2(i) Implement the main
method of run_grid_search_to_maximize_likelihood_on_5fold_heldout.py
, filling in required TODOs
Tasks for Report PDF
2a: FIGURE: Model Score vs. Polynomial Order for ToyWave dataset In your report PDF, produce a figure that shows the computed heldout score (averaged over 5 folds) as a function of the polynomial feature order. You should show two lines, one for N=16 (the first 16 data points in training set) and one for N=512 (full training set). (Starter code makes most of this plot already).
2b: SHORT ANSWER: Best hyperparameters for N=512 training set Report verbatim the selected hyperparameter values as well as the model's score on the test set, following the template below:
order = 0
alpha = 1.00
beta = 1.00
test likelihood score: -0.12345
required time = 5.67 sec
Problem 3: Model Selection via Evidence on the Training Set
In problem 2, we needed to perform K separate fits of a linear regression model across K folds to select \(\alpha\) and \(\beta\) for our grid search. We'll now explore a method that requires only one fit of the model for each \(\alpha, \beta\) value: model selection via the evidence (marginal likelihood).
Tasks for Code Implementation
(All these steps will be evaluated by Autograder)
CODE 3(i) Implement the fit_and_calc_log_evidence
method in the starter code LinearRegressionPosteriorPredictiveEstimator.py. Use the formula given in Bishop PRML Eq. 3.86.
CODE 3(ii) Implement the main
method of run_grid_search_to_maximize_evidence_on_train.py
, filling in required TODOs.
Tasks for Report PDF
3a: FIGURE: Model Score vs. Polynomial Order for ToyWave dataset In your report PDF, produce a figure that shows the computed log evidence as a function of the polynomial feature order. You should show two lines, one for N=16 and one for N=512. To get these two lines on the same scale, please plot the per-example log evidence (e.g. log_ev / N
).
3b: SHORT ANSWER: Best hyperparameters for N=512 training set Report verbatim the selected hyperparameter values as well as the model's score on the test set. Like in 2b.
3c: SHORT ANSWER: Reflect on runtime Which grid search was more expensive in terms of run time, 5-fold CV in problem 2 or the evidence in problem 3?
3d: SHORT ANSWER: Reflect on model selection Did the final model chosen in Problem 2 grid search differ substantially from the best model from the Problem 3 grid search? Discuss any differences in selected hyperparameter values as well as test set performance.
3e: SHORT ANSWER: Reflect on your reading How does the figure you made in 3a compare with Bishop PRML textbook's figure 3.14? Does Fig 3a still favor order 1 over order 2 models? How would you explain other key differences?
3f: SHORT ANSWER: Summarize your own experiment Repeat 2b and 3b for the ToyLine dataset. This dataset was truly generated by an order 1 polynomial. Does either procedure recover this ground truth?