CP5: Markov Chain Monte Carlo


Last modified: 2020-04-30 00:05

Due date: Fri. May 1, 2020, free late day extension until Mon. May 4, 2020 at 11:59pm

Status: RELEASED.

Updates:

  • 2020-04-30 00:00: FIX to specification of the prior on \(\log \sigma\) in Problem 3
  • 2020-04-21: Released

What to turn in:

Your ZIP file should include

  • All starter code .py files (with your edits) (in the top-level directory)
  • These will be auto-graded. Be sure to check gradescope test results for immediate feedback.

Your PDF should include (in order):

  • Your full name
  • Collaboration statement
  • About how many hours did you spend (coding, asking for help, writing it up)?
  • Problem 1 figure and short answers (1a, 1b, 1c)
  • Problem 2 short answers (2a, 2b)
  • Problem 3 figure and short answers (3a, 3b, 3c)

Questions?: Post to the cp5 topic on the discussion forums.

Jump to: Problem 1   Problem 2   Problem 3   Starter Code + Data

Overview and Background

In this coding practical, we will implement 2 possible algorithms for sampling from a target distribution

  • The Metropolis algorithm with Random Walk proposals (for Problem 1 and Problem 3)

  • Gibbs Sampling algorithm (Problem 2)

The problems will try to address several key practical questions:

  • How can we use MCMC methods to sample from target distributions?
  • What are the tradeoffs of using Random walk vs. Gibbs sampling? (runtime, implementation time, reliability, tunability, etc.)
  • How do we effectively implement MCMC methods in practice? (e.g. selecting hyperparameters, deciding how many samples to draw, etc.)

Background on the Metropolis Algorithm with Random Walk proposals (for Problem 1 and Problem 3)

For helpful background, review the lecture notes from day 21 as well as Bishop's PRML textbook Sec. 11.2

Background on the Gibbs Sampling Algorithm (for Problem 2)

For helpful background, review the lecture notes from day 22 as well as Bishop's PRML textbook Sec. 11.3

Starter Code and Dataset

You can find the starter code and datasets in the course Github repository here:

https://github.com/tufts-ml-courses/comp136-spr-20s-assignments/tree/master/cp5

Starter Code Installation Notes:

  • If you have the spr_2020s_env conda environment installed from CP1 or any other previous CP, you don't need to do anything else.

  • If not, follow directions in the starter code README for how to install the required Python environment

Problem 1: Random Walk Algorithm Implementation

Target Distribution

You goal in Problem 1 is to draw samples from this target distribution:

\begin{align} p( z ) = p( [z_0, z_1] ) = \text{MultivariateNormal} \Big( \left[ \begin{array}{c} -1.0 \\ +1.0 \\ \end{array} \right] , \left[ \begin{array}{c c} 2.00 & 0.95 \\ 0.95 & 1.0 \end{array} \right] \Big) \end{align}

You want to write a Random Walk sampler to draw a Markov chain of samples \(z^{1}, z^{2}, \ldots z^{S}\).

Remember each sample is a 2-dim. vector: \(z^{s} = [z^{s}_0 ~ z^{s}_1]\)

Tasks for Code Implementation

(All code steps will be evaluated primarily by Autograder, but also read thru by Instructor/TA).

WRITING CODE 1(i) Implement the calc_target_log_pdf method in run_RW_correlated_2d_normal.py, to calculate the log PDF of our target distribution.

WRITING CODE 1(ii) Implement the draw_samples method of RandomWalkSampler2D class. This is a generic procedure that will implement the Metropolis algorithm correctly (draw proposal from Normal, then accept/reject using Metropolis accept threshold), using an externally defined function that computes the target log pdf (or its value up to an additive constant).

Tasks for Data Analysis

Now implement the main block of run_RW_correlated_2d_normal.py in order to solve the analysis questions below.

ANALYSIS 1(i): Using the main block of starter code run_RW_correlated_2d_normal.py, run your RandomWalk sampler from the first initialization \(z = [0, 0]\) (labeled "A" in the code). Set the proposal standard deviation to rw_stddev=10.0. Run for 10000 iterations of the sampler.

  • Plot the 2D distribution of all 10000 samples. Does it qualitatively converge to the target distribution? How would you know?
  • Plot the 2D distribution of the last 5000 samples only (so that you only keep samples after the chain might have "converged"). Did this help?
  • Look at the acceptance rate of the last half of your samples. Does this give you a clue about what might be going on?

ANALYSIS 1(ii): Repeat with the second initialization \(z = [1, -1]\) (labeled B) in run_RW_correlated_2d_normal.py, again set the proposal standard deviation to rw_stddev=10.0.

  • Plot the 2D distribution of samples. Does it qualitatively converge to the intended distribution (if it does, it should look indistinguishable from the result of any other MCMC chain that converged, such as 1(i) above).

ANALYSIS 1(iii): Repeat the above across a range of rw_stddev hyperparameter values: [0.01, 0.1, 1.0, 10.0]. Make sure you understand (1) what acceptance rates and other metrics look like when this hyperparameter is too small, too big, and about right; and (2) how to use multiple chains from different initialization to sanity check convergence.

Tasks for Report Writing

1a FIGURE: Show the 2D distribution produced from each initialization at each rw_stddev value defined above: [0.01, 0.1, 1.0, 10.0]. (Use the default plot limits in the starter code, and plot the last 5000 of 10000 total samples).

1b SHORT ANSWER: Which rw_stddev value(s) (of the 4 in 1a) would you select to obtain samples from the target distribution? Why?

1c SHORT ANSWER: Suppose for a different problem (a different target distribution over the same 2D space) you implement a RandomWalk sampler. You find that setting rw_stddev=100.0 yields an acceptance rate of around 0.8 after running 1 chain for 100,000 samples. Should you be confident your MCMC chain has converged? Justify your answer.

Problem 2: Gibbs Sampling Algorithm Implementation

In problem 2, your goal is to write a Gibbs sampler for the same target distribution as in Problem 1.

Hint: You might find the day 22 GibbsSampler.ipynb notebook helpful.

Tasks for Code Implementation

(All code steps will be evaluated primarily by Autograder, but also read thru by Instructor/TA).

WRITING CODE 2(i): Implement the draw_samples method of GibbsSampler2D class

WRITING CODE 2(ii): Implement the draw_z0_given_z1 method in run_Gibbs_correlated_2d_normal.py

WRITING CODE 2(iii): Implement the draw_z1_given_z0 method in run_Gibbs_correlated_2d_normal.py

Tasks for Data Analysis

Now implement the main block of run_Gibbs_correlated_2d_normal.py in order to solve the analysis questions below.

ANALYSIS 2(i): Run a Gibbs sampler from the first initialization (labeled "A"): \(z = [0, 0]\) . Run at least 5000 iterations of the sampler, and be sure to keep only the samples you are confident have "converged" to the target distribution. Plot the 2D distribution of samples. Does it qualitatively converge to the target distribution?

ANALYSIS 2(ii): Repeat with the second initialization (labeled "B"): \(z = [1 -1]\). Plot the 2D distribution of samples. Does it qualitatively converge to the target distribution (e.g. look indistinguishable from the samples from 2(i) above? or to the best run of the RandomWalk sampler from Problem 1?).

Tasks for Report Writing

2a SHORT ANSWER: What are the advantages of the Gibbs sampler over the Random Walk from Problem 1?

2b SHORT ANSWER: What are the disadvantages of the Gibbs sampler compared to the Random Walk from Problem 1? (Hint: think about what made Gibbs possible for this problem).

Problem 3: Estimating posteriors for Linear Regression via Random Walk

We will now use the Random Walk sampler to do Bayesian data analysis for a simple regression problem.

The training data is found in data/toyline_train.csv in the Starter Code.

https://github.com/tufts-ml-courses/comp136-spr-20s-assignments/blob/master/cp5/data/toyline_train.csv

Each row of this file specifies a scalar input \(x_n\) and corresponding scalar output \(t_n\).

We generated this file by drawing \(t_n\) from a Normal, with some mean (a linear function of \(x_n\)) and some standard deviation. Your analysis goal is to figure out what these unknown values are.

Background for Problem 3: A Probabilistic Model for Linear Regression

Consider the following probabilistic model for regression outputs \(t_{1:N} = [t_1, t_2, \ldots t_N]\), given regression inputs \(x_{1:N} = [x_1, x_2, \ldots x_N]\).

The model assumes two parameters: weight \(w \in \mathbb{R}\) and output standard deviation \(\sigma > 0\).

Unlike past attempts at regression (where we modeled \(w\) but kept \(\sigma\) fixed), in this problem we'll treat both \(w\) and \(\sigma\) as random variables.

To handle the fact that \(\sigma\) must be positive (and that all our MCMC methods require unconstrained real valued variables), we'll write our whole model in terms of \(\log \sigma\), which can take any value on the real line.

Prior on regression weight

We assume the weight scalar \(w \in \mathbb{R}\) has the following PDF function:

\begin{align} f_1(w) = p(w) = 0.80 \cdot \text{NormPDF}( w | 0, 0.01^2) + 0.20 \cdot \text{NormPDF}( w | 0, 1) \qquad \text{(1)} \end{align}

This is a 2-component Gaussian mixture, with \(\pi_1\) = 0.80, \(\pi_2\) = 0.20, \(\sigma_1\) = 0.01, \(\sigma_2\) = 1.0, and \(\mu_1 = \mu_2 = 0.0\).

The big idea here is that we want a high chance that the weight is near zero (that's component 1 with small \(\sigma\)), but some chance that it is spread out between say -3 and +3 like a typical standard Normal. This is sometimes called a "spike-and-slab" distribution, and is often used for high-dimensional regression.

Note that the PDF function \(f_1\) is one that we can evaluate (e.g. by computing the PDF of each separate Gaussian component, then taking a weighted sum of the results. In logspace, we can use the logsumexp trick to perform the weighted sum in a safe way).

Prior on log standard deviation

We assume the log-standard-deviation scalar \(\log \sigma \in \mathbb{R}\) has the following PDF function:

\begin{align} f_2(\log \sigma) = p(\log \sigma) = \text{NormPDF}( \log \sigma | 0, 2^2) \qquad \text{(2)} \end{align}

UPDATE 2020-04-30 00:00 Note that this standard deviation should be 2 not 1. Using 2 gives much more interesting results. (if you already did this problem you don't need to redo it; we'll take either version for full credit).

Note that the PDF function \(f_2\) is one that we can evaluate (e.g. just call scipy.stats.norm.logpdf)

We can always obtain a standard deviation \(\sigma > 0\) via the deterministic transform: \(\sigma = e^{\log \sigma}\).

Likelihood

Given the parameters \(w, \log \sigma\), we have the following likelihood for producing regression output \(t_n\) given each input \(x_n\):

\begin{align} f_3( t_{1:N} | x_{1:N}, w, \log \sigma) = p(t_{1:N} | x_{1:N}, w, \log \sigma) = \prod_{n=1}^N \text{NormPDF}\left( t_n | w \cdot x_n , (e^{\log \sigma})^2 \right) \qquad \text{(3)} \end{align}

Note that the PDF function \(f_3\) is one that we can evaluate (e.g. answer the question what is the probability density at specific example \(t_n = 0.2\)).

Goal: Estimate the Posterior Distribution over Model Parameters

Your goal is to sample from the posterior:

\begin{align} p( w, \log \sigma | t_{1:N}, x_{1:N} ) &= \frac{1}{ p(t_{1:N} | x_{1:N})} p(t_{1:N} | x_{1:N}, w, \log \sigma) p(w) p(\log \sigma ) \\ &= c \cdot f_3(t_{1:N} | x_{1:N}, w, \log \sigma) \cdot f_2( \log \sigma ) \cdot f_1( w) \end{align}

We can compute the 3 numerator terms \(f_1, f_2, f_3\) on the RHS in terms of equations (1), (2), (3) above.

However, the normalizing denominator term \(c\) cannot be computed tractably. It is formally defined (via the sum rule) as the integral of the numerator, which is not easy to do analytically.

Nevertheless, we can use MCMC to sample from the 2D posterior distribution over \(w, \log \sigma\), because we know it up to a constant (\(c\) does not change value if we vary either \(w\) or \(\log \sigma\))

Tasks for Code Implementation

WRITING CODE 3(i): Implement the calc_target_log_pdf method in run_RW_toyline.py

You should do this in two steps:

  • First, define the required calc_model_log_pdf function in the starter code, which is a function of all variables \(w, \log \sigma, x, t\).
  • Then, define either a local function or a lambda expression called calc_target_log_pdf which just takes in one 2-dim. vector z_D, where z_D = [w, logsigma], and then calls calc_model_log_pdf inside.

Tasks for Data Analysis

Now implement the main block of run_RW_toyline.py in order to solve the analysis questions below.

ANALYSIS 3(i): Set \(N=512\), the full size of the toy dataset provided. Use the two provided initializations of parameters \(w, \log \sigma\) the starter code, "A: [0.02, 0.02]" and "B: [1, -1]". Find a setting of the rw_stddev hyperparameter such that the two separate long MCMC chains (one from A with at least 10000 samples, one from B with at least 10000 samples) each converge visually, so the 2D plot of samples are plausibly indistinguishable.

ANALYSIS 3(ii): Repeat for \(N=0, N=1\) and \(N=4\). This means you'll take the first \(N\) examples of the training set as your whole dataset for computing the likelihood \(f_3\). For each value of \(N\), use the two provided initializations of parameters \(w, \log \sigma\) the starter code, "A: [0.02, 0.02]" and "B: [1, -1]". Find a setting of the rw_stddev hyperparameter such that two separate MCMC chains (initialized from A and from B) each converge so the 2D plot of samples are indistinguishable (again, keep at least 10000 samples of each).

Report Tasks

3a FIGURE: Plot the 2D joint distribution from A and B chains for each value of \(N = 0, 1, 4, 512\). In your caption, for each \(N\) value mention which rw_stddev hyperparameter you used and its acceptance rate for both chains (computed after discarding early samples from the burn-in phase). (Use the default plot limits in the starter code please).

3b SHORT ANSWER: Reflect on the plot in 3a for \(N=1\). Based on your analytical understanding of the model as defined in \(f_1, f_2, f_3\), does the observed distribution of samples make sense? Hint: Look at the specific \(x_1, t_1\) data point used in the \(N=1\) case to make your argument.

3c SHORT ANSWER: What happens to the variance of the target distributions in Figure 3a as \(N\) increases? Why? What do you expect would happen as \(N\) gets even larger: \(N \rightarrow \infty\)? How is this related to an estimator we have discussed throughout this course?