HW4: Multi-layer Perceptrons and Stochastic Gradient Descent

Last modified: 2019-03-12 15:05

Status: RELEASED.

Due date: Fri. Mar 15 at 11:59PM EST (extended for all students, to give more buffer after midterm exam).

Due date: Wed. Mar 13 at 11:59PM EST.

Turn-in links:

PDF report turned in to: https://www.gradescope.com/courses/33142/assignments/171752/
ZIP file of source code turned in to: https://www.gradescope.com/courses/33142/assignments/171761/

Files to Turn In:

ZIP file of source code should contain:

hw4.ipynb : Jupyter Notebook file containing your code and markup
COLLABORATORS.txt : a plain text file [example], containing
- Your full name
- Estimate the hours you spent on Problem 1
- Names of any people you talked to for help (TAs, students, etc.). If none, write "No external help".
- Brief description of what content you sought help about (1-3 sentences)

PDF report:

Please export your completed hw4.ipynb notebook as a PDF (easiest way is likely in your browser, just do 'Print as PDF')
This document will be manually graded
Please: within Gradescope via the normal submission process, annotate for each subproblem which page(s) are relevant. This will save your graders so much time!

Evaluation Rubric:

See the PDF submission portal on Gradescope for the point values of each problem. Generally, tasks with more coding/effort will earn more potential points.

Starter code:

See the hw4 folder of the public assignments repo for this class:

https://github.com/tufts-ml-courses/comp135-19s-assignments/tree/master/hw4

Jump to: Problem 1

Best practices

Across all the problems here, be sure that all plots include readable axes labels and legends if needed when multiple lines are shown.

Problem 1: Binary MLP Classifiers for XOR

We discussed in class how we can train multi-layer perceptrons to solve the "XOR" binary classification problem. No classifier capable of only single linear decision boundaries (e.g. logistic regression) can achieve perfect classification on this data, but we saw in class that an MLP with one hidden layer can do well.

In this problem, we'll examine the choice of optimization algorithm (denoted solver in sklearn) and the choice of activation function (denoted activation in sklearn).

Activation functions: We'll compare the following:

ReLU : Rectified linear unit
- Resource: ReLU is defined and explained here: Stanford's CS231n Notes on Activation Functions
Logistic sigmoid
- Resource: Logistic sigmoid is defined and explained here: Stanford's CS231n Notes on Activation Functions

Optimization algorithms: We'll compare the following:

SGD : Stochastic Gradient Descent
- This is a method that uses only first derivative information
- At each step, gradient is approximated using a subsample of examples from the full dataset
L-BFGS : limited-memory BFGS (Broyden–Fletcher–Goldfarb–Shanno)
- A method that uses (approximate) second derivative information as well as first-order information
- Inspired by Newton's method
- Detailed technical introduction (far too detailed for COMP 135, but provided for reference): http://aria42.com/blog/2014/12/understanding-lbfgs

Code for Classification: We'll use the MLPClassifier within scikit-learn. We've also provided a small wrapper version -- the MLPClassifierLBFGS class, defined in the starter code file MLPClassifierWithSolverLBFGS.py -- that's better suited to reporting the results of the L-BFGS optimization algorithm than sklearn's original implementation.

Data: In the starter code, we have provided an existing train/test split of a "XOR" dataset, stored on-disk in comma-separated-value (CSV) files: x_train.csv, y_train.csv, x_test.csv, and y_test.csv.

Get the data here: https://github.com/tufts-ml-courses/comp135-19s-assignments/tree/master/hw4/data_xor

We generated this data so that there are 4 well-separated blobs in the 2-dimensional \(x\) feature space with many points each. These for blobs form the classic "XOR" visual pattern. Each example's \(x_n\) feature vector has 2 dimensions: x1 and x2. Each example's \(y_n\) label is binary, either 1 or 0.

1a: MLP with 1 hidden layer of 2 ReLU units trained with L-BFGS

Use the provided class MLPClassifierLBFGS, defined in the starter code.

Train 16 separate models, each one using a different random-initialization (random_state should be set to 0, 1, 2, ... 15).

Problem 1a(i): Plot a 2D visualization of learned decision boundaries for each of the 16 training runs.

Problem 1a(ii): What fraction of the 16 runs finds the 0 error rate solution? What happens to the other runs? Describe how rapidly (or slowly) the runs in 1a converge).

1b: MLP with 1 hidden layer of 2 Logistic units trained with L-BFGS

Same as 1a, but use logistic sigmoid activation function.

1b(i): Plot a 2D visualization of learned decision boundaries for each of the 16 training runs.

1b(ii): What fraction of the 16 runs finds the 0 error rate solution? Describe how rapidly (or slowly) the runs in 1b converge).

1c: MLP with 1 hidden layer of 2 ReLU units trained with SGD (batch_size=10)

Use the default MLPClassifier from sklearn and ReLU activation.

Use the following settings to perform Stochastic Gradient Descent (SGD) on batches of 10 examples per update step, completing 400 total passes through the dataset (stopping early if recent loss values do not change by more than tol).

        random_state=random_state,
        solver='sgd', batch_size=10,
        max_iter=400, tol=1e-8,
        learning_rate='adaptive', learning_rate_init=0.1, momentum=0.0,

1c(i): Plot a 2D visualization of learned decision boundaries for each of the 16 training runs.

1c(ii): What fraction of the 16 runs finds the 0 error rate solution? Describe how rapidly (or slowly) the runs in 1c converge).

1c(iii): What is most noticeably different between this SGD run and the previous L-BFGS run in 1a? What explanation can you provide for why this happens?

1d: MLP with 1 hidden layer of 2 Logistic units trained with SGD (batch_size=10)

Repeat 1c, but with logistic sigmoid activation.

1d(i): Plot a 2D visualization of learned decision boundaries for each of the 16 training runs.

1d(ii): What fraction of the 16 runs finds the 0 error rate solution? Describe how rapidly (or slowly) the runs in 1d converge).

1d(iii): What is most noticeably different between these SGD runs and the previous L-BFGS run with logistic activations in 1b? What explanation can you provide for why this happens?

1e: Bringing it all together (BONUS)

Completing 1e is not required, but will be worth some bonus points.

1e(i): Make a 2x2 grid of trace plots of the training loss (y-axis) vs. the epoch (x-axis) for each of the runs from Problems 1a, 1b, 1c, and 1d above. So you should have 4 subplots, each one with 16 different lines.

Use the loss_curve_ attribute of an MLP classifier to access the training loss at each epoch.

Label each panel with a title like "Sigmoid + SGD" or "ReLU + L-BFGS" to make it easy to understand which plot is which.

1e(ii): From the loss function trace plots in 1e(i) (plus your detailed plots from 1a-1d), which activation function seems easier to optimize, the ReLU or the Logistic Sigmoid? Which requires more iterations?

1e(iii): Are you convinced that one activation function is always easier to optimize? Suggest 3 additional experimental comparisons that would be informative.

1e(iv): In general, list 2 reasons to prefer L-BFGS over SGD when optimizing neural networks, and 2 reasons to prefer SGD over L-BFGS.

1e(v): In general, list 2 reasons to prefer ReLU over Logistic Sigmoid when selecting an activation function, and 2 reasons to prefer Logistic Sigmoid over ReLU.