HW3: Neural Networks and Stochastic Gradient Descent


Last modified: 2020-10-27 12:50

Status: RELEASED.

Due date: Wed. Oct. 28 at 11:59PM AoE (Anywhere on Earth) (Thu 10/29 at 07:59am in Boston)

Updates:

  • 2020-10-27 : Fixed HW3 reflection link.

Overview

In this HW, you'll complete the following:

  • Complete Problems 1-3 and write a report.
    • You'll submit this PDF report to the Gradescope link below.
    • The goal here is to demonstrate broad understanding of how to use MLPs effectively.
    • Much of your analysis will use library code in sklearn with similar functionality as what you implement yourself earlier.

There is NO autograder for this homework! You should still submit your notebook for completeness.

Turn-in links:

Files to Turn In:

PDF report:

  • Prepare a short PDF report (no more than 4 pages).
  • This document will be manually graded.
  • Can use your favorite report writing tool (Word or G Docs or LaTeX or ....)
  • Should be human-readable. Do not include code. Do NOT just export a jupyter notebook to PDF.
  • Should have each subproblem marked via the in-browser Gradescope annotation tool)

ZIP file of source code should contain:

  • hw3.ipynb (just for completeness, will not be autograded but will be manually assessed if necessary.)

Evaluation Rubric:

See the PDF submission portal on Gradescope for the point values of each problem. Generally, tasks with more coding/effort will earn more potential points.

Jump to: Background   Starter Code   Problem 1   Problem 2   Problem 3

Starter Code

See the hw3 folder of the public assignments repo for this class:

https://github.com/tufts-ml-courses/comp135-20f-assignments/tree/master/hw3

This starter code includes a notebook to help you get started on your analysis, plus a helper .py file to help you visualize learned MLPs for debugging.

Background

To complete this HW, you'll need some knowledge from the following sessions of class:

  • Gradient Descent (day06)
  • Logistic Regression (day09)
  • Neural Networks (day10)
  • Backpropagation (day11)
  • SGD and LBFGS (day12)

Optimization Algorithms

We'll compare two popular ways to solve optimization problems in Problems 1-3.

First, we have "SGD" or Stochastic Gradient Descent, which we covered in the day12 readings and lecture.

The most important facts about SGD are:

  • This method uses first derivative (aka gradient) information only to update parameters
  • Each step, we estimate gradient of our objective by using a randomly chosen set of examples (a "batch" or "minibatch") that is a subset of the full training dataset
  • Each step, the direction of our parameter update involves stepping directly in the downhill direction (in opposite direction of gradient, which points uphill).
  • Each step, the magnitude of our parameter update step (the total distance traveled between old and new parameter values) is determined by the learning rate (aka step size), a scalar hyperparameter which must be set in advance.

Second, we have "L-BFGS" or the Limited-memory BFGS (Broyden–Fletcher–Goldfarb–Shanno) Algorithm, which we briefly covered in the day12 lectures. You can find more detailed technical introduction here http://aria42.com/blog/2014/12/understanding-lbfgs. The most important facts about L-BFGS are:

  • This method uses both first derivative information as well as (approximate) second derivative information to update parameters
    • The way it uses second-order derivatives is inspired by Newton's method, so we call it a "quasi-Newton" optimization algorithm
  • Each step, we exactly compute the gradient of our objective using the entire dataset (all N training examples)
  • Each step, the direction of our parameter update involves stepping in a modified downhill direction (we use second-order information to modify the direction of the step).
  • Each step, the magnitude of our parameter update is determined using a line search, which is a smart way to pick the size of the current step in a smart way once we've fixed a direction. There is no learning rate hyperparameter.

Activation Functions

You also might want to read about different activation functions, especially the Rectified linear unit or "ReLU". Many activation functions are defined and explained here: Stanford's CS231n Notes on Activation Functions

Stochastic Gradient Descent in psuedocode

The SGD algorithm has the following pseudocode.

x_NF, y_N = load_training_dataset()                            # N = total number of training examples

model.initialize_parameters(random_state)                      # Initialize weight/bias to random values

for cur_iter in range(max_iter):

    n_examples_seen_this_iter = 0
    while n_examples_seen_this_iter < N:

        xb_BF, yb_B = draw_random_batch(x_NF, y_N, batch_size) # B = batch_size
                                                               # xb_BF.shape is (B,F), yb_B.shape is (B,)

        grad_arr = calc_grad(xb_BF, yb_B)                      # grad : array with entry for each param

        model.update_parameters(grad_arr, lr)                  # take step downhill. param = param - lr * grad

        n_examples_seen_this_iter += B                         # increment counter for current iteration

Vocabulary: What is an iteration in SGD?

Sometimes an "iteration" means different things in different contexts. We'll focus on what it means using sklearn's implementation.

Each iteration (also called an epoch) represents one or more gradient computation and parameter update steps (see pseudocode above).

Each iteration is complete when the number of examples it has ``seen'' (and used for updates) is equal to (or slightly bigger than) the total number examples in the dataset N.

Thus, the number of parameter updates that happen per iteration depends on the batch_size.

Dataset: The flower XOR toy dataset

We consider a "toy" dataset where each example (indexed by \(n\)) has:

  • a 2-dimensional feature vector: \(x_n \in \mathbb{R}^2\)
  • a binary label: \(y_n \in \{0, 1\}\)

We have included in the starter code repo a training set of 10000 feature-label pairs, and a heldout set of 2000 feature-label pairs. (Note: we labeled the heldout set as a "test" set, but in this homework we're really using it like a validation set because we use it to help us select hyperparameters).

The training dataset is shown here (blue points indicate a positive class label (y=1), red points indicate negative class label (y=0)):

Flower XOR dataset

Training Dataset for "Flower XOR" binary classification task.

Clearly, a single linear-boundary classifier would be quite bad here.

Can we train a neural network to solve this classification task well?

We'll see that effectively solving this requires:

  • Picking the right "architecture" for our neural network (Problem 1)
  • Picking the right optimization procedures for training our neural network (Problem 2-3)

Problem 1: MLPs with L-BFGS: What model size is effective?

Basic MLP Settings for Problem 1

We will use sklearn's MLPClassifier implementation: docs for sklearn.neural_network.MLPClassifier.

Throughout Problem 1, use the following fixed settings (already in your starter code).

Defining the method and objective function:

  • activation='relu' : Use the ReLU activation function
  • alpha=0.0001 : Add a small penalty term on sum-of-squares of all weights

Defining the optimization procedure:

  • solver='lbfgs' : Iterative gradient method that uses first and second-order gradient information.
  • tol=1e-5 : Defines threshold for deciding training has converged (by comparing change in loss over iterations).

Architecture Search: How many hidden units should we use?

In this problem, you'll fit MLPs using the lbfgs solver with sklearn's MLPClassifier. Each MLP will have one hidden layer, which means two layers of parameters total when we include the output layer.

We'll consider 4 possible sizes for our hidden layer : 4, 16, 64, and 256. This is the number of neurons or "units" in the layer.

Implementation Step 1A : At each size listed above, try 4 different runs with 4 different values of random_state. Multiple "runs" will let us understand how the random initialization of the weight and bias parameters impacts performance.

See the starter code notebook, which sets up most of this for you:

Implementation Step 1B : After each run is finished, record the following so you can plot it later:

  • base-2 log loss on training set and test set
  • error rate on training set and test set

Figure 1 in Report

Figure 1 should compare the performance of each model you've trained as a function of the size.

As in the starter notebook, create two different subplots:

  • on the left, show LOG LOSS (base 2) vs. model size
  • on the right, show ERROR RATE vs. model size

In each plot, show two kinds of points:

  • one color for the training-set performance (use color BLUE ('b') and style 'd')
  • one color for the test-set performance (use color RED ('r') and style 'd')

Each dot in your plot will represent the final result of one "run" of the optimizer. By looking across multiple dots at each size, we'll be able to see how sensitive the model is to its random initialization and to the model size.

Short Answer 1a in Report

Based on your Figure 1, what hidden layer size would you recommend to achieve the best log loss on heldout data? Do you see signs of overfitting?

Short Answer 1b in Report

Based on your Figure 1, what hidden layer size would you recommend to achieve the best error rate on heldout data? Do you see signs of overfitting?

Short Answer 1c in Report

Consider a typical L-BFGS run with 64 hidden units. What final log loss on the training set does it reach (round to nearest 0.01)? About how many seconds does it take to converge or complete its maximum iteration (round to nearest 0.1 seconds)? Does it converge?

Short Answer 1d in Report

You have fit an MLP with hidden_layer_sizes=[64] to this flower XOR dataset. How many total weight parameters are in each layer? How many total bias or intercept parameters in each layer? Show your work or justify your answer.

Problem 2: MLPs with SGD: What batch size and step size?

This problem requires no implementation. To give you a jump start, we already ran an thorough experiment for you, summarized in Figure 2 below.

Your job is to interpret this figure and draw useful conclusions from it.

To make Figure 2, at each possible batch size and learning rate setting, we ran 4 random initializations of an MLP with 64 hidden units on the same training data as in Problem 1 (the flower xor dataset with N=10000 training examples).

Our experiment studied how performance depended on the following settings:

  • Batch size: We tried batch_size in [10000, 500, 25]
  • Learning rate: We tried learning_rate_init in [0.1, 0.3, 0.9, 2.7]

Figure 2: SGD training loss vs. elapsed wallclock time, varying batch size (rows) and learning rate (columns)

SGD loss vs wallclock time (seconds)

Figure 2: SGD Training Loss vs Elapsed Wallclock Time. Click here for high-res PDF version.

Short Answer 2a in Report

Based on Figure 2, is the training objective function for MLPs convex or not convex? What evidence in Figure 2 supports this answer?

Short Answer 2b in Report

For each batch size in Figure 2, which learning rate(s) do you recommend? You should prioritize learning rates that produce good training loss values quickly and consistently across at least 3 of the 4 runs, without showing severe divergence.

Short Answer 2c in Report

Using the recommended learning rates you picked in 2b, report for each SGD batch size the time taken (in seconds, rounded to nearest whole number) to deliver a "good" training loss value of 0.1 (e.g the time when at least 3 out of 4 runs reach log loss of 0.1). If this good value was never achieved, just write "the method only reached a loss of X consistently after Y seconds".

Short Answer 2d in Report

Based on 2c, which batch size is fastest to deliver a good model? Can you explain why? What tradeoffs are at work?

Short Answer 2e in Report

Compare speed of the best SGD to what you observed for the best size 64 runs of LBFGS from Problem 1d. Which method is better for this problem? Why?

Short Answer 2f in Report

Think in general about training neural networks on classification tasks like this one. List 2 reasons to prefer L-BFGS over SGD when optimizing your neural network, and 2 reasons to prefer SGD over L-BFGS.

Problem 3: Producing your own figure comparing batch size and learning rate

In this problem, you'll try to replicate Figure 2 above yourself.

The starter notebook provides all the required code. We just want you to run the code on your machine, see it yourself, and create your own figure.

Implementation Step 3A

Using the starter code, for each of the batch sizes and learning rates in Figure 2, you'll run 2 random initializations (you do NOT need to do all 4 in the provided Figure 2), either to convergence or until the maximum number of iterations specified in the starter code is reached. Use the hyperparameter settings for Problems 2-3 detailed below.

Warning: Time Consuming. Running SGD on this problem may take a while. Perhaps about 30-45 minutes depending on your machine. Get started early. Maybe let it run overnight. Or see below for instructions on using Google's Colab cloud computing notebook resources so you don't need fast local hardware.

Warning: Unlikely to exactly reproduce. We fully expect that this won't exactly reproduce Figure 2, because:

  • your run times might be quite different, because your hardware is different
  • your random initializations might be different, because numpy's randomness can vary by platform

The point of Problem 3 is just to help you understand where Figure 2 comes from and give you deeper experience in using SGD.

How to Use Google Colab: If your local computer is too slow (e.g. Problem 1 took much longer than 10 minutes), we have made a version of the Problem 3 notebook you can run in your browser via Google Colab, that will use some cloud computing hardware instead of your own.

Follow this link: Problem 3 notebook on Google Colab.

To use this notebook, save a copy under your own google drive account. Then run it in your browser (you can skip over problem 1). It should take around 30 min. Then you can save the figure(s) to your local computer and write your report.

How to Turn In if you use Colab:

You do not need to upload colab notebook to gradescope. Instead, please just modify your local notebook used for Problem 1 to include a link to your colab notebook under Problem 3. We just want to be able to check that you've run it yourself (so please make sure your shared link works). You should still upload your notebook with your solution to Problem 1.

Figure 3 in Report

On the last page of your report, provide your completed Figure as "Figure 3".

Provide a caption with 2-3 complete sentences noting any significant changes between your figure and our Figure 2. Do the major conclusions from Problem 2 hold?

Basic MLP Settings for Problem 2-3

We will again use sklearn's MLPClassifier implementation: docs for sklearn.neural_network.MLPClassifier.

We'll study the impact of learning rate and batch size. As above in Figure 2, we consider the following settings

  • batch_size in [10000, 500, 25]
  • learning_rate_init in [0.1, 0.3, 0.9, 2.7]

Otherwise, we'll use the following fixed settings

  • Fixed settings defining the method and objective function:

    • hidden_layer_sizes = [64] : Fix model size to 64 units
    • activation='relu' : Use the ReLU activation function
    • alpha=0.0001 : Add a small penalty term on sum-of-squares of all weights
  • Fixed settings defining the optimization procedure:

    • solver='sgd'
    • tol=1e-5
    • learning_rate='adaptive' : Read sklearn docs to understand.
    • momentum=0.0 : Turns off a more advanced feature we don't need.