HW1: Regression, Cross-Validation, and Regularization

Last modified: 2020-09-19 13:50

Status: RELEASED.

Due date: Wed. 09/30 11:59pm Anywhere on Earth (Thu 10/01 at 07:59am in Boston)

Overview

In this HW, you'll complete two parts in order:

First, complete several code tasks, writing Python code based on provided starter code. You'll submit your code as a ZIP to the autograder link below.
Second, complete some analysis tasks (which use your code) and write a report. You'll submit this PDF report separately.

As much as possible, we have tried to decouple these parts, so you may successfully complete the report even if some of your code doesn't work.

Turn-in links:

PDF report turned in to: https://www.gradescope.com/courses/173055/assignments/667550/
ZIP file of source code turned in to: https://www.gradescope.com/courses/173055/assignments/666260/
Then complete your reflection here: https://forms.gle/XSuvETkLiv46WQak6
- Reflection requires a tufts.edu G-Suite account: http://systems.eecs.tufts.edu/logging-into-g-suite/

Files to Turn In:

PDF report:

Prepare a short PDF report (no more than 5 pages; ideally 1 page per problem below).
This document will be manually graded.
Can use your favorite report writing tool (Word or G Docs or LaTeX or ....)
Should be human-readable. Do not include code. Do NOT just export a jupyter notebook to PDF.
Should have each subproblem marked via the in-browser Gradescope annotation tool)

ZIP file of source code submitted to autograder should contain:

cross_validation.py
performance_metrics.py
LeastSquaresLinearRegression.py
hw1.ipynb

Evaluation Rubric:

See the PDF submission portal on Gradescope for the point values of each problem. Generally, tasks with more coding/effort will earn more potential points.

Jump to: Code Tasks Starter Code Dataset Report Problem 1 Report Problem 2 Report Problem 3

Background

To complete this HW, you'll need some specific knowledge from the following sessions of class:

Training Linear Regression (day03)
Polynomial Feature Transformations (day04)
Cross Validation (day04)
Regularized Linear Regression (day05)

Code Tasks

Starter Code

See the hw1 folder of the public assignments repo for this class:

https://github.com/tufts-ml-courses/comp135-20f-assignments/tree/master/hw1

This starter code includes a notebook to help you organize your analysis, plus several .py files for core functionality you need to implement yourself.

Code Task 1: Edit `performance_metrics.py` to implement `calc_mean_squared_error` and `calc_mean_absolute_error` methods

See the starter code here: performance_metrics.py.

Task 1(a) : Implement calc_mean_squared_error

Task 1(b) : Implement calc_mean_absolute_error

See the starter code for example inputs and the expected output for both functions.

Code Task 2: Edit `LeastSquaresLinearRegression.py` to implement `fit` and `predict`

See the starter code here: LeastSquaresLinearRegression.py. This file defines a LeastSquaresLinearRegressor class with the two key methods of the usual sklearn regression API: fit and predict.

You will edit this file to complete the fit and the predict methods, which will demonstrate your understanding of what goes on "inside" sklearn-like regressor objects.

Task 2(a) : The fit method should take in a labeled dataset $\{x_n, y_n\}_{n=1}^N$ and instantiate two instance attributes

w_F : 1D numpy array, shape (n_features = F,) Represents the 'weights' Contains float64 entries of the weight coefficients
b : scalar float Represents the 'bias' or 'intercept'.

Nothing should be returned. You're updating the internal state of the object.

These attributes should be set using the formulas discussed in class (day03) for solving the "least squares" optimization problem (finding $w$ and $b$ values that minimize squared error on the training set).

Hint: Within a Python class, you can set an attribute like self.b = 1.0.

Task 2(b) : The predict method** should take in an array of feature vectors $\{x_n\}_{n=1}^N$ and produce (return) the predicted responses $\{ \hat{y}(x_n) \}_{n=1}^N$

Recall that for linear regression, we've defined the prediction function as:

$$ \hat{y}(x_n) = b + w^T x_n = b + \sum_{f=1}^F w_f x_{nf} $$

Code Task 3: Edit `cross_validation.py` to randomly divide data into splits and estimate training and heldout error

See the starter code here: cross_validation.py.

Task 3(a) : Implement the make_train_and_test_row_ids_for_n_fold_cv function

This function should consume the number of examples, the desired number of folds, and a pseudo-random number generator. Then, it will produce, for each of the desired number of folds, arrays of integers indicating which rows of the dataset belong to the training set, and which belong to the test set.

See the starter code for detailed specification.

Task 3(b) : Implement the train_models_and_calc_scores_for_n_fold_cv function

This function will use the procedure from 3(a) to determine the different "folds", and then train a separate model at each fold and return that model's training error and heldout error.

See the starter code for detailed specification.

Dataset: Miles-per-Gallon efficiency of Vehicles

You have been given a data set containing gas mileage, horsepower, and other information for 395 makes and models of vehicles. For each vehicle, we have the following information:

column name	type	unit	description
horsepower	numeric	hp	engine horsepower
weight	numeric	lb.	vehicle weight
cylinders	numeric	#	number of engine cylinders, from 4 to 8
displacement	numeric	cu. inches	engine displacement
mpg	numeric	mi. / gal	vehicle miles per gallon

You have been asked to build a predictor for vehicle mileage (mpg) as a function of other vehicle characteristics.

In the starter code, we have provided an existing train/validation/test split of this dataset, stored on-disk in comma-separated-value (CSV) files: x_train.csv, y_train.csv, x_valid.csv, y_valid.csv, x_test.csv, and y_test.csv.

Get the data here: https://github.com/tufts-ml-courses/comp135-20f-assignments/tree/master/hw1/data_auto

Problem 1: Polynomial Regression - Model Selection on a Fixed Validation Set

For this problem, use the provided training set and validation set

x_train.csv and y_train.csv contain features and outcomes for 192 examples
x_valid.csv and y_valid.csv contain features and outcomes for 100 examples

Your goal is to determine which polynomial transformation yields the best predictive performance.

Follow the starter notebook. Your code should chain together the PolynomialFeatures and LinearRegression class provided by sklearn.

Implementation Step 1A: Fit a linear regression model to a polynomial feature transformation of the provided training set at each of these possible degrees: [1, 2, 3, 4, 5, 6, 7]. For each hyperparameter setting, record the training set error and the validation set error.

Implementation Step 1B: Select the model hyperparameters that minimize your fixed validation set error. Using your already-trained LinearRegression model with these best hyperparameters, compute error on the test set. Save this test set error value for later.

Figure 1 in Report: Make a line plot of mean-squared error on y-axis vs. polynomial degree on x-axis. Show two lines, one for error on training set (use style 'b:', a dotted blue line) and another line for error on validation set (use style 'rs-', a solid red line with square markers). Set the y-axis limits between [0, 70].

Short Answer 1a in Report: If your goal is to select a model that will generalize well to new data from the same distribution, which polynomial degree do you recommend based on this assessment? Are there other degrees that seem to give nearly the same performance?

Short Answer 1b in Report: At some point, the mean squared error on the training set should become very close to zero (say, within 0.5 or so). At what degree value do you observe this happening for this particular training dataset? What technical argument can you make to justify this (e.g. for this dataset, why is training error zero at X degrees but not X-1 degrees?)

Short Answer 1c in Report: You'll notice that our pipelines include a preprocessing step that rescales each feature column to be in the unit interval from 0 to 1. Why is this necessary for this particular dataset? What happens (in terms of both training error and test error) if this step is omitted?

Short Answer 1d in Report: Consider the model with degree 6. Print out its intercept coeficient value, as well as the minimum and maximum weight coeficient value (out of all the features). What do you notice about these values? How might they be connected to the training and validation set performance you observe in Figure 1?

Problem 2: Polynomial Regression - Model Selection with Cross-Validation

For this problem, you'll again use the provided training set and validation sets. However, you'll merge these into a large "development" set that contains 292 examples total.

We'll then use 10-fold cross validation to obtain good estimates of heldout performance.

Your goal is again to determine which polynomial transformation yields the best predictive performance.

Follow the starter notebook. Your code should chain together the PolynomialFeatures and LinearRegression class provided by sklearn.

Implementation Step 2A: For each possible polynomial degree used in Problem 1, train and evaluate a linear regression model across the entire train+validation set using 10-fold cross validation. Use the CV methods you implemented in cross_validation.py. Your 10-fold CV procedure will give you an estimate of the training error and heldout validation error (averaged across all folds).

Implementation Step 2B: Select the model hyperparameters that minimize your estimated cross-validation error. Using these best hyperparameters, retrain the model using the full development set and then compute that (retrained) model's error on the test set. Save this test set error value for later.

Figure 2 in Report: Make a line plot of mean-squared error on y-axis vs. polynomial degree on x-axis. Show two lines, one for error on training set (use style 'bd:', a dotted blue line with diamond markers) and another line for error on validation set (use style 'rs-', a solid red line with square markers). Set the y-axis limits between [0, 70]. Your code should chain together the PolynomialFeatures and LinearRegression class provided by sklearn, while using your own implementation of cross_validation.py.

Short Answer 2a in Report: If your goal is to select hyperparameters for your pipeline that will generalize well to new data from the same distribution, which polynomial degree do you recommend based on this assessment? Are there other degrees that seem to give nearly the same performance? What (if anything) changed from 1a?

Short Answer 2b in Report: What are two benefits of using cross validation when compared to a fixed validation set (as in Problem 1)?

Short Answer 2c in Report: What are two drawbacks to using cross validation when compared to a fixed validation set (as in Problem 1)?

Short Answer 2d in Report: Remember, your task is to develop models that will accurately predict miles per gallon given some basic features of car engines. Suppose your available data is augmented so that each example is associated with a specific manufacturer (e.g. each row of your development set could be labeled 'Toyota' or 'Ford' or 'Hyundai'). You have data labeled with 10 different manufacturers. You'd like your prediction to be accurate for new manufacturers, that you do not have available in your training set. (Thus, your regression model should not use manufacturer label as a feature). How would you suggest we change our cross validation procedure to do better at this task?

Problem 3: Polynomial Regression with L2 Regularization - Model Selection with Cross-Validation

For this problem, you'll again use 10 fold CV to estimate heldout data.

However, now we'll look at a regularized version of linear regression, that may be better at avoiding overfitting.

Follow the starter notebook. Throughout problem 3, your code should chain together the PolynomialFeatures and Ridge class provided by sklearn.

Implementation Step 3A: Consider the following set of possible alpha hyperparameters for Ridge:

alpha_grid = np.logspace(-6, 6, 13) # 10^-6, 10^-5, 10^-4, ... 10^-1, 10^0, 10^1, ... 10^6

For each possible alpha value as well as each possible polynomial degree used in Problem 1, train and evaluate a Ridge regression model across the entire train+validation set using 10-fold cross validation. Use the CV methods you implemented in cross_validation.py. For each possible hyperparameter (alpha value and degree value), your 10-fold CV procedure will give you an estimate of the training error and heldout validation error (averaged across all folds).

Implementation Step 3B: Select the model hyperparameters that minimize your estimated cross-validation error. Using these best hyperparameters, retrain the model using the full development set (concatenating the predefined training and validation sets). Then compute that (retrained) model's error on the test set. Save this test set error value for later.

Figure 3 in Report: Show the training and validation error as a function of degree, for 3 possible alpha values: 1e-5, 0.1, and 1000. Make one figure with 3 subplots, one per alpha value (see example starter notebook). In each subplot, make a line plot with mean-squared error on y-axis vs. polynomial degree on x-axis. Show two lines, one for error on training set (use style 'bd:', a dotted blue line with diamond markers) and another line for error on validation set (use style 'rs-', a solid red line with square markers). Set the y-axis limits between [0, 70].

Short Answer 3a in Report: If your goal is to select hyperparameters for your pipeline that will generalize well to new data from the same distribution, which polynomial degree and alpha values do you recommend based on this assessment? Are there other values that seem to give nearly the same performance?

Short Answer 3b in Report: Your colleague suggests that you can determine the regularization strength alpha by minimizing the following loss on the training set:

$$ \text{min}_{w \in \mathbb{R}^F, b \in \mathbb{R}, \alpha \ge 0} \quad \sum_{n=1}^N (y_n - \hat{y}(x_n, w, b))^2 + \alpha \sum_{f=1}^F w_f^2 $$

What value of $\alpha$ would you pick if you did this? Why is this problematic if your goal is to generalize to new data well?

Problem 4: Comparison of methods on the test set

Table 4 in Report:

In one neat table, please compare the test set root-mean-squared-error (RMSE) performance for the following regressors:

Baseline: A predictor that always guesses the mean $y$ value of the training set, regardless of the new test input
The best LinearRegression pipeline selected to minimize fixed validation set error (from 1B)
The best LinearRegression pipeline selected to minimize 10-fold cross validation set error (from 2B)
The best Ridge pipeline selected to minimize 10-fold cross validation set error (from 3B)

Include a caption that summarizes the major conclusions that a reader should take away from this comparison.

Helpful hints and best practices for preparing a report

Across all the problems here, be sure that:

All plots include readable axes labels and legends if needed when multiple lines are shown.
All figures include captions providing complete sentence summaries of the figure.
Generally, all tables should only report floating-point values up to 3 decimal places in precision.
- That is, if your error is 17.123456789, just display "17.123". Make it easy on your reader's eyes.