HW2: Cross-Validation and Regularization | Introduction to Machine Learning

Last modified: 2019-01-31 00:16

Status: Released.

Due date: Wed. Feb 6 at 11:59PM EST.

Turn-in links:

PDF report turned in to: https://www.gradescope.com/courses/33142/assignments/150450/
ZIP file of source code turned in to: https://www.gradescope.com/courses/33142/assignments/150453/

Files to Turn In:

ZIP file of source code should contain:

hw2.ipynb : Jupyter Notebook file containing your code and markup
COLLABORATORS.txt : a plain text file [example], containing
- Your full name
- Estimate the hours you spent on each of Problem 1, Problem 2, and Problem 3
- Names of any people you talked to for help (TAs, students, etc.). If none, write "No external help".
- Brief description of what content you sought help about (1-3 sentences)

PDF report:

Please export your completed hw2.ipynb notebook as a PDF (easiest way is likely in your browser, just do 'Print as PDF' or similar)
This document will be manually graded

Evaluation Rubric:

See the PDF submission portal on Gradescope for the point values of each problem. Generally, tasks with more coding/effort will earn more potential points.

Starter code:

See the hw2 folder of the public assignments repo for this class:

https://github.com/tufts-ml-courses/comp135-19s-assignments/tree/master/hw2

Jump to: Problem 1 Problem 2 Problem 3

Best practices

Across all the problems here, be sure that all plots include readable axes labels and legends if needed when multiple lines are shown.

Problem 1: Polynomial Basis Model Selection: From Validation Sets to Cross-Validation

You have been given a data set containing gas mileage, horsepower, and other information for 395 makes and models of vehicles. For each vehicle, we have the following information:

column name	type	unit	description
horsepower	numeric	hp	engine horsepower
weight	numeric	lb.	vehicle weight
cylinders	numeric	#	number of engine cylinders, from 4 to 8
displacement	numeric	cu. inches	engine displacement
mpg	numeric	mi. / gal	vehicle miles per gallon

You have been asked to build a predictor for vehicle mileage (mpg) as a function of other vehicle characteristics.

In the starter code, we have provided an existing train/validation/test split of this dataset, stored on-disk in comma-separated-value (CSV) files: x_train.csv, y_train.csv, x_valid.csv, y_valid.csv, x_test.csv, and y_test.csv.

Get the data here: https://github.com/tufts-ml-courses/comp135-19s-assignments/tree/master/hw2/data_auto

We will train linear regression models that minimize mean-squared error throughout Problem 1 and 2.

From the ISL textbook, we see from some preliminary experiments that using higher-order polynomial transformations of the input features result in improved predictive performance.

1a: Polynomial Degree Selection on Fixed Validation Set

Using sklearn, we can perform a polynomial transform via the PolynomialFeatures class.

For example, to make a degree 2 transformation of an input feature array x_tr_MF with \(M\) rows and \(F\) columns, we have:

    >>> poly_transformer = sklearn.preprocessing.PolynomialFeatures(degree=2, include_bias=False)
    >>> x_tr_MG = poly_transformer.fit_transform(x_tr_MF)

This creates a new set of \(G\) features, expanding the original \(F\) features to include all possible binomial combinations of degree 2 or less. sklearn provides an easy way to see the names of the new features:

    >>> poly_transformer.get_feature_names(['horsepower', 'weight', 'cylinders', 'displacement'])

    ['horsepower',
     'weight',
     'cylinders',
     'displacement',
     'horsepower^2',
     'horsepower weight',
     'horsepower cylinders',
     'horsepower displacement',
     'weight^2',
     'weight cylinders',
     'weight displacement',
     'cylinders^2',
     'cylinders displacement',
     'displacement^2']

1a(i): For this dataset, where the feature size is \(F=4\), make a plot of the total number of polynomial features \(G\) when the degree is in [1, 2, 3, 4, 5, 6, 7, 8]. What kind of trend to you observe?

1a(ii): Fit a linear regression model to a polynomial feature transformation of the provided training set of \(x\), \(y\) values at each of these possible degrees: [1, 2, 3, 4, 5, 6]. Make a line plot of mean-squared error vs. polynomial degree on the training set (use style 'b:', a dotted blue line) and the validation set (use style 'rs-', a solid red line with square markers). Set the y-axis limits between [0,70]. Your code should chain together the PolynomialFeatures and LinearRegression class provided by sklearn.

1a(iii): Based on this plot, which single degree value do you recommend? Why?

1a(iv): Report the numerical values of the 5th percentile and 95th percentile of the coefficients observed in your linear regression model for degrees 3, 4, 5, and 6. What seems to be happening?

1a(v): Comment on the training error observed at degree 6. Based on your plots from 1a(i) and your knowledge of linear regression, what should the training error be at degree 6? What do you think is happening instead?

1b: Rescaling Features

Although all our original feature are positive values, they all have different numerical scales and ranges. A typical weight value is in the 1000s. A typical cylinder value is 4-8. Especially when we take high-order polynomial combinations of features, we are likely to see extreme values that perhaps lead to problems.

To counteract this, we will rescale our numerical features in \(x\) to be between 0.0 and 1.0. We can use sklearn's convenient MinMaxScaler preprocessing tool to do this.

For best results, we want to apply rescaling twice, once to our original features, and once to the transformed polynomial features. This will ensure that the features fed into the polynomial featurizer have consistent scale, and the result of the polynomial featurizer also has consistent scale.

Tracking all these steps can be cumbersome, so we suggest a Pipeline. Inside the pipeline, we specify a series of steps to apply in order to an original dataset. After creating a pipeline, we can treat it like an encapsulated regression object, which has a fit method and a predict method. We can call fit to train the pipeline, and predict to apply it to new data.

    pipeline = sklearn.pipeline.Pipeline(
        steps=[
         ('rescaler', sklearn.preprocessing.MinMaxScaler()),
         ('poly_transformer', sklearn.preprocessing.PolynomialFeatures(degree=degree, include_bias=False)),
         ('poly_rescaler', sklearn.preprocessing.MinMaxScaler()),
         ('linear_regr', sklearn.linear_model.LinearRegression()),
        ])

1b(i): Fit a linear regression model to a rescaled polynomial feature transformation of the provided training set of \(x\), \(y\) values at each of these possible degrees: [1, 2, 3, 4, 5, 6]. Make a line plot of mean-squared error vs. polynomial degree on the training set (use style 'b:', a dotted blue line) and the validation set (use style 'rs-', a solid red line with square markers). Set the y-axis limits between [0,70].

1b(ii): Using this new analysis, which degree do you recommend?

1b(iii): Report the numerical values of the 5th percentile and 95th percentile of the coefficients observed in this most recent linear regression model for degrees 3, 4, 5, and 6. What seems to be happening? What's different than in 1a?

1b(iv): Comment on the training error observed at degree 6. Is this what we would expect? Why is this different than 1a?

1c: Tuning with cross validation

Using the same rescaling pipeline from 1b, we will now look at cross validation as a possible way to use our scarce training dataset more effectively.

First, you should stack all training and validation examples together:

x_trva_LF = np.vstack([x_tr_MF, x_va_NF])
y_trva_L = np.hstack([y_tr_M, y_va_N])

Coding Step 1/1: Complete the starter code function calc_mean_squared_error_across_k_folds, defined in in the starter notebook.

1c(i): Using your calc_mean_squared_error_across_k_folds function with 10 folds, make a line plot of the average mean-squared-error at degrees 1, 2, 3, 4, 5, 6.

1c(ii): Based on this plot, what is your recommended degree? How do your recommendations differ from 1b?

1c(iii): Fix the degree at 3. We might hope that using many folds lets us maximize our chances of fitting a complex model well (by using as much training data as possible in each fold). We'll look at the distribution of single-fold estimates as a function of the number of folds. For each number of folds \(K\) in the grid given below, make a scatter plot of the \(K\) fold-specific estimates of MSE (\(K\) is x-axis, MSE on y-axis). Also draw a line connecting the average MSE across \(K\).

K_grid = [2, 5, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140]

1c(iv): What happens to the distribution of the estimated heldout MSE as you use more and more folds? Is the trend observed in Plot 1c(iii) what we expect based on your readings from the ISL textbook Chapter 5?

Problem 2: L2 and L1 Regularization for Regression

2a: Grid search for L2 penalty strength

We'll use the same dataset, and now look at L2-penalized least-squares linear regression. In statistics, this is sometimes called "ridge" regression, so the sklearn implementation uses a regression class called Ridge, with the usual fit an predict methods.

2a(i): Train Ridge regression at a fine grid of 31 possible L2-penalty strengths \(\alpha\): alpha_grid = np.logspace(-9, 6, 31). Using degree 2 polynomial features, plot the MSE vs. regularization strength on both validation (use style 'rs-') and training (use style 'b:'). Because \(\alpha\) is log-scaled, show the base-10 log of \(alpha\) on the x axis.

2a(ii): Repeat the plot from 2a(i) with polynomial degree 6.

2a(iii): Describe how the recommended value of \(\alpha\) changes from degree 2 to 6.

2a(iv): For the grid of \(\alpha\) values we searched in 2a(ii), each one produced a different optimal weight vector. Produce a line plot (style 'k.-'), showing the the number of strictly non-zero entries in the weight vector on y-axis vs. the base-10 log of \(\alpha\) on x-axis. On the same axes, plot the number of weights with absolute value larger than 0.0001 (style 'b.-') and larger than 0.1 (style 'g.-').

2b: Grid search for L1 penalty strength

Consider the sklearn implementation of L1-penalized linear regression, which is also known as Lasso regression.

2b(i): Train Lasso regression at a fine grid of 31 possible L2-penalty strengths \(\alpha\): alpha_grid = np.logspace(-9, 6, 31). Use degree 3 polynomial features. Plot the MSE vs. regularization strength on both validation (style 'rs-') and training (style 'b:'). Because \(\alpha\) is log-scaled, show the base-10 log of \(\alpha\) on the x axis.

2b(ii): For the grid of \(\alpha\) values we searched in 2b(i), each one produced a different optimal weight vector. Produce a line plot (style 'k.-'), showing the the number of strictly non-zero entries in the weight vector on y-axis vs. the base-10 log of \(\alpha\) on x-axis. On the same axes, plot the number of weights with absolute value larger than 0.0001 (style 'b.-') and larger than 0.1 (style 'g.-').

2b(iii): Select the specific 5 models where \(\alpha\) equals \(0.01, 0.0316, 0.1, 0.316, 1.0\) (positions 14, 15, 16, 17, 18 in the log-spaced grid). Print out the non-zero coefficients, including their names and values.

3: Linear Regression Concept Questions

3a: Adding fake observations to a no-bias linear regression

You are given a training dataset for regression \(\{x_n, y_n \}_{n=1}^N\). Each feature vector \(x_n\) has size \(F=4\). Each \(y_n\) is a scalar.

You decide to add in the following 4 additional "fake" observations \(\tilde{x}\):

\begin{align} \tilde{x} = \begin{array}{c c c c} ~\sqrt{\lambda} & 0 & 0 & 0 \\ 0 & \sqrt{\lambda} & 0 & 0 \\ 0 &0 & \sqrt{\lambda} &0 \\ 0 & 0 & 0 &\sqrt{\lambda} \end{array} \end{align}

You also add in 4 "fake" responses \(\tilde{y}\), all equal to zero.

You then wish to fit a least squares linear regression model with no bias/intercept term to the combined labeled dataset that is the union of \(x\) and \(\tilde{x}\), \(y\) and \(\tilde{y}\).

3a(i): Write down the minimum sum of squared error objective for this combined dataset of \(N + F\) examples and simplify as much as possible.

3a(ii): Inspect your solution. Does this "add fake data" process lead to an objective similar to something we've seen before in class?