Project A: Classifying Images with Feature Transformations


Last modified: 2020-10-27 12:50

Status: RELEASED.

Due date: Mon. Oct. 26 at 11:59PM AoE (Anywhere on Earth) (Tue 10/27 at 07:59am in Boston)

Updates

  • 2020-10-26 : Clarified that Results section of your report should include discussion of your main findings and lessons you learned as well as the required figures (as the rubric always indicated)
  • 2020-10-20 : Added turn-in link for the Reflection Form
  • 2020-10-07 : Clarification of plots in Figure 1B

Turn-in links:

Overview

This is a four week project with lots of open-ended programming. Get started right away!

Team Formation

In this project, you can work with teams of 2 people, or (if you prefer) individually. Individual teams still need to complete all the parts below. We want to incentivize you to work in pairs.

If you need help finding teammates, please post to our "Finding a Partner for Project A" post on Piazza.

By the start of the second week (by end of day Mon 10/05), you should have identified your partner and signed up here:

  • ProjectA Team Formation Form
    • Please use your tufts.edu G Suite account. You must provided tufts.edu email addresses.
    • Even if you decide to work alone, you should fill this form out acknowledging that.

Work to Complete

As a team, you will work on one structured problem (Problem 1: 8-vs-9) and then one open-ended problem (Problem 2: Sneaker-vs-Sandal).

  • Problem 1, you will demonstrate that you know how to train a logistic regression classifier and interpret the results.

  • Problem 2, you will practice the development cycle of an ML practitioner:

    • Propose a reasonable ML pipeline (feature extraction + classifier)
    • Evaluate it and analyze the results carefully
    • Revise the pipeline and repeat

For Problem 2, we will maintain a leaderboard on Gradescope. You should periodically submit the predictions of your best model on the test set.

What to Turn In

Each team will prepare one PDF report for Problem 1:

  • Prepare a short PDF report (no more than 3 pages).
  • This document will be manually graded according to our rubric
  • Can use your favorite report writing tool (Word or G Docs or LaTeX or ....)
  • Should be human-readable. Do not include code. Do NOT just export a jupyter notebook to PDF.
  • Should have each subproblem marked via the in-browser Gradescope annotation tool)

Each team will prepare one PDF report for Sneaker-vs-Sandal (Problem 2)

  • Prepare a short PDF report (no more than 5 pages).
  • This document will be manually graded according to our rubric
  • Can use your favorite report writing tool (Word or G Docs or LaTeX or ....)
  • Should be human-readable. Do not include code. Do NOT just export a jupyter notebook to PDF.
  • Should have each subproblem marked via the in-browser Gradescope annotation tool)

Each team should submit one ZIP file of test-set predictions for the Sneaker-vs-Sandal Leaderboard (Problem 2):

  • yproba1_test.txt : plain text file
    • Each line contains float probability that the relevant example should be classified as a positive example given its features
    • Should be loadable into NumPy as a 1D array via this snippet: np.loadtxt('yproba1_test.txt')

Multiple submissions to the leaderboard are allowed and even encouraged. By the deadline, make sure the "best" pipeline you design has its predictions submitted.

Problem 1: Logistic Regression for Image Classification of Handwritten Digits

Examples of 8s and 9s with added noise

Example 28x28 pixel images of "8s" and "9s". Each image is titled with its associated binary label.

We consider building a classifier to distinguish images of handwritten digits, specifically the digit 8 from the digit 9. You'll use the \(x\) and \(y\) examples provided in CSV files located in the data_digits_8_vs_9_noisy folder of the starter code.

https://github.com/tufts-ml-courses/comp135-20f-assignments/tree/master/projectA/data_digits_8_vs_9_noisy

We extracted this data from the well-known MNIST dataset by LeCun, Cortes, and Burges. We have preprocessed lightly for usability as well as to make the problem slightly more difficult by adding a small amount of random noise to each image.

Your challenge is to build a logistic regression classifier to distinguish '8' from '9'.

Each example image (indexed by integer \(i\)) has features \(x_i \in \mathbb{R}^F\) and label \(y_i \in \{0, 1\}\).

  • Each row of the provided x_train.csv file is the feature vector \(x_i\) for one image.
  • Each row of the provided y_train.csv file is the label \(y_i\) for the corresponding image.

The features \(x_i\) for the \(i-\)th image are the gray-scale pixel intensity values of a 28x28 pixel image. Each pixel's feature value varies between 0.0. (black) and 1.0 (bright white), with shades of gray possible in between. We've reshaped and flattened each 28x28 image so that it has a feature vector \(x_i\) of size 784 (= 28 x 28).

The label \(y_i\) for the \(i\)-th image is a binary value. Here, we've encoded the '8's as a 0 and '9's as a 1.

Preparation and Starter Code

You are given the following predefined labeled datasets:

  • "training" set of 9817 total examples of handwritten digits and their labels.
  • "validation" set of 1983 total examples, and their labels
  • "test" set of 1983 total examples (but no labels!)

We emphasize that throughout this project, your test set does not have labels available to you. Your goal is to build a classifier that generalizes well.

See the sample code show_images.py to get a sense of how to load these into Python and display the images.

https://github.com/tufts-ml-courses/comp135-20f-assignments/blob/master/projectA/show_images.py

In your PDF report, include the following sections:

1A : Dataset Exploration

Table 1A

In one table, summarize the composition of your training and validation sets.

Include 3 rows:

  • total number of examples
  • number of positive examples
  • fraction of positive examples

You should always make such a table, to understand your data and help explain any trends you are seeing.

1B : Assess Loss and Error vs. Training Iterations

Using sklearn.linear_model.LogisticRegression, you should fit a logistic regression models to your training split.

Set C = 1e6 and `solver='lbfgs'. Leave other parameters at their default values. Explore what happens when we limit the iterations allowed for the solver to converge on its solution.

For the values i = 1, 2, 3, 4, ... 39, 40, build a logistic regression model with setting max_iter=i. Fit each such model to the training data, and keep track of the following performance metrics on both training and validation sets:

  • binary cross entropy (aka log loss)
  • error rate (can use sklearn.metrics.zero_one_loss).

You may safely ignore any warnings about lack of convergence.

Figure 1B : Produce two plots side by side:

  • Left plot should show log loss (y-axis) vs iteration (x-axis), with two lines (one for training, one for validation)
  • Right plot should show error rate (y-axis) vs iteration (x-axis), with two lines (one for training, one for validation)

Place these plots into your PDF document, with appropriate captions.

Short Answer 1B : Below the plots, discuss the results you are seeing; what do they show, and why?

1C : Hyperparameter Selection

You should now explore a range of penalty strength values C. You can fix max_iter = 1000.

C_grid = np.logspace(-9, 6, 31)

for C in C_grid:
    # Build and evaluate model for this C value
    # Record training and validation set error rate

Figure 1C : Produce a plot of the error rate as a function of C. Which hyperparameter should you select?

1D : Analysis of Mistakes

For the selected model from 1C, we might wonder if there is any pattern to the examples the classifier gets wrong.

Figure 1D : Produce two plots, one consisting of 9 sample images that are false positives on the validation set, and one consisting of 9 false negatives. You can display the images by converting the pixel data using the matplotlib function imshow(), using the Grey colormap, with vmin=0.0 and vmax=1.0. Place each plot into your PDF as a properly captioned figure. '

Short Answer 1D Discuss the results you are seeing in Figure 1D. What kinds of mistakes is the classifier making?

1E : Interpretation of Learned Weights

For the selected model from 1C, what has it really learned? One way to understand this model is to visualize the learned weight coefficient that will be applied to each pixel in a 28 x 28 image.

Figure 1E :

Reshape the weight coefficients into a (28 × 28) matrix, corresponding to the pixels of the original images, and plot the result using imshow(), with colormap RdYlBu, vmin=-0.5, and vmax=0.5. Place this plot into your PDF as a properly captioned figure.

Short Answer 1E

  • Which pixels have negative weights, and thus have high-intensity values correspond to the negative class ('8')?
  • Which pixels have positive weights, and thus have high-intensity values correspond to the positive class ('9')?
  • Why do you think this is the case?

Problem 2: Sneaker vs Sandal Image Classification

Examples of 8s and 9s with added noise

Example 28x28 pixel images of "sneakers" and "sandals". Each image is titled with its associated binary label.

In this open-ended problem, you'll take on another image classification problem: sneakers vs. sandals.

Each input image has the same \(x\) feature representation as the MNIST digits problem above (each example is a 28x28 image, which is reshaped into a 784-dimensional vector). However, now we have images of "sneakers" and "sandals". These are from the larger Fashion MNIST dataset, made public originally by Zalando Research.

Get the data here:

https://github.com/tufts-ml-courses/comp135-20f-assignments/tree/master/projectA/data_sneaker_vs_sandal

Now, we'd like you to build a binary classifier that works on this new task. Can you achieve an error rate that reaches the top of the leaderboard? Can you understand what it takes for an image classifier to be successful on this task?

You'll be evaluated much more on your PROCESS than on your results. A well-designed experiment and careful evaluation will earn more points than getting to a perfect error rate without an ability to justify what you've done.

Your challenge is open-ended: you're not restricted to using only the 784 pixel values as features \(x_n\) for image \(n\).

Instead, you should spend most of your time engineering your own feature transform \(\phi(x_n)\). You are free to use ANY feature transform you can think of to turn the image into a feature vector.

Ideas for better feature transformations

Be creative! If you're stuck, here are a few ideas to consider:

  • does the total number of pixels that are "on" help distinguish sandals from sneakers?

  • how can you identify when multiple pixels are on (bright white) simultanenously?

  • how can you capture spatial trends (e.g. across rows or columns) that are relevant to the sneaker vs. sandal problem?

  • can you measure the "holes" (blocks of dark pixels) that are more common within sandals?

Other ideas for improving your performance

  • You'll need to decide how to do hyperparameter selection. Will you use a fixed validation set? Something else? What candidate values will you choose, and why?

  • You could explore data augmentation (can you augment your existing training set by transforming existing images in helpful ways? for example, if you flip each image horizontally, would that be a useful way to "double" your training set size?)

  • You could explore better regularization (try L1 vs L2 penalties for LogisticRegression, or look at alternatives)

  • You could explore better optimization (does the `solver' you choose for LogisticRegression matter?)

Required Experiments

Your report should summarize experiments on at least 3 possible classifier pipelines:

  • 0) Baseline: raw pixel features, fed into a Logistic Regression classifier
    • You should use sklearn.linear_model.LogisticRegression
    • You should carefully justify all hyperparameters, and select at least one complexity hyperparameter via grid search
  • 1) A feature transform of your own design, fed into a Logistic Regression classifier

    • You should write your own transform functions, or use sklearn.preprocessing
    • You should use sklearn.linear_model.LogisticRegression
    • You should carefully justify all hyperparameters, and select at least one complexity hyperparameter via grid search
  • 2) Another feature transform of your own design, fed into a Logistic Regression classifier or some other classifier (e.g. KNeighborsClassifier)

    • You should write your own transform functions, or use sklearn.preprocessing
    • You should carefully justify all hyperparameters, and select at least one complexity hyperparameter via grid search
    • If you choose sklearn.neighbors.KNeighborsClassifier, think carefully about the distance function you use
    • You can use any classifier in sklearn, but you must understand it and be able to talk about it professionally in your report

For all classifiers, we strongly recommend using sklearn Pipelines to manage everything. For simple examples of pipelines, see Part 5 of day04 lab notebook. For examples tailored to the image classification task, see our starter code.

Data Usage Restrictions

You should not use any additional Fashion MNIST data from other sources, only that provided for you in the starter code.

Code Usage Restrictions

For this project, any code in our standard environment comp135_2020f_env is fair game.

If you really want to use some other library, you can ask instructors on Piazza. You'll need permission to proceed.

What goes in your report PDF for Problem 2

Your entire report for Problem 2 should be somewhere between 2-5 total pages, including all figures and text.

Section 1: Methods for Sneaker-Sandal

First, in your report, provide 1 paragraph describing your experimental design

  • How did you divide the provided labeled set to develop models?
  • Did you make a separate "train" or "validation" sets? How did you select the sizes of these sets?
  • Did you do cross-validation? Why or why not?

Next, in your report, for each of the 3 methods, provide 1 paragraph of concise but complete description of that method. A strong methods paragraph will answer each of these prompts:

  • Describe the feature transformation you tried and why you thought it would work
  • Describe your model fitting process: what were parameters, and how were they fit? are there concerns about overfitting? concerns about convergence of the optimization?
  • Describe your hyperparameter selection process: what were hyperparameters, and how were they selected? What candidate values were considered? What performance metric did you try to optimize?

Section 2: Results for Sneaker-Sandal

In the results section, of your report, please include the following with appropriate discussion in main text of your report:

  • One figure showing evidence of hyperparameter selection to avoid overfitting for your baseline model (you should look at these for all models, but you can just use your baseline model here). Usually, this figure will plot training and heldout error vs one hyperparameter (like we did in HW1).

  • One figure comparing the training-set and heldout-set performance of your 3 models using an ROC curve (one plot for training, one plot for heldout, each plot shows an ROC line for each of the 3 models).

  • One figure showing analysis of mistakes (several example false positives and false negatives on heldout set) for your preferred single model and one other baseline or alternative.

Rubric for Evaluating PDF Report

This should match the rubric outline on Gradescope for the PDF report. If for any reason there is a conflict, the official problem weights on Gradescope will be used.

  • 25 points for Problem 1 Report
    • 2 for 1A
    • 8 for 1B
    • 5 for 1C
    • 5 for 1D
    • 5 for 1E
  • 60 points for Problem 2 Report

    • 10 points for experimental design paragraph
    • 6 points for description of Method 0
    • 6 points for description of Method 1
    • 6 points for description of Method 2
    • 10 points for figure(s) showing hyperparameter selection, with discussion
    • 12 points for figure(s) showing ROC curves, with discussion
    • 10 points for figure(s) showing analysis of mistakes, with discussion
  • 15 points for Problem 2 Leaderboard

    • 10 points for submission with at least 'bare minimum' satisfactory error rate (as good as our Baseline Method 0)
    • 5 points for submission within 0.02 points of the best error rate