Project C: Recommendation Systems via Matrix Factorization


Last modified: 2020-12-20 16:04

Status: RELEASED.

Updates

  • 2020-12-07 : Updated Bonus problem to include a new short answer question "5c" about responsible AI

Due date

  • Leaderboard closes on Fri. Dec. 18 at 11:59pm AoE (Anywhere on Earth).
  • Report is due Mon. Dec. 21 at 11:59pm AoE (Anywhere on Earth). (Tue 12/22 at 07:59am in Boston).

No late days allowed due to the strict end of the semester.

Jump to:

  Background   Code   Datasets   Problem 1   Problem 2   Problem 3   Problem 4   Bonus Problem   Rubric for Evaluating PDF Report

Turn-in links:

Starter Code and Dataset Links:

https://github.com/tufts-ml-courses/comp135-20f-assignments/tree/master/projectC

Overview

This is a four week project with lots of open-ended programming. Get started right away!

Team Formation

In this project, you can work in teams of 2 people, or (if you prefer) individually. Individual teams still need to complete all the parts below. We want to incentivize you to work in pairs.

If you need help finding teammates, please post to our "Finding a Partner for Project C" post on Piazza.

By the start of the second week (by end of day Wed 12/02), you should have identified your partner.

Work to Complete

As a team, you will work on several problems that study different representations of recommendation systems.

  • Problem 1 looks at a baseline factorization model with one scalar parameter.
  • Problem 2 looks at a baseline factorization model with one learned scalar per item.
  • Problem 3 looks at a factorization model with a learned vector embedding per item.

Throughout Problems 1, 2, and 3, you will practice the development cycle of models trained with SGD:

    • Develop a coherent loss function for your application
    • Use automatic differentiation toolboxes to compute the gradient
    • Attempt multiple runs of SGD to train models until you get a satisfactory performing model
    • Select among several model complexity hyperparameters to get the best generalization performance

Then, for Problem 4, you have much more open-ended freedom.

  • Build the best model you can, and try to get to the top of the leaderboard!

What to Turn In

Each team will prepare one PDF report covering all problems below.

  • Prepare a short PDF report (no more than ~7 pages).
  • This document will be manually graded according to our rubric
  • Can use your favorite report writing tool (Word or G Docs or LaTeX or ....)
  • Should be human-readable. Do not include code. Do NOT just export a jupyter notebook to PDF.
  • Should have each subproblem marked via the in-browser Gradescope annotation tool)

Each team will prepare one ZIP file of the leaderboard-test-set predictions as a leaderboard submission for Problem 4. This ZIP file will contain just one file:

  • predicted_ratings_leaderboard.txt : a plain text file
    • One line for each line in the released dataset ratings_masked_leaderboard_set.csv
    • Can be loaded in with np.loadtxt() as a valid 1d array of floats with shape (10000,)

Each individual will turn in a reflection form (after completing the report).

Background

For this project, you are given a large dataset of the ratings that 943 users have given to 1682 movies.

We'd like to build a recommendation system to help guess which movies a user will like.

Input: The input to our prediction system is a pair \(i,j\), which denotes a specific user id (denoted by index \(i\), one of 943 possibilities) and movie id (denoted by index \(j\), one of 1682 possibilities).

Output: The output of our predictor (e.g. the quantity produced by calling model.predict(i,j)) will be a scalar rating \(\hat{y}_{ij} \in \mathbb{R}\). For each possible user-movie pair \(i,j\), we'd like \(\hat{y}_{ij}\) to be as close as possible to the "real" 5-star rating \(y_{ij}\) that the user \(i\) gave the movie \(j\). We can represent any five star rating with an integer in the set \(\mathcal{Y} = \{1, 2, 3, 4, 5\}\): a rating of 1 is the worst possible, a rating of 5 is the best.

We can think of our entire set of observed ratings as a big 2D matrix \(Y\), which has 943 rows and 1682 columns. Formally, we can write \(Y \in \mathcal{Y}^{943 \times 1682}\).

Of course, not all users have seen and rated all movies, so some true ratings are simply unknown to us. Thus, if we examined our true rating matrix \(Y\), most entries would be missing. It is impossible to know if our guesses for missing entries are any good.

We thus concentrate only the observed entries of \(Y\). We can represent our dataset more compactly as a list \(\mathcal{I}\) of all distinct observed pairs of input (user_id \(i\), item_id \(j\)) and output (rating \(y_{ij}\)) in the observed dataset.

Dataset

We will use the MovieLens 100K dataset. This data set consists of:

  • 100,000 ratings (1-5 stars) from 943 users on 1682 movies.
  • Each user has rated at least 20 movies.
  • Some movies have ratings from only a few users.
  • Ratings were collected on a 5-star scale. Each rating is one of 5 possible integer values -- 1, 2, 3, 4, or 5 -- with 5 being 'best' and 1 being 'worst'.
  • Simple demographic info for the users (age, gender, etc.) are available

For more information about the original dataset, see http://files.grouplens.org/datasets/movielens/ml-100k-README.txt. We are grateful to the dataset creators and the University of Minnesota for making this data publicly available.

We've provided a clean preprocessed version of this dataset here: https://github.com/tufts-ml-courses/comp135-20f-assignments/tree/master/projectC/data_movie_lens_100k

We've provided one CSV file, ratings_all_development_set.csv, to give you all the data you need to develop and evaluate models. (Another CSV file for the leaderboard will be discussed later in Problem 4).

Each row of this file specifies 3 things:

  • user_id \(i\) : an integer in \(\{0, 1, 2, ... 942\}\)
  • item_id \(j\) : an integer in \(\{0, 1, 2, ... 1681\}\)
  • observed 5-star rating \(y_{ij}\) : an integer in {1, 2, 3, 4, 5}$

The first few rows of ratings_all_development_set.csv are:

user_id,item_id,rating
772,36,3
471,228,5
641,401,4
312,98,4
...

Train/validation/test splits: As usual, we can take the large development dataset \(\mathcal{I}\) (which is provided to you in random order) and divide it into training, validation, and test sets: \(\mathcal{I}^{\text{train}}\), \(\mathcal{I}^{\text{valid}}\), and \(\mathcal{I}^{\text{test}}\). See the starter code in train_valid_test_loader.py, which you can use throughout Problems 1-5.

Background: Evaluation Performance Metric

Your goal is to predict the ratings of all user-movie pairs well.

To measure ultimately rank models by prediction quality, we'll use mean absolute error. This is useful for our five-star rating task, since it is nicely interpretable (e.g. an MAE of 0.5 means we're within a 1/2-star of the answer on average). Plus, it is not overly sensitive to outliers.

For a given dataset \(\mathcal{I}\) of triples \(i, j, y_{ij}\), recall that MAE is computed as:

$$ \text{MAE}(y, \hat{y}) = \frac{1}{|\mathcal{I}|} \sum_{i,j \in \mathcal{I}} | y_{ij} - \hat{y}_{ij} | $$

Background: SGD, Automatic Differentiation and autograd

Across Problem 1, Problem 2, and Problem 3, you'll develop your own Python code to build a series of increasingly more powerful models to perform recommendation. For each model, we will view training as an optimization problem, and we'll solve it with stochastic gradient descent.

Recall that we spent HW3 analyzing an existing sklearn implementation of SGD for MLP classifiers. You'll now get some experience writing your own loss function and using SGD to optimize that loss.

How do we use SGD for our problem? At each update step, you'll grab a minibatch of data (a random subset of the observed entries in our ratings matrix \(Y\)), and compute gradient estimates for parameters with respect to this batch.

How will we compute gradients? We've provided some starter code to help you. To save lots of effort and help us explore many possible models, we'll use the autograd Python package to perform automatic differentiation.

For a solid introduction to using autograd, see the Autograd For Gradient Descent lab from Monday 11/30

Starter Code organization

For each model of interest (Problem 1, Problem 2, Problem 3, all defined mathematically below), we have created a separate file defining the python class for that model. For examples in your starter code, see:

Your task is to define several methods for each model class, which allow us to perform prediction given fixed parameters, and compute the loss used for gradient-based training of parameters.

All these models will be subclasses of a "base" class AbstractBaseCollabFilterSGD, which contains logic to construct and fit collaborative filtering models to data via stochastic gradient descent. Using this base class, you don't need to write SGD yourself, or even how to compute the gradient! Please do read through this base class carefully to understand how it works.

Below, we review the methods/attributes you'll need to write yourself, and the methods/attributes you only need to use that are provided in the base class.

Methods you'll need to write

To complete the implementation of a given model, we'll follow the following pattern:

  • Use the instance attribute param_dict to store all model parameters
param_dict : dict
    Keys are string names of parameters
    Values are *autograd.numpy arrays* of parameter values    
  • Define method predict to make predictions:
    • Input: Specific user-movie example pairs, indicated by integer ids
    • Output: Predicted ratings for each example
  • Define method calc_loss_wrt_parameter_dict to compute the loss to minimize with SGD:

    • Input: a minibatch of training data, a parameter dict
    • Output: scalar float, indicating the loss on the batch given the parameters
  • Define method init_parameter_dict to initialize the param_dict attribute to random values:

    • Input: Number of possible users, Number of possible items
    • Output: None, internal attribute param_dict updated

The steps above are all you need to do for each possible model. Each class inherits a complete fit method from the predefined AbstractBaseCollabFilterSGD, which knows how to perform SGD given the pieces above.

Methods you'll need to understand and use (but not edit)

You should read through the provided AbstractBaseCollabFilterSGD.py, to be sure you understand what's going on.

  • __init__ : Constructor

This is where the user defines the batch_size, the step_size (learning rate), and the number of epochs n_epochs to complete during training. This constructor can also define the regularization strength alpha and the number of hidden factors n_factors for Problem 3.

  • fit : Method to fit model parameters to provided data

At the bottom of each starter code file, you can see example code for calling fit for that model.

When the user calls fit, the model parameters are initialized to random values, and then updated iteratively via SGD to improve the loss. These updates proceed until the desired number of epochs are performed.

Epoch - A Definition: An epoch is a unit of training progress in minibatch learning. One epoch is complete when our stochastic gradient descent has processed enough minibatches such that the total number of examples seen is "equivalent" to the size of the entire training dataset.

Within our fit implementation, we compute and store performance metrics at various checkpoints throughout the training process, including:

  • at the initial parameters (before any updates)
  • every 1/4 of an epoch for epochs 0 - 2
  • every 1/2 of an epoch for epochs 2 - 8
  • every 1 of an epoch for epochs 8 - 32
  • every 2 epochs for 32 - 128
  • every 4 epochs after that

These metrics helps us monitor progress as learning progresses, which is especially rapid in early epochs as the poor random initialization is improved.

Attributes available after calling fit that trace performance

We have recorded and stored useful diagnostic metrics computed at various checkpoints throughout the training procedure. These help us "trace" what happens throughout learning, so we call them trace performance metrics.

After fitting a model, you'll have the following trace attributes available to you:

trace_epoch : 1D array-like
    Contains the epochs (fractional) where model performance was assessed.
    Value of 0.0 indicates the initial model parameters (before any gradient updates).
    Value of 0.1 indicates that the total training examples seen represents 10% of the size of the training set.

trace_loss : 1D array-like
    Contains training loss (at current batch only) whenever model was assessed.
    This is reported as an average per example in the current batch.

trace_mae_train : 1D array-like
    MAE assessed on entire training set whenever model assessed.

trace_mae_valid : 1D array-like
    MAE assessed on entire validation set whenever model assessed.

So for example, to plot training MAE vs. epochs completed, you could do:

# After calling fit...
plt.plot( model.trace_epoch, model.trace_mae_train, '.-')

Starter Code and Code Restrictions

For this assignment, you are limited to the following Python packages for performing machine learning related functionality:

  • All Problems: Default packages in your environment: numpy, scipy, sklearn, etc.
  • All Problems: autograd package for automatic differentiation
  • Problem 4 onward: surprise package for recommendation, or any other packages you want

You can INSTALL the surprise package as follows:

conda activate comp135_2020f_env
conda install -c conda-forge scikit-surprise

Your staff found in late November 2020 that scikit-surprise version 1.1.1 worked fine with the existing comp135 environment.

For any package you use, please consult the documentation websites or other external web resources. However, you should understand every line of the code you use and not simply copy-paste without thinking carefully.

Remember to keep the course collaboration policy in mind: do your own work! (with/without a partner as required).

Problem 1: Simple Baseline Model with SGD and Autograd

To get used to developing models using our autograd framework, we'll consider the simplest possible baseline model "M1": a model that makes the same scalar prediction for a movie's rating no matter what user or movie is considered. This model has one scalar parameter \(\mu \in \mathbb{R}\), and the prediction for user \(i\) and movie \(j\) is simply:

$$ \hat{y}_{ij} = \mu $$

Training Model M1

To train model M1 for \(N\) users and \(M\) movies, we wish to optimize this squared error training objective:

$$ \min_{\mu \in \mathbb{R}} \sum_{i,j \in \mathcal{I}^{\text{train}}} (y_{ij} - \mu)^2 $$

In words, this means we want to minimize the squared error on the training set, between the predicted rating (simply \(\mu\) here) and the observed rating \(y_{ij}\).

Problem 1 Code Implementation Tasks

Edit the starter code file: CollabFilterMeanOnly.py. Complete each required method (init_parameter_dict, predict, and calc_loss_wrt_parameter_dict), as described above in the background section above.

Problem 1 Analysis Tasks

Using SGD, train M1 with two different settings of batch size: 10000 examples and 100 examples.

For the best run of these, please record that M1 model's:

  • value of parameter \(\mu\)
  • mean absolute error on the validation set
  • mean absolute error on the test set

Problem 1 Report Tasks

As evidence of successfully completing Problem 1, include the following in your report:

1a: Figure and caption: Trace plots showing mean absolute error vs. epoch completed for your SGD training runs.

In two side-by-side plots, you should compare the two batch sizes:

  • Left plot: 10000 examples per batch
  • Right plot: 100 examples per batch
  • Each plot should have two lines (one for training MAE, one for validation MAE)

Please adjust the y axis to focus on what happens after epoch 2. Avoid showing a plot where you cannot see some difference between the two curves.

1b: Short answer: There is a closed-form operation we could apply to the training set to compute the optimal \(\mu\) value (e.g. a one line computation in numpy involving the observed training ratings \(Y\)). How would you compute this "exact" solution? Report the computed optimal \(\mu\) value. Does this result agree with your SGD solution?

Problem 2: One-Scalar-Per-Item Baseline with SGD and Autograd

We now consider a second model "M2" with three parameters:

  • \(\mu\) : scalar mean rating
  • \(b_i\) : scalar bias term for each user \(i\)
  • \(c_j\) : scalar bias term for each movie \(j\)

Prediction under model "M2" becomes:

$$ \hat{y}_{ij} = \mu + b_i + c_j $$

Training Model M2

To train model M2 for \(N\) users and \(M\) movies, we wish to optimize the following squared error objective:

$$ \min_{\mu \in \mathbb{R}, b \in \mathbb{R}^N, c \in \mathbb{R}^M} \sum_{i,j \in \mathcal{I}^{\text{train}}} (y_{ij} - \mu - b_i - c_j)^2 $$

In words, this means we want to minimize the squared error on the training set, between the predicted rating and the observed rating \(y_{ij}\).

Problem 2 Code Implementation Tasks

Edit the starter code file: CollabFilterOneScalarPerItem.py. Complete each required method, as described above in the background section.

Problem 2 Analysis Tasks

Using SGD, train M2 with two different settings of batch size: 10000 examples and 100 examples.

You may need to adjust the SGD step_size or n_epochs hyperparameter here. You should aim to show runs that last until either training error visibly converges or until you see obvious overfitting.

For the best run of these, please record that M2 model's:

  • mean absolute error on the validation set
  • mean absolute error on the test set
  • value of parameters \(\mu, b, c\) (not needed for your report, but useful so you don't need to retrain the model if you have to remake a figure)

Problem 2: Report Tasks

As evidence of successfully completing Problem 2, include the following in your report:

2a: Figure and caption: Trace plots showing mean absolute error vs. epoch completed for your SGD training runs.

In two side-by-side plots, you should compare the two batch sizes:

  • Left plot: 10000 examples per batch
  • Right plot: 100 examples per batch
  • Each plot should have two lines (one for training MAE, one for validation MAE)

Please adjust the y axis to focus on what happens after epoch 2. Avoid showing a plot where you cannot see some difference between the two curves.

2b: Figure and caption: For the model from 2a with best validation MAE, display the learned per-movie rating adjustment parameters \(c_j\) for each of the movies in the short list in select_movies.csv. You might make a sorted list, showing each movie's title alongside its learned bias parameter.

In your caption, please answer: What kinds of movies have a large positive \(c_j\) or large negative \(c_j\) value? What does it mean for a \(c_j\) to be large and negative or large and positive?

Problem 3: One-Vector-Per-Item Collaborative Filtering with SGD and Autograd

We now consider a full matrix factorization model "M3" with five parameters:

  • \(\mu\) : scalar mean rating
  • \(b_i\) : scalar bias term for each user \(i\)
  • \(c_j\) : scalar bias term for each movie \(j\)
  • \(u_i\) : K-dimensional vector for each user \(i\)
  • \(v_j\) : K-dimensional vector for each movie \(j\)

Crucially, you'll now need to think about how to set \(K\), the number of "factors" or "dimensions" to learn when representing each user/movie in a vector space. Within the starter code, you set this with the n_factors keyword argument to the constructor.

Prediction under model "M3" for the rating that user \(i\) will give to movie \(j\) is:

$$ \hat{y}_{ij} = \mu + b_i + c_j + \sum_{k=1}^K u_{ik} v_{jk} $$

M3 is known as a collaborative filtering model for recommendation in the ML research literature.

Training:

To train model M3 for \(N\) users and \(M\) movies, we wish to optimize the following objective:

$$ \min_{\mu, b, c, \{ u_i \}_{i=1}^N, \{ v_j \}_{j=1}^M} \quad \alpha \Big( \sum_{j} \sum_{k} v_{jk}^2 + \sum_{i} \sum_{k} u_{ik}^2 \Big) + \sum_{i,j \in \mathcal{I}^{\text{train}}} (y_{ij} - \mu - b_i - c_j - u_i^T v_j)^2 $$

Again, this is a squared error objective. Note that we have added L2 regularization penalties on the \(u\) and \(v\) vectors with strength \(\alpha \geq 0\). We'll investigate whether this helps us generalize to heldout data better.

In the starter code, the L2 penalty strength hyperparameter can be set via the keyword argument alpha=... in the constructor, and accessed via the attribute alpha.

Problem 3 Code Implementation Tasks

Edit the starter code file: CollabFilterOneVectorPerItem.py. Complete each required method, as described above in the background section.

Problem 3 Analysis Tasks

You should fix batch_size=1000 throughout this problem.

Analysis 3(i): First, with no regularization (\(\alpha=0\)), train M3 using SGD. Try three possible values of \(K\): 2, 10, and 50.

Analysis 3(ii): Second, train M3 with \(K=50\) now with moderate regularization by setting strength \(\alpha > 0\) to try to eliminate overfitting you saw in 3(i). Focus only on \(K=50\).

For the best run at each \(K\) (with/without \(\alpha > 0\)), please record that M3 model's:

  • mean absolute error on the validation set
  • mean absolute error on the test set
  • value of parameters \(\mu, b, c, u, v\) (not needed for your report, but useful so you don't need to retrain the model if you have to remake a figure)

You may need to adjust the SGD step_size or n_epochs hyperparameter here. You should aim to show runs that last until either training error visibly converges or until you see obvious overfitting.

Hint: You might consider using early stopping, to get the best possible model here.

Problem 3 Report Tasks

As evidence of successfully completing Problem 3, include the following in your report:

3a: Figure and caption: Make trace plot showing mean absolute error vs. epoch completed when \(\alpha=0\).

You should include 3 figures side-by-side, one for \(K=2\) factors, one for \(K=10\), and one for \(K=50\).

In your caption, reflect on what your trace plots suggest. Do you see underfitting? Overfitting? What happens as \(K\) increases?

3b: Figure and caption: Make trace plot showing mean absolute error vs. epoch completed when \(\alpha > 0\).

Here, please plot one figure, for \(K=50\).

In your caption, please specify (1) which value of \(\alpha\) you ultimately selected (you should try several) as well as the batchsize and step size, and (2) whether you ultimately get better heldout error with \(\alpha > 0\) than you did in 3a.

3c: Table and caption: Report the MAE on validation set and test set for the "best version" of each of these models:

  • M1
  • M2
  • M3 with \(K=2\)
  • M3 with \(K=10\)
  • M3 with \(K=50\)

Below the figure, please discuss:

  • How you determined the "best" version of each model
  • How many factors \(K\) do you recommend for M3? Should we try even more than 50 factors?
  • Which of the 5 models is "best" overall and why

3d: Figure and caption: For the best M3 model with \(K=2\) factors, consider the learned per-movie vectors \(v_j\) for the short list of movies listed in select_movies.csv. Can you make a scatter plot of the 2-dimensional "embedding" vector \(v_j\) of these movies (labeling each point with its movie title), like we saw in lecture? Do you notice any interpretable trends?

Hint: See the very bottom of the day 24 lab notebook with help plotting a 2D visual of the embedding vectors. Search for the "Make visualization of the learned embeddings of select movies" section.

Problem 4: Open-Ended Recommendation Challenge

The starter code includes an additional leaderboard heldout dataset -- ratings_masked_leaderboard_set.csv -- which you haven't used yet. This contains an additional 10,000 entries of user-movie pairs for the same set of users and movies as above. We've omitted the ratings here, so you won't be able to access them (they are "masked").

In this problem, your goal is to obtain the best possible prediction results on this heldout test data, in terms of mean absolute error.

You can try any model you want. You can make use of the other ratings you've already observed in the training set, as well as the user-specific info found in user_info.csv and the movie-specific attributes found in item_info.csv. You can use autograd, sklearn, surprise, or any other package. Your goal is to get the best score on our leaderboard.

Example Problem 4 ideas

  • Try one of the implementations of SVD in the surprise package. This package offers models that look like our M3, plus some neat extensions of these. Can you train one of these to perform better? Please take advantage of surprise's extensive sklearn-like grid search tools.

  • Can you somehow use user-specific features (like gender and age, available in user_info.csv) or movie-specific features (like title and year, available in movie_info.csv) to improve your rating predictions?

  • Try one of the k-nearest neighbor approaches to recommendation in surprise

  • Try some other approach to recommendation you have read about.

  • Try a new loss function! Throughout this project, we have used mean-squared error in the calc_loss... method for training the models, but then evaluated with mean-absolute-error. What if we just used mean absolute error in the loss? This should be somewhat easy with autograd.

Problem 4: Leaderboard Submission Tasks

You should submit your final predictions as a plain text file named predicted_ratings_leaderboard.txt

  • predicted_ratings_leaderboard.txt : a plain text file
    • One line for each line in the released dataset ratings_masked_leaderboard_set.csv
    • Can be loaded in with np.loadtxt() as a valid 1d array of floats with shape (10000,)

Problem 4: Report Tasks

Please include in your report

4a: 1-2 paragraphs describing your proposed method (how it works, why you chose it, what training and hyperparameter selection is done, etc.). Show that you have mastered the core concepts and best practices of machine learning.

4b: 1 figure (with caption) relevant to reporting how you trained the model or selected model complexity. This could be a trace plot or a hyperparameter selection plot.

4c: 1 tables reporting your model's ultimate mean absolute error performance. Be sure to include the measured leaderboard performance (in terms of mean absolute error) as well as some internal performance number (e.g. on the test split of the development set).

4d: 1 paragraph analyzing the table of results from 4d

  • Discuss how your leaderboard number compared to your internal validation efforts
  • Discuss how this method compared to your best model of M1, M2, and M3

4e: 1 paragraph discussing limitations and opportunities for future work

BONUS Problem 5: Predicting Gender from Learned Per-User Embedding Vectors

Consider the best K-factor model you've trained (either with surprise or with your own implementation of M3 in Problem 3). Let \(U\) be the learned matrix of user vectors, with its i-th row as the vector \(u_i\) for user \(i\). Each row of \(U\) can be seen as a "feature vector" for a particular user.

The question we'd like to investigate is this: do our learned per-user features that are optimized for predicting movie ratings contain anything to do with gender?

The provided data file user_info.csv contains an is_male column indicating which users in the dataset are male. Can you predict this signal given the features \(U\)?

Implementation Tasks

Train a binary classifier to predict the is_male target variable given the fixed per-user embedding features \(U\). Feel free to use any sklearn binary classifier (logistic regression, SVM, random forest, etc.). You should use best practices for model selection (cross validation, hyperparameter search, etc).

You'll need to set up this classification task from scratch. You have info for all 943 users. Should you include them all in the training process? How will you fairly report the accuracy on heldout data?

Bonus Problem Report Tasks

5a: Paragraph: Describe your method. How did you split the data (train/test)? What classifier did you choose and why? How did you tune hyperparameters?

5b: Figure and caption: Show a confusion matrix for your gender-from-user-features classifier. What error rate do you get? Is it significantly better than chance for this dataset?

5c: Paragraph: Is predicting a user's gender "harmless", or are there applications of a recommendation system where this might have real-world consequences? How should a responsible AI practitioner handle this? What questions should we ask to decide if this tool should be "released"?

Rubric for Evaluating PDF Report

Earning full credit on this assignment requires a well-thought-out report that demonstrates you understand the models and training procedures we're studying here.

We're looking for the itemized content outlined above, with professional figures and captions that tell a story, concise but well-thought out short answers, and evidence of best practices throughout.

Points will be allocated across the various parts as follows:

  • 10%: Problem 1
  • 20%: Problem 2
  • 30%: Problem 3
  • 40%: Problem 4

The BONUS problem, if completed successfully, will be worth up to 8% toward your overall project C grade.

Note: You cannot get higher than an 105% on project C.