Last modified: 20201220 16:04
Status: RELEASED.
Updates
 20201207 : Updated Bonus problem to include a new short answer question "5c" about responsible AI
Due date
 Leaderboard closes on Fri. Dec. 18 at 11:59pm AoE (Anywhere on Earth).
 Report is due Mon. Dec. 21 at 11:59pm AoE (Anywhere on Earth). (Tue 12/22 at 07:59am in Boston).
No late days allowed due to the strict end of the semester.
Jump to:
Background Code Datasets Problem 1 Problem 2 Problem 3 Problem 4 Bonus Problem Rubric for Evaluating PDF Report
Turnin links:

PDF report: https://www.gradescope.com/courses/173055/assignments/880988

ZIP file containing Problem 4 predictions: https://www.gradescope.com/courses/173055/assignments/880999

Reflection Form: https://forms.gle/r4VFnbgrBa9d7ZDV6
Starter Code and Dataset Links:
https://github.com/tuftsmlcourses/comp13520fassignments/tree/master/projectC
Overview
This is a four week project with lots of openended programming. Get started right away!
Team Formation
In this project, you can work in teams of 2 people, or (if you prefer) individually. Individual teams still need to complete all the parts below. We want to incentivize you to work in pairs.
If you need help finding teammates, please post to our "Finding a Partner for Project C" post on Piazza.
By the start of the second week (by end of day Wed 12/02), you should have identified your partner.
Work to Complete
As a team, you will work on several problems that study different representations of recommendation systems.
 Problem 1 looks at a baseline factorization model with one scalar parameter.
 Problem 2 looks at a baseline factorization model with one learned scalar per item.
 Problem 3 looks at a factorization model with a learned vector embedding per item.
Throughout Problems 1, 2, and 3, you will practice the development cycle of models trained with SGD:

 Develop a coherent loss function for your application

 Use automatic differentiation toolboxes to compute the gradient

 Attempt multiple runs of SGD to train models until you get a satisfactory performing model

 Select among several model complexity hyperparameters to get the best generalization performance
Then, for Problem 4, you have much more openended freedom.
 Build the best model you can, and try to get to the top of the leaderboard!
What to Turn In
Each team will prepare one PDF report covering all problems below.
 Prepare a short PDF report (no more than ~7 pages).
 This document will be manually graded according to our rubric
 Can use your favorite report writing tool (Word or G Docs or LaTeX or ....)
 Should be humanreadable. Do not include code. Do NOT just export a jupyter notebook to PDF.
 Should have each subproblem marked via the inbrowser Gradescope annotation tool)
Each team will prepare one ZIP file of the leaderboardtestset predictions as a leaderboard submission for Problem 4. This ZIP file will contain just one file:
 predicted_ratings_leaderboard.txt : a plain text file

 One line for each line in the released dataset
ratings_masked_leaderboard_set.csv
 One line for each line in the released dataset

 Can be loaded in with
np.loadtxt()
as a valid 1d array of floats with shape (10000,)
 Can be loaded in with
Each individual will turn in a reflection form (after completing the report).
Background
For this project, you are given a large dataset of the ratings that 943 users have given to 1682 movies.
We'd like to build a recommendation system to help guess which movies a user will like.
Input: The input to our prediction system is a pair \(i,j\), which denotes a specific user id (denoted by index \(i\), one of 943 possibilities) and movie id (denoted by index \(j\), one of 1682 possibilities).
Output: The output of our predictor (e.g. the quantity produced by calling model.predict(i,j)
) will be a scalar rating \(\hat{y}_{ij} \in \mathbb{R}\). For each possible usermovie pair \(i,j\), we'd like \(\hat{y}_{ij}\) to be as close as possible to the "real" 5star rating \(y_{ij}\) that the user \(i\) gave the movie \(j\). We can represent any five star rating with an integer in the set \(\mathcal{Y} = \{1, 2, 3, 4, 5\}\): a rating of 1 is the worst possible, a rating of 5 is the best.
We can think of our entire set of observed ratings as a big 2D matrix \(Y\), which has 943 rows and 1682 columns. Formally, we can write \(Y \in \mathcal{Y}^{943 \times 1682}\).
Of course, not all users have seen and rated all movies, so some true ratings are simply unknown to us. Thus, if we examined our true rating matrix \(Y\), most entries would be missing. It is impossible to know if our guesses for missing entries are any good.
We thus concentrate only the observed entries of \(Y\). We can represent our dataset more compactly as a list \(\mathcal{I}\) of all distinct observed pairs of input (user_id \(i\), item_id \(j\)) and output (rating \(y_{ij}\)) in the observed dataset.
Dataset
We will use the MovieLens 100K dataset. This data set consists of:
 100,000 ratings (15 stars) from 943 users on 1682 movies.
 Each user has rated at least 20 movies.
 Some movies have ratings from only a few users.
 Ratings were collected on a 5star scale. Each rating is one of 5 possible integer values  1, 2, 3, 4, or 5  with 5 being 'best' and 1 being 'worst'.
 Simple demographic info for the users (age, gender, etc.) are available
For more information about the original dataset, see http://files.grouplens.org/datasets/movielens/ml100kREADME.txt. We are grateful to the dataset creators and the University of Minnesota for making this data publicly available.
We've provided a clean preprocessed version of this dataset here: https://github.com/tuftsmlcourses/comp13520fassignments/tree/master/projectC/data_movie_lens_100k
We've provided one CSV file, ratings_all_development_set.csv
, to give you all the data you need to develop and evaluate models. (Another CSV file for the leaderboard will be discussed later in Problem 4).
Each row of this file specifies 3 things:
 user_id \(i\) : an integer in \(\{0, 1, 2, ... 942\}\)
 item_id \(j\) : an integer in \(\{0, 1, 2, ... 1681\}\)
 observed 5star rating \(y_{ij}\) : an integer in {1, 2, 3, 4, 5}$
The first few rows of ratings_all_development_set.csv
are:
user_id,item_id,rating
772,36,3
471,228,5
641,401,4
312,98,4
...
Train/validation/test splits:
As usual, we can take the large development dataset \(\mathcal{I}\) (which is provided to you in random order) and divide it into training, validation, and test sets: \(\mathcal{I}^{\text{train}}\), \(\mathcal{I}^{\text{valid}}\), and
\(\mathcal{I}^{\text{test}}\). See the starter code in train_valid_test_loader.py
, which you can use throughout Problems 15.
Background: Evaluation Performance Metric
Your goal is to predict the ratings of all usermovie pairs well.
To measure ultimately rank models by prediction quality, we'll use mean absolute error. This is useful for our fivestar rating task, since it is nicely interpretable (e.g. an MAE of 0.5 means we're within a 1/2star of the answer on average). Plus, it is not overly sensitive to outliers.
For a given dataset \(\mathcal{I}\) of triples \(i, j, y_{ij}\), recall that MAE is computed as:
Background: SGD, Automatic Differentiation and autograd
Across Problem 1, Problem 2, and Problem 3, you'll develop your own Python code to build a series of increasingly more powerful models to perform recommendation. For each model, we will view training as an optimization problem, and we'll solve it with stochastic gradient descent.
Recall that we spent HW3 analyzing an existing sklearn
implementation of SGD for MLP classifiers. You'll now get some experience writing your own loss function and using SGD to optimize that loss.
How do we use SGD for our problem? At each update step, you'll grab a minibatch of data (a random subset of the observed entries in our ratings matrix \(Y\)), and compute gradient estimates for parameters with respect to this batch.
How will we compute gradients? We've provided some starter code to help you. To save lots of effort and help us explore many possible models, we'll use the autograd Python package to perform automatic differentiation.
For a solid introduction to using autograd
, see the Autograd For Gradient Descent lab from Monday 11/30
Starter Code organization
For each model of interest (Problem 1, Problem 2, Problem 3, all defined mathematically below), we have created a separate file defining the python class
for that model. For examples in your starter code, see:
 CollabFilterMeanOnly.py for Problem 1
 CollabFilterOneScalarPerItem.py for Problem 2.
 CollabFilterOneScalarPerVector.py for Problem 3.
Your task is to define several methods for each model class, which allow us to perform prediction given fixed parameters, and compute the loss used for gradientbased training of parameters.
All these models will be subclasses of a "base" class AbstractBaseCollabFilterSGD, which contains logic to construct and fit collaborative filtering models to data via stochastic gradient descent. Using this base class, you don't need to write SGD yourself, or even how to compute the gradient! Please do read through this base class carefully to understand how it works.
Below, we review the methods/attributes you'll need to write yourself, and the methods/attributes you only need to use that are provided in the base class.
Methods you'll need to write
To complete the implementation of a given model, we'll follow the following pattern:
 Use the instance attribute
param_dict
to store all model parameters
param_dict : dict
Keys are string names of parameters
Values are *autograd.numpy arrays* of parameter values
 Define method
predict
to make predictions: 
 Input: Specific usermovie example pairs, indicated by integer ids

 Output: Predicted ratings for each example

Define method
calc_loss_wrt_parameter_dict
to compute the loss to minimize with SGD: 
 Input: a minibatch of training data, a parameter dict

 Output: scalar float, indicating the loss on the batch given the parameters

Define method
init_parameter_dict
to initialize the param_dict attribute to random values: 
 Input: Number of possible users, Number of possible items

 Output: None, internal attribute
param_dict
updated
 Output: None, internal attribute
The steps above are all you need to do for each possible model. Each class inherits a complete fit
method from the predefined AbstractBaseCollabFilterSGD
, which knows how to perform SGD given the pieces above.
Methods you'll need to understand and use (but not edit)
You should read through the provided AbstractBaseCollabFilterSGD.py, to be sure you understand what's going on.
__init__
: Constructor
This is where the user defines the batch_size
, the step_size
(learning rate), and the number of epochs n_epochs
to complete during training. This constructor can also define the regularization strength alpha
and the number of hidden factors n_factors
for Problem 3.
fit
: Method to fit model parameters to provided data
At the bottom of each starter code file, you can see example code for calling fit
for that model.
When the user calls fit
, the model parameters are initialized to random values, and then updated iteratively via SGD to improve the loss. These updates proceed until the desired number of epochs are performed.
Epoch  A Definition: An epoch is a unit of training progress in minibatch learning. One epoch is complete when our stochastic gradient descent has processed enough minibatches such that the total number of examples seen is "equivalent" to the size of the entire training dataset.
Within our fit
implementation, we compute and store performance metrics at various checkpoints throughout the training process, including:
 at the initial parameters (before any updates)
 every 1/4 of an epoch for epochs 0  2
 every 1/2 of an epoch for epochs 2  8
 every 1 of an epoch for epochs 8  32
 every 2 epochs for 32  128
 every 4 epochs after that
These metrics helps us monitor progress as learning progresses, which is especially rapid in early epochs as the poor random initialization is improved.
Attributes available after calling fit
that trace performance
We have recorded and stored useful diagnostic metrics computed at various checkpoints throughout the training procedure. These help us "trace" what happens throughout learning, so we call them trace performance metrics.
After fitting a model, you'll have the following trace attributes available to you:
trace_epoch : 1D arraylike
Contains the epochs (fractional) where model performance was assessed.
Value of 0.0 indicates the initial model parameters (before any gradient updates).
Value of 0.1 indicates that the total training examples seen represents 10% of the size of the training set.
trace_loss : 1D arraylike
Contains training loss (at current batch only) whenever model was assessed.
This is reported as an average per example in the current batch.
trace_mae_train : 1D arraylike
MAE assessed on entire training set whenever model assessed.
trace_mae_valid : 1D arraylike
MAE assessed on entire validation set whenever model assessed.
So for example, to plot training MAE vs. epochs completed, you could do:
# After calling fit...
plt.plot( model.trace_epoch, model.trace_mae_train, '.')
Starter Code and Code Restrictions
For this assignment, you are limited to the following Python packages for performing machine learning related functionality:
 All Problems: Default packages in your environment: numpy, scipy, sklearn, etc.
 All Problems:
autograd
package for automatic differentiation  Problem 4 onward:
surprise
package for recommendation, or any other packages you want
You can INSTALL the surprise
package as follows:
conda activate comp135_2020f_env
conda install c condaforge scikitsurprise
Your staff found in late November 2020 that scikitsurprise version 1.1.1
worked fine with the existing comp135 environment.
For any package you use, please consult the documentation websites or other external web resources. However, you should understand every line of the code you use and not simply copypaste without thinking carefully.
Remember to keep the course collaboration policy in mind: do your own work! (with/without a partner as required).
Problem 1: Simple Baseline Model with SGD and Autograd
To get used to developing models using our autograd framework, we'll consider the simplest possible baseline model "M1": a model that makes the same scalar prediction for a movie's rating no matter what user or movie is considered. This model has one scalar parameter \(\mu \in \mathbb{R}\), and the prediction for user \(i\) and movie \(j\) is simply:
Training Model M1
To train model M1 for \(N\) users and \(M\) movies, we wish to optimize this squared error training objective:
In words, this means we want to minimize the squared error on the training set, between the predicted rating (simply \(\mu\) here) and the observed rating \(y_{ij}\).
Problem 1 Code Implementation Tasks
Edit the starter code file: CollabFilterMeanOnly.py
. Complete each required method (init_parameter_dict
, predict
, and calc_loss_wrt_parameter_dict
), as described above in the background section above.
Problem 1 Analysis Tasks
Using SGD, train M1 with two different settings of batch size: 10000 examples and 100 examples.
For the best run of these, please record that M1 model's:
 value of parameter \(\mu\)
 mean absolute error on the validation set
 mean absolute error on the test set
Problem 1 Report Tasks
As evidence of successfully completing Problem 1, include the following in your report:
1a: Figure and caption: Trace plots showing mean absolute error vs. epoch completed for your SGD training runs.
In two sidebyside plots, you should compare the two batch sizes:
 Left plot: 10000 examples per batch
 Right plot: 100 examples per batch
 Each plot should have two lines (one for training MAE, one for validation MAE)
Please adjust the y axis to focus on what happens after epoch 2. Avoid showing a plot where you cannot see some difference between the two curves.
1b: Short answer: There is a closedform operation we could apply to the training set to compute the optimal \(\mu\) value (e.g. a one line computation in numpy involving the observed training ratings \(Y\)). How would you compute this "exact" solution? Report the computed optimal \(\mu\) value. Does this result agree with your SGD solution?
Problem 2: OneScalarPerItem Baseline with SGD and Autograd
We now consider a second model "M2" with three parameters:
 \(\mu\) : scalar mean rating
 \(b_i\) : scalar bias term for each user \(i\)
 \(c_j\) : scalar bias term for each movie \(j\)
Prediction under model "M2" becomes:
Training Model M2
To train model M2 for \(N\) users and \(M\) movies, we wish to optimize the following squared error objective:
In words, this means we want to minimize the squared error on the training set, between the predicted rating and the observed rating \(y_{ij}\).
Problem 2 Code Implementation Tasks
Edit the starter code file: CollabFilterOneScalarPerItem.py
. Complete each required method, as described above in the background section.
Problem 2 Analysis Tasks
Using SGD, train M2 with two different settings of batch size: 10000 examples and 100 examples.
You may need to adjust the SGD step_size
or n_epochs
hyperparameter here. You should aim to show runs that last until either training error visibly converges or until you see obvious overfitting.
For the best run of these, please record that M2 model's:
 mean absolute error on the validation set
 mean absolute error on the test set
 value of parameters \(\mu, b, c\) (not needed for your report, but useful so you don't need to retrain the model if you have to remake a figure)
Problem 2: Report Tasks
As evidence of successfully completing Problem 2, include the following in your report:
2a: Figure and caption: Trace plots showing mean absolute error vs. epoch completed for your SGD training runs.
In two sidebyside plots, you should compare the two batch sizes:
 Left plot: 10000 examples per batch
 Right plot: 100 examples per batch
 Each plot should have two lines (one for training MAE, one for validation MAE)
Please adjust the y axis to focus on what happens after epoch 2. Avoid showing a plot where you cannot see some difference between the two curves.
2b: Figure and caption: For the model from 2a with best validation MAE, display the learned permovie rating adjustment parameters \(c_j\) for each of the movies in the short list in select_movies.csv
. You might make a sorted list, showing each movie's title alongside its learned bias parameter.
In your caption, please answer: What kinds of movies have a large positive \(c_j\) or large negative \(c_j\) value? What does it mean for a \(c_j\) to be large and negative or large and positive?
Problem 3: OneVectorPerItem Collaborative Filtering with SGD and Autograd
We now consider a full matrix factorization model "M3" with five parameters:
 \(\mu\) : scalar mean rating
 \(b_i\) : scalar bias term for each user \(i\)
 \(c_j\) : scalar bias term for each movie \(j\)
 \(u_i\) : Kdimensional vector for each user \(i\)
 \(v_j\) : Kdimensional vector for each movie \(j\)
Crucially, you'll now need to think about how to set \(K\), the number of "factors" or "dimensions" to learn when representing each user/movie in a vector space. Within the starter code, you set this with the n_factors
keyword argument to the constructor.
Prediction under model "M3" for the rating that user \(i\) will give to movie \(j\) is:
M3 is known as a collaborative filtering model for recommendation in the ML research literature.
Training:
To train model M3 for \(N\) users and \(M\) movies, we wish to optimize the following objective:
Again, this is a squared error objective. Note that we have added L2 regularization penalties on the \(u\) and \(v\) vectors with strength \(\alpha \geq 0\). We'll investigate whether this helps us generalize to heldout data better.
In the starter code, the L2 penalty strength hyperparameter can be set via the keyword argument alpha=...
in the constructor, and accessed via the attribute alpha
.
Problem 3 Code Implementation Tasks
Edit the starter code file: CollabFilterOneVectorPerItem.py
. Complete each required method, as described above in the background section.
Problem 3 Analysis Tasks
You should fix batch_size=1000
throughout this problem.
Analysis 3(i): First, with no regularization (\(\alpha=0\)), train M3 using SGD. Try three possible values of \(K\): 2, 10, and 50.
Analysis 3(ii): Second, train M3 with \(K=50\) now with moderate regularization by setting strength \(\alpha > 0\) to try to eliminate overfitting you saw in 3(i). Focus only on \(K=50\).
For the best run at each \(K\) (with/without \(\alpha > 0\)), please record that M3 model's:
 mean absolute error on the validation set
 mean absolute error on the test set
 value of parameters \(\mu, b, c, u, v\) (not needed for your report, but useful so you don't need to retrain the model if you have to remake a figure)
You may need to adjust the SGD step_size
or n_epochs
hyperparameter here. You should aim to show runs that last until either training error visibly converges or until you see obvious overfitting.
Hint: You might consider using early stopping, to get the best possible model here.
Problem 3 Report Tasks
As evidence of successfully completing Problem 3, include the following in your report:
3a: Figure and caption: Make trace plot showing mean absolute error vs. epoch completed when \(\alpha=0\).
You should include 3 figures sidebyside, one for \(K=2\) factors, one for \(K=10\), and one for \(K=50\).
In your caption, reflect on what your trace plots suggest. Do you see underfitting? Overfitting? What happens as \(K\) increases?
3b: Figure and caption: Make trace plot showing mean absolute error vs. epoch completed when \(\alpha > 0\).
Here, please plot one figure, for \(K=50\).
In your caption, please specify (1) which value of \(\alpha\) you ultimately selected (you should try several) as well as the batchsize and step size, and (2) whether you ultimately get better heldout error with \(\alpha > 0\) than you did in 3a.
3c: Table and caption: Report the MAE on validation set and test set for the "best version" of each of these models:
 M1
 M2
 M3 with \(K=2\)
 M3 with \(K=10\)
 M3 with \(K=50\)
Below the figure, please discuss:
 How you determined the "best" version of each model
 How many factors \(K\) do you recommend for M3? Should we try even more than 50 factors?
 Which of the 5 models is "best" overall and why
3d: Figure and caption: For the best M3 model with \(K=2\) factors, consider the learned permovie vectors \(v_j\) for the short list of movies listed in select_movies.csv
. Can you make a scatter plot of the 2dimensional "embedding" vector \(v_j\) of these movies (labeling each point with its movie title), like we saw in lecture? Do you notice any interpretable trends?
Hint: See the very bottom of the day 24 lab notebook with help plotting a 2D visual of the embedding vectors. Search for the "Make visualization of the learned embeddings of select movies" section.
Problem 4: OpenEnded Recommendation Challenge
The starter code includes an additional leaderboard heldout dataset  ratings_masked_leaderboard_set.csv
 which you haven't used yet. This contains an additional 10,000 entries of usermovie pairs for the same set of users and movies as above. We've omitted the ratings here, so you won't be able to access them (they are "masked").
In this problem, your goal is to obtain the best possible prediction results on this heldout test data, in terms of mean absolute error.
You can try any model you want. You can make use of the other ratings you've already observed in the training set, as well as the userspecific info found in user_info.csv
and the moviespecific attributes found in item_info.csv
. You can use autograd
, sklearn
, surprise
, or any other package. Your goal is to get the best score on our leaderboard.
Example Problem 4 ideas

Try one of the implementations of SVD in the
surprise
package. This package offers models that look like our M3, plus some neat extensions of these. Can you train one of these to perform better? Please take advantage of surprise's extensive sklearnlike grid search tools. 
Can you somehow use userspecific features (like gender and age, available in
user_info.csv
) or moviespecific features (like title and year, available inmovie_info.csv
) to improve your rating predictions? 
Try one of the knearest neighbor approaches to recommendation in
surprise

Try some other approach to recommendation you have read about.

Try a new loss function! Throughout this project, we have used meansquared error in the
calc_loss...
method for training the models, but then evaluated with meanabsoluteerror. What if we just used mean absolute error in the loss? This should be somewhat easy withautograd
.
Problem 4: Leaderboard Submission Tasks
You should submit your final predictions as a plain text file named predicted_ratings_leaderboard.txt
 predicted_ratings_leaderboard.txt : a plain text file

 One line for each line in the released dataset
ratings_masked_leaderboard_set.csv
 One line for each line in the released dataset

 Can be loaded in with
np.loadtxt()
as a valid 1d array of floats with shape (10000,)
 Can be loaded in with
Problem 4: Report Tasks
Please include in your report
4a: 12 paragraphs describing your proposed method (how it works, why you chose it, what training and hyperparameter selection is done, etc.). Show that you have mastered the core concepts and best practices of machine learning.
4b: 1 figure (with caption) relevant to reporting how you trained the model or selected model complexity. This could be a trace plot or a hyperparameter selection plot.
4c: 1 tables reporting your model's ultimate mean absolute error performance. Be sure to include the measured leaderboard performance (in terms of mean absolute error) as well as some internal performance number (e.g. on the test split of the development set).
4d: 1 paragraph analyzing the table of results from 4d
 Discuss how your leaderboard number compared to your internal validation efforts
 Discuss how this method compared to your best model of M1, M2, and M3
4e: 1 paragraph discussing limitations and opportunities for future work
BONUS Problem 5: Predicting Gender from Learned PerUser Embedding Vectors
Consider the best Kfactor model you've trained (either with surprise or with your own implementation of M3 in Problem 3). Let \(U\) be the learned matrix of user vectors, with its ith row as the vector \(u_i\) for user \(i\). Each row of \(U\) can be seen as a "feature vector" for a particular user.
The question we'd like to investigate is this: do our learned peruser features that are optimized for predicting movie ratings contain anything to do with gender?
The provided data file user_info.csv
contains an is_male
column indicating which users in the dataset are male. Can you predict this signal given the features \(U\)?
Implementation Tasks
Train a binary classifier to predict the is_male
target variable given the fixed peruser embedding features \(U\). Feel free to use any sklearn binary classifier (logistic regression, SVM, random forest, etc.). You should use best practices for model selection (cross validation, hyperparameter search, etc).
You'll need to set up this classification task from scratch. You have info for all 943 users. Should you include them all in the training process? How will you fairly report the accuracy on heldout data?
Bonus Problem Report Tasks
5a: Paragraph: Describe your method. How did you split the data (train/test)? What classifier did you choose and why? How did you tune hyperparameters?
5b: Figure and caption: Show a confusion matrix for your genderfromuserfeatures classifier. What error rate do you get? Is it significantly better than chance for this dataset?
5c: Paragraph: Is predicting a user's gender "harmless", or are there applications of a recommendation system where this might have realworld consequences? How should a responsible AI practitioner handle this? What questions should we ask to decide if this tool should be "released"?
Rubric for Evaluating PDF Report
Earning full credit on this assignment requires a wellthoughtout report that demonstrates you understand the models and training procedures we're studying here.
We're looking for the itemized content outlined above, with professional figures and captions that tell a story, concise but wellthought out short answers, and evidence of best practices throughout.
Points will be allocated across the various parts as follows:
 10%: Problem 1
 20%: Problem 2
 30%: Problem 3
 40%: Problem 4
The BONUS problem, if completed successfully, will be worth up to 8% toward your overall project C grade.
Note: You cannot get higher than an 105% on project C.