Last modified: 2019-04-18 15:45
Status: RELEASED.
Due date: Wed. May 8 at 11:59PM EST.
Jump to:
Background Code Datasets Problem 1 Problem 2 Problem 3 Problem 4 Problem 5 Bonus Problem Rubric for Evaluating PDF Report
Turn-in links:
- PDF report: https://www.gradescope.com/courses/33142/assignments/193961
- ZIP file containing Problem 5 predictions, collaborator info, and source code: https://www.gradescope.com/courses/33142/assignments/193962/
Starter Code and Dataset Links:
https://github.com/tufts-ml-courses/comp135-19s-assignments/tree/master/project3
Files to Turn In:
PDF report:
- Human-readable report (NOT a notebook export), answering the prompts for all Problems.
- This document will be manually graded according to our rubric
- Please: within Gradescope via the normal submission process, annotate for each subproblem exactly which page(s) are relevant. This will save your graders much time!
ZIP file should contain:
- predicted_ratings_test.txt : a plain text file for leaderboard performance on Problem 5
-
- One line for each line in
ratings_test_masked.csv
- One line for each line in
-
- Can be loaded in with
np.loadtxt()
as a valid 1d array of floats with shape (10000,)
- Can be loaded in with
- Any .py or .ipynb files
- COLLABORATORS.txt : a plain text file [example], containing
-
- Your full name
-
- Estimate the hours you spent on each Problem
-
- Names of any people you talked to for help (TAs, students, etc.). If none, write "No external help".
-
- Brief description of what content you sought help about (1-3 sentences)
Background
We have given you a dataset of ratings of 1682 movies by a set of 943 possible users.
Ratings were collected on a 5-star scale. Each rating is one of 5 possible integer values -- 1, 2, 3, 4, or 5 -- with 5 being 'best' and 1 being 'worst'.
We'd like to build a recommendation system to help guess which movies a user will like. Our prediction system should be able to take as inpute specific user (denoted by index \(i\), one of 943 possibilities) and movie (denoted by index \(j\), one of 1682 possibilities). The output of our predictor (e.g. the quantity produced by calling model.predict(i,j)
) will be a scalar rating \(\hat{y}_{ij} \in \mathbb{R}\). For each possible user-movie pair \(i,j\), we'd like \(\hat{y}_{ij}\) to be as close as possible to the "real" rating \(y_{ij}\).
Of course, not all users have seen and rated all movies, so some true ratings are simply unknown to us. Thus, some entries \(y_{ij}\) of the true rating matrix \(y\) are simply missing and it's impossible to know if our guesses for these entries are any good.
To sidestep the missing data problem, we can evaluate by simply ignoring missing entries and concentrating on the observed entries of \(y\). We can create a list \(\mathcal{I}\) of all distinct observed pairs of (user_id, item_id) in the complet dataset. As usual, we can then randomly divide these examples into training and test sets: \(\mathcal{I}^{\text{train}}\) and \(\mathcal{I}^{\text{test}}\).
We've provided two CSV files, ratings_train.csv
and ratings_test.csv
, to give you everything you need. Each row of these files specifies a (user_id, item_id) pair and its observed 5-star rating. The first few rows of ratings_train.csv
are:
user_id,item_id,rating
772,36,3
471,228,5
641,401,4
312,98,4
...
Background: Evaluation
You'll use the training set of user-movie ratings to fit a model, and then evaluate on the test set. To measure prediction accuracy, we'll use mean absolute error. On the test set, this is computed as:
Background: SGD, Automatic Differentiation and autograd
Across Problem 1, Problem 2, and Problem 3, you'll develop your own Python code to build a series of increasingly more powerful models to perform recommendation. For each model, we will view training as an optimization problem, and we'll solve it with stochastic gradient descent. This will be similar to the gradient descent we did in Project 1, but using at each step "noisy" gradient estimates computed by visiting a small random subset of data (entries in our ratings matrix) at each update step.
How will we compute gradients? We've provided some starter code to help you. To save lots of effort and help us explore many possible models, we'll use the autograd Python package to perform automatic differentiation.
For each model of interest (defined mathematically below), we'll create a separate class file for that model. For examples in your starter code, see CollabFilterMeanOnly.py for Problem 1 and CollabFilterOneScalarPerItem.py for Problem 2. Your task is to define each model's methods for performing prediction and computing the loss used for gradient-based training.
All these models will be subclasses of a "base" class AbstractBaseCollabFilterSGD, which contains logic to construct and fit collaborative filtering models to data via stochastic gradient descent. Using this base class, you don't need to write SGD yourself, or even how to compute the gradient!
Below, we review the methods/attributes you'll need to write yourself, and the methods/attributes you only need to use that are provided in the base class.
Methods you'll need to write
To complete the implementation of a given model, we'll follow the following pattern:
- Use the instance attribute
param_dict
to store all model parameters
param_dict : dict
Keys are string names of parameters
Values are *numpy arrays* of parameter values
- Define method
predict
to make predictions: -
- Input: Specific user-movie example pairs, indicated by integer ids
-
- Output: Predicted ratings for each example
-
Define method
calc_loss_wrt_parameter_dict
to compute the loss to minimize with SGD: -
- Input: a minibatch of training data, a parameter dict
-
- Output: scalar float, indicating the loss on the batch given the parameters
-
Define method
init_parameter_dict
to initialize the param_dict attribute to random values: -
- Input: Number of possible users, Number of possible items
-
- Output: None, internal attribute
param_dict
updated
- Output: None, internal attribute
The steps above are all you need to do for each possible model. Each class inherits a complete fit
method from the predefined AbstractBaseCollabFilterSGD
, which knows how to perform SGD given the pieces above.
As you get started with autograd, you might find some material from this previous tutorial/recitation helpful: IntroToAutogradAndBackpropForNNets.ipynb
Methods you'll need to use (but not edit)
You should then skim through the provided AbstractBaseCollabFilterSGD.py, to be sure you understand what's going on.
__init__
: Constructor
This is where you define the batch_size
, the step_size
(learning rate), and the number of epochs n_epochs
to complete during training.
This constructor can also define the regularization strength alpha
and the number of hidden factors n_factors
for Problem 3.
fit
: Method to fit model parameters to provided data
At the bottom of each starter code file, you can see example code for calling fit
for that model.
Attributes available after calling fit
After fitting a model, you'll have these attributes available to you. They store useful diagnostic metrics computed at various checkpoints throughout the training procedure.
Epoch - A Definition: An epoch is a unit of training progress. One epoch is complete when our stochastic gradient descent has processed multiple minibatches whose total number of examples are "equivalent" to the size of the entire training dataset.
By default, we compute "report" metrics on the initial parameters (before any updates), and then every 10% of an epoch for the first 5 or so epochs, then every 1 epoch or so after that. This helps us monitor progress, which is especially rapid in early iterations.
trace_epoch : 1D array-like
Contains the epochs (fractional) where model performance was assessed.
Value of 0.0 indicates the initial model parameters (before any gradient updates).
Value of 0.1 indicates that the total training examples seen represents 10% of the size of the training set.
trace_loss : 1D array-like
Contains training loss (at current batch only) whenever model was assessed.
This is reported as an average per example in the current batch.
trace_mae_train : 1D array-like
MAE assessed on entire training set whenever model assessed.
trace_mae_valid : 1D array-like
MAE assessed on entire validation set whenever model assessed.
So for example, to plot training MAE vs. epochs completed, you could do:
# After calling fit...
plt.plot( model.trace_epoch, model.trace_mae_train, 'b.-')
Datasets
We will use the MovieLens 100K dataset. This data set consists of:
- 100,000 ratings (1-5 stars) from 943 users on 1682 movies.
- Each user has rated at least 20 movies.
- Some movies have ratings from only a few users.
- Simple demographic info for the users (age, gender, etc.) are available
For more information about the original dataset, see http://files.grouplens.org/datasets/movielens/ml-100k-README.txt. We are grateful to the dataset creators and the University of Minnesota for making this data publicly available.
We've provided a clean train/test split of this dataset here: https://github.com/tufts-ml-courses/comp135-19s-assignments/tree/master/project3/data_movie_lens_100k
Starter Code and Code Restrictions
For this assignment, you are limited to the following Python packages for performing machine learning related functionality:
- All Problems: Default
comp135_env
packages: numpy, scipy, sklearn, etc. - All Problems:
autograd
package for automatic differentiation - Problem 4 onward:
surprise
package for recommendation - Problem 5 (open-ended): any other packages you want
You are welcome to consult the documentation websites or other external web resources for any of these packages. However, you should understand every line of the code you use and not simply copy-paste without thinking carefully. Remember to keep the course collaboration policy in mind: do your own work!
You can INSTALL the autograd
and surprise
packages as follows:
source activate comp135_env
pip install autograd
conda install -c conda-forge scikit-surprise
Problem 1: Simple Baseline Model with SGD and Autograd
To get used to developing models using our autograd framework, we'll consider the simplest possible baseline model "M1": a model that makes the same scalar prediction for a movie's rating no matter what user or movie is considered. This model has one scalar parameter \(\mu \in \mathbb{R}\), and the prediction for user \(i\) and movie \(j\) is simply:
Training Model M1
To train model M1 for \(N\) users and \(M\) movies, we wish to optimize the following objective:
In words, this means we want to minimize the squared error on the training set, between the predicted rating (simply \(\mu\) here) and the observed rating \(y_{ij}\).
Problem 1 Code Implementation Tasks
Edit the starter code file: CollabFilterMeanOnly.py
. Complete each required method (init_parameter_dict
, predict
, and calc_loss_wrt_parameter_dict
), as described above in the Autograd background section.
Problem 1 Report Tasks
As evidence of successfully completing Problem 1, include the following in your report:
1a: Figure and caption: Trace plot showing mean absolute error vs. epoch completed, for both the training set and validation set.
1b: Short answer: Would adding regularization to our training optimization problem (e.g. an L2 penalty on \(\mu\)) noticeably improve performance with this model on this dataset? Why or why not? What's special about this task that you might want to consider?
1c: Short answer: There is a closed-form operation we could apply to the training set to compute the optimal \(\mu\) value. How would you compute this "exact" solution? Report the computed optimal \(\mu\) value. Does this result agree with your SGD solution?
Problem 2: One-Scalar-Per-Item Baseline with SGD and Autograd
We now consider a model "M2" with three parameters:
- \(\mu\) : scalar mean rating
- \(b_i\) : scalar bias term for each user \(i\)
- \(c_j\) : scalar bias term for each movie \(j\)
Prediction under model "M2" becomes:
Training Model M2
To train model M2 for \(N\) users and \(M\) movies, we wish to optimize the following objective:
In words, this means we want to minimize the squared error on the training set, between the predicted rating and the observed rating \(y_{ij}\).
Problem 2: Code Implementation Tasks
Edit the starter code file: CollabFilterOneScalarPerItem.py
. Complete each required method, as described above in the Autograd background section.
Problem 2: Report Tasks
As evidence of successfully completing Problem 2, include the following in your report:
2a: Figure and caption: Trace plot showing mean absolute error vs. epoch completed, for both the training set and validation set.
2b: Short answer: Taking the best result of model M2 above, how does model M2 compare to M1 in terms of predictive performance?
2c: Figure and caption: Inspect the learned per-movie rating adjustment parameters \(c_j\) for the short list of movies in select_movies.csv
. Do you notice any interpretable trends? What does it mean for a movie to have a large positive \(c_j\) or large negative \(c_j\) value?
Problem 3: Matrix Factorization with Autograd
We now consider a full matrix factorization model "M3" with five parameters:
- \(\mu\) : scalar mean rating
- \(b_i\) : scalar bias term for user \(i\)
- \(c_j\) : scalar bias term for movie \(j\)
- \(u_i\) : K-dimensional vector for user \(i\)
- \(v_j\) : K-dimensional vector for movie \(j\)
Crucially, you'll now need to think about how to set \(K\), the number of "factors" or "dimensions" to learn when representing each user/movie in a vector space. Within the starter code, you set this with the n_factors
keyword argument to the constructor.
Prediction under model "M3" is:
Training:
To train model M3 for \(N\) users and \(M\) movies, we wish to optimize the following objective:
Note that we have added L2 penalties on the \(u\) and \(v\) vectors. In the starter code, the penalty strength hyperparameter can be set via the keyword argument alpha=...
in the constructor, and accessed via the attribute alpha
.
Problem 3: Code Implementation Tasks
Edit the starter code file: CollabFilterOneVectorPerItem.py
. Complete each required method, as described above in the Autograd background section.
Problem 3: Report Tasks
As evidence of successfully completing Problem 3, include the following in your report:
3a: Figure and caption: Using NO regularization (alpha=0.0
), make a trace plot showing mean absolute error vs. epoch completed, for both the training set and validation set. Repeat for several values of the number of factors \(K\): 2, 10, 50.
3b: Figure and caption: Repeat 3a with a moderate regularization strength \(\alpha > 0\), such that any overfitting in 3a is not longer visible.
3c: Short answer: Other than L2/L1 regularization penalties on parameter values, what else could we do to avoid overfitting in this setting when using the same model M3 and still using SGD?
3d: Short answer: Taking the best result of model M3 so far, how does model M3 compare to M2 or M1 in terms of predictive performance? How many factors do you recommend? Should we try even more than 50 factors?
3e: Figure and caption: For the best M3 model with \(K=2\) factors, consider the learned per-movie vectors \(v_j\) for the short list of movies listed in select_movies.csv
. Can you make a scatter plot of the 2-dimensional "embedding" vector \(v_j\) of these movies (labeling each point with its movie title), like we saw in lecture? Do you notice any interpretable trends?
Problem 4: Matrix Factorization using Surprise
Using surprise to tune collaborative filtering models
The surprise
package can fit a collaborative filtering model just like M3 with \(K\) hidden factors to ratings data, using surprise's SVD.
4a: Figure and caption: Show the results of 5-fold cross-validation on the full-training set (no need for a separate validation set here), across a range of possible \(K\) and \(\alpha\) values to get the best possible mean absolute error performance. What is your recommendation for the choice of the hyperparameters \(K\) and \(\alpha\)? How did you set the learning rate?
4b: Short answer: How does this model's performance compare with that from Problem 3, which you trained using your own implementation? If big differences exist, why do you think they occured?
Problem 5: Open-Ended Recommendation Challenge
The starter code includes an additional test dataset -- ratings_test_masked.csv
-- which you haven't used yet. This contains an additional 10,000 entries of user-movie pairs for the same set of users and movies as above. We've omitted the ratings here, so you won't be able to access them. In this problem, your goal is to obtain the best possible prediction results on this heldout test data, in terms of mean absolute error.
You can try any model you want. You can make use of the other ratings you've already observed in the training set, as well as the user-specific info found in user_info.csv
and the movie-specific attributes found in item_info.csv
. You can use autograd
, sklearn
, surprise
, or any other package. Your goal is to get the best score on our leaderboard.
Example Problem 5 ideas
-
Throughout this project, we have used mean-squared error in the
calc_loss...
method for training the models, but then evaluated with mean-absolute-error. What if we just used mean absolute error in the loss? This should be easy withautograd
. -
Can you somehow use user-specific features (like gender and age) as well as the learned features to improve accuracy?
-
Try one of the k-nearest neighbor approaches to recommendation in
surprise
-
Try dropout as an alternative to L2 regularization
Problem 5: Leaderboard Tasks
You should submit your final predictions as a plain text file named predicted_ratings_test.txt
This file should have one line for each non-header line of ratings_test_masked.csv.
Include this file inside the top-level of the ZIP file you submit to the leaderboard, which also contains your COLLABORATORS.txt and a snapshot of your full source code.
Problem 5: Report Tasks
Include 1 or 2 paragraphs describing your method, and 1 or 2 tables/figures relevant to reporting how you trained the model and how it performed on the test set. Be sure to include the measured leaderboard performance (in terms of mean absolute error) and discuss how this number compared to your internal validation efforts.
BONUS Problem: Predicting Gender from Learned Per-User Embedding Vectors
Consider the best M3 model you've trained (either with surprise or with your own implementation in Problem 3). Let \(U\) be the learned matrix of user vectors, with its i-th row as the vector \(u_i\) for user \(i\). Each row of \(U\) can be seen as a "feature vector" for a particular user.
The question we'd like to investigate is this: do our learned per-user features that are optimized for predicting movie ratings contain anything to do with gender?
The provided data file user_info.csv
contains an is_male
column indicating which users in the dataset are male. Can you predict this signal given the features \(U\)?
Implementation Tasks
Train a binary classifier to predict the is_male
target variable given the fixed per-user embedding features \(U\). Feel free to use any sklearn binary classifier (logistic regression, SVM, random forest, etc.). You should use best practices for model selection (cross validation, hyperparameter search, etc).
You'll need to set up this classification task from scratch. You have info for all 943 users. Should you include them all in the training process? How will you fairly report the accuracy on heldout data?
Bonus Problem Report Tasks
a: Paragraph: Describe your analysis. How did you split the data (train/test)? What classifier did you choose and why? How did you tune hyperparameters?
b: Figure and caption: Show a confusion matrix for your gender-from-user-features classifier. What error rate do you get? Is it significantly better than chance for this dataset?
Rubric for Evaluating PDF Report
Earning full credit on this assignment requires a well-thought-out report that demonstrates you understand the models and training procedures we're studying here.
Points will be allocated across the various parts as follows:
- 10%: Problem 1
- 20%: Problem 2
- 30%: Problem 3
- 20%: Problem 4
- 20%: Problem 5
The BONUS problem, if completed successfully, will be worth up to 8%.