Project B: Movie Recommendations


Last modified: 2025-12-04 18:25

Updates

  • 2025-11-17 19:00 : As mentioned in class, there are submission limits on the leaderboard. Submitting 5 times in a 24 hour period, or 25 times in total, will cause the autograder to fail.

Status: RELEASED

Due date

  • Leaderboard closes on Sun Dec 7 at 11:59pm Medford time

  • Report is due Mon Dec 8 at 11:59pm Medford time

Jump to:

  Background   Code   Datasets   Problem 1   Problem 2   Rubric for Evaluating PDF Report

Turn-in links

Overview

This is a multi-week project with lots of open-ended programming. Get started soon! Although you have over 3 weeks to work on this project, there is an extra credit homework and a 3-day break 2 weeks in.

Suggested schedule:

  • Release on Thu 11/13
  • Form partners and complete ProjectB Team Formation Form by Tue 11/18
  • Complete Problem 1 by Tue 11/25
  • Complete Problem 2 by Fri 12/5
  • Problem 2 Leaderboard closes Sun 12/7 11:59pm
  • Final Report due by Mon 12/8

Team Formation

In this project, you can work in teams of 2 people, or (if you prefer) individually. Individual teams still need to complete all the parts below. We want to incentivize you to work in pairs. If you need help finding teammates, feel free to make a post on Piazza!

Work to Complete

As a team, you will work on one scaffolded problem, and then a completely open problem.

The 2 problems look at different representations of text for a common task.

  • Problem 1 looks at a factorization model with a learned vector embedding per item.
  • Problem 2 is an open-ended problem, where any method is allowed.

In Problem 1, you will practice the development cycle of for models trained with SGD:

  • Develop a coherent loss function for your application
  • Use automatic differentiation toolboxes to compute the gradient
  • Attempt multiple runs of SGD, carefully tuning batch size and learning rate, until you get a satisfactory performing model
  • Select among several model complexity hyperparameters to get the best generalization performance

Then, for Problem 2, you have much more open-ended freedom.

  • Build the best model you can, and try to get to the top of the leaderboard!

What to Turn In

Each team will prepare one PDF report covering all problems below. You will not submit any code.

  • Prepare a short PDF report (7 pages or less).
  • This document will be manually graded according to our rubric
  • Can use your favorite report writing tool (Word or G Docs or LaTeX or ....)
  • Should be human-readable. Do not include code. Do NOT just export a jupyter notebook to PDF.
  • Should have each subproblem marked via the in-browser Gradescope annotation tool)

Each team will prepare one file of the leaderboard-test-set predictions as a leaderboard submission for Problem 2.

  • predicted_ratings_leaderboard.txt : a plain text file
  • One line for each line in the released dataset ratings_masked_leaderboard_set.csv
  • Can be loaded in with np.loadtxt() as a valid 1d array of floats with shape (10000,)

Each individual will turn in the reflection form after completing the report.

Starter Code and Code Restrictions

https://github.com/tufts-ml-courses/cs135-25f-assignments/tree/main/projectB/

For problem 1, you must use and complete the provided starter code files, which rely on autograd. You cannot use other Python packages beyond what has already been imported in starter Python files, or basic packages in this list: numpy, scipy, pandas, matplotlib, seaborn.

For problem 2, you can use any Python package you like (surprise, sklearn, nltk, etc). Specifically, you might find the surprise package useful for recommendations.

We thus recommend installing two extra packages to your CS135-specific environment:

micromamba activate cs135_25f_env
pip install autograd          # for prob 1
pip install scikit-surprise   # for prob 2

If you are using a different package manager, you might need to install autograd with a command other than pip.

You are welcome to consult documentation websites or other external web resources for snippets of code to guide your usage of different methods. However, you should understand every line of the code you use and not simply copy-paste without thinking carefully. You should also cite and acknowledge third-party code that made a significant impact on your work in your report.

Remember to keep the course collaboration policy in mind: your team should do your own work!

Starter Code organization

  • data_movie_lens_100k/: contains the data.
  • train_valid_test_loader.py: load predefined train/valid/test splits from CSV file

For Problem 1:

  • CollabFilterOneVectorPerItem.py: File you will edit.
  • AbstractBaseCollabFilterSGD.py: Abstract class for basic collaborative filtering.

For Problem 2:

  • surprise_demo.py: a demo of the surprise package for recommendation tasks.
  • read_svd_vectors.py: A demo file showing how to extract model parameters from suprise SVD models

Background

For this project, you are given a large dataset of the ratings that 943 users have given to 1682 movies.

We'd like to build a recommendation system to help guess which movies a user will like.

Input: The input to our prediction system is a pair \(i,j\), which denotes a specific user id (denoted by index \(i\), one of 943 possibilities) and movie id (denoted by index \(j\), one of 1682 possibilities).

Output: The output of our predictor, produced by calling model.predict(i,j), will be a scalar rating \(\hat{y}_{ij} \in \mathbb{R}\). For each possible user-movie pair \(i,j\), we'd like \(\hat{y}_{ij}\) to be as close as possible to the "real" 5-star rating \(y_{ij}\) that the user \(i\) gave the movie \(j\). We can represent any five star rating with an integer in the set \(\mathcal{Y} = \{1, 2, 3, 4, 5\}\): a rating of 1 is the worst possible, a rating of 5 is the best.

We can think of our entire set of observed ratings as a big 2D matrix \(Y\), which has 943 rows and 1682 columns. Formally, we can write \(Y \in \mathcal{Y}^{943 \times 1682}\).

Of course, not all users have seen and rated all movies, so some true ratings are simply unknown to us. Thus, if we examined our true rating matrix \(Y\), most entries would be missing. It is impossible to know if our guesses for missing entries are any good.

We thus concentrate only the observed entries of \(Y\). We can represent our dataset more compactly as a list \(\mathcal{I}\) of all distinct observed pairs of input (user_id \(i\), item_id \(j\)) and output (rating \(y_{ij}\)) in the observed dataset.

Dataset

We will use the MovieLens 100K dataset. This data set consists of:

  • 100,000 ratings (1-5 stars) from 943 users on 1682 movies.
  • Each user has rated at least 20 movies.
  • Some movies have ratings from only a few users.
  • Ratings were collected on a 5-star scale. Each rating is one of 5 possible integer values -- 1, 2, 3, 4, or 5 -- with 5 being 'best' and 1 being 'worst'.
  • Simple demographic info for the users (age, gender, etc.) are available

For more information about the original dataset, see http://files.grouplens.org/datasets/movielens/ml-100k-README.txt. We are grateful to the dataset creators and the University of Minnesota for making this data publicly available.

We've provided a clean preprocessed version of this dataset here: https://github.com/tufts-ml-courses/cs135-25f-assignments/tree/main/projectB/data_movie_lens_100k

We've provided one CSV file, ratings_all_development_set.csv, to give you all the data you need to develop and evaluate models. (Another CSV file for the leaderboard will be discussed later in Problem 2).

Each row of this file specifies 3 things:

  • user_id \(i\) : an integer in \(\{0, 1, 2, ... 942\}\)
  • item_id \(j\) : an integer in \(\{0, 1, 2, ... 1681\}\)
  • observed 5-star rating \(y_{ij}\) : an integer in {1, 2, 3, 4, 5}$

The first few rows of ratings_all_development_set.csv are:

user_id,item_id,rating
772,36,3
471,228,5
641,401,4
312,98,4
...

Train/validation/test splits: As usual, we can take the large development dataset \(\mathcal{I}\) (which is provided to you in random order) and divide it into training, validation, and test sets: \(\mathcal{I}^{\text{train}}\), \(\mathcal{I}^{\text{valid}}\), and \(\mathcal{I}^{\text{test}}\).

In starter code, we've provided a train_valid_test_loader.py that helps you load train/validation/test splits of this dataset.

Hint: See Part 1 of the Recommender Systems lab notebook, for a walk through of how to load the data and understand what it contains.

Background: Evaluation Performance Metric

Your goal is to predict the ratings of all user-movie pairs well.

To measure and ultimately rank models by prediction quality, we'll use two common metrics.

Root mean squared error (RMSE). For a given dataset \(\mathcal{I}\) of triples \(i, j, y_{ij}\), recall that RMSE is computed as:

$$ \text{RMSE}(y, \hat{y}) = \sqrt{ \frac{1}{|\mathcal{I}|} \sum_{i,j \in \mathcal{I}} ( y_{ij} - \hat{y}(i, j) )^2 } $$

This metric's units are interpretable on the five-star rating scale. Lower is better.

Mean absolute error (MAE). For a given dataset \(\mathcal{I}\) of triples \(i, j, y_{ij}\), recall that MAE is computed as:

$$ \text{MAE}(y, \hat{y}) = \frac{1}{|\mathcal{I}|} \sum_{i,j \in \mathcal{I}} | y_{ij} - \hat{y}(i, j) | $$

This is useful for our five-star rating task, since it is nicely interpretable (e.g. an MAE of 0.5 means we're within a 1/2-star of the answer on average). Plus, it is less sensitive to extreme outliers, unlike RMSE.

Background: SGD, Automatic Differentiation and autograd

In Problem 1, you'll develop your own Python code to build a latent factor model for collaborative filtering, whose resulting optimization problem we'll solve with stochastic gradient descent. A great deal of scaffolding is provided for you. Your task is to complete the implementation (you will not turn in your code) and tune it, describing its performance in your report.

We've already used sklearn to optimize loss functions, as in homework 3, but you'll now get some experience writing a custom loss function and using SGD to optimize that loss.

How do we use SGD for our problem? At each update step, you'll grab a minibatch of data (a random subset of the observed entries in our ratings matrix \(Y\)), and compute gradient estimates for parameters with respect to this batch.

How will we compute gradients? We've provided some starter code to help you. To save lots of effort and avoid having to hand-derive gradients, we'll use the autograd Python package to perform automatic differentiation.

For an introduction to using autograd, see the Autograd For Gradient Descent lab from day20.

When you fit parameters via SGD, your gradient descent code will iterate over many minibatches, and ultimately complete many epochs.

Epoch: An epoch is a unit of training progress in minibatch learning. One epoch is complete when our stochastic gradient descent has processed enough minibatches that it has processed each training data point once.

Implementing the CollabFilter

Your task for problem 1 is to define several methods for CollabFilterOneVectorPerItem.py which will then allow us to

  • perform prediction given fixed parameters
  • compute the loss used for gradient-based training of parameters

CollabFilterOneVectorPerItem inherits from AbstractBaseCollabFilterSGD. This abstract base class contains logic to construct and fit collaborative filtering models to data via stochastic gradient descent. Using this base class, you don’t need to write SGD yourself, or even compute the gradient! Please do read through this base class carefully to understand how it works.

Below, we review the methods/attributes you’ll need to write yourself, and the methods/attributes you only need to use that are provided in the base class.

Methods to Implement

Let instance attribute param_dict store all model parameters in a dictionary

    param_dict : dict
        Keys are string names of parameters
        Values are *autograd.numpy arrays* of parameter values    

You need to...

  • Define method predict to make predictions:

    • Input: Specific user-movie example pairs, indicated by integer ids
    • Output: Predicted ratings for each example
  • Define method calc_loss_wrt_parameter_dict to compute the loss to minimize with SGD:

    • Input: a minibatch of training data, a parameter dict
    • Output: scalar float, indicating the loss on the batch given the parameters
  • Define method init_parameter_dict to initialize the param_dict attribute to random values:

    • Input: Number of possible users, Number of possible items
    • Output: None, internal attribute param_dict updated

The steps above are all you need to do for each possible model. A complete fit method, inherited from the predefined AbstractBaseCollabFilterSGD, knows how to perform SGD given the pieces above.

Methods you'll need to understand and use (but not edit)

You should read through the provided AbstractBaseCollabFilterSGD.py, to be sure you understand what's going on.

  • __init__ : Constructor

This is where the user defines the batch_size, the step_size (learning rate), and the number of epochs n_epochs to complete during training. This constructor can also define the regularization strength alpha and the number of hidden factors n_factors.

  • fit : Method to fit model parameters to provided data

Example code calling fit for the CollabFilterOneVectorPerItem class is provided at the bottom its file.

When the user calls fit, the model parameters are initialized to random values, and then updated iteratively via SGD to improve the loss. These updates proceed until the desired number of epochs are performed.

Within our fit implementation, we compute and store performance metrics at various checkpoints throughout the training process, including:

  • at the initial parameters (before any updates)
  • every 1/4 of an epoch for epochs 0 - 2
  • every 1/2 of an epoch for epochs 2 - 8
  • every 1 of an epoch for epochs 8 - 32
  • every 2 epochs for 32 - 128
  • every 4 epochs after that

These metrics helps us monitor progress as learning progresses, which is especially rapid in early epochs as the poor random initialization is improved.

Attributes available after calling fit that trace performance

The code already records and stores useful diagnostic metrics computed at various checkpoints throughout the training procedure. These help us "trace" what happens throughout learning, so we call them trace performance metrics.

After fitting a model, you'll have the following trace attributes available to you:

trace_epoch : 1D array-like
    Contains the epochs (fractional) where model performance was assessed aka checkpointed.
    Value of 0.0 indicates the initial model parameters (before any gradient updates).
    Value of 0.1 indicates the total training examples seen represents 10% of the size of the training set.
    Value of 2.0 indicates two epochs have been completed; that is, total training examples seen represents 200% of the size of the training set.

trace_loss : 1D array-like
    Contains training loss (at current batch only) at each checkpoint.
    This is reported as an average per example in the current batch.

trace_rmse_train : 1D array-like
    RMSE assessed on entire training set at each checkpoint

trace_rmse_valid : 1D array-like
    RMSE assessed on entire validation set at each checkpoint

So for example, to plot training RMSE vs. epochs completed, you could do:

# After calling fit...
plt.plot( model.trace_epoch, model.trace_rmse_train, '.-')

Problem 1: One-Vector-Per-Item Collaborative Filtering with SGD and Autograd

We consider a full matrix factorization model with five parameters:

  • \(\mu\) : scalar mean rating
  • \(b_i\) : scalar bias term for each user \(i\)
  • \(c_j\) : scalar bias term for each movie \(j\)
  • \(u_i\) : K-dimensional vector for each user \(i\)
  • \(v_j\) : K-dimensional vector for each movie \(j\)

Crucially, you'll now need to think about how to set \(K\), the number of "factors" or "dimensions" to learn when representing each user/movie in a vector space. Within the starter code, you set this with the n_factors keyword argument to the constructor.

Prediction for the rating that user \(i\) will give to movie \(j\) is defined as:

\begin{align} \hat{y}_{ij} &= \mu + b_i + c_j + \sum_{k=1}^K u_{ik} v_{jk} \\ &= \mu + b_i + c_j + u_{i}^T v_{j} \end{align}

Which is known as a latent factor (LF) model, or a collaborative filtering (CF) model for recommendation, in the ML research literature.

Training:

To train a model for \(N\) users and \(M\) movies, we wish to optimize the following objective:

$$ \min_{\mu, b, c, \{ u_i \}_{i=1}^N, \{ v_j \}_{j=1}^M} \quad \alpha \Big( \sum_{j} \sum_{k} v_{jk}^2 + \sum_{i} \sum_{k} u_{ik}^2 \Big) + \sum_{i,j \in \mathcal{I}^{\text{train}}} (y_{ij} - \mu - b_i - c_j - u_i^T v_j)^2 $$

Again, this is a squared error objective. Note that we have added L2 regularization penalties on the \(u\) and \(v\) vectors with strength \(\alpha \geq 0\). We'll investigate whether this helps us generalize to heldout data better.

In the starter code, the L2 penalty strength hyperparameter can be set via the keyword argument alpha=... in the constructor, and accessed via the attribute alpha.

Problem 1 Code Implementation Tasks

Edit the starter code file: CollabFilterOneVectorPerItem.py. Complete each required method:

  • predict
  • calc_loss_wrt_parameter_dict
  • init_parameter_dict

As described above in Methods to Implement.

Hint: See Part 2 of the Recommender Systems lab notebook, for a walk through of how to use this code and what you need to do for Problem 1.

Problem 1 Analysis Tasks

You should fix batch_size=1000 throughout this problem.

Step 3(i): First, with no regularization (\(\alpha=0\)), train a CollabFilterOneVectorPerItem using SGD. Try three possible values of \(K\): 2, 10, and 50.

Step 3(ii): Second, train with \(K=50\) now with moderate regularization by setting strength \(\alpha > 0\) to try to eliminate overfitting you saw in 3(i). Focus only on \(K=50\).

For the best run at each \(K\) (with/without \(\alpha > 0\)), please record that model's:

  • RMSE and MAE on the validation set
  • RMSE and MAE on the test set
  • value of parameters \(\mu, b, c, u, v\) (not needed for your report, but useful so you don't need to retrain the model if you have to remake a figure)

You likely need to adjust the SGD step_size or n_epochs hyperparameter here. You should aim to show runs that last until either training error visibly converges or until you see obvious overfitting.

Hint: If you do see overfitting, you might consider using early stopping (by monitoring the validation set performance) to get the best possible model for each step above. Note that optimization contains randomness, both from the initialization and from the minibatch sampling in SGD, but by controlling the random state you can allow your optimization runs to be reproducible.

Problem 1 Report Tasks

Include the following in your report:

1a: Figure and caption: Make trace plots showing RMSE vs. epoch when \(\alpha=0\). You should include 3 figure panels side-by-side, one for \(K=2\) factors, one for \(K=10\), and one for \(K=50\). Each panel should show two lines, one for train RMSE and one for validation RMSE.

We recommend setting the y-axis range identical in all panels, to improve comparisons across panels.

In your caption, answer these questions:

  • (i) What happens as \(K\) increases in terms of under- or over- fitting?
  • (ii) How does the 'best' validation-set performance change from K=2 to 10 to 50?
  • (iii) What step size? Why did you pick that value?

1b: Figure and caption: Make trace plot showing RMSE vs. epoch completed when \(\alpha > 0\).

Here, please plot one figure, for \(K=50\).

In your caption, answer these questions:

  • (i) what value of alpha? how did you select it? (you should try several values)
  • (ii) what value of step size did you pick?
  • (iii) did you get better validation-set error with this alpha than you did with the K=50, alpha=0 result in 3a?

1c: Table and caption: Report the RMSE and MAE for train, validation, and test sets for the "best version" of each of these latent factor (LF) models (one per row):

  • LF with \(K=2\)
  • LF with \(K=10\)
  • LF with \(K=50\) with \(\alpha = 0\) (from 3a)
  • LF with \(K=50\) with \(\alpha > 0\) (from 3b)

Below the table, please discuss:

  • Focusing on RMSE, how many factors \(K\) do you recommend?
  • Does model ranking change if you were to select via MAE instead of RMSE? (You do not need to retain, this is only about selecting the optimal model after training.)

1d: Figure and caption: For the best LF model with \(K=2\) factors, consider the learned per-movie vectors \(v_j\) for the short list of movies listed in select_movies.csv. Make a scatter plot of the 2-dimensional "embedding" vector \(v_j\) of these movies (labeling each point with its movie title).

In a brief caption: Do you notice any interpretable trends? What makes sense? What does not make sense to you?

Hint: See the very bottom of Part 3 of the Recommender Systems lab notebook with help plotting a 2D visual of the embedding vectors. Search for "Make visualization of the learned embeddings of select movies".

Problem 2: Open-Ended Recommendation Challenge

The starter code includes an additional leaderboard heldout dataset -- ratings_masked_leaderboard_set.csv -- which you haven't used yet. This contains an additional 10,000 entries of user-movie pairs for the same set of users and movies as above. We've omitted the ratings here, so you won't be able to access them (they are "masked").

In this problem 2, your goal is to obtain the best possible prediction results on this heldout test data, in terms of mean absolute error (MAE).

You can try any model you want. You can make use of the other ratings you've already observed in the training set, as well as the user-specific info found in user_info.csv and the movie-specific attributes found in item_info.csv. You can use autograd, sklearn, surprise, or any other package. Your goal is to get the best score on our leaderboard.

Change from Project A: There are limits on your number of submissions. You are allowed a max of 25 submissions, with a max of 5 submissions on a given day.

Example ideas

  • Try one of the implementations of SVD in the surprise package. This package offers models that look like our LF from Problem 1, plus some neat extensions of these. Can you train one of these to perform better? Please take advantage of surprise's extensive sklearn-like grid search tools.

  • Can you somehow use user-specific features (like gender and age, available in user_info.csv) or movie-specific features (like title and year, available in movie_info.csv) to improve your rating predictions?

  • Try one of the k-nearest neighbor approaches to recommendation in surprise

  • Try a new loss function! In Problem 1, we used mean-squared error in the calc_loss... method for training the models. However, here in Problem 2 we are evaluating to focus on mean-absolute-error. What if we just used mean absolute error in the loss? This should be somewhat easy with autograd.

Problem 2: Leaderboard Submission Tasks

You should submit your final predictions as:

predicted_ratings_leaderboard.txt : a plain text file

  • One line for each line in the released dataset ratings_masked_leaderboard_set.csv
  • Can be loaded in with np.loadtxt() as a valid 1d array of floats with shape (10000,)

Problem 2: Report Tasks

Please include in your report

2a

1-2 paragraphs describing your proposed method (how it works, why you chose it, what training and hyperparameter selection is done, etc.). Show that you have mastered the core concepts and best practices of machine learning, and can describe your approach in a concise yet reproducible way.

Be sure you cite references and give credit to any key open-source tools you used.

2b

1 figure (with caption) relevant to reporting how you trained the model or selected model complexity. This could be a trace plot (MAE vs epochs) or a hyperparameter selection plot (heldout error vs hyperparameter value).

This figure should demonstrate your understanding of core concepts and ability to avoid under-/over-fitting.

2c

1 table reporting ultimate mean absolute error (MAE) performance.

Include two columns:

  • MAE on test split of the development set
  • MAE on leaderboard set

Include the following rows:

  • chosen Problem 2 method
  • model from Problem 1c (choose best K in terms of val set MAE)
  • any other methods you tried that you feel worth including

You need only report the leaderboard score of your chosen Problem 2 method.

Below this table, include a caption that answers

  • (i) Discuss how your leaderboard number compared to performances on the provided test split of the dev set.
  • (ii) Compare contrast Problem 1 and Problem 2 solutions

2d: Discussion

1 paragraph discussing the overall pros and cons of your current approach. What other kinds of recommendation problems would it work well for? What are its limitations? What kinds of data/problems would this approach not work well for?

1 paragraph describing opportunities for future work: what would you try if you had another week?

1 paragraph reflecting on this project: What are the key takeaway lessons you learned?

Rubric for Overall Performance

We'll get a final number for this project by averaging:

  • 87% : your report performance, using the rubric below
  • 10% : your leaderboard submissions, using the rubric below
  • 3% : completion of your reflection on the project

Rubric for Evaluating Leaderboard Submissions

For the Problem 2 leaderboard, we'll give you a score between 0.0 and 100.0 where:

  • 85% of points represent if you achieved a "reasonable" score (e.g. a standard pipeline trained using good practices)
  • 15% of points awarded depending on proximity to the top 3 submissions in this class.

Partial credit is possible, we linearly interpolate between the "reasonable" score and the "top" score.

For example, suppose the top MAE was 1.0, while the reasonable MAE was 1.3. We'd give the following credit:

  • 100 points for MAE of 1.0
  • 95 points for MAE of 1.1 (2/3 way from 1.3 to 1.0)
  • 90 points for MAE of 1.2 (1/3 way from 1.3 to 1.0)
  • 85 points for MAE of 1.3

Scores worse than the 'reasonable' will be similarly interpolated, against another reference point defining the D/C grade boundary of 70 points.

Rubric for Evaluating PDF Report

Earning full credit on this assignment requires a well-thought-out report that demonstrates you understand the models and training procedures we're studying here.

We're looking for the itemized content outlined above, with professional figures and captions that tell a story, concise but well-thought out short answers, and evidence of best practices throughout.

Points will be allocated across the various parts as follows:

  • 40%: Problem 1

    • 15%: 1a
    • 8%: 1b
    • 11%: 1c
    • 6%: 1d
  • 60%: Problem 2

    • 15%: 2a
    • 15%: 2b
    • 15%: 2c
    • 15%: 2d