Last modified: 2022-03-11 16:49

Project Goals

The goal of this project is to familiarize yourself with the process of selecting, applying, evaluating and improving the models we study in this class in the context of a real dataset. These skills are useful in both the context of both industry and academic careers, as both involve the process of choosing reasonable models to apply to a problem, evaluating their performance, and iterating until you find something that works. In this project, we will break this process down into a series of concrete steps building off of the models and learning methods explored in this course.

Instructions

Your project consists of 4 main components:

Choose a dataset and model+learning method covered in class and describe their properties (you can expect to use and build on code from the coding practices for the model+learning method)
Generate and evaluate specific hypotheses about how your choice of model+learning method will work on your data
Propose an upgrade to your model or learning method and generate specific hypotheses about how it will perform on your data
Implement your proposed upgrade and evaluate its performance on your dataset

The emphasis in this project will be on generating and testing reasonable hypotheses based on specific features of the dataset and model+learning method and proposed upgrade that you choose based on a specific performance property. Your upgrade is not required to be novel, and it is not required to work well (i.e. improve the performance of your model in the ways you hypothesize), but your upgrade and hypothesis about why it will work should make sense in the context of your dataset. Additionally, if your hypothesis fails, you should attempt to explain why. You should also be able to accurately describe the properties of your model, your dataset and your upgrade using the tools from this course. Your upgrade should also incorporate something that we have not covered in depth during the course lectures or homeworks/coding practicals.

Project Teams

You should work in teams of 2. We may make exceptions for you to work alone or in teams of 3, but we do not recommend it. Teams of 4 or more are not allowed. If you work alone, you will be required to do the same amount of work. If you work in a team of 3, you will be required to do 2 upgrades instead of 1 to appropriately scale the amount of work.

Choosing a Dataset

You are free to choose whatever real world dataset you want to work with as long as it is publically available and meets the constraints described below.

If you are looking for a dataset, I recommend checking out either the UCI repository of machine learning datasets: https://archive.ics.uci.edu/ml/datasets.php, or NYC open data repository of datasets: https://opendata.cityofnewyork.us/data/. UCI is easier to search for datasets that meet the project requirements.

Dataset constraints:

Publicly available (so we can access the data if needed)
At least 500 instances
At least 20 feature dimensions
Feature and label types match the model you are considering using. One example is if you're thinking of using a dirichlet-categorical, you do not need a label--t in our notation in class, but you do need to make sure the data is categorical. Another example is that if you plan to use logistic regression, there should be a binary label--t, in the dataset.

Choosing a Model

You should work with one of the models we cover in this course. If you choose a model used in one of the coding practical assignments, you are welcome to build on your implementation. You are welcome to choose a model that was not used in a coding practical (for example, logistic regression), but you will be required to build your own implementation.

Additionally, you can use models from later on in the course (for example, a Gaussian Mixture Model, or a Timeseries model), but you would need to study ahead to use one of these since we'll cover them after some of the project checkpoints, so I don't recommend it. Finally, if you have an idea for a model you want to use that we do not specifically cover in this class, you are welcome to ask me whether you can work on it and I will decide on a case by case basis. The main requirement will be that it needs to use the probabilistic methods we have been learning in this course (so, for example, neural networks are not appropriate).

Suggested models:

Beta-Bernoulli
Dirichlet Categorical (implemented in CP1)
Linear regression (implemented in CP2)
Logistic regression

Choosing an Upgrade

When choosing an upgrade to the model you are working with, the emphasis is on choosing an upgrade that you can describe clearly using the tools from this class, reasoning about the consequences of this upgrade on a specific type of performance, and formulating specific hypotheses based on the projected consequences that justify why this is a reasonable upgrade to try.

Your upgrade does not need to be novel--you are welcome but not required to consult your textbook, research papers etc. to come up with ideas but you must cite anything you consult; and your hypotheses do not need to be correct. You do, however, need to come up with convincing reasons why it might work based on the properties of your dataset, your model+learning method, and your upgrade.

Your upgrade should fall into one of the categories below. If you have something in mind that doesn't, you're welcome to discuss it with me and we'll decide on a case by case basis. Finally, if you use something we've explored in class or in homework, you should include an insight about it beyond those we've already seen together. For example, if you scale up inference by using stochastic gradient descent instead of gradient descent, tell me something about it beyond what we saw in question 2 on HW2 (whether or not it's unbiased in expectation), for example, does it take more steps to get to the same solution as gradient descent?

Upgrade Option 1: Changing the Prior

For this option, you can change the PDF/PMF used in the prior of the model you are working with. You should not just change a parameter in the prior (for example it is insufficient to change the value of \(\alpha\) in the dirichlet prior in the Dirichlet-Categorical model).

One example of a valid upgrade that falls in this category is the change we made to the prior of the Beta-Bernoulli model in quiz 1. Instead of using a Beta distribution over the prior parameter \(\mu\): \(\mu \sim Beta(a, b)\), we used a Normal distribution over a parameter \(r\): \(r \sim N(0, 1)\), that we transformed to \((0, 1)\) by using a sigmoid function: \(\mu = \sigma(r)\).

Upgrade Option 2: Changing the Likelihood

For this option, you can change the PDF/PMF used in the likelihood of the model you are working with.

One example of a valid upgrade that falls into this category is changing the likelihood from a normal distribution in linear regression to a Bernoulli distribution in logistic regression by using the sigmoid function. I.e. \(p(x) = Bern( \sigma(w^Tx) )\).

You might consider looking at other generalized linear models for other upgrades of this form to explore.

Upgrade Option 3: Changing the Inference Method

Note: inference is talking about how we learn the parameters of models.

For this option, you can change the inference method. There are various ways to do this, many of which involve different types of approximations that can be made in approximate inference. Changing from an MLE estimate of the parameters to a MAP estimate is technically a change in the inference method, but we will not allow this one since we have studied this extensively in class.

One example of a valid upgrade that falls under this category is changing how you estimate the posterior predictive of logistic regression from using a Monte Carlo estimate of the posterior predictive to using a probit approximation for the sigmoid function and computing a closed form. Another example is using 2nd order gradient descent to compute the parameters of the logistic regression model instead of first order gradient descent.

Upgrade Option 4: Scaling up the Inference Method

For this option, you can make a conceptual change to the inference method specifically to allow you to scale up inference to work on a larger dataset, or to work much faster. Straightforward code optimzations (for example changing a for loop to a matrix operation) do not count as a valid instance of this type of upgrade. However thinking carefully about how to parallelize inference is a valid upgrade.

One example of a valid upgrade that falls under this category is changing from gradient descent computed on the entire dataset to stochastic gradient descent computed on one instance at each step. This was described in question 2 of HW 2.

Format of a Hypothesis

One of the main outcomes of your project will be generating and testing a series of hypotheses: hypotheses about how your chosen model will work on your dataset, and hypotheses about how your upgrade will work compared to your chosen model on your dataset. We will describe the expectations of these hypotheses below, and we will also iterate on them throughout the course of this project. This includes submitting them in different checkpoints and receiving feedback, and one or more sturctured, in-class exercises to practice and refine the skill of generating hypotheses and ways of evaluating them.

Your hypotheses should have the following structure:

We hypothesize that our model+learning method/our upgrade will have the following performance property on our dataset because this property of our model+learning method/upgrade aligns with this property of our dataset in a way that affects the specific performance property we are studying.

We can instantiate this using an example from CP1, 1e. That hypothesis is:

We hypothesize that the MAP estimator and Posterior predictive estimator will outperform the ML estimator on negative log-likelihood when there are few training examples in the dataset, because the ML estimator is known to overfit with small training data because it does not incorporate a prior.

Aligning this example to the structure above:

The performance property here, is the difference between the negative log likelihood for the MAP/Posterior predictive estimators and the ML estimator.
The property of the dataset is that there are few training instances.
The property of the model+learning method is that it overfits when given few training examples because it does not incorporate a prior.

We can see that the model+learning method and dataset properties align in the sense that they both relate to training dataset size, and the model+learning method property and performance property align because they both relate to the relative negative log likelihood for the MAP/Posterior predictive estimators and the ML estimator.

We can also see from the example above how the hypothesis tells us what tests we need to run to evaluate our hypothesis. In general, if your hypothesis has a particular performance property, you need to evaluate it by testing that performance property. In our example, the performance property was about the relative performance in terms of negative log likelihood for the MAP/Posterior predictive estimators and the ML estimator, so to evaluate it, we needed to report these 3 numbers and compare them.

In Checkpoint 1, you will be choosing and describing properties of your dataset and model+learning method and use them as the foundations to generate your performance hypotheses. In Checkpoint 2, you will evaluate these hypotheses with respect to the specific performance property in your hypothesis and draw conclusions about whether or not your hypothesis was correct, and why. You will also be proposing an upgrade based on this conclusion. In Checkpoint 3, you will describe the properties of your upgrade to use in building your next round of hypotheses, and demonstrate progress on implementation of the upgrade. Finally in the final report, you should test the performance property in your upgrade hypothesis and report and explain the results.

Due Dates

March 17th (Thursday): Checkpoint 0
- Team Formation
- Choice of Dataset
- Initial choice of model
March 31st (Thursday): Checkpoint 1
- Description of dataset properties through exploratory data analysis
- Final choice of model + learning method
- Description of model properties
- Generation of hypotheses about model performance
April 14th (Thursday): Checkpoint 2
- Application of model to dataset + any required scaling up of the CP code
- Evaluation of hypotheses about model performance
- Proposal of model/learning method upgrade
April 28th (Thursday): Checkpoint 3
- Description of model/learning method upgrade properties
- Generation of hypotheses about model/learning method upgrade performance
- Demonstration of progress on implementation of model/learning method upgrade
May 12th (Friday): Final report
- Finish implementation of model/learning method upgrade
- Application of model/learning method upgrade to dataset
- Evaluation of model/learning method upgrade performance

In addition to these deliverables, your team will be expected to meet with the instructor at least once during office hours to discuss one of your hypotheses (either the initial hypothesis in Checkpoint 1 or the upgrade hypothesis in checkpoint 3).

Grading

Your project grade will be based on:

How well your checkpoint progress reports satisfy the checkpoint requirements
How well your final report satsfies the project requirements
Whether you satisfy the requirement to meet with the instructor to discuss one of your hypotheses

As the final report will be built on your checkpoint progress reports, improvements made between the checkpoint progress reports and their incorporation into the final project report will be reflected in your grade. We strongly encourage you to incorporate feedback from the instructors as well as insights from later on in the project into the parts of the final report built on the checkpoints.

Project Overview