Comp 136: Statistical Pattern Recognition
Department of Computer Science
Tufts University
Fall 2017

Last updated: 11/7

I encourage you to make up your own projects as this is often the most rewarding experience.
This page lists some potential topics for projects of various difficulty levels that you can pursue.

P1: Online and Batch Count Regression

Predicting count variables can be handled in the generalized linear model as covered in class. The starting point for this project is to implement the online/stochastic gradient descent and the batch IRLS algorithms and compare them. Please add some contents to this comparison. Either an application focus, or an exploration of model selection for one or both methods. Model selection for logistic regression is discussed in the RVM section in the textbook. Try to explore model selection ideas for the online algorithm, to be done during training, to the best of your ability.

For comparisons you may collect some classification datasets from the UCI repository

P2: Gaussian Processes and Sparse Gaussian Processes

Read the GP sections of the text (some of this will be covered later in the course) as well as paper by Titsias and paper by Sheth. Then write your own implementation for GP and sparse GP and evaluate their performance.

For comparisons you may collect some classification datasets from the UCI repository

P3: L1 vs L2 regularized regression

L1 regularization is a variant of the standard regularized regression (see page 145 of [B]) that is known to provide sparse solutions, in the sense that many of the w_i coefficients are zero.

Please read section 3.4 of Elements of Statistical Learning for background, some intuition on the LAR-Lasso algorithm. The algorithm is described with complete details in sections 2 and 3 of Least Angle Regression esp. pages 413-414 and 417.

Implement LAR/Lasso and wrap it with model selection for lambda using cross validation. Compare this to your implementation from programming project 2 for Bayesian linear regression with model selection (but predict with w_MAP).

Apply the two methods to several regression datasets, to test the pros and cons of the methods. Compare runs time, accuracy, stability etc across the datasets.

For datasets please use the prostate cancer dataset , collect some regression datasets from the UCI repository , and make some artificial datasets with sparse and dense true w vectors.

P4: Topic Models

Read the papers describing Gibbs sampling and variational solution for LDA: Steyvers & Griffiths (2007) Probabilistic topic models , and Blei, Ng & Jordan (2003) Latent Dirichlet Allocation. Implement the collapsed Gibbs sampling algorithm, and the variational algorithm for LDA, and compare their performance on some datasets. Alternatively, choose one of the two algorithms to implement but add your own applications where you collect data to analyze using topic models.

P5: Topic Models for Folklore Data

This could be combined with technical material from the previous project. Duchas is a project that digitizes primary folklore material in Ireland. The idea for this project is to collect folktale text from this site and then apply topic models to analyze the this text. The goal for initial exploratory analysis, identifying types, sub-types, topics etc in this text.

P6: Collaborative Filtering

Read the paper by Lim and Teh and implement the variational algorithm for Matrix Factorization. Consider how matrix factorization is used for collaborative filtering and using this evaluate the performance of the algorithm on some existing datasets.

P7: On line Sparse Gaussian Processes

Related to P2. Read the paper by sheth as we as the paper by Bijl. Then implement an online sparse GP algorithm, the one by Bijl or in one of the cited papers to evaluate its performance. Please consult Rishit sheth or Roni Khardon about more ideas and details for this project.