Project B: Classifying Sentiment from Text Reviews


Last modified: 2020-11-23 18:42

Status: RELEASED.

Due date: Mon. 11/23 at 11:59pm AoE (Anywhere on Earth). (Tue 11/24 at 07:59am in Boston)

Updates

  • 2020-10-27 7:45am ET : Released. Please get started.

Jump to:

  Background   Code   Datasets   Problem 1   Problem 2   Problem 3   Rubric

Turn-in links: See What to Turn In Instructions for details

Overview

This is a four week project with lots of open-ended programming. Get started right away!

Team Formation

In this project, you can work in teams of 2 people, or (if you prefer) individually. Individual teams still need to complete all the parts below. We want to incentivize you to work in pairs.

If you need help finding teammates, please post to our "Finding a Partner for Project B" post on Piazza.

By the start of the second week (by end of day Mon 11/02), you should have identified your partner and signed up here:

Work to Complete

As a team, you will work on two semi-structured, yet open-ended problems, and then a completely open-ended problem. The 3 problems look at different representations of text for a common task.

  • Problem 1 looks at using bag-of-word feature representations
  • Problem 2 looks at word embedding feature representations, using "GloVe"
  • Problem 3 is an open-ended problem, where any feature representation is allowed

Throughout Problems 1, 2, and 3, you will practice the development cycle of an ML practitioner:

    • Propose a reasonable ML pipeline (feature extraction + classifier)
    • Evaluate it and analyze the results carefully
    • Revise the pipeline and repeat

For all 3 problems, we will maintain a leaderboard on Gradescope. You should periodically submit the predictions of your best model on the test set (we do not release the labels to you in advance).

What to Turn In

  1. Each team will prepare one PDF report covering all 3 problems below.

  2. Prepare a short PDF report (no more than 7 pages).

  3. This document will be manually graded according to our rubric
  4. Can use your favorite report writing tool (Word or G Docs or LaTeX or ....)
  5. Should be human-readable. Do not include code. Do NOT just export a jupyter notebook to PDF.
  6. Should have each subproblem marked via the in-browser Gradescope annotation tool)

  7. Each team will prepare one ZIP file of test-set predictions as a leaderboard submission for Problem 1, Problem 2, and Problem 3. This ZIP file will contain just one file:

  8. yproba1_test.txt : plain text file

    • Each line contains float probability that the relevant example should be classified as a positive example given its features
    • Should be loadable into NumPy as a 1D array via this snippet: np.loadtxt('yproba1_test.txt')
    • Will be thresholded at 0.5 to produce hard binary predicted labels (either 0 or 1)
  9. Each individual will turn in a reflection form (after completing the report).

  10. Link: https://forms.gle/eKWE8fyvNTKQtS196

Datasets, Starter Code and Code Restrictions

For all required data and code, see the projectB folder of the public assignments repo for this class:

https://github.com/tufts-ml-courses/comp135-20f-assignments/tree/master/projectB

Our starter code repo provides a few scripts helping you load the data for each problem, but otherwise offers no other code. This is meant to simulate the lack of code you'd have in the "real world", trying to build a text sentiment classifier from scratch using your machine learning skills.

For this assignment, you can use any Python package you like (sklearn, nltk, etc). You are welcome to consult the sklearn documentation website or other external web resources for snippets of code to guide your usage of different classifiers. However, you should understand every line of the code you use and not simply copy-paste without thinking carefully.

Remember to keep the course collaboration policy in mind: do your own work!

Background

We have given you a dataset of several thousand single-sentence reviews collected from three domains: imdb.com, amazon.com, yelp.com. Each review consists of a sentence and a binary label indicating the emotional sentiment of the sentence (1 for reviews expressing positive feelings; 0 for reviews expressing negative feelings). All the provided reviews in the training and test set were scraped from websites whose assumed audience is primarily English speakers, but of course may contain slang, misspellings, some foreign characters, and many other properties that make working with natural language data challenging (and fun!).

Your goal is to develop a binary classifier that can correctly identify the sentiment of a new sentence.

Here are some example positive sentences:

imdb          The writers were "smack on" and I think the best actors and actresses were a bonus to the show.These characters were so real.
imdb          The Songs Were The Best And The Muppets Were So Hilarious.  
yelp          Food was so gooodd.
yelp          I could eat their bruschetta all day it is devine.

Here are some example negative sentences:

amazon        It always cuts out and makes a beep beep beep sound then says signal failed.
amazon        the only VERY DISAPPOINTING thing was there was NO SPEAKERPHONE!!!!
yelp          It sure does beat the nachos at the movies but I would expect a little bit more coming from a restaurant.
yelp          I'm not sure how long we stood there but it was long enough for me to begin to feel awkwardly out of place.

Dataset acknowledgment

This dataset comes from research work by D. Kotzias, M. Denil, N. De Freitas, and P. Smyth described in the KDD 2015 paper 'From Group to Individual Labels using Deep Features'. We are grateful to these authors for making the dataset available.

Provided data

You are given the data in CSV file format, with 2400 input,output pairs in the training set, and 600 inputs in the test set.

Training set of 2400 examples

  • x_train.csv : input data
    • Column 1: 'website_name' : one of ['imdb', 'amazon', 'yelp']
    • Column 2: 'text' : string sentence which represents the raw review
  • y_train.csv : binary labels to predict

    • Column 1: 'is_positive_sentiment' : 1 = positive sentiment, 0 = negative

Test set of 600 examples

  • x_test.csv : input data

Suggested Way to Load Data into Python

We suggest loading the sentence data using the read_csv method in Pandas:

x_train_df = pd.read_csv('x_train.csv')
tr_list_of_sentences = x_train_df['text'].values.tolist()

You can see a short example working Python script here: https://github.com/tufts-ml-courses/comp135-20f-assignments/blob/master/projectB/load_train_data.py

We'll often refer to each review or sentence as a single "document". Our goal is to classify each document into positive or negative labels.

Preprocessing

We suggest that you remove all the punctuation and convert upper case to lower case for each example.

As discussed in class, there are many possible approaches to feature representation, the process of transforming any possible natural language document (often represented as an ordered list of words which can be of variable length) into a feature vector \(x_n\) of a standard length.

In this project, we will explore two approaches: bag-of-words vectors (explored in Problem 1) and embedding vectors (Problem 2). Later, you'll be allowed to try any feature representation approach you want (Problem 3).

Problem 1: Bag-of-Words Feature Representation

Background on Bag-of-Words Representations

As discussed in lecture, the "Bag-of-Words" (BoW) representation assumes a fixed, finite-size vocabulary of V possible words is known in advance, with a defined index order (e.g. the first word is "stegosaurus", the second word is "dinosaur", etc.).

Each document is represented as a count vector of length V, where entry at index v gives the number of times that the vocabulary word with index v appears in the document.

You have many design decision to make when applying a BoW representation:

  • How big is your vocabulary?
  • Do you exclude rare words (e.g. appearing in less than 10 documents)?
  • Do you exclude common words (like 'the' or 'a', or appearing in more than 50% of documents)?
  • Do you only use single words ("unigrams")? Or should you consider some bigrams (e.g. 'New York' or 'not bad')?
  • Do you keep the count values, or only store present/absent binary values?
  • Do you use smart reweighting techniques like term-frequency/inverse-document-frequency?

The key constraint with BoW representations is that each input feature must easily map to one human-readable unigram, bigram, or n-gram in a finite vocabulary.

You should feel free to take advantage of the many tools that sklearn provides related to BoW representations:

Goals and Tasks for Problem 1

For Problem 1, you will develop an effective BoW representation plus binary classifier pipeline, always aiming to produce the highest balanced accuracy on heldout data.

You should experiment with several possible ways of performing BoW preprocessing, as well as two possible classifiers. You should use best practices in hyperparameter selection techniques to avoid overfitting and generalize well to new data. Within your hyperparameter selection, you should use cross-validation over multiple folds to assess the range of possible performance numbers that might be observed on new data.

Your report should contain the following sections:

1A : Bag-of-Words Design Decision Description

Well-written paragraph describing your chosen BoW feature representation pipeline, with sufficient detail that another student in this class could reproduce it. You are encouraged to use just plain English prose, but you might include a brief, well-written pseudocode block if you think it is helpful.

You should describe and justify all major decisions, such as:

  • how did you "clean" and "standardize" the data? (punctuation, upper vs. lower case, etc)
  • how did you determine the final vocabulary set? did you exclude words, and if so how?
  • what was your final vocabulary size (or ballpark size(s), if size varies across folds because it depends on the training set)?
  • did you use unigrams or bigrams?
  • did you use counts or binary values or something else?
  • how did you handle out-of-vocabulary words in the test set?

1B : Cross Validation and Hyperparameter Selection Design Description

Well-written paragraph describing how you will use cross-validation and hyperparameter selection to assess and refine each classifier pipeline you'll develop in 1C and 1D below.

You should describe and justify all major decisions, such as:

  • What performance metric will your search try to optimize on heldout data?
  • What hyperparameter search strategy will you use?
  • What is the source of your heldout data for performance estimates? (how many folds? how big is each fold? how do you split the folds?).
  • Given a selected hyperparameter configuration created using CV by training models across several folds, how will you then build a "final" model to apply on the test set?

1C : Hyperparameter Selection Figure for Classifier #1 (should be a tree-based classifier)

Using your BoW preprocessing plus any tree-based classifier (decision tree or random forest or boosting classifier), your goal is to train a model that achieves the best performance on heldout data.

You should use at least 3 fold cross validation to perform a hyperparameter search over at least 5 possible configurations to avoid overfitting. Please follow the hyperparameter selection rubric.

Your report should include a figure and paragraph summarizing this search.

1D : Hyperparameter Selection Figure for Classifier #2 (any classifier)

Using your BoW preprocessing plus any classifier of your choice (logistic regression, MLP, nearest neighbor, etc.), your goal is to train a model that achieves the best performance on heldout data.

You should use at least 3 fold cross validation to perform a hyperparameter search over at least 5 possible configurations to avoid overfitting. Please follow the hyperparameter selection rubric.

Your report should include a figure and paragraph summarizing this search.

1E : Determination of the Best Classifier Pipeline

Summarize which classifier among the two in 1C and 1D above performs best on heldout data according to your experiments, and try to give some verbal explanation about why (more flexible decision boundaries? better at avoiding overfitting? better suited to these features?). If one is not clearly better than the other, try to explain why.

Other than heldout performance, is there some strong reason to prefer one classifier over the other (e.g. runtime cost or complexity of implementation)?

Make sure it is clear which one classifier you select for the next few steps, and what your experiments suggest is a reasonable range for its performance on heldout data.

1F : Analysis of Predictions for the Best Classifier

In a table or figure, show some representative examples of false positives and false negatives for your chosen best classifier. If possible, try to characterize what kinds of mistakes it is making.

To give specific examples, you could look at any of these questions:

  • does it do better on longer sentences or shorter sentences?
  • does it do better on a particular kind of review (amazon or imdb)?
  • does it do better on sentences without negation words ("not", "didn't", "shouldn't", etc.)?

Do you notice anything about these sentences that you could use to improve performance? (You can apply these ideas later in Problem 3).

1G : Report Performance on Test Set via Leaderboard

Apply your best BoW classifier from 1E to the test sentences in x_test.csv. Store your probabilistic predictions into a single-column plain-text file yproba1_test.txt (remember, we'll use a threshold of 0.5 to turn these into binary decisions when computing performance metrics). Upload this file to our bag-of-words leaderboard.

In your report, include a summary paragraph stating your ultimate test set performance, compare it to your previous estimates of heldout performance from cross-validation, and reflect on any differences.

Problem 2: Word Embedding Feature Representation

Background on Word Embedding Representations

The basic idea of word embeddings is that each possible vocabulary word, such as "the" or "good" or "stegosaurus", has a specific associated vector with a fixed size (e.g. 5 dimensions).

For some background on word embeddings, you might find these articles helpful:

In the projectB starter code repo, we have made available a large file of pre-trained length-50 embedding vectors for almost 100,000 possible vocabulary words, using a specific word embedding method called "GloVe" (short for Globel Vector). For this problem, the repo contains:

Given a set of pretrained embedding vectors, we can represent a sentence by averaging or summing together the vectors of words in the sentence.

The key design decisions here are:

  • How do you aggregate vectors for each word to produce one vector for a sentence? Average? Sum? Something else?
  • What size of vector do you use?
  • Do you incorporate all words? Or can you ignore words like 'the' with no anticipated semantic content?

Not required, but if you are curious:: You can read more about the GloVe method (and find other .zip files of higher-dimensional pretrained vectors) at these links:

Goals and Tasks for Problem 2

For Problem 2, you will develop an effective word embedding representation plus binary classifier pipeline, always aiming to produce the highest balanced accuracy on heldout data.

You should experiment with possible ways of performing word-embedding preprocessing, as well as two possible classifiers. You should use best practices in hyperparameter selection techniques to avoid overfitting and generalize well to new data. Within your hyperparameter selection, you should use cross-validation over multiple folds to assess the range of possible performance numbers that might be observed on new data.

Your report should contain the following sections:

2A : Word Embedding Preprocessing Description

Well-written paragraph describing your chosen word embedding pipeline in plain English, with sufficient detail that another student in this class could reproduce it.

You should describe and justify all major decisions, such as:

  • how did you "clean" and "standardize" the data? (punctuation, upper vs. lower case, etc)
  • how did you determine the final vocabulary set? did you exclude words, and if so how?
  • what is the size of your final vocabulary (roughly)?
  • how was each vocabulary word represented as an embedding vector?
  • how did you combine the embedding vectors for each word in a sentence to produce one vector representation for your sentence? how large is each sentence's feature vector?
  • how did you handle out of vocabulary words in the test set?

2B : Cross Validation and Hyperparameter Selection Design Description

What (if anything) has changed from your earlier hyperparameter selection strategies in 1B.

2C : Hyperparameter selection Figure and Paragraph for Classifier #1 (your choice)

Using your word embedding preprocessing plus any classifier of your choice, your goal is to train a model that achieves the best performance on heldout data.

You should use at least 3 fold cross validation to perform a hyperparameter search over at least 5 possible configurations to avoid overfitting. Please follow the hyperparameter selection rubric.

Your report should include a figure and paragraph summarizing this search.

2D : Hyperparameter selection Figure and Paragraph for Classifier #2 (your choice)

Using your word embedding preprocessing plus any classifier of your choice (distinct from 2C above), your goal is to train a model that achieves the best performance on heldout data.

You should use at least 3 fold cross validation to perform a hyperparameter search over at least 5 possible configurations to avoid overfitting. Please follow the hyperparameter selection rubric.

Your report should include a figure and paragraph summarizing this search.

2E : Determination of the Best Classifier Pipeline

Summarize which classifier among the two in 2C and 2D above performs best on heldout data according to your experiments, and try to give some verbal explanation about why (more flexible decision boundaries? better at avoiding overfitting? better suited to these features?). If one is not clearly better than the other, try to explain why.

Other than heldout performance, is there some strong reason to prefer one classifier over the other (e.g. runtime cost or complexity of implementation or guarantees about solution quality)?

Make sure it is clear which one classifier you select for the next few steps, and what your experiments suggest is a reasonable range for its performance on heldout data.

2F : Analysis of Predictions for the Best Classifier

In a table or figure, show some representative examples of false positives and false negatives for your chosen best classifier. If possible, try to characterize what kinds of mistakes it is making.

To give specific examples, you could look at any of these questions:

  • does it do better on longer sentences or shorter sentences?
  • does it do better on a particular kind of review (amazon or imdb)?
  • does it do better on sentences without negation words ("not", "didn't", "shouldn't", etc.)?

Do you notice anything about these sentences that you could use to improve performance? (You can apply these ideas later in Problem 3).

Best practice: It is ideal if the set of examples here in 2F overlaps with the examples in 1F.

2G : Report Performance on Test Set via Leaderboard

Apply your best word embedding classifier from 2E to the test sentences in x_test.csv. Store your probabilistic predictions into a single-column plain-text file yproba1_test.txt (remember, we'll use a threshold of 0.5 to turn these into binary decisions when computing performance metrics). Upload this file to our Word Embeddings leaderboard.

In your report, include a summary paragraph stating your ultimate test set performance, compare it to your previous estimates of heldout performance from cross-validation, and reflect on any differences. Discuss if your performance on word embeddings in Problem 2 is better or worse than in Problem 1, and why you think that might be.

Problem 3: Open-ended challenge

Goals and Tasks for Problem 3

For this problem, your goal is to obtain the best balanced accuracy on heldout data, using any feature representation you want, any classifier you want, and any hyperparameter selection procedure you want.

Here are some concrete examples of methods/ideas you could try:

  • Can you combine BoW and GloVe representations to improve performance? Maybe just concatenate (stack) the two feature vectors together?
  • Would it help build features for the first-half and second-half of sentences?
  • Would it help to build separate classifiers for amazon, imdb, and yelp reviews?
  • Can you try a kernel specifically designed for comparing BoW histograms, such as the Histogram Intersection kernel (see slide 18)?

3A : Methods Description

Your report should include a concise, well-justified summary of what you've tried and why, that covers:

  • how you obtain a feature representation
  • which classifier you selected and how you train it
  • which model complexity hyperparameters you selected and how you selected them

We just want to see one or two well-written paragraphs, like the ones in 2A (for hyperparameter selection) and 2D (for classifier).

You can include figures or tables if you like, but you do not need to.

3B : Report Performance on Test Set via Leaderboard

Apply your best pipeline from 3A to the test sentences in x_test.csv. Store your probabilistic predictions into a single-column plain-text file yproba1_test.txt (remember, we'll use a threshold of 0.5 to turn these into binary decisions when computing performance metrics). Upload this file to our Open-Ended leaderboard.

In your report, include a summary paragraph stating your ultimate test set performance. Discuss if your performance is better or worse than previous problems, and why you think that might be.

Rubric for Overall Performance

We'll get a final number for this project by averaging:

  • 85% : your report performance, using the rubric below
  • 10% : your leaderboard submissions, using the rubric below
  • 5% : completion of your reflection on the project

Rubric for Evaluating Leaderboard Submissions

You'll submit 3 sets of predictions to our leaderboard (one each for Problem 1, 2, and 3)

For each one, we'll give you a score between 0.0 and 1.0 where:

  • 90% of points represent if you achieved a "reasonable" score (e.g. a standard pipeline trained using good practices)
  • 10% of points awarded if you are within tolerance of the top 5 submissions in this class (partial credit possible, linearly interpolating between the "reasonable" score and the "top" score).

Rubric for Evaluating PDF Report

Earning full credit on this assignment requires a well-thought-out report that demonstrates you made reasonable design decisions for feature preprocessing and classifiers and followed machine learning best practices throughout, especially for hyperparameter selection. Achieving top-scores on the leaderboard is far less important than understanding why some methods and choices outperform others.

Points will be allocated across the various parts as follows:

  • 30%: Feature representation design decisions in 1A and 2A
  • 15%: Cross validation and hyperparameter selection design decisions in 1B and 2B
  • 25%: Training and selection for individual classifiers in 1C, 1D, 2C, 2D
  • 10%: Analysis of classifier mistakes in 1E and 2E
  • 10%: Reflection on heldout performance in 1F and 2F
  • 10%: Open-ended pipeline analysis in 3A and 3B

Hyperparameter Selection Rubric

Figure Requirements:

Your figure should show heldout performance a range of at least 5 possible hyperparameter values controlling model complexity that cover both underfitting and overfitting. That is, if at all possible, at least one candidate value should show clear underfitting and at least one should show clear overfitting.

Your figure should:

  • Show both training set and validation set performance trends in the same plot
  • Show the typical performance at each hyperparameter via the average over multiple CV folds
  • Communicate uncertainty around this typical value, by exposing the variation across the multiple CV folds

    • A simple way to show uncertainty is show the empirical range across all folds, or the empirical standard deviation
    • A better way to do this is show a separate dot for the direct performance of each fold (so 5 dots for 5 folds).

The big idea here is that your figure should help the reader understand if one hyperparameter is definitely better than another (e.g. performance is better on most or all folds) or if there isn't much difference.

Paragraph requirements:

In each paragraph where you describe training a classifier and selecting its hyperparameters to avoid overfitting, you should include

  • 1-2 sentences: describe the potential advantages of the chosen classifier for the task at hand.
  • 1-3 sentences: describe any necessary details about the training process (e.g. are there convergence issues? step-size selection issues? need for early stopping?)
  • 1-2 sentences: describe which model complexity hyperparameter(s) were explored, how these values control model complexity, and why the chosen candidate value grids (or random distributions) are reasonable to explore the transition between under and over fitting and find the "sweet spot" in-between.
  • 1-2 sentences: describe the results of the experiment: which hyperparameter is preferred? is the evidence decisive, or uncertain?