Project 2: Sentiment Analysis from Text Reviews


Last modified: 2019-04-07 18:15

Status: Released.

Due date: Wed. Apr 10 at 11:59PM EST.

Update 2019-04-07:

  • Clarified that using any external code package is allowed, if you abide by the collaboration policy (write your code without line-by-line help from others in class or online, be able to describe what each line of code is doing to an instructor or TA when asked)
  • Fixed links for where to turn in files on gradescope
  • Clarified instructions for Problem 1f
  • Added code to help you read in GloVe embeddings for Problem 2 load_word_embeddings.py
  • Added ideas for possible methods to try for Problem 3
  • Added clarification about how to visualize uncertainty across folds in model selection figures

Jump to:

  Background   Code   Datasets   Problem 1   Problem 2   Problem 3   Rubric for Evaluating PDF Report

Turn-in links:

Files to Turn In:

PDF report:

  • Human-readable report (NOT a notebook export), answering the prompts for Problem 1, Problem 2, and Problem 3.
  • This document will be manually graded according to our rubric
  • Please: within Gradescope via the normal submission process, annotate for each subproblem exactly which page(s) are relevant. This will save your graders much time!

ZIP file of source code should contain:

  • Any .py or .ipynb files
  • COLLABORATORS.txt : a plain text file [example], containing
    • Your full name
    • Estimate the hours you spent on each of Problem 1, Problem 2, and Problem 3
    • Names of any people you talked to for help (TAs, students, etc.). If none, write "No external help".
    • Brief description of what content you sought help about (1-3 sentences)

ZIP file of test-set predictions should contain one file:

  • yproba1_test.txt : plain text file
    • Each line contains float probability that the relevant example should be classified as a positive example given its features
    • Should be loadable into NumPy as a 1D array via this snippet: np.loadtxt('yproba1_test.txt')
    • Will be thresholded at 0.5 to produce hard binary predicted labels (either 0 or 1)

Datasets

See the project2 folder of the public assignments repo for this class:

https://github.com/tufts-ml-courses/comp135-19s-assignments/tree/master/project2

Starter Code and Code Restrictions

There is no starter code for this assignment. This is meant to simulate the lack of code you'd have in the ``real world'', trying to build a text sentiment classifier from scratch using your machine learning skills.

For this assignment, you can use any Python package you like (sklearn, nltk, etc). You are welcome to consult the sklearn documentation website or other external web resources for snippets of code to guide your usage of different classifiers. However, you should understand every line of the code you use and not simply copy-paste without thinking carefully. Remember to keep the course collaboration policy in mind: do your own work!

Background

We have given you a dataset of several thousand single-sentence reviews collected from three domains: imdb.com, amazon.com, yelp.com. Each review consists of a sentence and a binary label indicating the sentiment (1 for positive and 0 for negative) of the sentence. All the provided reviews in the training and test set were scraped from websites whose assumed audience is primarily English speakers, but of course may contain slang, misspellings, some foreign characters, and many other properties that make working with natural language data challenging (and fun!).

Your goal is to develop a binary classifier that can correctly identify the sentiment of a new sentence.

Here are some example positive sentences:

imdb          The writers were "smack on" and I think the best actors and actresses were a bonus to the show.These characters were so real.
imdb          The Songs Were The Best And The Muppets Were So Hilarious.  
yelp          Food was so gooodd.
yelp          I could eat their bruschetta all day it is devine.

Here are some example negative sentences:

amazon        It always cuts out and makes a beep beep beep sound then says signal failed.
amazon        the only VERY DISAPPOINTING thing was there was NO SPEAKERPHONE!!!!
yelp          It sure does beat the nachos at the movies but I would expect a little bit more coming from a restaurant.
yelp          I'm not sure how long we stood there but it was long enough for me to begin to feel awkwardly out of place.

Dataset acknowledgment

This dataset comes from research work by D. Kotzias, M. Denil, N. De Freitas, and P. Smyth described in the KDD 2015 paper 'From Group to Individual Labels using Deep Features'. We are grateful to these authors for making the dataset available.

Provided data

You are given the data in CSV file format, with 2400 input,output pairs in the training set, and 600 inputs in the test set.

Training set of 2400 examples

  • x_train.csv : input data
    • Column 1: 'website_name' : one of ['imdb', 'amazon', 'yelp']
    • Column 2: 'text' : string sentence which represents the raw review
  • y_train.csv : binary labels to predict

    • Column 1: 'is_positive_sentiment' : 1 = positive sentiment, 0 = negative

Test set of 600 examples

  • x_test.csv : input data

Suggested Way to Load Data into Python

We suggest loading the sentence data using the read_csv method in Pandas:

x_train_df = pd.read_csv('x_train.csv')
tr_list_of_sentences = x_train_df['text'].values.tolist()

You can see a short example working Python script here: https://github.com/tufts-ml-courses/comp135-19s-assignments/blob/master/project2/load_train_data.py

We'll often refer to each review or sentence as a single "document". Our goal is to classify each document into positive or negative labels.

Preprocessing

We suggest that you remove all the punctuation, numeric values, and convert upper case to lower case for each example.

As discussed in class, there are many possible approaches to feature representation, the process of transforming any possible natural language document (often represented as an ordered list of words which can be of variable length) into a feature vector \(x_n\) of a standard length.

In this project, we will explore two approaches: bag-of-words vectors (explored in Problem 1) and embedding vectors (Problem 2). Later, you'll be allowed to try any feature representation approach you want (Problem 3).

Problem 1: Bag-of-Words Feature Representation

As discussed in lecture, the "Bag-of-Words" (BoW) representation assumes a fixed, finite-size vocabulary of V possible words is known in advance, with a defined index order (e.g. the first word is "stegosaurus", the second word is "dinosaur", etc.).

Each document is represented as a count vector of length V, where entry at index v gives the number of times that the vocabulary word with index v appears in the document.

You have many design decision to make when applying a BoW representation:

  • How big is your vocabulary?
  • Do you exclude rare words (e.g. appearing in less than 10 documents)?
  • Do you exclude common words (like 'the' or 'a', or appearing in more than 50% of documents)?
  • Do you only use single words ("unigrams")? Or should you consider some bigrams (e.g. 'New York' or 'not bad')?
  • Do you keep the count values, or only store present/absent binary values?
  • Do you use smart reweighting techniques like term-frequency/inverse-document-frequency?

The key constraint with BoW representations is that each input feature must easily map to a human-readable unigram, bigram, or n-gram in a finite vocabulary.

For Problem 1, you will develop an effective BoW representation for 3 possible classifiers (one must be Naive Bayes, two can be your choice).

You should feel free to take advantage of the many tools that sklearn provides related to BoW representations:

Your report should contain the following sections:

  • 1(a) : Bag-of-Words Design Decision Description

Well-written paragraph describing your chosen BoW pipeline in plain English, with sufficient detail that another student in this class could reproduce it. You should justify all major decisions.

  • 1(b) : Model Selection Figure for Classifier A (Naive Bayes)

Follow the model selection figure rubric

  • 1(c) : Model Selection Figure for Classifier B (your choice)

Follow the model selection figure rubric

  • 1(d) : Model Selection Figure for Classifier C (your choice)

Follow the model selection figure rubric

  • 1(e) : Analysis of Best Classifier

Summarize which classifier among those tested in 1(b), 1(c), and 1(d) performs best, and give some verbal explanation about why (more flexible decision boundaries? better at avoiding overfitting?).

Look at this best classifier's predictions and try to characterize what kinds of mistakes it is making. What are example sentences that this classifier is consistently wrong about? Do you notice anything about these sentences that you could use to improve performance? (To be clear, you should save these ideas to perhaps implement in Problem 3, but just note problems here.)

One advantage of BoW representations is that each entry in the feature vector corresponds to a single human-readable unigram/bigram/n-gram, and thus is easy to "interpret" by a human.

Provide one figure analyzing which features are most important to the best performing classifier. For example, with logistic regression, you can indicate which unigrams/bigrams have largest positive weight and largest negative weight. For random forests and other blackbox classifiers, you can look at feature importance metrics.

  • 1(f) : Report Performance on Leaderboard Task

Apply your best BoW classifier from 1(e) to the test sentences in x_test.csv. Store your probabilistic predictions into a single-column plain-text file yproba1_test.txt (remember, we'll use a threshold of 0.5 to turn these into binary decisions when computing error rate). Upload this file to our 1f bag-of-words leaderboard.

In your report, include a summary paragraph stating your heldout performance, compare it to your previous cross-validation estimates, and reflect on any differences.

Problem 2: Word Embedding Feature Representation

The basic idea of word embeddings is that each possible vocabulary word, such as "the" or "good" or "stegosaurus", has a specific associated vector with a fixed size (e.g. 50 dimensions or 1000 dimensions).

For some background on word embeddings, you might find these articles helpful:

In the project2 starter code repo, we have made available a large file of pre-trained length-50 embedding vectors for 400,000 possible vocabulary words, using a specific word embedding method called "GloVe" (short for Globel Vector). You can find everything you need in the pretrained_embeddings folder in the starter code repo, including a large .zip file of the vectors and example Python code to load the vectors in.

You can read more about the GloVe method (and find other .zip files of higher-dimensional pretrained vectors) at these links:

Given a set of pretrained embedding vectors, we can represent a sentence by averaging or summing together the vectors of words in the sentence.

The key design decisions here are: How do you aggregate word vectors to produce one vector for a sentence? Average? Sum? Something else? What size of vector do you use? * Do you incorporate all words, or can you ignore words like 'the' with no anticipated semantic content?

Your report should contain the following sections:

  • 2(a) : Word Embedding Preprocessing Description

Well-written paragraph describing your chosen word embedding pipeline in plain English, with sufficient detail that another student in this class could reproduce it. You should justify all major decisions, perhaps using a combination of common sense reasoning or empirical numbers.

  • 2(b) : Model Selection Figure for Classifier A (your choice)

Follow the model selection figure rubric

  • 2(c) : Model Selection Figure for Classifier B (your choice)

Follow the model selection figure rubric

  • 2(d) : Model Selection Figure for Classifier C (your choice)

Follow the model selection figure rubric

  • 2(e) : Analysis of Best Classifier

Summarize which classifier among those tested in 2(b), 2(c), and 2(d) performs best, and give some verbal explanation about why this might be the case (more flexible decision boundaries? better at avoiding overfitting?).

Look at this best classifier's predictions and try to characterize what kinds of mistakes it is making. What are example sentences that this classifier is consistently wrong about? Do you notice anything about these sentences that you could use to improve performance?

To give specific examples, you could look at any of these questions: does it do better on longer sentences or shorter sentences? does it do better on amazon reviews or imdb reviews? * does it do better on sentences without negation words ("not", "didn't", "shouldn't", etc.)?

  • 2(f) : Report Performance on Leaderboard Task

Apply your best GloVe-based classifier from 2(e) to the test sentences in x_test.csv. Store your probabilistic predictions into a single-column plain-text file yproba1_test.txt, and upload this file to the GloVe classifier leaderboard.

In your report, include a summary paragraph stating your heldout performance, compare it to your previous cross-validation estimates, and reflect on any differences.

Problem 3: Open-ended challenge

For this problem, you can use any feature representation you want and any classifier you want. You just need to describe what you've done and why, produce probability predictions on the test dataset, and submit to the leaderboard. This is your chance to explore some new methods and really dig in to an exciting problem.

Update: Here are some concrete examples of methods/ideas you could try: Can you combine BoW and GloVe representations to improve performance? Maybe just concatenate (stack) the two feature vectors together? Can you try a kernel specifically designed for comparing BoW histograms, such as the Histogram Intersection kernel? Would it help build features for the first-half and second-half of sentences? Would it help to build separate classifiers for amazon, imdb, and yelp reviews?

Your report should include a concise, well-justified and reproducible summary of what you've tried, with any needed figures or tables to explain the approach and your conclusions.

Rubric for Evaluating PDF Report

Earning full credit on this assignment requires a well-thought-out report that demonstrates you made reasonable design decisions for feature preprocessing and classifiers and followed machine learning best practices throughout, especially for hyperparameter optimization. Achieving top-scores on the leaderboard is far less important than understanding why some methods and choices outperform others.

Points will be allocated across the various parts follows:

  • 20%: Feature representation design decisions in Problem 1a and 2a
  • 48%: Model selection analysis for individual classifiers in Problems 1b, 1c, 1d, 2b, 2c, 2d
  • 20%: Comparison of classifiers in Problems 1e and 2e
  • 4%: Reflection on heldout performance in Problem 1f and 2f
  • 8%: Problem 3

Model selection Figure Rubric

Figure Requirements:

  • Show performance a range of at least 5 possible hyperparameter values controlling model complexity that cover both underfitting and overfitting
  • At each hyperparameter value:
    • Shows both training error and validation error estimates from multiple folds of cross validation
    • Communicate uncertainty in estimated error from multiple folds, not just the average.

Update: To show uncertainty, you could show a standard deviation interval, or even just show the direct performance metrics on each of the 5 folds or 10 folds or whatever CV scheme you've decided to try. The big idea is to help the reader understand if one hyperparameter is definitely better than another (e.g. performance is better on all folds) or if any difference in the mean isn't that significant.

Paragraph requirements:

  • 1-2 sentences: describe the potential advantages of the chosen classifier.
  • 1-2 sentences: describe any necessary details about the training process (e.g. are there convergence issues? step-size selection issues?)
  • 1-2 sentences: describe which hyperparameter(s) were explored, how these values control model complexity, and why the chosen values are reasonable
  • 1-2 sentences: describe cross validation experimental procedure so it could be reproduced
  • 1-2 sentences: describe the results of the experiment: which hyperparameter is preferred? is the evidence decisive, or uncertain?