HW4: Trees and Random Forests for Bag of Words


Last modified: 2020-11-11 10:26

Status: RELEASED.

Due date: Wed. Nov. 11 at 11:59PM AoE (Anywhere on Earth) (Thu 11/12 at 07:59am in Boston)

Overview

In this HW, you'll complete the following:

  • Complete Problem 0's coding tasks (worth 50% of your grade)
    • You'll submit a ZIP of your code to the Gradescope link below.
    • The goal of these problems is to:
      • Demonstrate understanding of how to implement a decision tree
  • Complete Problems 0-3 analysis and implementation tasks and write a report (worth 50% of your grade)

    • You'll submit this PDF report to the Gradescope link below.
    • The goal of these problems is to:
      • Demonstrate understanding of how to use decision trees and random forests effectively on real data.
      • Demonstrate understanding of how to "interpret" the learned structure of these models.

Much of your analysis will use library code in sklearn with similar functionality as what you implement yourself earlier.

Turn-in links:

Files to Turn In:

PDF report:

  • Prepare a short PDF report (no more than 4 pages).
  • This document will be manually graded.
  • Can use your favorite report writing tool (Word or G Docs or LaTeX or ....)
  • Should be human-readable. Do not include code. Do NOT just export a jupyter notebook to PDF.
  • Should have each subproblem marked via the in-browser Gradescope annotation tool)

ZIP file of source code should contain:

  • tree_utils.py (will be autograded)
  • train_tree.py (will be autograded)
  • select_best_binary_split.py (will be autograded)
  • hw4.ipynb (just for completeness, will not be autograded but will be manually assessed if necessary.)

Evaluation Rubric:

See the PDF submission portal on Gradescope for the point values of each problem. Generally, tasks with more coding/effort will earn more potential points.

Jump to: Background   Code Tasks   Starter Code   Dataset   Problem 1   Problem 2   Problem 3

Starter Code

See the hw4 folder of the public assignments repo for this class:

https://github.com/tufts-ml-courses/comp135-20f-assignments/tree/master/hw4

This starter code includes a notebook to help you get started on your analysis, plus a helper file to visualize the internal logic of trees produced by sklearn.

Background

To complete this HW, you'll need some knowledge from the following sessions of class:

  • Logistic Regression (day09)
  • Hyperparameter Search (day13)
  • Decision Trees (day14)
  • Random Forests (day15)

Problem 0: Code Implementation of Decision Tree Regression

In this problem, you'll finish an implementation of Decision Tree Regression, which at the end will be a drop-in replacement for sklearn's DecisionTreeRegression in terms of underlying functionality.

Notably, we've written our autograder such that each part is assessed independently. If you don't get one part correct, you can still get the other parts correct.

Please look at the starter code README. to get organized.

Code Task A (20% of code grade): Implement predict for a LeafNode

See the LeafNode class within starter code file: tree_utils.py.

Code Task B (20% of code grade): Implement predict for an InternalDecisionNode

See the InternalDecisionNode class within starter code file: tree_utils.py.

Code Task C (40% of code grade): Implement select_best_binary_split

See the starter code file: select_best_binary_split.py.

Code Task D (20% of code grade): Implement train_tree

See the starter code file: train_tree.py.

Problem 0 Analysis (40% of report grade)

Short Answer 0a in Report

Consider a decision tree for regression that has already been trained on \(N\) examples with \(F\) features. The depth of the tree is \(D\).

What is the runtime complexity of making a prediction for a single test feature vector \(x \in \mathbb{R}^F\)?

Express your answer in terms of \(N\), \(F\), and \(D\). You should provide a clear statement like \(O(L^2)\) or "quadratic as a function of variable L", as well as a brief explanation.

Short Answer 0b in Report

Your colleague suggests that, given any dataset of \(N\) examples, training a regression tree can be done in runtime complexity \(O(N)\) (i.e. the time cost scales linearly with the number of training examples \(N\)), and thus is "just as scalable" as logistic regression.

Is your colleague correct? Provide a brief explanation. You should focus on simply supporting or refuting this statement above.

Short Answer 0c in Report

Suppose you fix your max_depth hyperparameter to \(N^2\) and other settings so the tree is as flexible as possible. You then train a decision tree for regression trained using our greedy training procedure on a training set of \(N\) examples. What can you say in general about the worst case depth of the tree? Will the worst-case depth as a function of \(N\) grow like \(O(\sqrt{N})\) or \(O(\log N)\) or \(O(N)\) or \(O(N^2)\) or something else?

Dataset: Bag-of-words representations of product reviews on Amazon

Dataset credit: M. Dredze, J. Blitzer, and Fernando Pereira. https://www.cs.jhu.edu/~mdredze/datasets/sentiment/

We consider a classification task where each example is one plain-text online review for a consumer product (either a book, a movie, an electronics item, or a kitchen item)

Here's an example book review that is a positive sentiment:

This all-Spanish handbook for parents with new babies will prove essential for any concerned about a child's health.

Here's one with a negative sentiment

It completely sucked. Very long with nothing to say. Author was proud of his own knowledge of the SCA and Wiccans, and the publishers thought that obviated the need for plot or character development.

At training time, we observe N samples of feature-label pairs, where the features are a bag-of-words representation of the written text and the label is an overall binary rating of the user's feelings about the product (either 'positive' or 'negative').

Our goal is to build a model that can predict the sentiment (positive or negative) from the text alone.

We have preprocessed this data into a standardized format using a bag-of-words representation, using a fixed vocabulary of the 7729 most common words provided by the original dataset creators (with some slight modifications by us). We'll emphasize that the vocabulary includes some bigrams (e.g. "waste_of") in addition to single words.

For the n-th text review, we define our features and labels as follows:

  • Feature vector \(x_n \in \mathbb{R}^F\) is a binary vector of size F=7729, indicating which terms in the vocabulary are present in the review, and which are absent.
    • Words that are not in the vocabulary have no impact on this feature vector.
  • Label \(y_n\) is a binary label (\(y_n \in \{0, 1\}\)), indicating whether that review had a positive or negative star rating.

We have included in the starter code repo:

  • a training set of 6346 documents
  • a validation set of 792 documents
  • a test set of 793 documents

Our analysis goals across Problems 1-3 are:

  • Can we train tree models to solve this task well?
  • Can we the inspect these trees to understand how the model is making decisions?
  • How well do tree models compare with others we know about?

Problem 1: Decision Trees for Review Classification (20% of report grade)

We'll now examine simple decision trees for our sentiment classification problem.

Implementation Step 1A : Train a Simple Tree

Train a DecisionTreeClassifier with criterion='gini', max_depth=3, min_samples_leaf=1.

Figure 1 in Report

Show the ASCII-text representation of your trained decision tree from 1A, by calling the helper function pretty_print_sklearn_tree found in the starter code. (Remember when you are reading the printed statements that "Y" means the above decision question evaluated to "yes" and "N" means "no".)

Short Answer 1a in Report

Reflect on the tree in Figure 1, remembering that we are predicting overall sentiment about a household product based on observed words in a written review

  • Is the internal logic of the tree reasonable?
  • Are there any paths or internal logic that do not make sense to you?

Implementation Step 1B

Perform a grid search for hyperparameters over the following settings:

  • Performance metric to optimize: "balanced_accuracy"
  • Source of heldout data: the provided fixed validation set
  • max_depth in [2, 8, 32, 128]
  • min_samples_leaf in [1, 3, 9]
  • random_state in [101]
  • All other settings at default values

Use the best-ranked configuration to obtain one "best" decision tree (using only the training set).

You'll use this best tree later in Problem 3.

Short Answer 1b in Report

What is the depth of the best tree you found? How many leaves does it have? Can you interpret the internal logic of this tree? Why or why not?

Problem 2: Random Forests for Review Classification (20% of report grade)

We'll now examine whether we can produce an ensemble of decision trees to improve performance on our classification problem.

Implementation Step 2A: Selecting the best random forest

Perform a grid search for hyperparameters over the following settings:

  • Performance metric to optimize: "balanced_accuracy"
  • Source of heldout data: the provided fixed validation set
  • max_features in [3, 10, 33, 100, 333]
  • max_depth in [16, 32]
  • min_samples_leaf in [1]
  • n_estimators in [125]
  • random_state in [101]
  • All other settings at default values

Use the best-ranked configuration to obtain one "best" random forest (using only the training set).

You'll use this best tree later in Problem 3.

Implementation Step 2B: Feature importances

Access the feature_importances_ attribute of your trained forest to get a score for each term in our vocabulary.

A higher value of this score indicates the feature is "more important".

Figure 2 in Report

On one panel, display a list of the top 10 vocabulary words of your best forest with highest feature importance.

In another panel, display a list of 10 randomly chosen terms that have close-to-zero feature importance (any feature with importance less than 0.00001 is eligible).

Short Answer 2a in Report

Reflect on the contents of Figure 2. Do the included vocabulary terms that are "most important" appear reasonable? Is there anything confusing or that you have trouble explaining? Is there anything missing you would have expected?

Short Answer 2b in Report

When you fit random forests, you can adjust n_estimators to set the number of trees in your ensemble.

What is the primary tradeoff this hyperparameter controls? Can you overfit by setting this too large?

Problem 3: Comparing Trees to Linear Models for Review Classification (20% of report grade)

In this problem, you'll think about comparing trees to models with linear decision boundaries (logistic regression).

We'll specifically look at an L1-penalized logistic regression (aka "Lasso"), since we want an interpretable model.

Short Answer 3a in Report

Specifically for binary bag-of-words features, can you describe a reason where you'd expect a tree model to outperform a (thresholded) linear model? What kinds of logic can a tree encode about the presence or absence of vocabulary words in text that a logistic regression model cannot?

Implementation Step 3A: Selecting the best logistic regression with L1 penalty

Perform a grid search for LogisticRegression classifier with the following settings:

  • Performance metric to optimize: "balanced_accuracy"
  • Source of heldout data: the provided fixed validation set
  • C in np.logspace(-4, 4, 9)
  • penalty set to 'l1', since we want a more interpretable model with sparse weights.
  • solver set to 'saga' (needed to handle the L1 penalty)
  • All other settings at default values

Table 3 in Report

In one table, show the train, validation, and test set performance (in terms of balanced accuracy).

  • Include 3 columns, one for each dataset (train/valid/test)
  • Include 4 rows:
    • Simple decision tree
    • Best decision tree
    • Best random forest
    • Best L1-penalized Logistic Regresion

Include a brief caption summarizing your conclusions. Which model does best on the test set? Would we have expected that from the validation set performance?