HW2: Evaluating Binary Classifiers and Implementing Logistic Regression


Last modified: 2020-10-21 11:25

Status: RELEASED.

Due date: Wed. Oct. 14 at 11:59PM AoE (Anywhere on Earth) (Thu 10/14 at 07:59am in Boston)

Updates

  • 2020-10-01 : Instructions for Report Short Answer 1a clarified: use the test set.
  • 2020-10-05 : Instructions for Coding Task 2(c) clarified: make sure you use the base-2 logarithm for all returned values.
  • 2020-10-09 : Clarified Selecting the Decision Threshold step, please consider the candidate thresholds produced by starter code helper function when you do your search
  • 2020-10-14 : Clarified for Short answer 1e what the current practice is you compare your method to, and that you can assume that the test set represents your population.

Overview

In this HW, you'll complete the following:

  • First, complete Problem 1's code tasks, writing Python code based on provided starter code. You'll submit your code as a ZIP to the autograder link below.
  • Second, for Problem 1, complete some analysis tasks and write a report. You'll submit this PDF report separately.
  • Third, complete Problem 2's code tasks, writing Python code based on provided starter code. Again, you'll submit your code as a ZIP to the autograder link below.

As much as possible, we have tried to decouple these parts, so you may successfully complete the report even if some of your code doesn't work. Much of your analysis will use library code in sklearn with similar functionality as what you implement yourself.

Turn-in links:

Files to Turn In:

PDF report:

  • Prepare a short PDF report (no more than 5 pages; ideally 1 page per problem below).
  • This document will be manually graded.
  • Can use your favorite report writing tool (Word or G Docs or LaTeX or ....)
  • Should be human-readable. Do not include code. Do NOT just export a jupyter notebook to PDF.
  • Should have each subproblem marked via the in-browser Gradescope annotation tool)

ZIP file of source code submitted to autograder should contain:

  • performance_metrics_for_binary_predictions.py (will be autograded)
  • performance_metrics_for_proba_predictions.py (will be autograded)
  • logsumexp.py (will be autograded)
  • hw2.ipynb (just for completeness, will not be autograded and will be manually assessed only if necessary.)

Evaluation Rubric:

See the PDF submission portal on Gradescope for the point values of each problem. Generally, tasks with more coding/effort will earn more potential points.

Jump to: Code Tasks   Starter Code   Dataset   Problem 1   Problem 2

Background

To complete this HW, you'll need some specific knowledge from the following sessions of class:

  • Gradient Descent (day06)
  • Evaluation of Binary Classifiers (day07)
  • Logistic Regression (day08)

Starter Code

See the hw2 folder of the public assignments repo for this class:

https://github.com/tufts-ml-courses/comp135-20f-assignments/tree/master/hw2

This starter code includes several .py files for core functionality you need to implement yourself.

It also includes a notebook to help you organize your analysis.

Problem 1: Binary Classifier for Cancer-Risk Screening

Code Tasks for Problem 1: Implement performance metrics for binary predictions

Here, you'll implement several metrics for comparing provided "true" binary labels with "predicted" binary decisions.

See the starter code file: performance_metrics_for_binary_predictions.py.

Task 1(a) : Implement calc_TP_TN_FP_FN

Task 1(b) : Implement calc_ACC

Task 1(c) : Implement calc_TPR and calc_TNR

Task 1(d) : Implement calc_PPV and calc_NPV

See the starter code for example inputs and the expected output.

Dataset: Predicting Cancer Risk from Easy-to-measure Facts

You have been given a dataset containing some medical history information for 750 patients that might be at risk of cancer. Dataset credit: A. Vickers, Memorial Sloan Kettering Cancer Center [original link].

Each patient in our dataset has been biopsied (fyi: in this case a biopsy is a short surgical procedure that is painful but with virtually no lasting harmful effects) to obtain a direct "ground truth" label so we know each patient's actual cancer status (binary variable, 1 means "has cancer", 0 means does not, column name is cancer in the \(y\) data files). We want to build classifiers to predict whether a patient likely has cancer from easier-to-get information, so we could avoid painful biopsies unless they are necessary. Of course, if we skip the biopsy, a patient with cancer would be left undiagnosed and therefore untreated. We're told by the doctors this outcome would be life-threatening.

Easiest features: It is known that older patients with a family history of cancer have a higher probability of harboring cancer. So we can use age and famhistory variables in the \(x\) dataset files as inputs to a simple predictor.

Possible new feature: A clinical chemist has recently discovered a real-valued marker (called marker in the \(x\) dataset files) that she believes can distinguish between patients with and without cancer. We wish to assess whether or not the new marker does indeed identify patients with and without cancer well.

To summarize, there are two versions of the features \(x\) we'd like you to examine:

  • 2-feature dataset: 'age' and 'famhistory'
  • 3-feature dataset: 'marker', 'age' and 'famhistory'

Bottom-line: We are building classifiers so that many patients might not need to undergo a painful biopsy if our classifier is reliable enough to be trusted to filter out low-risk patients.

In the starter code, we have provided an existing train/validation/test split of this dataset, stored on-disk in comma-separated-value (CSV) files: x_train.csv, y_train.csv, x_valid.csv, y_valid.csv, x_test.csv, and y_test.csv.

Get the data here: https://github.com/tufts-ml-courses/comp135-20f-assignments/tree/master/hw2/data_cancer

Understanding the Dataset

Implementation Step 1A Given the provided datasets (as CSV files), load them and compute the relevant counts needed for Table 1A below.

Table 1 in Report

Provide a table summarizing some basic properties of the provided training set, validation set, and test set:

  • Row 1 'total count': how many total examples are in each set, total?
  • Row 2 'positive label count': how many examples have a positive label (means cancer)?
  • Row 3 'fraction positive' : what fraction (between 0 and 1) of the examples have cancer?

Establishing Baseline Prediction Quality

Implementation Step 1B

Given a training set of values \(\{y_n \}_{n=1}^N\), we can always consider a simple baseline for prediction that returns the same constant predicted label regardless of the input \(x_i\) feature vector:

  • predict-0-always : \(\forall i, \quad \hat{y}(x_i) = 0\)

Compute the confusion matrix for this baseline classifier, and save it for later (see Table 2 below).

Short Answer 1a in Report What accuracy does the "predict-0-always" classifier get on the test set (report to 3 decimal places)? (You should see a pretty good number). Does this mean we should use this classifier?

Short Answer 1b in Report

Consider a general binary classifier for this task (not just the baseline above, but any classifier). For the intended application (screening patients before biopsy), describe the possible mistakes the classifier can make in task-specific terms. What costs does each mistake entail (lost time? lost money? life-threatening harm?).

Short Answer 1c in Report (Note: Added after turn in so next time will be more clear.)

How do you recommend evaluating a potential classifier to be mindful of these costs?

Trying Logistic Regression: Training and Hyperparameter Selection

Implementation Step 1C

Consider the 2-feature dataset. Fit a logistic regression model using sklearn's LogisticRegression implementation sklearn.linear_model.LogisticRegression docs.

When you construct your LogisticRegression classifier, please be sure that:

  • Set solver='lbfgs' (ensures consistent performance, coherent penalty)
  • Provide a positive value for hyperparameter C, an "inverse strength value" for the L2 penalty on coefficient weights
    • Small C (e.g. \(10^{-6}\)) mean the weights should be near zero (equivalent to large \(\alpha\))
    • Large C (e.g. \(10^{+6}\)) means the weights should be unpenalized (equivalent to small \(\alpha\))

To avoid overfitting, you should explore a range of C values, using a regularly-spaced grid: C_grid = np.logspace(-9, 6, 31). Among these possible values, select the value that minimizes the mean cross entropy loss on the validation set. The starter code contains a function from sklearn for computing this loss. Later in Problem 2, you'll implement this loss yourself.

Implementation Step 1D

Repeat 1C, for the 3-feature dataset.

Comparing Models with ROC Analysis

We have trained two possible LR models, one using the 2-feature dataset (\(F=2\)) and the other with the 3-feature dataset (\(F=3\)). Which is better?

Receiver Operating Curves ("ROC" curves) allow us to compare classifiers across many possible decision thresholds. Each curve shows the tradeoffs a classifier makes between true positive rate (TPR) and false positive rate (FPR), as you vary the decision threshold. Remember FPR = 1 - TNR.

You can use `sklearn.metrics.roc_curve' to plot such curves. To understand how to use this function, consult the function's User Guide and documentation.

Figure 1 in Report

Compare the \(F=2\) and \(F=3\) model's performance on the validation set, using ROC curves.

Create a single plot showing two lines: * one line is the validation-set ROC curve for the \(F=2\) model from 1C (use color BLUE ('b') and style '.-') * one line is the validation-set ROC curve for the \(F=3\) model from 1D (use color RED ('r') and style '.-')

Short Answer 1c in Report

Compare the two models in terms of their ROC curves from Figure 1. Does one dominate the other in terms of overall performance, or are there areas where one model “wins” and others where the other model does? Which model do recommend for the task at hand?

Selecting the Decision Threshold

Remember that even after we train an LR model to make probabilistic predictions, if we intend the classifier to ultimately make some yes/no binary decision (e.g. should we give a biopsy or not), we need to select the threshold we use to obtain a binary decision from probabilities.

Of course, we could just use a threshold of 0.5 (which is what sklearn and most implementations will do by default). Below, we consider several potentially smarter strategies for selecting this threshold.

Clarification (10/09): To get candidate threshold values, use the helper function compute_perf_metrics_across_thresholds in the starter code file threshold_selection.py.

Implementation Step 1E

For the classifier from 1D above (LR for 3-feature dataset), compute performance metrics across all candidate thresholds on the validation set (use compute_perf_metrics_across_thresholds), and pick the threshold that maximizes TPR while satisfying PPV >= 0.98 on the validation set.

Remember, you pick this threshold based on the validation set, then later you'll evaluate it on the test set.

Implementation Step 1F

For the classifier from 1D above (LR for 3-feature dataset), compute performance metrics across all candidate thresholds on the validation set (use compute_perf_metrics_across_thresholds), and pick the threshold that maximizes PPV while satisfying TPR >= 0.98 on the validation set.

Remember, you pick this threshold based on the validation set, then later you'll evaluate it on the test set.

Figure 2 in Report

In one figure with 2 rows and 3 columns, summarize the test-set performance of the F=3 logistic regression model across several thresholds.

Each column should correspond to one of the following models:

  • \(F=3\) Logistic Regression (from 1D above), using threshold 0.5
  • \(F=3\) Logistic Regression (from 1D above), using threshold from 1E above
  • \(F=3\) Logistic Regression (from 1D above), using threshold from 1F above

Each row should report some performance of the model on the test set

  • Top Row: confusion matrix on the test set
  • Bottom Row: TPR and PPV on the test set

When you make confusion matrices, use the same orientation as in the day07 slides (true on vertical axis, predicted on horizontal axis, make class 0 before class 1 when you read left-to-right or top-to-bottom).

Short Answer 1d in Report

Compare the 3 confusion matrices in Figure 2. Which model and thresholding strategy best meets our preferences from 1b: avoid life-threatening mistakes at all costs, while also eliminating unnecessary biopsies?

Short Answer 1e in Report

By carefully reading the confusion matrices from Figure 2, estimate how many subjects in the test set are saved from unnecessary biopsies that would be done in current practice using your selected thresholding strategy. What fraction of current biopsies might be avoided if this classifier was adopted by the hospital?

Clarifications (Added 2020-10-14):

  • You can assume the current practice is to biopsy every patient. Your goal is to build a classifier that improves on this practice.
  • You can assume the test set is a reasonable representation of the true population of patients.

Problem 2: Computing the Loss for Logistic Regression without Numerical Issues

Code Task 2: Implement performance metrics for probabilistic classification

Here, you'll implement some metrics for comparing provided "true" binary labels with real-valued predictions, expressed as either probabilities or as scores.

There's nothing to add to your report here. We just want you to complete these 3 functions below. We'll build on these functions in the next HW.

Task 2(a) : Implement function calc_mean_binary_cross_entropy_from_probas

Starter code file: performance_metrics_for_proba_predictions.py.

Given a single binary label \(y_n \in {0, 1}\) and a corresponding predicted probability \(p_n \in (0,1)\), we define binary cross entropy as:

$$ \text{BCE}(y_n, p_n) = - y_n \log_2 p_n - (1-y_n) \log_2 (1-p_n) $$

Given the labels for \(N\) examples \(y = \{y_n \}_{n=1}^N\) their predicted probabilities \(p = \{ p_n \}_{n=1}^N\), we define the mean binary cross entropy for the dataset as:

$$ \text{mean_BCE}(y, p) = \frac{1}{N} \sum_{n=1}^N \text{BCE}(y_n, p_n) $$

To keep this computation numerically stable, we will require that your code preprocesses the provided probabilities such that:

\(10^{-14} \leq p_n \leq 1 - 10^{-14}\)

If we didn't do this, we might call \(\log_2(0)\), which is not a valid number (NumPy will yield a -np.inf value).

See the starter code for example inputs and the expected output.

Task 2(b) : Implement function my_logsumexp

Starter code file: logsumexp.py.

In several ML tasks, given a vector \(a \in \mathbb{R}\) of \(L\) real numbers, we need to frequently evaluate a function like this:

\begin{align} \text{logsumexp}([a_1, a_2, \ldots a_L]) &= \log \left( e^{a_1} + e^{a_2} + \ldots e^{a_L} \right) \end{align}

We call this the "logsumexp" function.

If we just implement this in NumPy code, we'll quickly notice problems for some inputs that should be "easy".

For example, consider the case where our vector \(a\) has two entries: a = [1000, -999].

\begin{align} \text{logsumexp}([1000.0, -999]) &= \log \left( e^{1000} + e^{-999} \right) = 1000 + \epsilon \end{align}

where we use \(\epsilon\) to indicate a very small positive float value. The true result is not quite 1000, but probably so close that our finite computers won't be able to represent the difference.

But what does a quick implementation in NumPy do?

# BAD! Naive computation overflows
>>> import numpy as np
>>> np.log(np.sum(np.exp([1000.0, -999.])))
RuntimeWarning: overflow encountered in exp
inf

# GOOD: a "numerically stable" computation from a good library
>>> from scipy.special import logsumexp
>>> logsumexp([1000., -999.])
1000.0

How do we implement this in a stable way? We can use the "logsumexp trick", which is just a few lines of algebra:

\begin{align} \text{logsumexp}([a_1, a_2, \ldots a_L]) &= \log \left( e^{a_1} + e^{a_2} + \ldots e^{a_L} \right) \\ &= \log \left( e^{m} ( e^{a_1 - m} + e^{a_2 - m} + \ldots e^{a_L - m} ) \right) \\ &= \log \left( e^{m} \right) + \log \left( e^{a_1 - m} + e^{a_2 - m} + \ldots e^{a_L - m} \right) \\ &= m + \log \left( e^{a_1 - m} + e^{a_2 - m} + \ldots e^{a_L - m} \right) \end{align}

The above is true for any scalar real value \(m\). However, if we pick \(m\) smartly, this computation becomes more robust to any possible input values in our vector \(a\).

If we select \(m\) equal to the maximum entry of the vector \(a\) (\(m = \max_{\ell} a_{\ell}\)), we can guarantee that each of these right-hand side terms indexed by \(\ell\) satisfy: \(0 \leq e^{a_\ell - m} \leq 1\). Furthermore, at least one of these terms equals 1.0 precisely. Thus, the total sum inside the log must be between 1 and \(L\). We can easily compute np.log(1.0) or np.log(L) and values in between for any reasonable value of \(L\) (even if our vector has a few billion entries).

By implementing the last line above in our computer and setting \(m\) to the maximum entry of our vector \(a\), our computation will be correct no matter what the entries of \(a\) are, so long as they are representable as a valid `float64' value in your computer. The input \(a\) could contain extreme negative values like -9999.999, or extreme positive values like 987654.123.

Task 2(c) : Implement function calc_mean_binary_cross_entropy_from_scores

Starter code file: performance_metrics_for_proba_predictions.py.

Given a single binary label \(y_n \in {0, 1}\) and a corresponding real-valued score \(s_n \in (-\infty, \infty)\), we define binary cross entropy as:

\begin{align} \text{scoreBCE}(y_n, s_n) = \text{BCE}(y_n, \sigma(s_n)) &= - y_n \log_2 \frac{1}{1 + e^{-s_n}} - (1-y_n) \log_2 \frac{e^{-s_n}}{1 + e^{-s_n}} \\ &= \begin{cases} \log_2 (1 + e^{-s_n}) \quad \text{if}~ y_n = 1 \\ \log_2 (1 + e^{s_n}) \quad \text{if}~ y_n = 0 \end{cases} \\ &= \log_2 \left( 1 + e^{\text{flip}(y_n) s_n} \right) \\ &= \log_2 \left( e^0 + e^{\text{flip}(y_n) s_n} \right) \\ &= \frac{\text{logsumexp}( [0, \text{flip}(y_n) s_n ] )}{ \log_e(2) } \end{align}

Here, we assume the logsumexp function (which uses the "natural" or base-\(e\) logarithm) is defined above in 2(b). The function flip is defined as:

\begin{align} \text{flip}(y_n) &= \begin{cases} -1 \quad \text{if}~ y_n = 1 \\ +1 \quad \text{if}~ y_n = 0 \end{cases} \end{align}

You are expected to return a base 2 logarithm value here, so the final \( \log_e 2\) term accounts for this.

Remember that if it is easy to compute a logarithm in base "B", but you want a logarithm in base "A", you can convert between bases like so for any input value \(z\):

$$ \log_A(z) = \frac{\log_B(z)}{\log_B(A)} $$

Reference https://www.khanacademy.org/math/algebra2/x2ec2f6f830c9fb89:logs/x2ec2f6f830c9fb89:change-of-base/a/logarithm-change-of-base-rule-intro