CP1: Unigram Probabilities


Last modified: 2020-02-03 16:07

Due date: Wed. Feb 5, 2020 at 11:59pm

Status: Released

Updates:

  • 2020-02-03 04:00pm : Added separate link for PDF turnin

What to turn in:

Your ZIP file should include

  • All starter code .py files (with your edits) (in the top-level directory)

Your PDF should include (in order):

  • Your full name
  • Collaboration statement
  • About how many hours did you spend (coding, asking for help, writing it up)?
  • Problem 1 figure and short answers
  • Problem 2 figure and short answers

Questions?: Post to the cp1 topic on the discussion forums.

Jump to: Problem 1   Problem 2   Starter Code

Background

Recall the unigram model discussed in class and in HW1

We'll assume throughout that we have a known vocabulary with \(V\) distinct words, all known in advance.

The two problems below will address two key questions:

  • Given training data, how should we estimate the probability of each word? How might estimates change if we have very little (or abundant) data?
  • How can we select hyperparameter values to improve our predictions on heldout data, using only the training set?

Probabilistic Model

Likelihood model for one word

Consider a discrete random variable \(X\) whose value indicates one of the \(V\) possible vocabulary words.

Let us define a flexible probability mass function, where each possible vocabulary term \(v \in \{1, 2, \ldots V\}\) can have its own probability value \(\mu_v\), with \(0 \leq \mu_v \leq 1\):

$$ p(X = v | \mu) = \mu_v, \quad \forall v \in \{1, \ldots V \} $$

Thus, our PMF is defined by a parameter vector \(\mu = [ \mu_1, \mu_2, \ldots \mu_V ]\).

To define a valid PMF, the vector \(\mu\) must have \(V\) non-negative entries and sum to one:

$$ \sum_{v=1}^V \mu_v = 1 $$

Likelihood model for many words

We can observe a total list of \(N\) words as training data, \(x_1, x_2, \ldots x_N\), where each symbol \(x_n\) stands for an integer index to our vocabulary \(x_n \in \{1, 2, \ldots V\}\). We can consider these words as the outcome of \(N\) random variables, \(X_1, \ldots X_N\), each one taking \(V\) possible discrete values (each possible vocab term).

We model our list of words by making the assumption that each word is conditionally independent of the other words given the parameter vector \(\mu\):

$$ p(X_1 = x_1, X_2 = x_2, \ldots, X_N = x_N | \mu) = \prod_{n=1}^N p(X_n = x_n | \mu) $$

We can summarize the observed values \(x_1, \ldots x_N\) via a vector of counts \(n_1, \ldots n_V\), each one indicating how many times term \(v\) appears in our list of \(N\) words:

$$ n_v = \sum_{n=1}^N [ x_n = v ] $$

Where the bracket expression is 1 if the expression inside is true, and 0 otherwise. This is commonly called Iverson bracket notation: https://en.wikipedia.org/wiki/Iverson_bracket

Prior model

We assume the vector \(\mu\) is drawn from a symmetric Dirichlet with concentration parameter \(\alpha > 0\).

$$ p(\mu | \alpha) = \text{Dirichlet}( \mu_1, \ldots \mu_V | \alpha, \ldots \alpha ) $$

Recall that this is like describing our beliefs about \(\mu\) in terms of "pseudo-counts". That is, we act as if we have observed each vocabulary term \(\alpha\) times before seeing any training data.

Experiment 1: How to estimate the probability of heldout words?

Below, we provide the exact formulas for 3 common estimators for unigram probabilities.

Your task in Problem 1 (below) will be to implement these estimators and apply them to the provided training/test data.

Throughout all the estimators below, it is useful to view \(n_v\) as a function of the training data: \(n_v(x_1, \ldots x_N)\). We will simply write \(n_v\) to avoid verbose notation, but keep in mind we determine the count \(n_v\) by what we observe in our training data.

Maximum Likelihood (ML) estimator (with ad-hoc smoothing for unseen words)

Given a new word \(X_*\), we estimate it takes value \(v \in \{1, \ldots V \}\) with probability:

\begin{align} p( X_* = v | \mu^{\text{ML}}(x_1, \ldots x_N) ) = \begin{cases} (1 - \epsilon) \frac{n_v}{N} &\quad \text{if~} n_v > 0 \\ \epsilon \frac{1}{U} &\quad otherwise \end{cases} \end{align}

Here, we use a small constant \(\epsilon > 0\) to denote the fraction of all probability mass we will allow to be used for unknown words. The integer \(U\) is the total number of vocabulary words that have zero count.

Maximum A-Posteriori (MAP) estimator

Given a new word \(X_*\), we estimate it takes value \(v\) with probability:

$$ p( X_* = v | \mu^{\text{MAP}}(x_1, \ldots x_N) ) = \frac{n_v + \alpha - 1}{N + V(\alpha - 1)} $$

Note that this estimator requires that \(\alpha > 1\) unless every vocabulary word is observed at least once.

Posterior Predictive estimator

Given a new word \(X_*\), we estimate it takes value \(v\) with probability:

$$ p( X_* = v | X_1=x_1, \ldots X_N=x_N, \alpha ) = \frac{n_v + \alpha}{N + V \alpha} $$

Experiment 2: How to select the hyperparameter settings using training data?

You might expect that performance of the estimators for our model is rather sensitive to the chosen value of the prior hyperparameter \(\alpha\).

In Problem 2 below, you'll be asked to compute the probability of the observed training words given hyperparameter \(\alpha\), also called the evidence.

As derived in class and in HW1, the evidence PMF is:

\begin{align} p(X_1 = x_1, \ldots X_N = x_n | \alpha) &= \int_{\mu} p(X_1 = x_1, \ldots X_N = x_n | \mu ) p( \mu | \alpha ) d\mu \\ &= \frac { \Gamma(V \alpha) \prod_{v=1}^V \Gamma( n_v + \alpha ) } { \Gamma(N + V \alpha ) \prod_{v=1}^V \Gamma(\alpha) } \end{align}

Again, this formula is specialized to a symmetric Dirichlet prior, where every vocabulary term has the same "pseudocount" of \(\alpha\).

Hints for numerical stability

We suggest computing the log of the above PMF function directly (use SciPy's gammaln function as demonstrated in class). This will be more numerically stable, because of it works by adding in log space rather than multiplying in probability space where underflow or overflow are likely.

We further suggest that you divide by the total number of tokens in the training set. This makes the scale a bit easier (your answer should be between -11 and -8, not a large negative number, and easier to compare.

Data and Starter Code

You can find the starter code and datasets in the course Github repository here:

https://github.com/tufts-ml-courses/comp136-spr-20s-assignments/tree/master/cp1

Follow directions in the README for how to install the required Python packages.

Inside the data/ folder, you will find two plain-text files:

  • training_data.txt
  • test_data.txt

Each containing lists of 640,000 words, separated by spaces. We have “cleaned” the text content here already so it does not require any further preprocessing. You only to read the content of these files in as a list of strings, using code like that found in the __main__ function of run_estimator_comparison.py.

Problem 1

Tasks for Code Implementation

1a: CODE Implement fit and predict_proba methods of starter code MLEstimator.py

1b: CODE Implement fit and predict_proba methods of starter code MAPEstimator.py

1c: CODE Implement fit and predict_proba methods of starter code PosteriorPredictiveEstimator.py

Tasks for Report PDF

1d: FIGURE In your report PDF, using the starter code of run_estimator_comparison.py, produce 1 figure showing three overlapping line plots, one for each of the estimators you implemented above in 1a - 1c. Each estimator's line should show the estimated per-word log probability of the entire test data on the y-axis, as a function of the fraction of available training data on the x-axis.

You should be sure to enforce the following settings:

  • unseen_proba = 0.000001 for the maximum likelihood estimator
  • alpha = 2.0 for both estimators that require using the Dirichlet prior
  • frac_train_list = [1./128, 1./64, 1./32, 1./16, 1./8, 1./4, 1./2, 1.0]
  • Do not change the plotting limits or tick labels (the starter code defaults are ideal)
  • Report and plot "per-token" log probabilities, as done already in the score methods for each estimator released in the starter code. That is, take the sum of the estimator's log probabilities, and then divide the number of words in the test set, so they are averaged "per token",
$$ \text{average-score-per-token}(x_1, \ldots x_N) = \frac{1}{N} \sum_{n=1}^N \log p( X_n = x_n | \mu) $$

In your report PDF, provide 1-2 complete sentences to each of the following prompts:

  • 1e: SHORT ANSWER What do you expect to happen to the heldout log likelihood performance of all estimators as the training data gets larger and larger?

  • 1f: SHORT ANSWER What heldout log likelihood performance would you get if you simply estimated a uniform probability distribution over the vocabulary? Does the ML estimator always beat this "dumb" baseline?

Problem 2

In problem 1, we set \(\alpha\) manually to a single value. We'll now explore principled ways to select the value of \(\alpha\) to optimize performance, even if we only have access to our training set.

Tasks for Code Implementation

2a: CODE Implement the calc_log_evidence method in the starter code run_model_selection.py, using the formula given above. Also edit whatever you need in the __main__ section of that script to make the figure below.

Tasks for Report PDF

2b: FIGURE In your report PDF, deliver a figure assessing model selection with 3 panels, one for 3 possible training data sizes: \(N/128\), \(N/16\), and \(N\). For each dataset size, plot the per-token log evidence of the training set (e.g. the value produced by your calc_log_evidence function, divided by the number of tokens in the training set) as a function of \(\alpha\), for the log-spaced grid of alpha values suggested in the starter code. On the same axes, overlay the "test set" per-token log probability computed by your posterior predictive estimator at each value of \(\alpha\). If the evidence is a good indicator of which \(\alpha\) to select, the two curves should have similar trends in terms of peak performance.

Below this figure in your report PDF, answer the following with 1-2 sentences each:

  • 2c: SHORT ANSWER Is maximizing the evidence function on the training set a good strategy for selecting \(\alpha\) on this dataset? Why or why not?

  • 2d: SHORT ANSWER How else could we select \(\alpha\)? (Hint: think of a common way to pick hyperparameter values you might have learned about in an intro ML class). What would be the advantage of using the evidence? What would be an advantage of the other approach?

  • 2e: SHORT ANSWER Think about the \(\epsilon\) hyperparameter of the ML Estimator (unseen_proba in the code). What would happen if we selected the value of \(\epsilon\) by maximizing the probability of the training data? How is this different than selecting \(\alpha\) using the training data?