HW1: Bayes, Beta, Dirichlet, and Unigrams


Last modified: 2020-01-25 19:02

Due date: Wed Feb 5, 2020 at 11:59pm

Status: Released

How to turn in: Submit PDF to https://www.gradescope.com/courses/82862/assignments/337392

Jump to: Problem 1   Problem 2   Problem 3

Questions?: Post to the hw1 topic on the Piazza discussion forums.

Instructions for Preparing your PDF Report

What to turn in: PDF of handwritten or typeset (LaTeX) answers

Your PDF should include (in order):

  • Your full name
  • Collaboration statement
  • About how many hours did you spend (coding, asking for help, writing it up)?
  • Problem 1 answer
  • Problem 2 answer
  • Problem 3 answer

For each problem:

Throughout this homework, we are practicing the skills necessary to derive and analyze mathematical expressions involving probabiility.

Each step of a mathematical derivation that you turn in should be:

  • legible
  • justified by an accompanying short phrase (e.g. "using Bayes rule" or "by the identity 2.15 in the textbook")

Solutions that lack justifications or skip key steps without showing work will receive poor marks.

Problem 1: Bayes Rule and Medical Diagnosis

You are helping some doctors develop a test for a new dangerous disease.

For brevity, you can use the following throughout this problem:

  • The number 1 to indicate "has disease", and 0 to indicate otherwise.
  • The symbol 'b' to indicate "blood is black", and 'r' to indicate "blood is red"

1a

Across the whole population, the chance that one individual has the disease is 4%.

Formalize this statement by defining a random variable "D". What is the sample space of D? What is the probability of each possible value?

1b

Doctors have developed a lab test that each time, conclusively turns an individual's blood black or leaves it red. If the person has the disease, the probability the test turns blood black is 100%. If the person does not have the disease, the probability the test turns blood black is 5%.

Write the above statements using conditional probabilities with random variable "D" for disease and "T" for test result.

1c

Compute the joint probability for all possible joint configurations of D and T, writing your answers as a table with rows and columns.

  • Please order the rows as "black" then "red"
  • Please order the columns as "has disease" then "does not"

1d

For a random individual whose disease status is unknown, you run the test and observe a black color. What is the probability they have the disease?

Problem 2: Mean for the Beta and Dirichlet

2a

Let \(\rho \in (0.0, 1.0)\) be a Beta-distributed random variable: \(\rho \sim \text{Beta}(a, b)\).

Show that \(\mathbb{E}[ \rho ] = \frac{a}{a + b}\).

Hint: You can use these identities, which hold for all \(a > 0\) and \(b > 0\):

\begin{align} \Gamma(a) &= \int_{t=0}^{\infty} e^{-t} t^{a-1} dt \\ \Gamma(a+1) &= a \Gamma(a) \\ \int_{0}^1 \rho^{a-1} (1-\rho)^{b-1} d\rho &= \frac {\Gamma(a)\Gamma(b)} {\Gamma(a+b)} \end{align}

If you are curious, the derivation of the last identity can be found in the textbook in Exercise 2.5. But you can use these identities without understanding the derivation.

2b

Let \(\mu\) be a Dirichlet-distributed random variable: \(\mu \sim \text{Dir}(a_1, \ldots a_V)\).

Show that \(\mathbb{E}[ \mu_v ] = \frac{a_v}{\sum_{w=1}^V a_w}\).

Hint: You can use the identity:

\begin{align} \int_{\mu \in \Delta^V} \mu_1^{a_1-1} \mu_2^{a_2 - 1} \ldots \mu_V^{a_V-1} d\mu &= \frac {\prod_{v=1}^V \Gamma(a_v)} {\Gamma(a_1 + a_2 \ldots + a_V)} \end{align}

Where the integral is over the \(V\)-dimensional probability simplex (the set of vectors of size \(V\) that are non-negative and sum to one). We denote this as \(\Delta^V \subset \mathbb{R}^V\).

Problem 3: Bayesian estimation for unigram probabilities

Consider a model for observing \(N\) words, \(x_1, \ldots x_N\), each one a known term in a finite vocabulary of size \(V\). Thus, each observation can be represented as an integer index into the vocabulary: \(x_n \in \{1, 2, \ldots V\}\).

We model any list of words by making the assumption that each word is conditionally independent of the other words given a parameter vector \(\mu\):

$$ p(X_1 = x_1, X_2 = x_2, \ldots, X_N = x_N | \mu) = \prod_{n=1}^N p(X_n = x_n | \mu) $$

Here, we define \(\mu = [ \mu_1 \ldots \mu_V]\) to be a vector of \(V\) non-negative entries, where the sum of this vector equals one.

Given \(\mu\), each word comes from a Discrete distribution with PMF:

$$ p(X_n = v | \mu) = \mu_v $$

We further assume a symmetric Dirichlet prior distribution on the vector \(\mu\): \(p(\mu) = \text{Dir}( \alpha, \alpha, \ldots \alpha)\), where \(\alpha > 0\).

As shown in Section 2.2 of the textbook, when we have a Dirichlet prior \(p(\mu)\) and a Discrete likelihood \(p(X_1, \ldots X_N | \mu)\), the posterior over \(\mu\) is ALSO Dirichlet distribution (see Eq. 2.41).

From the \(N\) observed training words, we wish to predict the identity of the next word, \(X_*\).

3a

Show that the likelihood of all \(N\) observed words can be written as:

$$ p(X_1 = x_1, X_2 = x_2, \ldots, X_N = x_N | \mu) = \prod_{v=1}^V \mu_v^{n_v} $$

where \(n_v\) is the non-negative integer that counts how often vocabulary term \(v\) appears in the training data: \(n_v = \sum_{n=1}^N [ x_n = v ]\). Where the bracket expression is 1 if the expression inside is true, and 0 otherwise. This is commonly called Iverson bracket notation: https://en.wikipedia.org/wiki/Iverson_bracket

3b

Derive the probability mass function for the predictive posterior.

That is, show that after seeing the \(N\) training words, the probability of the next word \(X_*\) being vocabulary word \(v\) is:

$$ p( X_* = v | X_1 = x_1 \ldots X_N = x_n, \alpha) = \frac{n_v + \alpha}{N + V\alpha} $$

Hint: Use the definition of the posterior \(p(\mu | x_1, \ldots x_N)\) given in the textbook Eq. 2.41.

3c

Derive the probability mass function for the joint configuration of the observed training data. That is, show that the probability of the observed \(N\) training words is:

\begin{align} p( X_1 = x_1 \ldots X_N = x_N | \alpha) = &= \frac { \Gamma(V \alpha) \prod_{v=1}^V \Gamma( n_v + \alpha ) } { \Gamma(N + V \alpha ) \prod_{v=1}^V \Gamma(\alpha) } \end{align}

Note: We often call this the "evidence", as in the evidence our assumed model provides for the observed data (higher is considered better).