HW1: Beta, Dirichlet, and Estimators of Unigram Probability

Last modified: 2021-02-03 18:03

Due date: Wed. Feb. 17, 2021 at 11:59pm AoE (anywhere on Earth)

Status: RELEASED

How to turn in: Submit PDF to https://www.gradescope.com/courses/239096/assignments/972003/

Questions?: Post to the hw1 topic on the Piazza discussion forums.

Instructions for Preparing your PDF Report

What to turn in: PDF of typeset answers via LaTeX. No handwritten solutions will be accepted, so that grading can be speedy and you get prompt feedback.

Please use the official LaTeX Template: https://raw.githubusercontent.com/tufts-ml-courses/comp136-21s-assignments/master/hw1/hw1_template.tex

Your PDF should include (in order):

Your full name
Collaboration statement
Problem 1 answer
Problem 2 answer

For each problem:

- Be sure to write a human-readable label for each part (e.g. "1c", 2a" and "3b")
- Be sure to mark each part via the in-browser Gradescope annotation tool)

Throughout this homework, we are practicing the skills necessary to derive and analyze mathematical expressions involving probabiility.

Each step of a mathematical derivation that you turn in should be:

legible
justified by an accompanying short phrase (e.g. "using Bayes rule" or "by the identity 2.15 in the textbook")

Solutions that lack justifications or skip key steps without showing work will receive poor marks.

Problem 1: Mean for the Beta and Dirichlet

1a

Let $\rho \in (0.0, 1.0)$ be a Beta-distributed random variable: $\rho \sim \text{Beta}(a, b)$.

Show that $\mathbb{E}[ \rho ] = \frac{a}{a + b}$.

Hint: You can use these identities, which hold for all $a > 0$ and $b > 0$:

\begin{align} \Gamma(a) &= \int_{t=0}^{\infty} e^{-t} t^{a-1} dt \\ \Gamma(a+1) &= a \Gamma(a) \\ \int_{0}^1 \rho^{a-1} (1-\rho)^{b-1} d\rho &= \frac {\Gamma(a)\Gamma(b)} {\Gamma(a+b)} \end{align}

If you are curious, the derivation of the last identity can be found in the textbook in Exercise 2.5. But you can use these identities without understanding the derivation.

1b

Let $\mu$ be a Dirichlet-distributed random variable: $\mu \sim \text{Dir}(a_1, \ldots a_V)$.

Show that $\mathbb{E}[ \mu_v ] = \frac{a_v}{\sum_{w=1}^V a_w}$.

Hint: You can use the identity:

\begin{align} \int_{\mu \in \Delta^V} \mu_1^{a_1-1} \mu_2^{a_2 - 1} \ldots \mu_V^{a_V-1} d\mu &= \frac {\prod_{v=1}^V \Gamma(a_v)} {\Gamma(a_1 + a_2 \ldots + a_V)} \end{align}

Where the integral is over the $V$-dimensional probability simplex (the set of vectors of size $V$ that are non-negative and sum to one). We denote this as $\Delta^V \subset \mathbb{R}^V$.

Problem 2: Bayesian estimation for unigram probabilities

Consider a model for observing $N$ words, $x_1, \ldots x_N$, each one a known term in a finite vocabulary of size $V$. Thus, each observation can be represented as an integer index into the vocabulary: $x_n \in \{1, 2, \ldots V\}$.

We model any list of words by making the assumption that each word is conditionally independent of the other words given a parameter vector $\mu$:

$$ p(X_1 = x_1, X_2 = x_2, \ldots, X_N = x_N | \mu) = \prod_{n=1}^N p(X_n = x_n | \mu) $$

Here, we define $\mu = [ \mu_1 \ldots \mu_V]$ to be a vector of $V$ non-negative entries, where the sum of this vector equals one.

Given $\mu$, each word comes from a Discrete distribution with PMF:

$$ p(X_n = v | \mu) = \mu_v $$

We further assume a symmetric Dirichlet prior distribution on the vector $\mu$: $p(\mu) = \text{Dir}( \alpha, \alpha, \ldots \alpha)$, where $\alpha > 0$.

As shown in Section 2.2 of the textbook, when we have a Dirichlet prior $p(\mu)$ and a Discrete likelihood $p(X_1, \ldots X_N | \mu)$, the posterior over $\mu$ is ALSO Dirichlet distribution (see Eq. 2.41).

From the $N$ observed training words, we wish to predict the identity of the next word, $X_*$.

2a

Show that the likelihood of all $N$ observed words can be written as:

$$ p(X_1 = x_1, X_2 = x_2, \ldots, X_N = x_N | \mu) = \prod_{v=1}^V \mu_v^{n_v} $$

where $n_v$ is the non-negative integer that counts how often vocabulary term $v$ appears in the training data: $n_v = \sum_{n=1}^N [ x_n = v ]$. Where the bracket expression is 1 if the expression inside is true, and 0 otherwise. This is commonly called Iverson bracket notation: https://en.wikipedia.org/wiki/Iverson_bracket

2b

Derive the probability mass function for the predictive posterior.

That is, show that after seeing the $N$ training words, the probability of the next word $X_*$ being vocabulary word $v$ is:

$$ p( X_* = v | X_1 = x_1 \ldots X_N = x_n, \alpha) = \frac{n_v + \alpha}{N + V\alpha} $$

Hint: Use the definition of the posterior $p(\mu | x_1, \ldots x_N)$ given in the textbook Eq. 2.41.

2c

Derive the probability mass function for the joint configuration of the observed training data. That is, show that the probability of the observed $N$ training words is:

\begin{align} p( X_1 = x_1 \ldots X_N = x_N | \alpha) = &= \frac { \Gamma(V \alpha) \prod_{v=1}^V \Gamma( n_v + \alpha ) } { \Gamma(N + V \alpha ) \prod_{v=1}^V \Gamma(\alpha) } \end{align}

Note: We often call this the "evidence", as in the evidence our assumed model provides for the observed data (higher is considered better).