Last modified: 2021-02-03 18:03
Due date: Wed. Feb. 17, 2021 at 11:59pm AoE (anywhere on Earth)
Status: RELEASED
How to turn in: Submit PDF to https://www.gradescope.com/courses/239096/assignments/972003/
Questions?: Post to the hw1 topic on the Piazza discussion forums.
Instructions for Preparing your PDF Report
What to turn in: PDF of typeset answers via LaTeX. No handwritten solutions will be accepted, so that grading can be speedy and you get prompt feedback.
Please use the official LaTeX Template: https://raw.githubusercontent.com/tufts-ml-courses/comp136-21s-assignments/master/hw1/hw1_template.tex
Your PDF should include (in order):
- Your full name
- Collaboration statement
- Problem 1 answer
- Problem 2 answer
For each problem:
-
- Be sure to write a human-readable label for each part (e.g. "1c", 2a" and "3b")
Throughout this homework, we are practicing the skills necessary to derive and analyze mathematical expressions involving probabiility.
Each step of a mathematical derivation that you turn in should be:
- legible
- justified by an accompanying short phrase (e.g. "using Bayes rule" or "by the identity 2.15 in the textbook")
Solutions that lack justifications or skip key steps without showing work will receive poor marks.
Problem 1: Mean for the Beta and Dirichlet
1a
Let \(\rho \in (0.0, 1.0)\) be a Beta-distributed random variable: \(\rho \sim \text{Beta}(a, b)\).
Show that \(\mathbb{E}[ \rho ] = \frac{a}{a + b}\).
Hint: You can use these identities, which hold for all \(a > 0\) and \(b > 0\):
If you are curious, the derivation of the last identity can be found in the textbook in Exercise 2.5. But you can use these identities without understanding the derivation.
1b
Let \(\mu\) be a Dirichlet-distributed random variable: \(\mu \sim \text{Dir}(a_1, \ldots a_V)\).
Show that \(\mathbb{E}[ \mu_v ] = \frac{a_v}{\sum_{w=1}^V a_w}\).
Hint: You can use the identity:
Where the integral is over the \(V\)-dimensional probability simplex (the set of vectors of size \(V\) that are non-negative and sum to one). We denote this as \(\Delta^V \subset \mathbb{R}^V\).
Problem 2: Bayesian estimation for unigram probabilities
Consider a model for observing \(N\) words, \(x_1, \ldots x_N\), each one a known term in a finite vocabulary of size \(V\). Thus, each observation can be represented as an integer index into the vocabulary: \(x_n \in \{1, 2, \ldots V\}\).
We model any list of words by making the assumption that each word is conditionally independent of the other words given a parameter vector \(\mu\):
Here, we define \(\mu = [ \mu_1 \ldots \mu_V]\) to be a vector of \(V\) non-negative entries, where the sum of this vector equals one.
Given \(\mu\), each word comes from a Discrete distribution with PMF:
We further assume a symmetric Dirichlet prior distribution on the vector \(\mu\): \(p(\mu) = \text{Dir}( \alpha, \alpha, \ldots \alpha)\), where \(\alpha > 0\).
As shown in Section 2.2 of the textbook, when we have a Dirichlet prior \(p(\mu)\) and a Discrete likelihood \(p(X_1, \ldots X_N | \mu)\), the posterior over \(\mu\) is ALSO Dirichlet distribution (see Eq. 2.41).
From the \(N\) observed training words, we wish to predict the identity of the next word, \(X_*\).
2a
Show that the likelihood of all \(N\) observed words can be written as:
where \(n_v\) is the non-negative integer that counts how often vocabulary term \(v\) appears in the training data: \(n_v = \sum_{n=1}^N [ x_n = v ]\). Where the bracket expression is 1 if the expression inside is true, and 0 otherwise. This is commonly called Iverson bracket notation: https://en.wikipedia.org/wiki/Iverson_bracket
2b
Derive the probability mass function for the predictive posterior.
That is, show that after seeing the \(N\) training words, the probability of the next word \(X_*\) being vocabulary word \(v\) is:
Hint: Use the definition of the posterior \(p(\mu | x_1, \ldots x_N)\) given in the textbook Eq. 2.41.
2c
Derive the probability mass function for the joint configuration of the observed training data. That is, show that the probability of the observed \(N\) training words is:
Note: We often call this the "evidence", as in the evidence our assumed model provides for the observed data (higher is considered better).