The Coupon Collector Problem

This is motiviated by "collect all ten coupons and win" contests. There are n possible outcomes - how long do you have to wait until you've seen them all?

Assume that all n outcomes are equally likely.
After one observation, you've seen one outcome.
After two observations, you've seen the same outcome twice with probability 1/n and two different outcomes with probability 1 - 1/n.
After three observations, you've seen the same outcome three times with probability 1/n². To see three different outcomes, you must have seen two different outcomes on the first two observations. The conditional probability of a new outcome on the third observation is 1 - 2/n, so the probability of three different outcomes is (1 - 1/n.) * (1 - 2/n). The case of two different outcomes can be split into two subcases depending on whether the first two observations were the same or different. The formula is:

1/n * (1 - 1/n) + (1 - 1/n) * 2/n

We could extend this approach, but the formula looks like it's getting messy, so let's try another approach. The probability of seeing a new coupon only changes after each new coupon is seen. While waiting for the ith coupon (after having seen (i - 1) different coupons) the probability of a new coupon is

p_i = 1 - (i - 1)/n

so if we let X_i be the waiting time between coupons i-1 and i we get a random variable with geometric distribution and expectation n/(n - i + 1). Let X be the number of coupons bought until all coupons have been collected. X is the sum of the X_i, and (by linearity of expectation) its expectation is the sum of n/(n - i + 1). This sum is n times the nth Harmonic number, so it is approximately n * ln n

We can easily calculate this value, and we can use Markov's inequality to (for example) bound the probability that we have to wait N times this long by 1/N. Chebyshev's inequality gives a better bound, but to use it we need to calculate the variance of X. The X_i are independent, since the probabilities just depend on i and n. This means that the variance is the sum of the variances of the geometric random variables X_i.

For a geometric random variable, G, with parameter p we have seen that the expectation is 1/p. Now we need to calculate

E[G²] = sum over of k² * p( 1 - p )^k-1.

By differentiating the series used for E[G] term by term, we get (2 - p)/p², so

var( G ) = (2 - p)/p² - 1/p² = (1 - p)/p² < 1/p²

The sum we get for the variance is similar to the one for the expected value, except that each term is squared. In this case, the series converges to pi²/6, so var(X) < (n * pi)²/6