How to Find a Confidence Interval for a Maximum Likelihood Estimate

probabilitystatistics

My cousin is at elementary school and every week is given a book by his teacher. He then reads it and returns it in time to get another one the next week. After a while we started noticing that he was getting books he had read before and this became gradually more common over time. Naturally, I started to wonder how one could estimate the total number of books in their library.

Say the true number of books in the library is $N$ and the teacher picks one uniformly at random (with replacement) to give to you each week. If at week $t$ you have received a book you have read before on $x$ occasions, then I can produce a maximum likelihood estimate for the number of books in the library following How many books are in a library? .


Clarification. If the books he receives are named $A,B,C,B, A, D$ then $x$ will be $0,0,0,1,2,2$ at successive weeks.


However, is there a mathematical formula as a function of $t$ and $x$ which will give me a 95% confidence interval for this estimate?

Best Answer

I'll use the framework of the library book problem. Let $K$ be the total sample size, $N$ be the number of different items observed, $N_1$ be the number of items seen once, $N_2$ be the number of items seen twice, $A=N_1(1-{N_1 \over K})+2N_2,$ and $\hat Q = {N_1 \over K}.$

Then an approximate 95% confidence interval on the total population size $M$ is given by

$$\hat M_{Lower}={1 \over {1-\hat Q+{1.96 \sqrt{A} \over K} }} $$

$$\hat M_{Upper}={1 \over {1-\hat Q-{1.96 \sqrt{A} \over K} }} $$

As noted in the discussion of the library problem, at times the upper bound will be infinite, especially for small samples. Similarly, the lower bound may need to be capped at zero.

This approach is due to Good and Turing. A reference with the confidence interval is Esty, The Annals of Statistics, 1983.

Related Question