Bayesian – How and Why Does MAP Converge to MLE?

bayesianconvergencemaximum likelihoodself-study

In Kevin Murphy's "Machine learning: A probabilistic perspective", chapter 3.2,
the author demonstrates Bayesian concept learning on an example called "number game": After observing $N$ samples from $\{1,…,100\}$, we want to pick a hypothesis $h$ which best describes the rule that generated the samples. For example "even numbers" or "prime numbers".

The maximum a-posteriori and maximum likelihood estimates are defined as:

$$\hat h_\mathrm{MAP}={\arg\max}_h\ p(\mathcal{D}|h)p(h)={\arg\max}_h[\log p(\mathcal{D}|h)+\log p(h)],$$

$$\hat h_\mathrm{MLE}={\arg\max}_h\ p(\mathcal{D}|h)={\arg\max}_h\log p(\mathcal{D}|h),$$

where $p(h)$ represents the prior probabilities of various hypotheses and the posterior is defined as:

$$p(\mathcal{D}|h)=\Bigg[\frac{1}{|h|}\Bigg]^N,$$

iff $\mathcal{D}\subset h$, i.e., how likely it is that uniform sampling with replacement from the hypothesis $h$ would yield set $\mathcal{D}$. Intuitively it means that the posterior is highest for "smallest" hypotheses. For example, hypotheses "powers of 2" explains observations $\{2,4,8,16,64\}$ better than "even numbers".

All of this is clear. However, I am confused about the following sentence (even though intuitively it makes perfect sense):

Since the likelihood term depends exponentially on $N$, and the prior stays constant, as we get more and more data, the MAP estimate converges towards the maximum likelihood estimate.

It is true that the likelihood depends exponentially on $N$, however, the exponentiated number is in the interval $(0,1)$ and as $N \to \infty$, $x^N \to 0$, so the likelihood should actually vanish.

Why does MAP converge to MLE in this case?

Best Answer

There are two issues here, first, why does the MAP converge to the MLE generally (but not always) and the "vanishing likelihood" problem.

For the first issue, we refer ourselves to the Bernstein - von Mises theorem. The essence of it is that, as the sample size grows, the relative information contained in the prior and in the data shifts in favor of the data, so the posterior becomes more concentrated around the data-only estimate of the MLE, and the peak actually converges to the MLE (with the usual caveat that certain assumptions have to be met.) See the Wikipedia page for a brief overview.

For the second issue, this comes about because you have not normalized the posterior density. By Bayes' Rule:

$$P(h|D) = {P(D|h)p(h) \over p(D)}$$

and, although $P(D|h) \to 0$ as $n \to \infty$, as you observe, so does $P(D)$. For a little more concreteness, if we assume two hypotheses $h_1$ and $h_2$, we find the posterior by:

$$P(h_1|D) = {P(D|h_1)p(h_1) \over P(D|h_1)p(h_1) + P(D|h_2)p(h_2)}$$

Both the numerator and the denominator have terms raised to the power $N$, so both $\to 0$ as $N \to \infty$, but it should be clear that the normalization required fixes the problem that this would otherwise cause.

Related Question