Bayesian – Which Estimator is Better for a Random Sample From Bernoulli(p)?

bayesianestimationestimatorsloss-functionsmathematical-statistics

Let $X_1,\dots, X_n$ be random sample from $Bernoulli(p)$. Compare the risks of the squared loss of two estimators of $p$:

$$
\hat{p}_1=\bar{X}, \, \hat{p}_2=\frac{n\bar{X}+\alpha}{\alpha+\beta+n}
$$

where $\bar{X}$ is the sample mean, and $\alpha, \beta$ are two positive constants. Which estimator is better?


I compute the MSE of these two estimators but I have no idea how to compare them.

I get
$$
MSE(\hat{p}_1)=Var(\hat{p}_1)+(Bias(\hat{p}_1))^2=\frac{p(1-p)}{n}
$$

and
$$
MSE(\hat{p}_2)=Var(\hat{p}_2)+(Bias(\hat{p}_2))^2=\frac{np(1-p)}{(\alpha+\beta+n)^2}+\frac{(\alpha-p(\alpha+\beta))^2}{(\alpha+\beta+n)^2}
$$

Then I compare the difference between them:
$$
MSE(\hat{p}_1)-MSE(\hat{p}_2)=\frac{p(1-p)(\alpha+\beta)(\alpha+\beta+2n)-n(\alpha-p(\alpha+\beta))^2}{(\alpha+\beta+n)^2}
$$

Best Answer

You have been given the MLE as well as the posterior mean under a Beta prior (see e.g. here).

As you write (in slightly different notation with $\theta=p$, $k=n\bar X$, $\alpha_0=\alpha$, $\hat \theta_1=\hat p_1$, $R=MSE$ etc.), $$ \begin{align*} R(\theta, \hat{\theta}_2) =& \, V_\theta (\hat{\theta}_2) + ({bias}_\theta (\hat{\theta}_2))^2 \\ =& \, V_\theta \left( \frac{k + \alpha_0}{\alpha_0 + \beta_0 + n} \right) + \left( E_\theta \left( \frac{k + \alpha_0}{\alpha_0 + \beta_0 + n} \right) - \theta \right)^2 \\ =& \, \frac{n\theta(1 - \theta)}{(\alpha_0 + \beta_0 + n)^2} + \left( \frac{n\theta + \alpha_0}{\alpha_0 + \beta_0 + n} - \theta \right)^2. \end{align*} $$ Which estimator then is "better" indeed depends on the prior parameters, sampls size as well as the true $\theta$. This is intuitive as a prior that puts lots of probability on regions of the parameter space where the true value happens to be can be expected to produce a good Bayesian estimator.

A well-known and interesting choice is to let $\alpha_0 = \beta_0 = \sqrt{0.25n}$. The resulting estimator is $$ \hat{\theta}_2 = \, \frac{k + \sqrt{0.25n}}{n + \sqrt{n}}$$ Then, the risk function is independent of $\theta$: $$ R(\theta,\hat{\theta}_2) = \, \frac{n}{4(n + \sqrt{n})^2}, $$ and you can easily find the region in which $\hat \theta_1$ is outperformed in terms of maximum risk, and that that region shrinks with $n$ (that it shrinks with $n$ is intuitive, the sample average being the asymptotically efficient MLE).

The prior is "well-known" because there is a theorem (which I think I have come across in a book by Wasserman, but surely is also available elsewhere) saying that if $\hat{\theta}$ is a Bayes rule (a decision rule that minimizes the Bayes risk, e.g. the integral over $\theta$ for the MSE for squared loss with respect to some prior) with respect to some prior $\pi$ and if $\hat{\theta}$ has a constant risk \begin{align*} R(\theta, \hat{\theta}) = c \end{align*} for some $c$, then $\hat{\theta}$ is minimax.

A rule is minimax if, in a class of estimators $\tilde{\theta}$, it minimizes the maximum risk.

Formally, $\hat{\theta}$ is minimax if $$ \sup\limits_{\theta} R(\theta, \hat{\theta}) = \, \inf\limits_{\tilde{\theta}} \sup\limits_{\theta} R(\theta, \tilde{\theta}).$$

In this example, $$ \begin{align*} \sup\limits_{\theta} R(\theta, \hat{\theta}_1) =& \, \max\limits_{0 \leq \theta \leq 1} \frac{\theta(1-\theta)}{n} = \frac{1}{4n} \end{align*} $$ and $$\begin{align*} \sup\limits_{\theta} R(\theta, \hat{\theta}_2) =& \, \max\limits_{\theta} \frac{n}{4(n + \sqrt{n})^2} = \frac{n}{4(n + \sqrt{n})^2} \end{align*} $$ and we observe that $$ \sup\limits_{\theta} R(\theta, \hat{\theta}_2)<\sup\limits_{\theta} R(\theta, \hat{\theta}_1) $$ On the other hand, consider a uniform prior $\pi(\theta) = 1$. Then, the Bayes risks are \begin{align*} r(\pi, \hat{\theta}_1) =& \, \int R(\theta, \hat{\theta}_1) \, d\theta = \int \frac{\theta(1-\theta)}{n} \, d\theta = \frac{1}{6n} \end{align*} and \begin{align*} r(\pi, \hat{\theta}_2) =& \, \int R(\theta, \hat{\theta}_2) \, d\theta = \frac{n}{4(n + \sqrt{n})^2}. \end{align*} For $n \geq 20$, $r(\pi, \hat{\theta}_2) > r(\pi, \hat{\theta}_1)$ (in the figure below, only the red dots are above the red dahes, with the ranking reversed for the other $n$) which suggests that $\hat{\theta}_1$ is a better estimator according to this criterior and this prior.

Schematically:

enter image description here

n <- c(15, 25, 30, 100)

theta <- seq(0.001, 0.999, 0.001)
risk.MLE <- sapply(n, function(n) theta*(1-theta)/n)
risk.posteriormean <- n/(4*(n + sqrt(n))^2)
Bayesrisk.MLE. <- 1/(6*n)

matplot(theta, risk.MLE, type="l", col=c("red", "blue", "green", "orange"), lty=1, lwd=2)
abline(h=risk.posteriormean, lwd=2, col=c("red", "blue", "green", "orange"), lty=2) # long dashes for the (Bayes) risk of the posterior mean
abline(h=Bayesrisk.MLE., lwd=2, col=c("red", "blue", "green", "orange"), lty=3) # dots for Bayes risk of MLE

So to make a long answer short: it depends!

Related Question