[Math] Showing that a Bayesian Estimator minimizes mean squared error

probabilitystatistics

Suppose X ~ Bin(n, p). Using a Beta(.5, .5) as prior, I can show that the Bayes estimator of p, Pr(p|x), is $\frac{(x+\frac{1}{2})}{n+1}$ as follows:

$\propto f(x | p)*f(p)$

We know that:

$f(x|p) = c*p^{x}(1-p)^{n-x}$

$f(p) = c*p^{-\frac{1}{2}}(1-p)^{-\frac{1}{2}}$

Multiplying through and combining terms we get:

$f(p|x) = p^{x-\frac{1}{2}}(1-p)^{n-x-\frac{1}{2}}$ which is a beta distribution with parameters $\alpha = x + \frac{1}{2}, \beta = n – x + \frac{1}{2}$.

Then using the definition for expected value of a beta distribution, $\frac{\alpha}{\alpha + \beta}$, we find $P(p|x) = \frac{(x+\frac{1}{2})}{n+1}$

Now, my question:

How can I demonstrate that this bayes estimator minimizes the mean squared error (i.e. no other estimator could produce smaller mean squared error)?

Best Answer

To be pedantic, you have found $\mathbb{E}[p\mid x] = \int p \, f(p\mid x) \, dp = \frac{x+\frac12}{n+1}$.

To minimise mean-square error, your aim is to find $\hat{p}$ which minimises $\mathbb{E}[(p-\hat{p})^2 \mid x] = \int (p-\hat{p})^2 \, f(p\mid x) \, dp$. If you take the derivative with respect to $\hat{p}$ then you will find it is zero when $\hat{p}= \mathbb{E}[p\mid x]$; the second derivative is always positive, so this is a minimum.

If instead you had tried to minimise the absolute error, i.e. $\mathbb{E}[|p-\hat{p}| \mid x]$ then you would have found the median of the posterior distribution.

Related Solutions

[Math] Mean Squared Error Maximum uniform distribution

Let $T$ be an estimator of $\theta$ and denote $E(T) = \mu.$ Then $$\text{MSE} = E[(T-\tau)^2] = E[((T - \mu) + (\mu -\tau))^2]\\ = E[(T-\mu)^2] + 2(\mu - \tau)E(T - \mu) + (\mu - \tau)^2\\ = Var(T) + 0 + b_T^2(\tau)$$

Here is a simulation in R statistical software for $W = \max(X_1, \dots, X_n)$ as an estimator of $\theta,$ where $X_i \stackrel{indep}{\sim} Unif(0,\theta),\ n=5,$ and $\theta=10.$ With a million iterations, quantities with original units should be accurate to three places, and quantities with squared units perhaps about two places.

m = 10^6;  n = 5;  th = 10  
x = runif(m*n, 0, th)
DTA = matrix(x, nrow=m)
w = apply(DTA, 1, max)
mean(w);  var(w);  mean((w-th)^2)
## 8.3332              # aprx E(W) = 8.333
## 1.986969            # aprx Var(W) = 1.984
## 4.765187            # aprx MSE = 4.762

mu = n*th/(n+1); mu
## 8.333333            # exact E(W)
var.w = n*th^2/((n+2)*(n+1)^2); var.w
## 1.984127            # exact Var(W)
bias.sq = (mu - th)^2; bias.sq
## 2.777778            # exact squared bias
var.w + bias.sq
## 4.761905            # exact MSE

Except for your incorrect sign (mentioned in the comments), your computations are compatible with simulation results.

Note: An unbiased estimator of $\theta$ is $2\bar X$, but it has a much larger variance ($4\theta^2/12n$), and hence a much larger MSE, than the maximum.

doub.avg = 2*rowMeans(DTA)
mean(doub.avg)
## 10.00068    # unbiased 
var(doub.avg)
## 6.687577    # relatively large var = MSE

The histograms below compare the properties of these two estimators.

[Math] Mean squared error for vectors

Note that if $\hat \theta(X)$ is an estimator (depending on random data $X$) for the parameter $\theta\in \mathbb{R}^n,$ the MSE is a scalar quantity defined as

$$\begin{align}MSE(\hat\theta,\theta)&\equiv E[\|\hat\theta(X)-\theta\|^2]\\ &=E[(\hat\theta(X)-\theta)'(\hat\theta(X)-\theta)].\\\end{align}$$

With some matrix algebra, one can easily prove the identity

$$\begin{align}MSE(\hat\theta,\theta)&=\|Bias(\hat\theta,\theta)\|^2+tr(Var(\hat\theta(X))),\\ Bias(\hat\theta,\theta)&\equiv E[\hat\theta(X)]-\theta. \end{align}$$

So rather than look at a vector of individual MSEs, we typically look at the above metric as the generalization of MSE.

However, the MSE is only one metric to judge an estimator by. One may also be interested in looking at the variance-covariance matrix $Var(\hat\theta(X))$, in which case your question still stands, namely how do we decide which of $V_1\equiv Var(\hat\theta_1(X))$, $V_2\equiv Var(\hat\theta_2(X))$ is "greater" given two estimators $\hat\theta_1(X),\hat\theta_2(X)$?

A common partial order used in this respect that is defined on the set of symmetric positive semidefinite matrices is the Loewner order: $$V_1\geq V_2\iff V_1-V_2 \text{ is positive semidefinite (p.s.d)}.$$

Being a partial order, this relation cannot be used to compare any two variance-covariance matrices summoned from the ether, but it is still meaningful. For instance, because p.s.d matrices have nonnegative diagonal entries, one immediate implication of $V_1\geq V_2$ is that the variance of each component of $\hat\theta_1(X)$ is at least as great as the variance of the corresponding component of $\hat\theta_2(X).$

Best Answer

Related Solutions

[Math] Mean Squared Error Maximum uniform distribution

[Math] Mean squared error for vectors

Related Question