Solved – Hyper-parameter estimation for Beta-Binomial Empirical Bayes

beta distributionbeta-binomial distributionempirical-bayesmethod of moments

I am reading a paper Illustrating empirical Bayes methods and in the paper the author uses method of moments to derive the value of an estimate. In equation 17 the author gives the following marginal distribution

$$m(y_i\vert \lambda) = {n\choose y_i}\frac{\Gamma(2\lambda)}{[\Gamma(\lambda)]^2}\frac{\Gamma(y_i+\lambda)\Gamma(n-y_i+\lambda)}{\Gamma(n+2\lambda)}$$

Using the marginal he obtains an estimate of $\lambda$.

$$E(p_i\vert y_i, \lambda) = \frac{y_i+\hat{\lambda}}{n+2\hat{\lambda}}\\ i=1,2$$

When $y_i=35$ and $y_i=27$ the author states that $\hat{\lambda}$ is 15.205. How do I find this value?

Best Answer

The hierarchical model

You don't actually even need the marginal probability mass function $m()$, you actually only need the marginal moments of $Y$. In this tutorial, Casella (1992) is assuming the following hierarchical model for a response count $Y$: $$Y|p\sim\mbox{bin}(n,p)$$ and $$p \sim \mbox{Beta}(\lambda,\lambda)$$ with $n=50$.

Moments of the prior distribution

The Beta distribution usually has two parameters, $\alpha$ and $\beta$ say, and the mean is $\alpha/(\alpha+\beta)$. The variance is slightly more complicated, see the Wikipedia article on the Beta distribution. In this case, both parameters are the same, $\alpha=\beta=\lambda$, so the prior distribution for $p$ is symmetric with mean 1/2. The hyper-parameter $\lambda$ affects the variance of the prior distribution around 1/2, with larger values of $\lambda$ corresponding to smaller variances. In other words, $\lambda$ determines the precision (and hence the informativeness) of the prior. Specifically, the mean and variance of the prior distribution are $E(p)=1/2$ and $\mbox{var}(p)=1/\{4(2\lambda+1)\}$.

It will also be convenient later to note that $$E[p(1-p)]=\frac12-\frac14-\frac1{4(2\lambda+1)}=\frac{\lambda}{2(2\lambda+1)}$$

Marginal moments for $Y$

Now we can obtain the marginal moments of $Y$. It is a characteristic of empirical Bayes that it uses the marginal distribution for marginal moments to estimate the unknown parameters in the prior. The marginal mean of $Y$ is obviously $$E(Y)=E_p(np)=n/2$$ The marginal variance of $Y$ is most easily obtained by the law of total variance: \begin{eqnarray}\mbox{var}(Y)&=&E_p \mbox{var}(Y|p) + \mbox{var}_p E(Y|p)\\ &=&E_p[ np(1-p)] + \mbox{var}_p[np]\\ &=&\frac{n\lambda}{2(2\lambda+1)}+\frac{n^2}{4(2\lambda+1)}\\ &=&\frac{n}{4}\frac{2\lambda+n}{2\lambda+1} \end{eqnarray} It might not be obvious, but var($Y$) is a decreasing function of $\lambda$. It has a maximum value of $n^2/4$ for $\lambda=0$ and a minimum value of $n/4$ for $\lambda=\infty$.

Hyper-parameter estimation

We can use the observed variance of $Y$ to estimate $\lambda$. Suppose we observe y-values 35 and 27. The sample variance of these two values is 32. Equating $$\mbox{var}(Y)=\frac{n}{4}\frac{2\lambda+n}{2\lambda+1}=32$$ and solving for $\lambda$ gives $\hat\lambda=$15.205.

Posterior Inference

Now that we have estimated the hyper-parameter $\hat\lambda$, we can now proceed with Bayesian posterior inference. Given an observation $Y=y_i$, we have two possible estimators of the corresponding $p_i$. The usual maximum likelihood estimator (MLE) is $\hat p_i=y_i/n$, but we also have the prior estimate $p_0=1/2$ predicted by the prior distribution.

How should we combine these two estimators? The precision of the MLE is determined by $n$ and the precision of the prior is determined by $2\lambda$, so we weight the two estimators accordingly. The posterior estimator for $p_i$ is the weighted average of the two estimators $$E(p_i|y_i,\lambda)=w_0 p_0 + w_1 \hat{p}_i$$ with weights equal to the relative precisions $$w_0=\frac{2\lambda}{2\lambda+n}$$ and $$w_1=\frac{n}{2\lambda+n}$$ This gives $$E(p_i|y_i,\lambda)=\frac{y_i+\lambda}{n+2 \lambda}$$ and we just plug-in $\lambda=\hat\lambda$.

Another way to interpret the prior estimator is like this. It is as if we observed another $n=2\lambda$ cases and observed $\lambda$ successes (exactly half). So we just combine the prior sample with the observed sample to get $y_i+\lambda$ successes from $n+2\lambda$ cases, and that becomes the posterior mean.

Interpretation

Notice what is happening here. If the prior distribution was diffuse, then the $p_i$s will vary from one observation to another, and the variance of the $y_i$ will be relatively large. If the prior distribution was very concentrated, then the $p_i$ should be very consistent and the $y_i$ should be less variable. So we can use the variance of the $y_i$ to guess what the prior precision $2\lambda$ might have been.

If the $y_i$ values are very tight, then we conclude that $\lambda$ is large and we give more weight to the prior distribution, moving all the $\hat p_i$ values towards 1/2. If the $y_i$ values are very dispersed, then we conclude that $\lambda$ was small, and we give less weight to the prior, leaving the $\hat p_i$ values more as they are. This is the essential idea of empirical Bayes, common to all empirical Bayes applications.

Related Question