Ok, I'm assuming you mean the fellow researcher's opinion is that the mean of the process is $\mu_0$ with a standard deviance of that opinion of $\sigma_0$ (rather than the researcher thinks the standard deviation of the process you are modelling is $\sigma_0$, since according to your question you know the process has standard deviation $\sigma$). In that case, you are specifying a Gamma prior for $\mu$ which should have $E[\mu]=\mu_0$ and $Var[\mu]=\sigma_0^2$.
Now, if $\mu \sim Gamma(\alpha,\beta)$, the mean and variance are given by:
\begin{equation} E[\mu] = \frac{\alpha}{\beta}, Var[\mu] = \frac{\alpha}{\beta^2} \end{equation}
So you need to set $\alpha/\beta = \mu_0$ and $\alpha/\beta^2 = \sigma_0^2$, and then solve for $\alpha$ and $\beta$ to get an appropriate choice of parameters to reflect the researcher's opinion in your prior (I think you should get $\alpha = \mu_0^2 / \sigma_0^2$ and $\beta = \mu_0 / \sigma_0^2$).
So the problem looks like this (for prior $\pi(\mu)$, likelihood $f(\mathbf{x}|\mu)$ and posterior $\pi(\mu|\mathbf{x})$, assuming $n$ independent observations):
\begin{eqnarray}
\pi(\mu) &\propto& \mu^{\alpha - 1} \exp\{-\beta \mu\} \mathbb{1}_{\{\mu>0\}} \\
f(\mathbf{x}|\mu) &\propto& \exp\{-\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2\} \\
\pi(\mu|\mathbf{x}) &\propto& \mu^{\alpha - 1} \exp\{-\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 -\beta \mu \} \mathbb{1}_{\{\mu>0\}}
\end{eqnarray}
The posterior given here is up to a constant of proportionality - I don't think the Gamma distribution is conjugate for the Normal mean. So to perform inference for $\mu$ you would need a sampling based approach (I would suggest rejection sampling here using a Gamma proposal density, see section 2.3 in Chapter 2 of Introducing Monte Carlo Methods with R by Robert & Casella (2009) for a good introduction to rejection sampling.)
To be clear, the conjugate prior for the mean of the Normal distribution is the Normal distribution, so if you had a prior $\mu \sim N(\mu_0,\sigma_0)$ then you would also get a Normal posterior density that you could find analytically (see any Bayesian Inference text book for the derivation and explanation, I like Peter Hoff A First Course in Bayesian Statistical Methods (2009)).
What has happened is that the estimates of the individual subject means have been shrunk towards the group mean, causing the std. deviation of the subject means (and consequently that of the mean of the subject means) to be "too small". This shrinkage is part and parcel of the hierarchical Bayesian approach. The group_mu
values are the ones that you want.
You can see in a more empirical manner that the group_mu
values are correct by comparing the width of the observed 95% credible interval from your quantile summary to the width of the confidence interval you would expect in classical statistics if you had directly observed the 50 true subject means. In this case, that's just $1.96/\sqrt{50}$, or +/- 0.28, which corresponds pretty well to the quantile results you displayed above for group_mu
: (-0.26,+0.25). Since you can't do better than you could if you'd actually observed the 50 subject means, it follows that the credible interval shouldn't be much smaller than the confidence interval (as you're not adding any substantial prior information to the likelihood.) Clearly the (-0.1, 0.09) quantiles of the alternative mean are too close to 0, by this comparison.
In fact, you can use the ratio of the std. deviation of the mean of the estimated subject means to the std. deviation of group_mu
as an index of how much shrinkage has taken place. (It's not the only such metric; I'm just pointing it out since you're calculating them both anyway.) If you have a very small ratio, that indicates, in a heuristic sense, that the hierarchical part of the model isn't adding much, and assuming all the subject means are the same may be a plausible alternative. Conversely, if the ratio is near one, that also indicates that the hierarchical part of the model isn't adding much, and just having a "fixed effects" model for the subject means would likely be almost as good - questions of modeling approaches etc. aside.
Best Answer
I disagree with the way you interpret Gelman concerning the choice of the Gamma for scale parameter. The basis of hierarchical modeling is to relate individual parameters to a common one through a structure with unknown (typically mean and variance) parameters. In this sense, using a gamma distribution for the individual variance (or lognormal for heavier tail) conditioned to the mean variance and its dispersion looks valid to me (at least with regard to Gelman arguments).
The critics of Gelman about the gamma for scale parameter are about the fact that the gamma is used to approximate the Jeffreys by setting extreme values to its parameter. The problem is that depending on how extreme these values are (which is quite arbitrary) the posterior may be very different. This observation invalidates the use of this prior, at least when we don't have information to set in the prior. In it discussion, it looks to me that the gamma or inverse-gamma is never calibrated in terms of mean and variance from prior information or from a hierarchical structure. So its recommendation concerns a context which is quite different from yours which, if I understand well your purpose, consists in using a hierarchical prior structure relating the individual variance through a structure whose mean and variance parameters are also estimated.