Solved – How is the inverse gamma distribution related to $n$ and $\sigma$

bayesianconjugate-priorprior

Given that the posterior estimate of $\sigma'^{2}$of a normal likelihood and an inverse gamma prior on $\sigma^2$ is:

$$\sigma'^{2}\sim\textrm{IG}\left(\alpha + \frac{n}{2}, \beta +\frac{\sum_{i=1}^n{(y_i-\mu)^2}}{2}\right)$$

which is equivalent to

$$\sigma'^{2}\sim\textrm{IG}\left( \frac{n}{2}, \frac{n\sigma^2}{2}\right)$$

since a weak $\textrm{IG}(\alpha, \beta)$ prior on $\sigma^2$ removes $\alpha$ and $\beta$ from eqn 1:

$$\sigma'^{2}\sim\textrm{IG}\left( \frac{n}{2}, \frac{\sum_{i=1}^n{(y_i-\mu)^2}}{2}\right)$$

It is apparent that the posterior estimate of $\sigma^2$ is a function of the sample size and the sum of squares of the likelihood. But what does this mean? There is a derivation on Wikipedia that I don't quite follow.

I have the following questions

  1. Can I get to this second equation without invoking Bayes' rule? I am curious if there is something inherent in the parameters of an IG that is related to the mean and variance independent of the normal likelihood.
  2. Can I use the sample size and standard deviation from a previous study to estimate an informed prior on $\sigma^2$, and then update the prior with new data? This seems straightforward, but I can not find any examples of doing so, or rationale why this would be a legitmate approach – other than what can be seen in the posterior.
  3. Is there a popular probability or statistics textbook that I can consult for further explanation?

Best Answer

I think it is more correct to speak of the posterior distribution of your parameter $\sigma'^{2}$ rather than its posterior estimate. For clarity of notations, I will drop the prime in $\sigma'^{2}$ in what follows.

Suppose that $X$ is distributed as $\mathcal{N}(0, \sigma^2)$, — I drop $\mu$ for now to make a heuristic example — and $1/\sigma^2 = \sigma^{-2}$ is distributed as $\Gamma(\alpha, \beta)$ and is independent of $X$.

The pdf of $X$ given $\sigma^{-2}$ is Gaussian, i.e.

$$f(x|\sigma^{-2}) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{x^2}{2\sigma^2}\right).$$

The joint pdf of $(X, \sigma^{-2})$, $f(x,\sigma^{-2})$ is obtained by multiplying $f(x|\sigma^{-2})$ by $g(\sigma^{-2})$ — the pdf of $\sigma^{-2}$. This comes out as

$$f(x, \sigma^{-2}) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{x^2}{2\sigma^2}\right) \frac{\beta^{\alpha}}{\Gamma(\alpha)}\exp \left(-\frac{\beta}{ \sigma^2}\right)\frac{1}{\sigma^{2(\alpha-1)}}.$$

We can group similar terms and rewrite this as follows

$$f(x, \sigma^{-2}) \propto \sigma^{-2(\alpha-1/2)} \exp\left(-\sigma^{-2} \left(\beta + x^2/2 \right)\right).$$

The posterior distribution of $\sigma^{-2}$ is by definition the pdf of $\sigma^{-2}$ given $x$, which is $f(x, \sigma^{-2}) / f(x)$ by Bayes' formula. To answer your question 1. I don't think there is a way to express $f(\sigma^{-2}|x)$ from $f(x, \sigma^{-2})$ without using Bayes' formula. On with the computation, we recognize in the formula above something that looks like a $\Gamma$ function, so integrating $\sigma^{-2}$ out to get $f(x)$ is fairly easy.

$$ f(x) \propto (\beta + x^2/2)^{-(\alpha+1/2)}, $$

so by dividing we get

$$ f(\sigma^{-2}|x) \propto \left(\beta + x^2/2 \right) \left( \sigma^{-2} \left(\beta + x^2/2 \right) \right)^{\alpha-1/2} \exp\left(-\sigma^{-2} \left(\beta + x^2/2 \right)\right) \\ \propto \left( \sigma^{-2} \left(\beta + x^2/2 \right) \right)^{\alpha-1/2} \exp\left(-\sigma^{-2} \left(\beta + x^2/2 \right)\right). $$

And here in the last formula we recognize a $\Gamma$ distribution with parameters $(\alpha + 1/2, \beta + x^2/2)$.

If you have an IID sample $((x_1, \sigma_1^{-2}), ..., (x_n, \sigma^{-2}_n))$, by integrating out all the $\sigma_i^{-2}$, you would get $f(x_1, ..., x_n)$ and then $f(\sigma_1^{-2}, ..., \sigma_n^{-2}|x_1, ..., x_n)$ as a product of the following terms:

$$ f(\sigma_1^{-2}, ..., \sigma_n^{-2}|x_1, ..., x_n) \propto \prod_{i=1}^n \left( \sigma_i^{-2} \left(\beta + x_i^2/2 \right) \right)^{\alpha-1/2} \exp\left(-\sigma_i^{-2} \left(\beta + x_i^2/2 \right)\right), $$

Which is a product of $\Gamma$ variables. And we are stuck here because of the multiplicity of the $\sigma_i^{-2}$. Besides, the distribution of the mean of those independent $\Gamma$ variables is not straightforward to compute.

However, if we assume that all the observations $x_i$ share the same value of $\sigma^{-2}$ (which seems to be your case) i.e. that the value of $\sigma^{-2}$ was drawn only once from a $\Gamma(\alpha, \beta)$ and that all $x_i$ were then drawn with that value of $\sigma^{-2}$, we obtain

$$ f(x_1, ..., x_n, \sigma^{-2}) \propto \sigma^{-2 (\alpha + n/2)} \exp\left(-\sigma^{-2} \left(\beta + \frac{1}{2} \sum_{i=1}^n x_i^2\right) \right), $$

from which we derive the posterior distribution of $\sigma^{-2}$ as your equation 1 by applying Bayes' formula.

The posterior distribution of $\sigma^{-2}$ is a $\Gamma$ that depends on $\alpha$ and $\beta$, your prior parameters, the sample size $n$ and the observed sum of squares. The prior mean of $\sigma^{-2}$ is $\alpha/\beta$ and the variance is $\alpha/\beta^2$, so if $\alpha = \beta$ and the value is very small, the prior carries very little information about $\sigma^{-2}$ because the variance becomes huge. The values being small, you can drop them from the above equations and you end up with your equation 3.

In that case the posterior distribution becomes independent of the prior. This formula says that the inverse of the variance has a $\Gamma$ distribution that depends only on the sample size and the sum of squares. You can show that for Gaussian variables of known mean, $S^2$, the estimator of the variance, has the same distribution, except that it is a function of the sample size and the true value of the parter $\sigma^2$. In the Bayesian case, this is the ditribution of the parameter, in the frequentist case, it is the distribution of the estimator.

Regarding your question 2. you can of course use the values obtained in a previous experiment as your priors. Because we established a parallel between Bayesian and frequentist interpretation in the above, we can elaborate and say that it is like computing a variance from a small sample size and then collecting more data points: you would update your estimate of the variance rather than throw away the first data points.

Regarding your question 3. I like the Introduction to Mathematical Statistics by Hogg, McKean and Craig, which usually gives the detail of how to derive these equations.