Hierarchical Bayesian – Inverse Gamma (0.001,0.001) Prior on Variance in Bayesian Hierarchical Model

hierarchical-bayesianinverse gamma distribution

This 8 schools data is from Gelman 2006 paper: http://www.stat.columbia.edu/~gelman/research/published/taumain.pdf. In Figure 1 (c), the prior density of inverse gamma (0.001,0.001) was overlain on the posterior histogram. The author claims that "This prior distribution is even more sharply peaked near zero and further distorts posterior inferences", which I find difficult to understand because inverse gamma (0.001,0.001) only has a very small probability near 0, e.g., $P(\sigma < 1 | \alpha =\beta=.001) = 0.006$ for the prior but the posterior $P(\sigma < 1) = .487$. So how does this prior distort the posterior towards 0?

UPDATE: Thanks so much for your reproducible code. I have accepted it as an answer. It appears that the prior density in the plot has been scaled to the range (0,30), that's why we see a spike of inv-Gamma(0.001, 0.001) near 0. But the issue is that if you draw samples from inv-Gamma(0.001, 0.001), you are going to see the probability $\sigma < 1$ is quite small and most values would be very large, so I do not understand how this prior can still distort the posterior to have a spike near 0. PS: the CDF of the inverse gamma distribution on Wiki page is defined using UPPER incomplete gamma function, so you should look at the "uppinc" when you run pracma::gammainc(shape, scale / 1) / gamma(shape)

Best Answer

Figure 1 in the "Priors for variances" paper compares three prior distributions for the hierarchical standard deviation, $\sigma_\alpha$, in a two-level normal hierarchical model. It illustrates the drawbacks of the supposedly non-informative inverse gamma prior on the eight schools example.

As the answers by @Eoin and @Eli point out, the three priors are ordered so that, from left to right, each prior puts more probability mass near zero. In practice the inverse gamma priors suggest that we expect the school effects $\alpha_j$ shrink towards 0 even before looking at the data. In other words, the uniform prior is the least informative and the inverse-gamma($\epsilon$, $\epsilon$) prior is the most informative.

The informativeness of priors is to some extent defined with respect to the likelihood: a prior is strongly informative if it doesn't support values that are consistent with the data. For the eight schools, the likelihood is close to flat in the range [0,10], so any prior that concentrates only on a part of this range ends up being much more informative than the likelihood and the data.

This is perhaps easier to see in the plots below which show the prior (red curve), the likelihood (blue curve) and the posterior (histogram) because $$ posterior \propto prior \times likelihood $$ With the uniform prior, the posterior is informed mostly by the likelihood, ie, by the data. But with the inverse gamma prior, the posterior doesn't look like much different from the prior even after observing the data from the eight schools. Why bother with collecting data if we are not prepared to update our prior in the face of evidence?

Note: The likelihood for eight schools is a function of both the overall effect $\mu$ and the hierarchical standard deviation $\sigma_\alpha$. I picked $\mu$ = 10 to plot the likelihood as a function of $\sigma_\alpha$ but the choice doesn't matter: for any value of $\mu$ in [0,15] the likelihood supports values for $\sigma_\alpha$ in the range [0,10].

priors for eight schools hierarchical model

Finally, the inverse-gamma($\epsilon$, $\epsilon$) has most of its probability mass close to zero. It seems you have inadvertently computed the upper-tail probability P{X > x} instead of the lower-tail probability P{X ≤ x}.

I'm sure you've looked up the definition: The CDF of the inverse gamma distribution with parameters shape $\alpha$ and scale $\beta$ is $F(x;\alpha,\beta) = \Gamma(\alpha,\beta/x) / \Gamma(\alpha)$. The numerator is the incomplete gamma function, the numerator is the (complete) gamma function.

Let's do the computation in R; we will use the pracma::gammainc function which helpfully returns both the upper and lower incomplete gammas, so any confusion is avoided.

shape <- scale <- 0.001

# lowinc = P{X ≤ x}; uppinc = P{X > x} for x = 0.1 and x = 1

pracma::gammainc(shape, scale / 0.1) / gamma(shape)
#>       lowinc       uppinc       reginc 
#> 0.0933783135 0.0061116005 0.0009391118
pracma::gammainc(shape, scale / 1) / gamma(shape)
#>       lowinc       uppinc       reginc 
#> 0.9936876467 0.0063123533 0.0009942606

The code to reproduce the figures, in all its gory details:

library("tidyverse")
library("brms")

set.seed(2022)

y_j <- c(28, 8, -3, 7, -1, 1, 18, 12)
sigma_j <- c(15, 10, 16, 11, 9, 11, 10, 18)

schools <- tibble(
  school = letters[1:8], y_j, sigma_j
)

log_likelihood <- function(mu, sigma_a) {
  sum(dnorm(y_j, mean = mu, sd = sqrt(sigma_j^2 + sigma_a^2), log = TRUE))
}
likelihood <- function(mu, sigma_a) {
  exp(log_likelihood(mu, sigma_a))
}

scaled_curve <- function(func, from = 0, to = 1, n = 1001, ...) {
  x <- seq(from = from, to = to, length.out = n)
  y <- eval(substitute(func), envir = list(x = x))
  # Scale to match a density histogram
  y <- y / sum(y, na.rm = TRUE) / ((to - from) / n) * .95
  lines(x, y, ...)
}

schools_posterior <- function(sigma_a_prior) {
  fit <- brm(
    y_j | se(sigma_j) ~ 1 + (1 | school),
    family = gaussian,
    prior = prior_string(sigma_a_prior, class = "sd"),
    data = schools
  )
  hist(
    as_draws_array(fit, variable = "sd_school__Intercept"),
    breaks = 60, probability = TRUE, col = NULL,
    xlim = c(0, 30), xlab = expression(sigma[alpha]),
    yaxt = "n", ylab = NULL,
    main = paste("The prior is", sigma_a_prior)
  )
  scaled_curve(
    sapply(x, function(x) likelihood(10, x)),
    from = 0, to = 30, col = "blue", lwd = 2
  )
}

schools_posterior(
  "uniform(0,100)"
)
scaled_curve(
  dunif(x, min = 0, max = 100),
  from = 0, to = 30, col = "red", lwd = 2
)

schools_posterior(
  "inv_gamma(1,1)"
)
scaled_curve(
  MCMCpack::dinvgamma(x, 1, 1),
  from = 0, to = 30, col = "red", lwd = 2
)

schools_posterior(
  "inv_gamma(0.001,0.001)"
)
scaled_curve(
  MCMCpack::dinvgamma(x, 0.001, 0.001),
  from = 0, to = 30, col = "red", lwd = 2
)
Related Question