Bayesian Statistics – Example of a Prior That Leads to a Non-Invariant Posterior Unlike Jeffreys Prior

bayesianfisher informationinvariancejeffreys-priormathematical-statistics

I am reposting an "answer" to a question that I had given some two weeks ago here: Why is the Jeffreys prior useful? It really was a question (and I did not have the right to post comments at the time, either), though, so I hope it is OK to do this:

In the link above it is discussed that the interesting feature of Jeffreys prior is that, when reparameterizing the model, the resulting posterior distribution gives posterior probabilities that obey the restrictions imposed by the transformation. Say, as discussed there, when moving from the success probability $\theta$ in the Beta-Bernoulli example to odds $\psi=\theta/(1-\theta)$, it should be the case that the a posterior satisfies $P(1/3\leq\theta\leq 2/3\mid X=x)=P(1/2\leq\psi\leq 2\mid X=x)$.

I wanted to create a numerical example of invariance of Jeffreys prior for transforming $\theta$ to odds $\psi$, and, more interestingly, lack thereof of other priors (say, Haldane, uniform, or arbitrary ones).

Now, if the posterior for the success probability is Beta (for any Beta prior, not only Jeffreys), the posterior of the odds follows a Beta distribution of the second kind (see Wikipedia) with the same parameters. Then, as highlighted in the numerical example below, it is not too surprising (to me, at least) that there is invariance for any choice of Beta prior (play around with alpha0_U and beta0_U), not only Jeffreys, cf. the output of the program.

library(GB2) 
# has the Beta density of the 2nd kind, the distribution of theta/(1-theta) if theta~Beta(alpha,beta)

theta_1 = 2/3 # a numerical example as in the above post
theta_2 = 1/3

odds_1 = theta_1/(1-theta_1) # the corresponding odds
odds_2 = theta_2/(1-theta_2)

n = 10 # some data
k = 4

alpha0_J = 1/2 # Jeffreys prior for the Beta-Bernoulli case
beta0_J = 1/2
alpha1_J = alpha0_J + k # the corresponding parameters of the posterior
beta1_J = beta0_J + n - k

alpha0_U = 0 # some other prior
beta0_U = 0
alpha1_U = alpha0_U + k # resulting posterior parameters for the other prior
beta1_U = beta0_U + n - k

# posterior probability that theta is between theta_1 and theta_2:
pbeta(theta_1,alpha1_J,beta1_J) - pbeta(theta_2,alpha1_J,beta1_J) 
# the same for the corresponding odds, based on the beta distribution of the second kind
pgb2(odds_1, 1, 1,alpha1_J,beta1_J) - pgb2(odds_2, 1, 1,alpha1_J,beta1_J) 

# same for the other prior and resulting posterior
pbeta(theta_1,alpha1_U,beta1_U) - pbeta(theta_2,alpha1_U,beta1_U)
pgb2(odds_1, 1, 1,alpha1_U,beta1_U) - pgb2(odds_2, 1, 1,alpha1_U,beta1_U)

This brings me to the following questions:

Do I make a mistake?
If no, is there a result like there being no lack of invariance in conjugate families, or something like that? (Quick inspection leads me to suspect that I could for instance also not produce lack of invariance in the normal-normal case.)
Do you know a (preferably simple) example in which we do get lack of invariance?

Best Answer

Your computation seems to be verifying that, when we have a particular prior distribution $p(\theta)$ the following two procedures

Compute the posterior $p_{\theta \mid D}(\theta \mid D)$
Transform the aforementioned posterior into the other parametrization to obtain $p_{\psi \mid D}(\psi \mid D)$

and

Transform the prior $p_\theta(\theta)$ into the other parametrization to obtain $p_\psi(\psi)$
Using the prior $p_\psi(\psi)$, compute the posterior $p_{\psi \mid D}(\psi \mid D)$

lead to the same posterior for $\psi$. This will indeed always occur (caveat; as long as the transformation is such that a distribution over $\psi$ is determined by a distribution over $\theta$).

However, this is not the point of the invariance in question. Instead, the question is whether, when we have a particular Method For Deciding The Prior, the following two procedures:

Use the Method For Deciding The Prior to decide $p_\theta(\theta)$
Convert that distribution into $p_\psi(\psi)$

and

Use the Method For Deciding The Prior to decide $p_\psi(\psi)$

result in the same prior distribution for $\psi$. If they result in the same prior, they will indeed result in the same posterior, too (as you have verified for a couple of cases).

As mentioned in @NeilG's answer, if your Method For Deciding The Prior is 'set uniform prior for the parameter', you will not get the same prior in the probability/odds case, as the uniform prior for $\theta$ over $[0,1]$ is not uniform for $\psi$ over $[0,\infty)$.

Instead, if your Method For Deciding The Prior is 'use Jeffrey's prior for the parameter', it will not matter whether you use it for $\theta$ and convert into the $\psi$-parametrization, or use it for $\psi$ directly. This is the claimed invariance.

Related Solutions

Bayesian – Producing Posterior PDF Using Gibbs Sampling

I won't belabor the valid points made by @Tomas above so I will just answer with the following: The likelihood for a normal linear model is given by

$$f(y|\beta,\sigma,X)\propto\left(\frac{1}{\sigma^2}\right)^{n/2}\exp\left\{-\frac{1}{2\sigma^2}(y-X\beta)'(y-X\beta)\right\}$$ so now using the standard diffuse prior $$p(\beta,\sigma^2)\propto\frac{1}{\sigma^2}$$ we can obtain draws from the posterior distribution $p(\beta,\sigma^2|y,X)$ in a very simple Gibbs sampler. Note, although your steps above for the Gibbs sampler are not wrong, I would argue that a more efficient decomposition is the following: $$p(\beta,\sigma^2|y,X)=p(\beta|\sigma^2,y,X)p(\sigma^2|y,X)$$ and so now we want to design a Gibbs sampler to sample from $p(\beta|\sigma^2,y,X)$ and $p(\sigma^2|y,X)$.

After some tedious algebra, we obtain the following:

$$\beta|\sigma^2,y,X\sim N(\hat\beta,\sigma^2(X'X)^{-1})$$ and $$\sigma^2|y,X\sim\text{Inverse-Gamma}\left(\frac{n-k}{2},\frac{(n-k)\hat\sigma^2}{2}\right)$$ where $$\hat\beta=(X'X)^{-1}X'y$$ and $$\hat\sigma^2=\frac{(y-X\hat\beta)'(y-X\hat\beta)}{n-k}$$

Now that we have all that derived, we can obtain samples of $\beta$ and $\sigma^2$ from our Gibbs sampler and at each iteration of the Gibbs sampler we can obtain estimates of $\psi$ by plugging in our estimates of $\beta$ and $\sigma$.

Here is some code that gets at all of this:

library(mvtnorm)

#Pseudo Data
#Sample Size
n = 50

#The response variable
Y = matrix(rnorm(n,20))

#The Design matrix
X = matrix(c(rep(1,n),rnorm(n,3),rnorm(n,10)),nrow=n)
k = ncol(X)

#Number of samples
B = 1100

#Variables to store the samples in 
beta = matrix(NA,nrow=B,ncol=k)
sigma = c(1,rep(NA,B))
psi = rep(NA,B)

#The Gibbs Sampler
for(i in 1:B){

    #The least square estimators of beta and sigma
    V = solve(t(X)%*%X)
    bhat = V%*%t(X)%*%Y

    sigma.hat = t(Y-X%*%bhat)%*%(Y-X%*%bhat)/(n-k)


    #Sample beta from the full conditional 
    beta[i,] = rmvnorm(1,bhat,sigma[i]*V)

    #Sample sigma from the full conditional
    sigma[i+1] = 1/rgamma(1,(n-k)/2,(n-k)*sigma.hat/2)

    #Obtain the marginal posterior of psi
    psi[i] = (beta[i,2]+beta[i,3])/sigma[i+1]
}


#Plot the traces
dev.new()
par(mfrow=c(2,2))
plot(beta[,1],type='l',ylab=expression(beta[1]),main=expression("Plot of "*beta[1]))
plot(beta[,2],type='l',ylab=expression(beta[2]),main=expression("Plot of "*beta[2]))
plot(beta[,3],type='l',ylab=expression(beta[3]),main=expression("Plot of "*beta[2]))
plot(sigma,type='l',ylab=expression(sigma^2),main=expression("Plot of "*sigma^2))


#Burn in the first 100 samples of psi
psi = psi[-(1:100)]

dev.new()
#Plot the marginal posterior density of psi
plot(density(psi),col="red",main=expression("Marginal Posterior Density of "*psi))

Here are the plots it will generate as well enter image description here

and

enter image description here

FYI, the above trace plots of $\beta$ and $\sigma^2$ are not with burn-in.

Question 2 in response to Edit 2:

If you want the 5% quantile (or any quantile for that matter) for $\psi|y$, all you have to do is the following:

quantile(psi,prob=.05)

If you wanted a 95% confidence interval you could do the following:

lower = quantile(psi,prob=.025)
upper = quantile(psi,prob=.975)

ci = c(lower,upper)

Solved – Which distributions are parameterization invariant when based on the Jeffreys prior

Yes. And actually this is the interesting invariance property: it means that two Bayesians using a different parameterization of the model but both using the Jeffreys prior, obtain the same posterior distribution (up to change-of-variables) to draw inference.
Conceptually, there's no prior predictive distribution based on the Jeffreys prior. The goal of the Jeffreys prior is to provide a posterior distribution which reflects as best as possible the information brought by the data. There's no prior belief about the parameters, hence no prior predictive distribution of the data.
It is not clear what you mean by invariance for a (prior or posterior) predictive distribution. But note that from 1), two Bayesians using the Jeffreys prior but different parameterizations, obtain the same posterior predictive distribution.
The MAP is the mode of the posterior distribution. It is not invariant, in the sense that if you use $\theta$ as the model paramater on one hand, and $\psi=f(\theta)$ on the other hand, with $f$ one-to-one, then the mode of the posterior distribution of $\psi$ is not the image under $f$ of the mode of the posterior distribution of $\theta$. That means that our two Bayesians, both using the Jeffreys prior but using different parameterization, will get incoherent results if they consider the MAP as the parameter estimate.

Best Answer

Related Solutions

Bayesian – Producing Posterior PDF Using Gibbs Sampling

Solved – Which distributions are parameterization invariant when based on the Jeffreys prior

Related Question