Solved – Jeffreys prior for multiple parameters

bayesiandistributionsestimationjeffreys-priorprior

In certain cases, the Jeffreys prior for a full multidimensional model is generaly considered as inadequate, this is for example the case in:
$$
y_i=\mu + \varepsilon_i \, ,
$$
(where $\varepsilon \sim N(0,\sigma^2)$, with $\mu$ and $\sigma$ unknown) where the following prior is prefered (to the full Jeffreys prior $\pi(\mu,\sigma)\propto \sigma^{-2}$):
$$
p(\mu,\sigma) = \pi(\mu) \cdot \pi(\sigma) \propto \sigma^{-1}\, ,
$$
where $ \pi(\mu)$ is the Jeffreys prior obtained when keeping $\sigma$ fixed (and similarly for $p(\sigma)$). This prior coincides with the reference prior when treating $\sigma$ and $\mu$ in separate groups.

Question 1: Why does treating them as in separate groups make more sense than treating them in the same group (which will result, if I am correct (?), in the full dimensional Jeffreys prior, see [1]) ?


Then consider the following situation:
$$
y_i=g(x_i,\mathbf{\theta}) +\varepsilon_i\, ,
$$
where $\theta \in \mathbb{R}^n$ is unknown, $\varepsilon_i \sim N(0,\sigma^2)$, $\sigma$ is unkown, and $g$ is a known non-linear function. In such a case, it is tempting and from my experience sometimes fruitful to consider the following decomposition:
$$
p(\sigma,\theta)=\pi(\sigma) \pi(\theta) \, ,
$$
where $\pi(\sigma)$ and $\pi(\theta)$ are the Jeffreys prior for the two submodels as for the previous scale location example.

Question 2: In such a situation, can we say anything about the optimality (from an information theory perspective) of the derived prior $p(\sigma,\theta)$ ?


[1] From https://theses.lib.vt.edu/theses/available/etd-042299-095037/unrestricted/etd.pdf:

Finally, we note that Jeffreys' prior is a special case of a reference prior. Specifically, Jeffreys' prior corresponds to the reference prior in which all model parameters are treated in a single group.

Best Answer

What is optimal? There is no general and generic "optimality" result for the Jeffreys prior. It all depends on the purpose of the statistical analysis and of the loss function adopted to evaluate and compare procedures. Otherwise, $\pi(\theta,\sigma)\propto \dfrac{1}{\sigma}$ cannot be compared with $\pi(\theta,\sigma)\propto \dfrac{1}{\sigma}^2$. As I wrote in my most popular answer on X validated, there is no such thing as a best non-informative prior.