Your computation seems to be verifying that, when we have a particular prior distribution $p(\theta)$ the following two procedures
- Compute the posterior $p_{\theta \mid D}(\theta \mid D)$
- Transform the aforementioned posterior into the other parametrization to obtain $p_{\psi \mid D}(\psi \mid D)$
and
- Transform the prior $p_\theta(\theta)$ into the other parametrization to obtain $p_\psi(\psi)$
- Using the prior $p_\psi(\psi)$, compute the posterior $p_{\psi \mid D}(\psi \mid D)$
lead to the same posterior for $\psi$. This will indeed always occur (caveat; as long as the transformation is such that a distribution over $\psi$ is determined by a distribution over $\theta$).
However, this is not the point of the invariance in question. Instead, the question is whether, when we have a particular Method For Deciding The Prior, the following two procedures:
- Use the Method For Deciding The Prior to decide $p_\theta(\theta)$
- Convert that distribution into $p_\psi(\psi)$
and
- Use the Method For Deciding The Prior to decide $p_\psi(\psi)$
result in the same prior distribution for $\psi$. If they result in the same prior, they will indeed result in the same posterior, too (as you have verified for a couple of cases).
As mentioned in @NeilG's answer, if your Method For Deciding The Prior is 'set uniform prior for the parameter', you will not get the same prior in the probability/odds case, as the uniform prior for $\theta$ over $[0,1]$ is not uniform for $\psi$ over $[0,\infty)$.
Instead, if your Method For Deciding The Prior is 'use Jeffrey's prior for the parameter', it will not matter whether you use it for $\theta$ and convert into the $\psi$-parametrization, or use it for $\psi$ directly. This is the claimed invariance.
I often think of it this way. In the fully Bayesian approach, we find the integral
$$p(x^*|X) = \int p(x^*|\theta) p(\theta|X) \text{ d}\theta$$
as integrating over all possible models (infinitely many in fact), and we make a prediction taking all of these models "into consideration". As this is often intractable, we use the MAP estimate of the posterior $p(\theta|X)$, which corresponds to evaluating the same integral but this time using a infinitely small part of $p(\theta|X)$, namely at its maximum. In other words, we multiply $p(x^*|\theta)$ with a new "delta-distribution" located at the max of the posterior distribution and integrate this to obtain the prediction.
The difference is therefore rather obvious: a fully Bayesian treatment corresponds to an infinite ensemble of models, where a given prediction $p(x|\textbf{x},\theta)$ is weighted by the model probability $p(\theta|\textbf{x})$, i.e. more likely models will contribute more to the prediction. The MAP estimate of the parameters will give you the prediction from one specific model, namely the most likely one according to Bayes theorem. Ensemble theory shows us that we often obtain better generalization and more accurate predictions and therefore this will often be "better" than the MAP.
Hope this helps.
Best Answer
Finding the MAP means solving the program $$\hat\theta^\text{MAP}=\arg\max_\theta p(\theta|x)=\arg\max_\theta p(\theta)\mathfrak{f}(x|\theta)$$Assuming the transform $\phi=f(\theta)$ is bijective with inverse $\theta(\phi)$ and differentiable, the posterior distribution of $\phi$ is $$q(\phi|x)\propto\mathfrak{f}(x|\theta(\phi))p(\theta(\phi))\times\left|\dfrac{\text{d}\theta(\phi)}{\text{d}\phi}\right|$$by the Jacobian formula. Hence, if the Jacobian$$\left|\dfrac{\text{d}\theta(\phi)}{\text{d}\phi}\right|$$is not constant in $\phi$, there is no reason for $$\hat\phi^\text{MAP}=\arg\max_\phi q(\phi|x)$$to satisfy $$\hat\phi^\text{MAP}=f(\hat\theta^\text{MAP})$$For instance, if $\theta$ is a real parameter and $$f(\theta)=1\big/1+\exp\{-\theta\}$$ then $$\theta(\phi)=\log\{\phi/[1-\phi]\}$$and the Jacobian is$$\left|\dfrac{\text{d}\theta(\phi)}{\text{d}\phi}\right|=1\big/\phi[1-\phi]$$Due to its explosive behaviour at $0$ and $1$, the posterior distribution in $\phi$ will likely have a MAP more extreme than $f(\hat\theta^\text{MAP})$
To borrow an example from my book, The Bayesian Choice, the MAP associated with$$p(\theta)\propto\exp\{-|\theta|\}\qquad\mathfrak{f}(x|\theta)\propto[1+(x-\theta)^2]^{-1}$$is always$$\hat\theta^\text{MAP}=\arg\max_\theta p(\theta|x)=0$$notwithstanding the value of the (single) observation $x$. Now, if we switch to the logistic reparameterisation $f(\theta)=1\big/1+\exp\{-\theta\}$ then the MAP estimator in $\phi$ maximises $$-\left|\ln\dfrac{\phi}{1-\phi}\right|-\ln\left[1+\left(x-\ln\frac{\phi}{1-\phi}\right)^2\right]-\ln{\phi}(1-\phi)$$which is not systematically maximised at $\phi_0=1/2$, as shown by the following graph where $x$ ranges from $0.5$ to $2.5$ (and the maxima $\hat\phi^\text{MAP}$ drift rightwards):
[Note: I have written a few blog entries on MAP estimators and their "dangers", in connection with this issue. The first difficulty being the dependence on the dominating measure.]