Maximum a posteriori parameter estimation of normal inverse gamma distribution

maximum likelihoodparameter estimationpartial derivativestatistics

I am trying to estimate the mean and the variance of normal distribution using Maxima a posteriori (MAP). The parameters $\theta$ are found by maximizing the a posteriori probability:
$$
\theta_* = {\rm argmax}_{\theta} p(\theta | X) = {\rm argmax}_{\theta} p(X| \theta) p(\theta)
$$

In my case. $\theta = (\mu, \sigma^2)$ and both $\mu$ and $\sigma^2$ are unknown. $X \subseteq \mathbb{R}^N$ is the feature space.

However the derived variance is not fitting the ground truth gaussian.

In order to obtain the expression for posterior probability I use a Conjugate prior. Since likelihood function has a continuous distribution, I am using the Normal inverse gamma distribution:
$$
p(\mu, \sigma^2) = \mbox{NIG}(\mu, \sigma^2 | \mu_0, \nu, \alpha, \beta) = \frac{\sqrt{\nu}}{\sigma \sqrt{2\pi}} \frac{\beta^\alpha}{\Gamma(\alpha)} \left(\frac{1}{\sigma^2}\right)^{\alpha+1} \exp\left[-\frac{2\beta + \nu(\mu_0 – \mu)^2}{2\sigma^2}\right] \,\, (1)
$$

Now I am going to find the argmax of likelihood and prior probabilites.
For each measurement $x \in X$, expressing probabilities I get:
$$
p(X| \theta) p(\theta) = \prod_{i=1}^N(\frac{\sqrt{\nu}}{\sigma \sqrt{2\pi}} \frac{\beta^\alpha}{\Gamma(\alpha)} \left(\frac{1}{\sigma^2}\right)^{\alpha+1} \exp\left[-\frac{2\beta + \nu(x_i – \mu)^2}{2\sigma^2}\right]) \\ \phantom{……………….}
\frac{\sqrt{\nu}}{\sigma \sqrt{2\pi}} \frac{\beta^\alpha}{\Gamma(\alpha)} \left(\frac{1}{\sigma^2}\right)^{\alpha+1} \exp\left[-\frac{2\beta + \nu(\mu_0 – \mu)^2}{2\sigma^2}\right] \,\, (2)
$$

Putting things together, (2) can be expressed as:
$$
L = \left(\frac{\sqrt{\nu}}{\sigma \sqrt{2\pi}} \frac{\beta^\alpha}{\Gamma(\alpha)} \left(\frac{1}{\sigma^2}\right)^{\alpha+1}\right)^{n+1}
\exp\left[\sum_{i=1}^{N}\left(\frac{2\beta + \nu(x_i – \mu)^2}{-2\sigma^2}\right)+\frac{2\beta + \nu(\mu_0 – \mu)^2}{-2\sigma^2}\right] \, (3)
$$

To make future estimation easier, I will take logarithm of (3). Since logarithm is a monotonic function, it won't change the optimal result.
$$
{\rm log}(L) = l = (n+1){\rm log}\left(\frac{\sqrt{\nu}}{\sigma \sqrt{2\pi}} \frac{\beta^\alpha}{\Gamma(\alpha)} \left(\frac{1}{\sigma^2}\right)^{\alpha+1}\right) + \frac{2N\beta}{-2\sigma^2} + \frac{\nu\sum_{i=1}^{N}(x_i-\mu)^2}{-2\sigma^2} + \frac{2\beta+\nu(\mu_0-\mu)^2}{-2\sigma^2} = \\
\frac{(n+1)}{2}{\rm log}\left(\frac{\nu}{2\pi\sigma^2}\right) + (n+1){\rm log}\left(\frac{\beta^\alpha}{\Gamma(\alpha)}\right) + 2(n+1)(\alpha + 1){\rm log}\left(\frac{1}{\sigma}\right) + \frac{2N\beta}{-2\sigma^2} + \frac{\nu\sum_{i=1}^{N}(x_i-\mu)^2}{-2\sigma^2} + \frac{2\beta+\nu(\mu_0-\mu)^2}{-2\sigma^2} \,\,\, (4)
$$

To find optimal mean and variance, I will look when their partial derivatives equal zero:
$$
\frac{\partial l}{\partial \mu} = \frac{\nu}{\sigma^2}\sum_{i=1}^{N}(x_i-\mu) + \frac{\nu}{\sigma^2} \nu(\mu_0-\mu) = \frac{\nu}{\sigma^2}\left(\sum_{i=1}^{N}x_i-N\mu+\nu\mu_0-\nu\mu\right) = 0
$$

if
$$
\nu\mu+N\mu = \sum_{i=1}^{N}x_i+\nu\mu_0
$$

thus
$$
\mu = \frac{\sum_{i=1}^{N}x_i+\nu\mu_0}{\nu+N}
$$

$$
\frac{\partial l}{\partial \sigma} = \frac{N+1}{\sigma}-(N+1)(\alpha+1)\frac{1}{\sigma}+\frac{2N\beta}{\sigma^3}+\frac{\nu\sum_{i=1}^{N}(x_i-\nu)^2}{\sigma^3}+\frac{2\beta}{\sigma^3}+\frac{\nu(\mu_0-\mu)^2}{\sigma^3} =
\frac{1}{\sigma}\left(-2N\alpha-2\alpha-3N-3+\frac{2N\beta}{\sigma^2}+\frac{\nu\sum_{i=1}^{N}(x_i-\nu)^2}{\sigma^2}\frac{2\beta}{\sigma^2}+\frac{\nu(\mu_0-\mu)}{\sigma^2}\right) = 0
$$

if the expression in brackets is zero. Therefore
$$
\sigma^2 = \frac{2N\beta+\nu\sum_{i=1}^{N}(x_i-\nu)^2+2\beta+\nu(\mu_0-\mu)}{2N\alpha+2\alpha+3N+3}
$$

On real data, the mean is ok, but the variance is not fitting the gaussian as it is supposed to fit more with more data samples. See in
estimated parameters plot experiment, where the maximum likelihood estimation (ML) fits but the MAP do not fit as it is excepted to converge to ML.

Maybe, I done something wrong in the parameter derivation. Any ideas where
I could make mistake?

Best Answer

The large bug is in which probabilities I took to express the posterior probability (posterior = likelihood * hyperprior, NOT posterior = hyperprior * hyperprior).

Then, the correct posterior is: $$ p(X| \theta) p(\theta) = \prod_{i=1}^N\left(\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x_i - \mu)^2}{2\sigma^2}}\right) \frac{\sqrt{\nu}}{\sigma \sqrt{2\pi}} \frac{\beta^\alpha}{\Gamma(\alpha)} \left(\frac{1}{\sigma^2}\right)^{\alpha+1} \exp\left[-\frac{2\beta + \nu(\mu_0 - \mu)^2}{2\sigma^2}\right] $$

After taking logarithm of posterior, parameter estimation from the derivative equal zero is for:

mean: $$ \frac{\partial l}{\partial \mu} = \frac{1}{\sigma^2}\left(\sum_{i=1}^{N}x_i-N\mu+\nu\mu_0-\nu\mu\right) = 0 $$ if $$ \mu = \frac{\sum_{i=1}^{N}x_i+\nu\mu_0}{\nu+N} $$

variance: $$ \frac{\partial l}{\partial \sigma} = \frac{1}{\sigma}\left(-N-2\alpha-3+\frac{\sum_{i=1}^{N}(x_i-\mu)^2}{\sigma^2}+\frac{2\beta}{\sigma^2}+\frac{\nu(\mu_0-\mu)^2}{\sigma^2}\right) = 0 $$

if $$ \sigma^2 = \frac{\sum_{i=1}^{N}(x_i-\mu)^2+2\beta+\nu(\mu_0-\mu)^2}{N+2\alpha+3} $$

Note that for both correct an incorrect derivations, the estimated mean is the same, which should not, and that confuses me.

Related Question