Fisher Information – How to Approximate Fisher Information Matrix of Multivariate Normal Distribution

fisher informationmatrixmultivariate normal distribution

For $d\geq 2$, consider the d-dimensional multivariate normal distribution $\mathcal N(x|\mu,\Sigma)$ whose the log of density is given by
$$
l(x;\mu,\Sigma)=-\frac{d}{2}\log(2\pi)-\frac{1}{2}\log|\Sigma|-\frac{1}{2}(x-\mu)^\top\Sigma^{-1}(x-\mu)
$$

here $\mu=(\mu_1,\ldots,\mu_d)^\top$ and $\Sigma=(\sigma_{ij})_{d\times d}=\Sigma^\top$. I know that using identities from Matrix cookbook
$$
\nabla_{\mu,\Sigma} l =(\frac{\partial l}{\partial\mu},\frac{\partial l}{\partial\Sigma})\equiv l_1(x;\mu,\Sigma)=\left( \Sigma^{-1}(x-\mu),-\frac{1}{2}\Sigma^{-1}+\frac{1}{2}\Sigma^{-1}(x-\mu)(x-\mu)^\top\Sigma^{-1} \right)^\top.
$$

The Fisher information matrix:
The Fisher Information is defined as the covariance matrix, which in this case is given by

$$
\begin{align}
F&=\mathbb{E}_{X \sim N(\mu,\Sigma)}[l_1(X;\mu,\Sigma) l^\top_1(X;\mu,\Sigma)]
\\&\approx\frac{1}{S}\sum_{i=1}^Sl_1(x_i;\mu,\Sigma)l^\top_1(x_i;\mu,\Sigma)\in\mathbb{R}^{[d+d(d+1)/2]\times [d+d(d+1)/2]},\quad x_i\sim N(x|\mu,\Sigma)
\end{align}
$$

Note that I am aware that there is a closed-form for $F$ but I just wanted to use its approximation $\frac{1}{S}\sum_{i=1}^Sl_1(x_i;\mu,\Sigma)l_1^\top(x_i;\mu,\Sigma)$ to test its accuracy.

My question: Since the component of $l_1(x;\mu,\Sigma)$ is a mixture of a vector and a matrix, how I can compute the sum $\frac{1}{S}\sum_{i=1}^Sl_1(x_i;\mu,\Sigma)l_1^\top(x_i;\mu,\Sigma)$ or in particularly the factor $l_1(x_i;\mu,\Sigma)l^\top_1(x_i;\mu,\Sigma)$? Is there a neat way to do it ?

Best Answer

Edit: part of the answer refers to an old version of the question, and was fixed by OP since. I think that this could confuse other users, too, and keep it therefore.

The Fisher information: $$ F=\mathbb{E}[l_1^\top l_1]\approx\frac{1}{S}\sum_{i=1}^Sl_1^\top(x_i)l_1(x_i)\in\mathbb{R}^{d\times d},\quad x_i\sim N(x|\mu,\Sigma) $$

Your confusion seems to stem from assuming that the parameter space of a $d$-dimensional normal distribution $N(\mu,\Sigma)$ also has dimension $d$. If you do not put any restrictions on $\mu$ and $\Sigma$, you have more than $d$ free parameters: $d$ parameters alone would specify the mean vector $\mu$, on top of which you have additionally up to $d(d+1)/2$ parameters for the covariance matrix $\Sigma$ (they cannot all vary completely freely, though).

You need to express all parameters in your model as a vector, and then write the likelihood in dependence of the elements of your parameter vector. When you take the derivative of the log likelihood, you will do this with respect to the $p$ elements of the actual parameter vector of your model. The resulting Fisher information is a $p\times p$ matrix, not a $d\times d$ matrix.

Here are two examples for statistical models of two dimensional normal distributions. Write $$\mathbf(X)_i=(X_{1i}, X_{2i})\sim N(\mu,\Sigma), \quad \mu=(\mu_1,\mu_2),\ \ \Sigma=\pmatrix{\sigma_1^2 & \gamma\\\gamma&\sigma_2^2}$$ where $\gamma=\mathrm{Cov}(X_1,X_2)$.

1.) Components with different means and variances and unknown covariance $\gamma$.

The parameter vector is $\theta=(\theta_1,\dots,\theta_5)=(\mu_1,\mu_2,\sigma^2_1,\sigma_2^2,\gamma)$. It has the restrictions that $\sigma_1^2>0, \sigma_2^2 > 0$, and $\gamma\in[-\frac{\sigma_1^2+\sigma_2^2}{2},\frac{\sigma_1^2+\sigma_2^2}{2}]$ . So $p=5$ here.

In this example, write $$ \begin{multline} l(x;\theta) = -\frac{2}{2}\log(2\pi)-\frac{1}{2}\log\left|\pmatrix{\sigma_1^2 & \gamma\\\gamma&\sigma_2^2}\right|- \\ \frac{1}{2}(x_1-\mu_1, x_2-\mu_2)\pmatrix{\sigma_1^2 & \gamma\\\gamma&\sigma_2^2}^{-1}\pmatrix{x_1-\mu_1\\x_2-\mu_2} \end{multline} $$ which gives a moderately lengthy expression. Take the partial derivatives with respect to $\mu_1,\mu_2,\sigma^2_1,\sigma_2^2,\gamma$.

2.) Independent components with same variance but different means.

Then $\gamma=0$, $\sigma_1^2=\sigma_2^2=\sigma^2$ and you have a $p=3$ dimensional parameter vector $\theta=(\theta_1,\theta_2,\theta_3)=(\mu_1,\mu_2,\sigma^2)$. It has the restriction that $\sigma^2>0$.

In this example, the log likelihood is $$ \begin{aligned} l(x;\theta) &= -\frac{2}{2}\log(2\pi)-\frac{1}{2}\log\left|\pmatrix{\sigma^2 & 0\\0&\sigma^2}\right|- \\ &\qquad\frac{1}{2}(x_1-\mu_1, x_2-\mu_1)\pmatrix{\sigma^2 & 0\\0&\sigma^2}^{-1}\pmatrix{x_1-\mu_1\\x_2-\mu_2} \\&=-\frac{1}{2}\left(2\log 2+\log\sigma^4+\frac{(x_1-\mu_1)^2}{\sigma^2}+\frac{(x_2-\mu_2)^2}{\sigma^2} \right) \\&=-\log 2 -\log\sigma^2-\frac{(x_1-\mu_1)^2}{2\sigma^2}-\frac{(x_2-\mu_2)^2}{2\sigma^2} . \end{aligned} $$ Take the partial derivatives with respect to $\mu_1,\mu_2,\sigma^2$: $$ l_1(x;\theta)=\nabla_{\mu_1,\mu_2,\sigma^2} l = \left(\frac{x_1-\mu_1}{\sigma^2}, \frac{x_2-\mu_2}{\sigma^2},-\frac{1}{\sigma^2}+\frac{(x_1-\mu_1)^2+(x_2-\mu_2)^2}{2\sigma^4}\right)^\top. $$ This is a three dimensional vector, and $l_1(x;\theta)l_1(x;\theta)^\top$ is consequently a $3\times 3$ matrix. (Note that $\sigma^2$ counts as a symbol for $\theta_3$ when taking the deriative).