Solved – What does the inverse of covariance matrix say about data? (Intuitively)

bayesiancovariancematrixmaximum likelihood

I'm curious about the nature of $\Sigma^{-1}$. Can anybody tell something intuitive about "What does $\Sigma^{-1}$ say about data?"

Edit:

Thanks for replies

After taking some great courses, I'd like to add some points:

  1. It is measure of information, i.e., $x^T\Sigma^{-1}x$ is amount of info along the direction $x$.
  2. Duality: Since $\Sigma$ is positive definite, so is $\Sigma^{-1}$, so they are dot-product norms, more precisely they are dual norms of each other, so we can derive Fenchel dual for the regularized least squares problem, and do maximization w.r.t dual problem. We can choose either of them, depending on their conditioning.
  3. Hilbert space: Columns (and rows) of $\Sigma^{-1}$ and $\Sigma$ span the same space. So there is not any advantage (other that when one of these matrices is ill-conditioned) between representation with $\Sigma^{-1}$ or $\Sigma$
  4. Bayesian Statistics: norm of $\Sigma^{-1}$ plays an important role in the Bayesian statistics. I.e. it determined how much information we have in prior, e.g., when covariance of the prior density is like $\|\Sigma^{-1}\|\rightarrow 0 $ we have non-informative (or probably Jeffreys prior)
  5. Frequentist Statistics: It is closely related to Fisher information, using the Cramér–Rao bound. In fact, fisher information matrix (outer product of gradient of log-likelihood with itself) is Cramér–Rao bound it, i.e. $\Sigma^{-1}\preceq \mathcal{F}$ (w.r.t positive semi-definite cone, i.e. w.r.t. concentration ellipsoids). So when $\Sigma^{-1}=\mathcal{F}$ the maximum likelihood estimator is efficient, i.e. maximum information exist in the data, so frequentist regime is optimal. In simpler words, for some likelihood functions (note that functional form of the likelihood purely depend on the probablistic model which supposedly generated data, aka generative model), maximum likelihood is efficient and consistent estimator, rules like a boss. (sorry for overkilling it)

Best Answer

It is a measure of precision just as $\Sigma$ is a measure of dispersion.

More elaborately, $\Sigma$ is a measure of how the variables are dispersed around the mean (the diagonal elements) and how they co-vary with other variables (the off-diagonal) elements. The more the dispersion the farther apart they are from the mean and the more they co-vary (in absolute value) with the other variables the stronger is the tendency for them to 'move together' (in the same or opposite direction depending on the sign of the covariance).

Similarly, $\Sigma^{-1}$ is a measure of how tightly clustered the variables are around the mean (the diagonal elements) and the extent to which they do not co-vary with the other variables (the off-diagonal elements). Thus, the higher the diagonal element, the tighter the variable is clustered around the mean. The interpretation of the off-diagonal elements is more subtle and I refer you to the other answers for that interpretation.

Related Question