Covariance – Exploring Relationship Between Cholesky Decomposition and Matrix Inversion

cholesky decompositioncovariancegaussian processmatrix decompositionmatrix inverse

I've been reviewing Gaussian Processes and, from what I can tell, there's some debate whether the "covariance matrix" (returned by the kernel), which needs to be inverted, should be done so through matrix inversion (expensive and numerically unstable) or via Cholesky decomposition.

I'm quite new to Cholesky decomposition and I've come to understand that it's akin to square roots for scalars. Likewise, the inverse of a matrix is akin to division by a scalar (ex when you multiply $A * A^{-1} = I$ the identity matrix is returned, which resembles $5/5 = 1$.)

I'm struggling to make the connection- what's the relationship between Cholesky decomposition of a covariance matrix and the inverse of the covariance matrix? Are additional steps required to cement the equivalence of solutions?

Best Answer

Gaussian process models often involve computing some quadratic form, such as $$ y = x^\top\Sigma^{-1}x $$ where $\Sigma$ is positive definite, $x$ is a vector of appropriate dimension, and we wish to compute scalar $y$. Typically, you don't want to compute $\Sigma^{-1}$ directly because of cost or loss of precision. Using a definition of Cholesky factor $L$, we know $\Sigma=LL^\top$. Because $\Sigma$ is PD, the diagonals of $L$ are also positive, which implies $L$ is non-singular. In this exposition, $L$ is lower-triangular.

We can rewrite $$\begin{align} y &= x^\top(LL^\top)^{-1}x \\ &= x^\top L^{-\top}L^{-1}x \\ &= (L^{-1}x)^\top L^{-1}x \\ &= z^\top z \end{align} $$

The first to second line is an elementary property of a matrix inverse. The second to third line just rearranges the transpose. The final line re-writes it as an expression of vector dot-products, which is convenient because we only need to compute $z$ once.

The nice thing about triangular matrices is that they're dirt-simple to solve, so we don't actually ever compute $L^{-1}x$; instead, we just use forward substitution for $Lz=x$, which is very cheap compared to computing an inverse directly.