regression – Statistical Interpretation of Diagonal of Cholesky Decomposition

cholesky decompositioncovarianceregression

Given a set of $m$ examples $x$ arranged as rows in $m\times n$ data matrix X, consider Cholesky decomposition of covariance matrix $X'X$.

Is there a statistical interpretation of diagonal entries of Cholesky, or equivalently of $D$ in LDLt decomposition of covariance?

$$X'X=LDL'$$

There's an interpretation of $L$ in terms of coefficients of linear model predicting $i$th coordinate of $x$ from all previous ones, hence we can estimate entries of $L$ approximately by fitting linear models to subsets of data. Is there an analogous way to estimate $D$ approximately?

Some related statistical interpretations:

Use least-squares to fit a one-sided auto-regressive model predicting $i$th coordinate of $x$ from coordinates $j<i$, and arrange these coefficients as entries of $\phi$ so that $x.\phi$ predicts $x$ . We can extract entries of $L$ from $\phi$ using the following:
$$\phi = I – (L^{-1})^{T}$$

Similarly, we can estimate entries of $(X'X)^{-1}$ by fitting two-sided autoregressive models of $x$, predicting $i$th coordinate $x_i$ from coordinates $x_j, j\ne i$.

Arrange coefficients of these models as entries of $B$. Let $R=X-XB$ indicate the residuals left over after subtracting predictions of this model from X. Diagonal entries of inverse of $X$ have interpretation in terms of per-feature residual norms squared:

$$\text{diag}((X'X)^{-1})=\frac{1}{\text{diag}(R'R)}$$

If we let $D_2=(X'X)^{-1}_d$ with $_d$ notation indicating that all off-diagonal terms are set to zero, then inverse of $(X'X)$ has interpretation in terms of predictive coefficients $B$

$$(X'X)^{-1} = (I-B)D_2$$

Plugging in statistical interpretation of $D_2$ we can define inverse of $X'X$ purely in terms of coefficients $B$ which could be estimated from data.

$$(X'X)^{-1} = (I-B)[((X-BX)'(X-BX))_d]^{-1}$$

We can similarly write inverse of $X'X$ in terms of coefficients of one-sided linear models arranged as $\phi$:

$$(X'X)^{-1} = (I-\phi)'D^{-1}(I-\phi)$$

Unlike inverse in terms of two-sided autoregressive model $B$, using one-sided autoregressive model $\phi$ involves matrix $D$ from $LDL'$ decomposition, so it's not directly usable without statistical way to estimate $D$. Having a way to estimate $D$ from data would give a way to

  1. Formulate matrix inverse in terms of one-sided autoregressive model coefficients
  2. Get an iterative algorithm for approximating Cholesky decomposition of covariance $X'X$ or precision matrix $(X'X)^{-1}$

(notebook)

Best Answer

If the $X_i$ variables follow a normal distribution with covariance matrix $\Sigma$ and $$\Sigma = LDL'$$ then the diagonal elements of $D$ are the conditional variances of each $X_i$ conditional on $X_1,\ldots,X_{i-1}$. And, as you have already said, the elements of the $i$th row of $L$ give the regression coefficients of $X_i$ on $X_1,\ldots,X_{i-1}$.

More generally, if the $X_i$ are not normally distributed, then the elements of $L$ define the best linear predictor for $X_i$ based on $X_1,\ldots,X_{i-1}$ and the diagonal elements of $D$ give the variances of the residuals from these linear regressions. In other words, the elements of $D$ gives the variance of each $X_i$ not explained by linear regression on $X_1,\ldots,X_{i-1}$.

In your question, you are computing the LDL decomposition not of a covariance matrix but of a squared data matrix. If each column of the data matrix $X$ was mean-corrected, then $X'X$ would be a sample covariance matrix, and the theoretical interpretation for the Cholesky decomposition would hold in an estimated sense. If the data is not mean-corrected, then $X'X$ is not a covariance matrix and the interpretation does not hold.

The Cholesky decomposition algorithm, when properly coded with forward and backward equations, is a marvelously efficient and stable numerical algorithm that does not require any matrix inversions. So I am a bit surprised that you are trying to approximate it, especially with equations that look to me to be somewhat inefficient. There also exist Cholesky algorithms that produce banded $L$ matrices, i.e., which limit the number of previous variables than each $X_i$ depends on. Banded Cholesky algorithms are appropriate for auto-regressive processes. I wrote Fortran subroutines to do such things when I was a PhD student back in the 1980s! If you are dealing with auto-regressive time series, then you could look into state-space models, which can be viewed as an implementation of an LDL Cholesky decomposition for time series.

Related Question