Solved – Which is more numerically stable for OLS: pinv vs QR

computational-statisticsmatrixmatrix decompositionmatrix inverse

If I am doing standard OLS and want to calculate beta values (OLS estimators), which of the following is the more numerically stable method? And why?

Assuming that the columns of $X$ are already mean-centered and normalised, to solve $Y = X\beta + \epsilon$ do:

1) $\hat{\beta}_{pinv}=(X'X)^+X'Y$

2) $\hat{\beta}_{QR} = \text{solve}(R,Q'Y)$

Where $^+$ represents the moore-penrose inverse, $Q$ and $R$ come from the QR decomposition of $X$ and solve is a function like the solve functions in python or r.

I would have thought (2) was better as $(X'X)^+$ seems to have a higher condition number than $R$, but in practice (in python at least) I am finding that the beta values derived from (1) minimize the sum of squared residuals better.

Best Answer

Using the Moore-Penrose pseudo-inverse $X^{\dagger}$ of an matrix $X$ is more stable in the sense that can directly account for rank-deficient design matrices $X$. $X^{\dagger}$ allows us to naturally employ the identities: $X^{\dagger} X X^{\dagger} = X$ and $X X^{\dagger} X= X^{\dagger}$; the matrix $X^{\dagger}$ can be used as "surrogate" the true inverse of the matrix $X$, even if the inverse matrix $X^{-1}$ does not exist. In addition, the "usual" way of computing $X^{\dagger}$ by employing the Singular Value Decomposition of matrix $X$, where $X = USV^T$, is straight-forward methodologically and computationally well-studied. We simply take the reciprocal of the non-zero singular values in the diagonal matrix $S$, and we are good to go. Moore-Penrose pseudo-inverses are common in many proofs because they "just exist" and greatly simplify many derivations.

That said, in most cases it is not good practice to use the Moore-Penrose Pseudo-inverse unless we have a very good reason (e.g. our procedure consistently employs small and potentially rank-degenerate covariance matrices). The reasons are that: 1. It can hide true underlying problems with our data (e.g. duplication of variables) and 2. it is unnecessarily expensive (we have better alternatives). Finally, note that the Moore-Penrose pseudo-inverse of a full rank $X$ can be directed computed through the QR factorization of $X$, $X = QR$, as: $X^{\dagger} = [R^{-1}_{1} 0] Q^T$ where $R_1$ is an upper triangular matrix, coming from the "thin/reduced/skinny" QR factorization of $X$. So we do not really gain much if $X$ is full rank anyway. (Gentle's Matrix Algebra: Theory, Computations and Applications in Statistics provides a wealth of information the matter if one wishes to explore this further - Sect. 3.6 on Generalised Inverses should be a relevant starting point.)

To elaborate my first point a bit: It is far more natural to use a penalised regression procedure like Ridge or LASSO if we have issues with collinearity or simply have a $p\gg n$ (i.e. more predictors than data-points) than hide the problem using $X^\dagger$. The condition of a system of equations solved through the employment of Moore-Penrose pseudo-inverses might still be prohibitorily bad, resulting to unstable solutions and/or misleading inference. Thus if numerical stability is an issue, I would suggest using regularisation directly instead of Moore-Penrose pseudo-inverses. Note that in terms of speed, computing $X^{\dagger}$ is also problematic; potentially iterative methods based on gradient descent methods or alternating least squares are far faster for large systems (e.g. in Recommender Systems literature, see Paterek (2008) Improving regularized singular value decomposition for collaborative filtering for something very concise).

Related Question