Solved – Is $X^T X$ invertible if $p > n$

linear algebramathematical-statisticsregression

I am checking other's regression analysis work on a $p > n$ data. I can only see the results but not the process how he did it.

I believe he made mistakes. Since $p > n$, $X^T X$ is not full rank it is not invertible. So we cannot find the OLS coefficient.

However, I am not sure my reasoning is correct. I have not used linear algebra for a long time.

Update:

No penalized methods involved.
He used stepwise regression for variable selection. A new question: would such algorithm stop if number of variables in the model equal to the number of sample points?
His goal is to find out which variables are important. Doesn't care about the prediction power.

Best Answer

If matrix $\mathbf X$ is $n \times p$ and $p > n$, then it is fat and, thus,

$$ \mbox{rank} (\mathbf X) = \mbox{rank} \left(\mathbf X^\top \mathbf X \right) \leq n < p $$

Hence, $\mathbf X^\top \mathbf X$ does not have full rank and, thus, it is not invertible.

Related Solutions

Solved – Equivalence of the OLS and GLS estimates

Question: In the setup above, are conditions (1) and (2) satisfied?

Answer: No, in general the conditions are not satisfied.

The following example provides a proof of the answer.

\begin{align*} X &= \begin{bmatrix} 1 & 0 \\ 1 & 0 \\ 0 & 1 \\ 0 & 1 \\ \end{bmatrix}, Y = \begin{bmatrix} 1 \\2 \\3\\4 \end{bmatrix}, \Sigma = \begin{bmatrix} 1 & 0&0&0 \\ 0&5&0&0 \\ 0&0&5&0\\ 0&0&0&5 \end{bmatrix}. \end{align*}

Notice that $\Sigma, X'X$ and $X'\Sigma^{-1}X$ are all diagonal matrices with non-zero, positive, elements on the diagonals. Thus, they are all positive definite and have the standard basis vectors as eigenvectors. That is, they satisfy the setup and condition 1). It is easy to check that the OLS and GLS estimates are different (see code below). Thus, condition 2) must not hold. Let's see why.

In this example, $k=2$ so the columns of $H$ are two eigenvectors of $\Sigma$. Let $A=[a_1, a_2]$. Then $X = HA$ implies that $Ha_1 = x_1 = [1,1,0,0]'$. The eigenvectors of $\Sigma$ are the standard basis vectors, say $e_i$, and, thus, it must be that $H = [e_1, e_2]$ up to reordering of the columns. But then $x_2 =[0,0,1,1]' \notin \mathrm{span}(H)$, i.e. we cannot pick $a_2$ to satisfy the requirement that $X=HA$. We conclude condition 2) is not satisfied.

The following R code snippet shows that the GLS estimates, in this case WLS because of the diagonal covariance matrix, differ from the OLS estimates.

X <- matrix(c(1,1,0,0,0,0,1,1), ncol = 2); Y <- 1:4; E <- diag(c(1, 5, 5, 5))
coef(lm(Y ~ X - 1)
>X1  X2 
>1.5 3.5

coef(lm(Y ~ X - 1, weights = 1/diag(E)))
>X1      X2 
>1.66667 3.50000

Solved – Iteratively Reweighted Least squares for logistic regression when features are dependent

This might come as a slight anti-climax but essentially prior to fitting we remove any dependent columns of $X$ (or $XW$ in the case of a weighted task) so that $X$ is of full rank. For each IRLS iteration we compute a new vector of coefficient estimates, so computationally we just care for this (potentially weighted) fit to be numerically stable. The "iteration" part is somewhat immaterial.

The actual "rank correction" is relatively straightforward using $QR$ decomposition with column pivoting. This is a decomposition such that $X = QRP^T$; $Q$ is a unitary matrix, $R$ is an upper triangular matrix and $P$ is the permutation matrix such that diagonal elements of $R$ are non-increasing. If $QR$ suggests that the matrix $X$ (or $XW$ if we have weights) is not of full rank $k_f$ but of rank say $k_s$ (where $k_s < k_f$) we keep $k_s$ columns from the design matrix $X$. To pick the columns to retain we use $P$ and get the columns of $X$ denoted by the first $k_s$ columns of $P$. And then we are done - this "thinner" $X$ should be full rank so the show is over. :) Both R and MATLAB employ QR with pivoting in their glm.fit and glmfit functions respectively to handle this kind of rank-deficiencies. MATLAB is slightly nicer and removes these "offending columns" prior to the IRLS iteration while R explicitly sets certain factors to NA's after the IRLS iteration but the idea and approach is the same.

Addition based on comment: In general there is a multitude of approaches for fitting logistic regression models; T. Minka has written a nice overview here, it provides a number of Big O results. Particularly to IRLS, IRLS turns out to be equivalent to the use of Newton's method; the catch is that we use the expected Hessian of the Bernoulli likelihood (ie. the Fisher information matrix) instead of the actual Hessian; this leads to name of Fisher scoring method. This suggest a quadratic convergence with $O(nd^2)$ time complexity per iteriation. It goes without saying this does not guarantee convergence; for example successive iteration might lead to ever increasing MLE solution $\beta$ in cases where the likelihood does not have finite maximum. Use-cases of complete separation are prime examples and CV has two very enlightening post on the matter here and here to get you started. The course notes by C. Shalizi on Logistic Regression and Newton’s Method also quite good regarding the numeric aspects of this (the recommended reading in the notes (Faraway's awesome Extending the Linear Model with R, Chapter 2), has no detailed numerics on this unfortunately). About 5 years ago I read parts of Å. Björck's Numerical Methods for Least Squares Problems - it is not a Stats book but if you care about the Numerical Linear Algebra (ie. speed and stability) it is the good bet.

For something somewhat more formal than Wikipedia but still accessible I would suggest looking at M. Mueller's Generalized Linear Models contributed chapter for Springer's Handbook of Computational Statistics. You want to focus at Sect. 3.3. (or 24.3.3 if you can afford Springer's Handbook). CV itself has already some excellent posts on the $QR$ decomposition and its use on linear regression problem, see here and here. I also think you will find this thread on "How to correctly implement iteratively reweighted least squares algorithm for multiple logistic regression?" quite helpful.

Best Answer

Related Solutions

Solved – Equivalence of the OLS and GLS estimates

Solved – Iteratively Reweighted Least squares for logistic regression when features are dependent

Related Question