Solved – Too many variables and multicollinearity in OLS regression

least squaresmulticollinearitymultiple regressionregression

After reading material related to my topic, I understood that multicollinearity among predictors would result in singular matrix $X'X$, and that leads to noninvertible matrix. Thus, the solution will not be unique.

Now, I am confused after reading that having too many variables (number of predictors greater than number of observations) causes the matrix $X'X$ to be singular too.

Is that true in both situations? If yes, could you explain that, please?

Best Answer

Suppose you add predictors to your model one by one, and you have a way to do that to minimize the multicollinearity issue - that is, you somehow make sure that each new predictor is independent of the predictors that are already in the model. That way, the correlation matrix for predictors will be always diagonal, and multicollinearity won't be an issue - up to a point.

The thing is that, for a given sample size, $N$, it's impossible to find more than $N$ independent predictors (including the column of 1s for the intercept). That's because for design matrix $X$ of size $N \times p$ its rank cannot be greater than $\text{min}\{N, p\}$. No matter how you pick the predictors, if you use more than $N$ of them, the $X'X$ will be non-invertible for sure.

Based on what we know from linear algebra, it's impossible for $X'X$ to have rank greater than that of $X$. So, with $p > N$ then $X'X_{p\times p}$ will be of rank $N$ at most, but in order for it to be invertible the rank has to be $p$.

Related Question