[Math] Multicollinearity: Why does highly correlated columns in the design matrix lead to high variance of the regression coefficient

regressionstatistics

I came across the term "Multicollinearity" in statistics, particularly statistics. However, I never really understand mathematically why highly correlated (almost linearly dependent) columns in the design matrix $X$ lead to higher variance of the regression coefficient, given that the $$Var(\hat{\beta}) = \sigma^2 (X^TX)^{-1}$$ ?

Can someone explain to me mathematically the idea behind please ?

Best Answer

Lets start with your formula:

$var(\beta)=\sigma^2(X^TX)^{-1}$

First, note that is the columns of X are linearly dependent, then $X'X$ will not be invertible and have a determinant of 0. And, I won't prove this next part, but if $X'X$ is "close" to being linearly dependent, then the determinant will be "close" to 0.

Now, the inverse of a matrix A is related to the inverse of its determinant:

$A^{-1} = \frac{1}{det(A)}adj(A)$

where $adj(A)$ is the adjugte of the matrix A.

Therefore, as $X'X$ becomes more linearly dependent, its determinant becomes closer to 0, which means the elements of the inverse get larger.

Related Question