Regression – Handling Singular Matrix with Dummy Variables in Regression

regression

I'm trying to detect correlations between my variables, and I should be able to find this by inverting the correlation matrix and looking at the diagonal values, which are the VIF values. I can't do this because my matrix is singular, which I think means there is correlation between two or more of my variables.

However, I've been trying to remove some, and the only time the matrix becomes nonsingular is when I remove the dummy variables that I have.

From this post it seems that a high VIF is likely to occur when you have dummy variables. When I look at the correlations there are correlations of around 0.5 between two levels of the same original variable. So what exactly do you do in this situation? I could drop all the variables that I have dummy coded but I don't think that would be a good model anymore. Do I simply have to find a different diagnostic for multicollinearity, because I won't be able to invert the matrix?

Best Answer

I'm picking up on some confusion about a few things.

  1. Linear dependence in the columns of your design matrix $\mathbf{X}$, or equivalently singularity of $\mathbf{X}^\intercal \mathbf{X}$, is bad because you won't even have (unique) OLS estimators. This doesn't just mean that there is correlation among your predictors--it means that there is perfect correlation among some of your predictors. This (usually) means that you have to remove some of the columns before you can estimate a regression model.

  2. Near linear dependence, or correlation between $-1$ and $1$, not inclusive, is bad because, for the OLS estimators you have, $\hat{\boldsymbol{\beta}}$, they can very large variance. However, if you have a lot of predictors, you have to ask yourself which combination of predictors is correlated with another combination of predictors. There are a lot of ways this can happen (e.g. $(x_1 + x_2)/2$ is correlated with $(x_3 + x_4)/2$, etc.). We can't just look at a correlation matrix of of your $p$ variables because that would only alert us to pairwise correlations. If you want a one number summary of how bad your entire design matrix is, you could use the determinant of $\mathbf{X}^\intercal \mathbf{X}$.

  3. VIF gives you a measure of how bad things are for a specific coefficient estimate. Say you're interested in $\hat{\beta}_3$, the third predictor's coefficient. Then

\begin{align*} V[\hat{\beta}_3] &= \frac{\sigma^2}{\mathbf{x}_3^\intercal\mathbf{x}_3 - \mathbf{x}_3^\intercal\mathbf{x}_{-3} [\mathbf{x}_{-3}^\intercal\mathbf{x}_{-3}]^{-1}\mathbf{x}_{-3}^\intercal\mathbf{x}_3 } \\ &= \frac{\sigma^2}{SS_{T,3} }\left[\frac{1}{1 - R^2_{3} }\right] \end{align*} and we call $\text{VIF}_3 = \frac{1}{1 - R^2_{3}}$ the variance inflation factor for the third coefficient. It's something proportional to the variance of your estimate. Bigger is obviously bad.

  1. Dummy variables don't cause multicollinearity in and of themselves. Partially redundant dummy variables do. Say you have two categorical variables in your data set: sex and gender. If your particular data set had only cisgender folks, the numerical dummy variables generated from those two categorical columns would be identical for each row element. That would cause perfect collinearity. On the other hand, if you had most, but not all, people in your data set as cisgender, then you could still end up non-perfect multicollinearity and high VIFs.