OLS Regression – How to Solve Collinearity Problems

least squaresmulticollinearityregressionvariance-inflation-factor

Please see the model below (link to bigger image). The independent variables are properties of 2500 companies from 32 countries, trying to explain companies' CSR (corporate social responsibility) score.

I am worried about the VIF scores of especially the LAW_, GCI_ and HOF_ variables but I really need them all included in the model to connect it to the theoretical background the model is built upon. All variables are discrete numeric values, except the law LAW_ variables: they are dummies as to which legal system applies in the companies' country of origin (either english, french, german or scandinavian).

enter image description here

Amongst other articles, I have read this article about dealing with collinearity. Often-suggested tips are removing the variable with highest VIF-score (in my model this would be LAW_ENG). But then other VIF-scores increase as a result. I do not have the proper knowledge to see through what is going on here and how I can solve this problem.

I have uploaded the corresponding data here (in SPSS .sav format). I would really appreciate somebody with more experience having a quick look and tell me a way to solve the collinearity problem without taking out (any or too many) variables.

Any help is greatly appreciated.

P.S. For reference, I am including a correlation table (link to bigger image):

enter image description here

Best Answer

When variables are co-linear, you can think of them sometimes as being different manifestations of the same thing. Say I had a dataset of happiness of cats, a variable of whether they were soaking wet, and a variable of whether or not there were nearby children who thought it was fun to throw cats into water. Clearly cats don't like water, yet sometimes they will fall into it on their own. More often however, they are thrown in by malevolent children. Sometimes however, malevolent children fail to thrown cats in the water.

So, wet cats and malevolent children are different, but can be thought of as a unitary dynamic. If a researcher was only interested in the effect of wetness on cat happiness, and didn't control for malevolent children, the estimates would be biased. Include them, and VIF goes up. This is because you simply don't have enough independent observations of wetness to know its effect apart from the effect of malevolent children.

Shrinkage estimators are one way to go. Basically, you increase the bias of your estimator in order to decrease its variance. Appealing for prediction, but not for inference.

If you're willing to put aside (or think differently about) inference on individual model terms, you could first do a principal components analysis, "interpret" your principal components somehow, and then fit your regression to the rotated dataset. Collinearity will be gone, but you're only able to conduct inference on these PC's, which may or may not have a convenient interpretation. In the case of wet cats and malevolent children, the first PC would increase as the probability of wetness got higher and as the probability of malevolent children increased. The other PC would be perpendicular, and relate to wetness as the probability of malevolent children decreased. If you simply wanted to know the effect of wetness absent malevolent children, you'd be interested in the coefficient on the second PC. Most PC regs don't have interpretation this straightforward however.

It is also worth emphasizing that prediction from a model with high collinearity is fine. So if your F-stat is good and you don't care about any of the coefficients individually, leave the model as it is.