Multicollinearity Testing – How to Test for Multicollinearity Among Non-linearly Related Independent Variables

multicollinearitynonlinear

I have a health outcome (measured as a rate of cases per 10,000 people in an administrative zone) that I'd like to associate with 15 independent variables (social, economic, and environmental measures of those same administrative zones) through some kind of model (I'm thinking a Poisson GLM or negative binomial if there's overdispersion). From scatterplot investigation I know there is multicollinearity among the variables that I need to investigate and deal with. I am not against removing variables, and within any problematic combinations can justify choosing one variable over the other(s) for reasons of interest or cost/ease of collection.

In the past I used the correlation matrix to detect multicollinearity, but I've been reading around on this site and discovered VIF and the condition index/number, which seem to be generally accepted as better options. My question is how any of these measures work with non-linear correlations, which is what I have between nearly all of my variables (determined graphically). What are my options for evaluating multicollinearity besides Spearman correlations?

Aside: I realize this is probably a separate question, but in case it's relevant here, I also have a lot of non-monotonic correlations among my variables that I don't know what to do with…

Thanks for any help!

Best Answer

Multicolinearity is all about the linear relationship among you independent/explanatory/right-hand-side/x-variables. That you want to use those variables in a non-linear model does not matter. The logic behind that is that if you want to add both variables to your model then you have te be able to distinguish between a unit change in one variable and a unit change in the other. If the variables are linearly related then a unit change in one coincides with $k$ units increase in the other variables, where $k$ is some constant, so we cannot determine the separate effects of both variables. If the relationship is non-linear a unit change in one variable coincides with a variable number of units change in the other, so we are able to distinguish between the variables. So if you graphically determined that there is a relationship but that relationship is non-linear then that fact alone has already solved most of your problems.

Consider the following example: if we add a quadratic curve, that is, we add a variable $x$ and a variable $x^2$ to our model, then the relationship between the variables $x$ and $x^2$ is extremely strong. Still we can estimate that model. The reason is that that relationship is non-linear.

I find it informative to see a situation where this can break. Consider we have a study where we want to consider year of birth, which ranges between 1950 and 1990. If we just add that and its square then you might get into trouble as the relationship between birthyear and birthyear$^2$ is almost linear, as you can see below. You can solve this by centering at a meaningful variable within the range of your data, e.g. 1960. As you can see in the second graph the relationship is now non-linear and that is usually enough to solve the problem.

enter image description here

I created that graph with Stata using the following code:

twoway function xsquare = x^2, range(1950 1990) ///
    name(a,replace) title(uncentered) ytitle("x{sup:2}")
twoway function xsquare = (x-1960)^2, range(1950 1990) ///
    name(b, replace) title(centered) ytitle("(x-1960){sup:2}")
graph combine a b, ysize(3)
Related Question