Multicollinearity Testing – How to Test for Multicollinearity Among Non-linearly Related Independent Variables

multicollinearitynonlinear

I have a health outcome (measured as a rate of cases per 10,000 people in an administrative zone) that I'd like to associate with 15 independent variables (social, economic, and environmental measures of those same administrative zones) through some kind of model (I'm thinking a Poisson GLM or negative binomial if there's overdispersion). From scatterplot investigation I know there is multicollinearity among the variables that I need to investigate and deal with. I am not against removing variables, and within any problematic combinations can justify choosing one variable over the other(s) for reasons of interest or cost/ease of collection.

In the past I used the correlation matrix to detect multicollinearity, but I've been reading around on this site and discovered VIF and the condition index/number, which seem to be generally accepted as better options. My question is how any of these measures work with non-linear correlations, which is what I have between nearly all of my variables (determined graphically). What are my options for evaluating multicollinearity besides Spearman correlations?

Aside: I realize this is probably a separate question, but in case it's relevant here, I also have a lot of non-monotonic correlations among my variables that I don't know what to do with…

Thanks for any help!

Best Answer

Multicolinearity is all about the linear relationship among you independent/explanatory/right-hand-side/x-variables. That you want to use those variables in a non-linear model does not matter. The logic behind that is that if you want to add both variables to your model then you have te be able to distinguish between a unit change in one variable and a unit change in the other. If the variables are linearly related then a unit change in one coincides with $k$ units increase in the other variables, where $k$ is some constant, so we cannot determine the separate effects of both variables. If the relationship is non-linear a unit change in one variable coincides with a variable number of units change in the other, so we are able to distinguish between the variables. So if you graphically determined that there is a relationship but that relationship is non-linear then that fact alone has already solved most of your problems.

Consider the following example: if we add a quadratic curve, that is, we add a variable $x$ and a variable $x^2$ to our model, then the relationship between the variables $x$ and $x^2$ is extremely strong. Still we can estimate that model. The reason is that that relationship is non-linear.

I find it informative to see a situation where this can break. Consider we have a study where we want to consider year of birth, which ranges between 1950 and 1990. If we just add that and its square then you might get into trouble as the relationship between birthyear and birthyear$^2$ is almost linear, as you can see below. You can solve this by centering at a meaningful variable within the range of your data, e.g. 1960. As you can see in the second graph the relationship is now non-linear and that is usually enough to solve the problem.

enter image description here

I created that graph with Stata using the following code:

twoway function xsquare = x^2, range(1950 1990) ///
    name(a,replace) title(uncentered) ytitle("x{sup:2}")
twoway function xsquare = (x-1960)^2, range(1950 1990) ///
    name(b, replace) title(centered) ytitle("(x-1960){sup:2}")
graph combine a b, ysize(3)

Related Solutions

Solved – How to test for multicollinearity among dumthe explanatory variables

The VIF is probably the best way to go here. The Pearson correlation will give you a lousy measure here because it behaves somewhat weirdly for categorical variables like this. Another possibility is to use a matrix of a different measure like cosine similarity: $\sum x_i*x_j / \sqrt{\sum x_i^2 * \sum x_j^2}$. I think that is equivalent to Spearman's Rho or Kendall's Tau but am not sure off the top of my head.

I'd stick to the VIF though because it will tell you for each variable whether the other variables combined are highly colinear. But if you want a visual diagnostic of which pairwise variables are similar, those other metrics are better than Pearson for categorical data.

----EDIT---

Sure. This has to do primarily with the fact that Pearson's correlation can swing up or down or go negative very easily. Here's an example:

> cor(c(0,1,1,1,0,1,0,1,0),c(1,1,0,1,1,0,1,1,0))
[1] -0.1581139
> cor(c(0,1,1,1,0,1,0,1,0),c(0,1,0,1,1,0,1,1,0))
[1] 0.1

Here, by changing just one of the entries to zero we have swung the correlation from positive to negative. But the VIF uses $1/(1-R_{i}^2)$ where the $R_{i}^2$ is for the regression of the other variables on the one in question. I would have to work it out but I think that is basically a linear combination of something similar to the cosine measure I posted above, or a transform of it. Essentially though, it can't go negative.

I don't know any literature on it off the top of my head, but I will think about it.

Solved – Multicollinearity among categorical variables – Is it normal

A set of indicator (dummy) variables for the same categorical variable are always correlated: If the indicator variable for friday is 1 than the indicator variables for all other days are necessarily 0. The correlation will be higher if one category dominates the categorical variable.

Removing an additional indicator variable on top of the reference category is definately a bad idea: that way you change your reference category to a combination of your original reference category and the catagory you additionally left out, and what is that than mean?

Best Answer

Related Solutions

Solved – How to test for multicollinearity among dumthe explanatory variables

Solved – Multicollinearity among categorical variables – Is it normal

Related Question