Solved – High correlation between two independent variables, but no multicollinearity

correlationmulticollinearityregressionstata

I have two independent variables which have a Pearson correlation coefficient of 0.98.

The two independent variables measure the same underlying construct but only at two different points in time (one is a forecast, the other the actual realization). The VIF is around 25 – however, if I replace one variable with the incremental change to the other variable, I basically get the same information from the coefficients (just a base coeff. and an incremental value), but VIF is around 1 and multicollinearity is gone? Does that automatically mean that the initial regression has no multicollinearity problem?

Best Answer

Let XF and XA be respectively the forecast of X and the actual value of X. If the forecast is good enough, the correlation between XF and XA will be high, as it is here. You use the incremental change E, which I assume is defined as E=XF-XA.

E is just the error of your forecast. It can be correlated with the actual value XA (or XF) but does not have to be so. In this case, it is not, given that the VIF is very low. So, the multicollinearity is gone but the interpretation of your coefficient changes. Before, you had both XA and XF in your regression, now you have XA and E (or XF and E, not clear from your question).

But absent more context, neither model makes much sense to me. In the first case, the multicollinearity is high and so it is not clear why you just don't keep only one of the two variables. In the second case, the multicollinearity is low but it is not clear why you would want the error of the forecast in your model, in addition to the forecast itself. So, without more information, I would suggest to use either XA or XF.

Related Solutions

Multicollinearity Testing – How to Test for Multicollinearity Among Non-linearly Related Independent Variables

Multicolinearity is all about the linear relationship among you independent/explanatory/right-hand-side/x-variables. That you want to use those variables in a non-linear model does not matter. The logic behind that is that if you want to add both variables to your model then you have te be able to distinguish between a unit change in one variable and a unit change in the other. If the variables are linearly related then a unit change in one coincides with $k$ units increase in the other variables, where $k$ is some constant, so we cannot determine the separate effects of both variables. If the relationship is non-linear a unit change in one variable coincides with a variable number of units change in the other, so we are able to distinguish between the variables. So if you graphically determined that there is a relationship but that relationship is non-linear then that fact alone has already solved most of your problems.

Consider the following example: if we add a quadratic curve, that is, we add a variable $x$ and a variable $x^2$ to our model, then the relationship between the variables $x$ and $x^2$ is extremely strong. Still we can estimate that model. The reason is that that relationship is non-linear.

I find it informative to see a situation where this can break. Consider we have a study where we want to consider year of birth, which ranges between 1950 and 1990. If we just add that and its square then you might get into trouble as the relationship between birthyear and birthyear$^2$ is almost linear, as you can see below. You can solve this by centering at a meaningful variable within the range of your data, e.g. 1960. As you can see in the second graph the relationship is now non-linear and that is usually enough to solve the problem.

enter image description here

I created that graph with Stata using the following code:

twoway function xsquare = x^2, range(1950 1990) ///
    name(a,replace) title(uncentered) ytitle("x{sup:2}")
twoway function xsquare = (x-1960)^2, range(1950 1990) ///
    name(b, replace) title(centered) ytitle("(x-1960){sup:2}")
graph combine a b, ysize(3)

Solved – VIF calculation in regression

It is important to address multicollinearity within all the explanatory variables, as there can be linear correlation between a group of variables (three or more) but none among all their possible pairs.

The threshold for discarding explanatory variables with the Variance Inflation Factor is subjective. Here is a recommendation from The Pennsylvania State University (2014):

VIF is a measure of how much the variance of the estimated regression coefficient $b_k$ is "inflated" by the existence of correlation among the predictor variables in the model. A VIF of 1 means that there is no correlation among the $k_{th}$ predictor and the remaining predictor variables, and hence the variance of $b_k$ is not inflated at all. The general rule of thumb is that VIFs exceeding 4 warrant further investigation, while VIFs exceeding 10 are signs of serious multicollinearity requiring correction.

Remember always sticking to the hypothesis previously formulated to investigate the relationship between the variables. Keep the predictors which make more sense in explaining the response variable.

Multicollinearity in logistic regression is equally important as other types of regression. See: Logistic Regression - Multicollinearity Concerns/Pitfalls.

Best Answer

Related Solutions

Multicollinearity Testing – How to Test for Multicollinearity Among Non-linearly Related Independent Variables

Solved – VIF calculation in regression

Related Question