Solved – Choosing predictors in regression analysis and multicollinearity

exploratory-data-analysisfeature selectionmulticollinearitymultiple regressionregression

I would like to run a linear regression analysis and I'm uncertain about including predictors.

I have three predictor variables available. One is based on a lot of previous research. Therefore I am planning to enter that in the first step of a hierarchical regression. The second two predictors both make theoretically a lot of sense, but there is no previous research on their relationship to the dependent variable. Therefore, the second step of the regression would be more exploratory and I thought I might enter both variables using a stepwise selection.

When I do this the first variable is a significant predictor, as well as the one of the two exploratory variables; the other one is being excluded.

The problem is:

There seems to be multicollinearity between the two exploratory variables: they are negatively correlated (-.7) and VIF = 2.5 (which is apparently large for a small sample size $N$=24)
Does that mean I can't enter this variable in the regression? But if I can't enter it, how can I show that one of the exploratory variables makes a significant contribution to explaining variance in the DV but the other one not?
And if I can only enter one of the exploratory predictors on what grounds can I decide which one? Both make sense theoretically.
The insignificant predictor is also insignificant if I enter it without the other exploratory variables (and it is also not even significantly correlated with the DV): accordingly it is not just insignificant because the other exploratory variable steals the variance.

Thanks a lot for the replies!

Here is some more information on the data. The data is from a group of neurological patients:

-The dependent variable for the regression is the performance on an ‘emotion recognition task’: the patients are asked to identify different emotions from facial expressions -the higher the score the better the performance.
-The first predictor X1 is an indicator of disease progression. It is well known that the ability to recognize emotions from faces declines during the course of the disease. However, emotion recognition is already affected in very early stages as well.

The two exploratory predictors X2 and X3 are measures obtained using eye-tracking during the presentation of faces with emotional expressions. X2 is the ratio of fixations on the eye-region and X3 is the ratio of fixations on the nose/mouth region. We know that the eye-region and the nose/mouth region carry the most important information for recognizing emotional expressions. Additionally, in the first step of my data analysis I have compared the patient group to a healthy control group and found that the patient groups has lower fixation ratios for both regions of interest: The ratio of fixations on the eye-region (X2) and the nose/mouth region (X3).
Now, the aim for the regression is to see whether those reduced fixations of relevant areas of the face actually might explain (partly) why the patients have difficulties recognizing emotional expressions. I am specifically interested to see whether X2 and X3 make additional contributions to explaining emotion recognition (independent of disease progression).

Different scenarios are possible, e.g.:

Disease progression and viewing behavior make independent contributions to emotion recognition
Disease progression affects viewing behavior and that affects emotion recognition (mediation)
Disease progression affects emotion recognition and viewing behavior independently (maybe due to general cognitive decline) – and viewing behavior does not affect emotion recognition.

It seems that X1 and X2 independently contribute to explaining emotion recognition performance, but not X3. The results are always the same no matter which predictors I put in the model. I am just not sure how to present the data, which model to choose and how to explain that choice.
The predictor correlation I am worried about is that X2 and X3 are negatively correlated (-.7). People that spend more time looking at the eyes, spend less time looking at the mouth! Can I still use X2 and X3 as predictors in one model? And what if the stepwise procedure excludes X3? How do I show that this is not a result of multicollinearity?

I am also aware that N =24 in the patient group is very small and a regression might not be possible. If that is the case I can always just report correlations….

I would like to attach the data, but I don’t know if that is possible….

Best Answer

Your approach doesn't have to be hierarchical or stepwise. Let's call your response variable (dependent variable or DV in your terms) $Y$, the apparently important predictor $X_1$ and the others $X_2$ and $X_3$. You can easily look at all the possible models, as there are at most 7 models of interest, namely

$X_1$, $X_2$, $X_3$ alone,
The three possible pairs,
All three predictors.

There is much need for caution, as

A sample size of 24 is small for any exercise here, especially fitting a model with more than one predictor.
Focusing on whether the coefficient for a particular predictor is or is not significant at some conventional level is not as important as understanding why that is so. A scatter plot matrix plotted as exploratory analysis before your regressions and residual and added variable plots after your regressions would help to signal whether the real problems are say nonlinearity, outliers, some other reason to transform, grouping of values, or whatever. Multicollinearity or other structure is not something to be guessed at from the value of some diagnostic, but something that can be explored directly by looking at the data with graphs.
Paying attention to previous knowledge or theory is clearly sensible, but I wouldn't pay too much attention to it. Presumably you wouldn't use $X_2$ and $X_3$ if they were not of interest. It's common to find that the well-known predictor is not as crucial as theory implies, say because it doesn't vary enough in a particular dataset; because a theory hinging on dynamics is being tested with cross-sectional data; and so on.

Can you post the data? Then guesses and prejudices could be checked against the facts.

(I have to guess that you are some kind of economist. Economists are in my experience naturally very well informed about regression, but often most reluctant to draw graphs.)

Related Solutions

Solved – “Wrong Sign” On Regression Coefficients – Hierarchical Multiple Linear Regression

There are several things in your description that are a bit confusing, for example you state that taking the log transform reverses the direction of the coding, but the log by itself does not reverse coding.

Your main question seems to be that when you look at individual pairwise correlations the sign of the correlation is as expected, but some of the signs of the slopes in a multiple regression are opposite what you expect. This is not uncommon since the interpretation of slopes is much more complex in multiple regression models.

Consider this example (I read this recently, I don't get the credit for thinking of it): Collect data on the change in various peoples pockets, the variables to collect are the total value of the change (y), the total number of coins (x1), and the total number of coins that are not quarters, or if using non-US coins then number of coins not the highest common coin carried (x2). Generally x1 and x2 will both be positivly correlated with y, but if you do a multiple regression using both x1 and x2 then the slope on x2 will be negative because to increase the number of non-quarters without changing the total number of coins we need to trade quarters for other coins of lesser value which decreases y. You could have something similar happening with your data, does it really make sense to increase the religeous variable without the others changing? What is often more meaningful is to compare predicted outcomes for what would be considered common combinations of your predictor variables.

Solved – the effect of having correlated predictors in a multiple regression model

The topic you are asking about is multicollinearity. You might want to read some of the threads on CV categorized under the multicollinearity tag. @whuber's answer linked above in particular is also worth your time.

The assertion that "if two predictors are correlated and both are included in a model, one will be insignificant", is not correct. If there is a real effect of a variable, the probability that variable will be significant is a function of several things, such as the magnitude of the effect, the magnitude of the error variance, the variance of the variable itself, the amount of data you have, and the number of other variables in the model. Whether the variables are correlated is also relevant, but it doesn't override these facts. Consider the following simple demonstration in R:

library(MASS)    # allows you to generate correlated data
set.seed(4314)   # makes this example exactly replicable

# generate sets of 2 correlated variables w/ means=0 & SDs=1
X0 = mvrnorm(n=20,   mu=c(0,0), Sigma=rbind(c(1.00, 0.70),    # r=.70
                                            c(0.70, 1.00)) )
X1 = mvrnorm(n=100,  mu=c(0,0), Sigma=rbind(c(1.00, 0.87),    # r=.87
                                            c(0.87, 1.00)) )
X2 = mvrnorm(n=1000, mu=c(0,0), Sigma=rbind(c(1.00, 0.95),    # r=.95
                                            c(0.95, 1.00)) )
y0 = 5 + 0.6*X0[,1] + 0.4*X0[,2] + rnorm(20)    # y is a function of both
y1 = 5 + 0.6*X1[,1] + 0.4*X1[,2] + rnorm(100)   #  but is more strongly
y2 = 5 + 0.6*X2[,1] + 0.4*X2[,2] + rnorm(1000)  #  related to the 1st

# results of fitted models (skipping a lot of output, including the intercepts)
summary(lm(y0~X0[,1]+X0[,2]))
#             Estimate Std. Error t value Pr(>|t|)    
# X0[, 1]       0.6614     0.3612   1.831   0.0847 .     # neither variable
# X0[, 2]       0.4215     0.3217   1.310   0.2075       #  is significant
summary(lm(y1~X1[,1]+X1[,2]))
#             Estimate Std. Error t value Pr(>|t|)    
# X1[, 1]      0.57987    0.21074   2.752  0.00708 **    # only 1 variable
# X1[, 2]      0.25081    0.19806   1.266  0.20841       #  is significant
summary(lm(y2~X2[,1]+X2[,2]))
#             Estimate Std. Error t value Pr(>|t|)    
# X2[, 1]      0.60783    0.09841   6.177 9.52e-10 ***   # both variables
# X2[, 2]      0.39632    0.09781   4.052 5.47e-05 ***   #  are significant

The correlation between the two variables is lowest in the first example and highest in the third, yet neither variable is significant in the first example and both are in the last example. The magnitude of the effects is identical in all three cases, and the variances of the variables and the errors should be similar (they are stochastic, but drawn from populations with the same variance). The pattern we see here is due primarily to my manipulating the $N$s for each case.

The key concept to understand to resolve your questions is the variance inflation factor (VIF). The VIF is how much the variance of your regression coefficient is larger than it would otherwise have been if the variable had been completely uncorrelated with all the other variables in the model. Note that the VIF is a multiplicative factor, if the variable in question is uncorrelated the VIF=1. A simple understanding of the VIF is as follows: you could fit a model predicting a variable (say, $X_1$) from all other variables in your model (say, $X_2$), and get a multiple $R^2$. The VIF for $X_1$ would be $1/(1-R^2)$. Let's say the VIF for $X_1$ were $10$ (often considered a threshold for excessive multicollinearity), then the variance of the sampling distribution of the regression coefficient for $X_1$ would be $10\times$ larger than it would have been if $X_1$ had been completely uncorrelated with all the other variables in the model.

Thinking about what would happen if you included both correlated variables vs. only one is similar, but slightly more complicated than the approach discussed above. This is because not including a variable means the model uses less degrees of freedom, which changes the residual variance and everything computed from that (including the variance of the regression coefficients). In addition, if the non-included variable really is associated with the response, the variance in the response due to that variable will be included into the residual variance, making it larger than it otherwise would be. Thus, several things change simultaneously (the variable is correlated or not with another variable, and the residual variance), and the precise effect of dropping / including the other variable will depend on how those trade off. The best way to think through this issue is based on the counterfactual of how the model would differ if the variables were uncorrelated instead of correlated, rather than including or excluding one of the variables.

Armed with an understanding of the VIF, here are the answers to your questions:

Because the variance of the sampling distribution of the regression coefficient would be larger (by a factor of the VIF) if it were correlated with other variables in the model, the p-values would be higher (i.e., less significant) than they otherwise would.
The variances of the regression coefficients would be larger, as already discussed.
In general, this is hard to know without solving for the model. Typically, if only one of two is significant, it will be the one that had the stronger bivariate correlation with $Y$.
How the predicted values and their variance would change is quite complicated. It depends on how strongly correlated the variables are and the manner in which they appear to be associated with your response variable in your data. Regarding this issue, it may help you to read my answer here: Is there a difference between 'controlling for' and 'ignoring' other variables in multiple regression?

Best Answer

Related Solutions

Solved – “Wrong Sign” On Regression Coefficients – Hierarchical Multiple Linear Regression

Solved – the effect of having correlated predictors in a multiple regression model

Related Question