Solved – Choosing predictors in regression analysis and multicollinearity

exploratory-data-analysisfeature selectionmulticollinearitymultiple regressionregression

I would like to run a linear regression analysis and I'm uncertain about including predictors.

I have three predictor variables available. One is based on a lot of previous research. Therefore I am planning to enter that in the first step of a hierarchical regression. The second two predictors both make theoretically a lot of sense, but there is no previous research on their relationship to the dependent variable. Therefore, the second step of the regression would be more exploratory and I thought I might enter both variables using a stepwise selection.

When I do this the first variable is a significant predictor, as well as the one of the two exploratory variables; the other one is being excluded.

The problem is:

  • There seems to be multicollinearity between the two exploratory variables: they are negatively correlated (-.7) and VIF = 2.5 (which is apparently large for a small sample size $N$=24)
  • Does that mean I can't enter this variable in the regression? But if I can't enter it, how can I show that one of the exploratory variables makes a significant contribution to explaining variance in the DV but the other one not?
  • And if I can only enter one of the exploratory predictors on what grounds can I decide which one? Both make sense theoretically.
  • The insignificant predictor is also insignificant if I enter it without the other exploratory variables (and it is also not even significantly correlated with the DV): accordingly it is not just insignificant because the other exploratory variable steals the variance.

Thanks a lot for the replies!

Here is some more information on the data. The data is from a group of neurological patients:

-The dependent variable for the regression is the performance on an ‘emotion recognition task’: the patients are asked to identify different emotions from facial expressions -the higher the score the better the performance.
-The first predictor X1 is an indicator of disease progression. It is well known that the ability to recognize emotions from faces declines during the course of the disease. However, emotion recognition is already affected in very early stages as well.

  • The two exploratory predictors X2 and X3 are measures obtained using eye-tracking during the presentation of faces with emotional expressions. X2 is the ratio of fixations on the eye-region and X3 is the ratio of fixations on the nose/mouth region. We know that the eye-region and the nose/mouth region carry the most important information for recognizing emotional expressions. Additionally, in the first step of my data analysis I have compared the patient group to a healthy control group and found that the patient groups has lower fixation ratios for both regions of interest: The ratio of fixations on the eye-region (X2) and the nose/mouth region (X3).
    Now, the aim for the regression is to see whether those reduced fixations of relevant areas of the face actually might explain (partly) why the patients have difficulties recognizing emotional expressions. I am specifically interested to see whether X2 and X3 make additional contributions to explaining emotion recognition (independent of disease progression).

Different scenarios are possible, e.g.:

  • Disease progression and viewing behavior make independent contributions to emotion recognition
  • Disease progression affects viewing behavior and that affects emotion recognition (mediation)
  • Disease progression affects emotion recognition and viewing behavior independently (maybe due to general cognitive decline) – and viewing behavior does not affect emotion recognition.

It seems that X1 and X2 independently contribute to explaining emotion recognition performance, but not X3. The results are always the same no matter which predictors I put in the model. I am just not sure how to present the data, which model to choose and how to explain that choice.
The predictor correlation I am worried about is that X2 and X3 are negatively correlated (-.7). People that spend more time looking at the eyes, spend less time looking at the mouth! Can I still use X2 and X3 as predictors in one model? And what if the stepwise procedure excludes X3? How do I show that this is not a result of multicollinearity?

I am also aware that N =24 in the patient group is very small and a regression might not be possible. If that is the case I can always just report correlations….

I would like to attach the data, but I don’t know if that is possible….

Best Answer

Your approach doesn't have to be hierarchical or stepwise. Let's call your response variable (dependent variable or DV in your terms) $Y$, the apparently important predictor $X_1$ and the others $X_2$ and $X_3$. You can easily look at all the possible models, as there are at most 7 models of interest, namely

  1. $X_1$, $X_2$, $X_3$ alone,

  2. The three possible pairs,

  3. All three predictors.

There is much need for caution, as

  • A sample size of 24 is small for any exercise here, especially fitting a model with more than one predictor.

  • Focusing on whether the coefficient for a particular predictor is or is not significant at some conventional level is not as important as understanding why that is so. A scatter plot matrix plotted as exploratory analysis before your regressions and residual and added variable plots after your regressions would help to signal whether the real problems are say nonlinearity, outliers, some other reason to transform, grouping of values, or whatever. Multicollinearity or other structure is not something to be guessed at from the value of some diagnostic, but something that can be explored directly by looking at the data with graphs.

  • Paying attention to previous knowledge or theory is clearly sensible, but I wouldn't pay too much attention to it. Presumably you wouldn't use $X_2$ and $X_3$ if they were not of interest. It's common to find that the well-known predictor is not as crucial as theory implies, say because it doesn't vary enough in a particular dataset; because a theory hinging on dynamics is being tested with cross-sectional data; and so on.

Can you post the data? Then guesses and prejudices could be checked against the facts.

(I have to guess that you are some kind of economist. Economists are in my experience naturally very well informed about regression, but often most reluctant to draw graphs.)