Regression Variables – What Variables Need to Be Controlled for in Regression Analysis?

correlationhypothesis testingpsychologyregression

There are numerous discussions on this site concerning how to control for certain variables in regression analysis.

However, there exists unlimited variables in the universe. And in psychology/epidemiology research, there are a lot of demographic variables (e.g., age, gender, income, marital status, number of children, etc). When do we need to control for them? Is there a rule of thumb?

For example, if income is expected to affect my DV, but it is not significantly correlated to it, shall I control for it? Alternatively, if age is not expected to influence my DV but it is significantly correlated with my DV, shall I control for it?

Best Answer

If there are theoretical grounds for suspecting a variable is a confounder, then it should be included in the model to correct for its effect. On the other hand, mediators should generally not be included in the model. While it might seem like a good idea to correct for as many potential confounders as possible, there are actually a good number of reasons not to.

When to Correct for a Variable

A good, yet not always helpful answer to this question is:

"When you as an expert in your field believe the variable to affect your outcome."

First, let's discuss why this is a good answer. There are many important reasons why it is a bad idea to correct for a large number of variables. Sure, there may be unlimited variables in the universe, but...

  1. ...these do not all have a unique effect on the outcome and the inclusion of variables with high pairwise correlation will result in multicolinearity;
  2. ...you don't have unlimited data and everything you model costs you degrees of freedom
  3. ...including (too) many variables (which poorly predict the outcome) results in overfitting.;

Multicolinearity occurs when an explanatory variable can itself be explained as a combination of other explanatory variables. In other words, including everything that might affect the outcome means that many of the variables will also have some effect on each other. To make matters worse, there need not even be high correlation between explanatory variables, as long as one or more can be explained in terms of the others.

Degrees of freedom are required to estimate every parameter. Including variables that affect the outcome marginally or not at all still costs you degrees of freedom, without gaining you an improvement in model fit. If you want to report the significance of estimates, this also means you will lose power for everything you try to correct for.

An overfitted model is fitting the stochastic part of the process, rather than the systematic part. In other words, a model with too many parameters will tend to explain variance in the outcome that is simply there due to natural random variability in the sample, rather than due to some underlying process. Overfitted models appear to perform really well on the sample, but have poor out-of-sample performance (i.e. generalize very poorly).

Hence, a theoretical justification for the inclusion of variables is generally preferred over adding more and more variables to correct for.

Another argument in favour of the answer is that there is no simple alternative to choosing the important variables "as an expert in the field". It might seem tempting to include every possibly involved variable and then just narrowing it down through some exhaustive search for the most important ones (known as stepwise regression), but this is actually a very bad idea.

Second, let's discuss why this is not always a useful answer. If expert knowledge can decide the inclusion of variables, this is the way to go. However, this approach assumes that the data generating process is already well understood and that this choice of variables can be reasonably made. Moreover, it assumes this expert knowledge to be correct! In practice, there is often a lot of uncertainty what can and cannot affect the outcome. Lurking variables which are excluded because they are not known to affect the outcome will not be discovered.

Because of this, there are many proposed alternatives to stepwise regression, most of which are some form of regularization. For example:

  • The LASSO penalty shrinks certain coefficients to zero, essentially selecting the non-zero ones;
  • Ridge regression does this in a way that respects pairwise correlation among predictors more, but cannot shrink to zero (i.e. cannot select variables);
  • Elastic net combines the penalties;
  • Horseshoe is yet another form of shrinkage intended to be a 'best of both';
  • Partial Least Squares deconstructs the explanatory variables and adds weights to the principal components based on their correlation with the outcome.

However, do keep in mind that there is no guarantee that LASSO or any other method will choose the right variables. It is still better to choose which to include based on expert knowledge, if possible. If there are enough observations, the predictive accuracy can help decide which model is best.

So does that mean we are forever stuck in a loophole of deciding on which variables to include? I don't think so and I think this is where exploratory analysis can help out. If you are really clueless about the inclusion of a set of candidate variables, perhaps the first study should merely investigate potential relationships and clearly state in the report the analysis is exploratory in nature. In a second study, new, independent dataset(s) can be used to verify which of these found relationships are not spurious. This is not too different from my field (biology), where large sets of genes, proteins or metabolites are studied with some 'shotgun' approach, followed by confirmation using a directed approach on new samples.

Related Question