Regression Variables – What Variables Need to Be Controlled for in Regression Analysis?

correlationhypothesis testingpsychologyregression

There are numerous discussions on this site concerning how to control for certain variables in regression analysis.

However, there exists unlimited variables in the universe. And in psychology/epidemiology research, there are a lot of demographic variables (e.g., age, gender, income, marital status, number of children, etc). When do we need to control for them? Is there a rule of thumb?

For example, if income is expected to affect my DV, but it is not significantly correlated to it, shall I control for it? Alternatively, if age is not expected to influence my DV but it is significantly correlated with my DV, shall I control for it?

Best Answer

If there are theoretical grounds for suspecting a variable is a confounder, then it should be included in the model to correct for its effect. On the other hand, mediators should generally not be included in the model. While it might seem like a good idea to correct for as many potential confounders as possible, there are actually a good number of reasons not to.

When to Correct for a Variable

A good, yet not always helpful answer to this question is:

"When you as an expert in your field believe the variable to affect your outcome."

First, let's discuss why this is a good answer. There are many important reasons why it is a bad idea to correct for a large number of variables. Sure, there may be unlimited variables in the universe, but...

...these do not all have a unique effect on the outcome and the inclusion of variables with high pairwise correlation will result in multicolinearity;
...you don't have unlimited data and everything you model costs you degrees of freedom
...including (too) many variables (which poorly predict the outcome) results in overfitting.;

Multicolinearity occurs when an explanatory variable can itself be explained as a combination of other explanatory variables. In other words, including everything that might affect the outcome means that many of the variables will also have some effect on each other. To make matters worse, there need not even be high correlation between explanatory variables, as long as one or more can be explained in terms of the others.

Degrees of freedom are required to estimate every parameter. Including variables that affect the outcome marginally or not at all still costs you degrees of freedom, without gaining you an improvement in model fit. If you want to report the significance of estimates, this also means you will lose power for everything you try to correct for.

An overfitted model is fitting the stochastic part of the process, rather than the systematic part. In other words, a model with too many parameters will tend to explain variance in the outcome that is simply there due to natural random variability in the sample, rather than due to some underlying process. Overfitted models appear to perform really well on the sample, but have poor out-of-sample performance (i.e. generalize very poorly).

Hence, a theoretical justification for the inclusion of variables is generally preferred over adding more and more variables to correct for.

Another argument in favour of the answer is that there is no simple alternative to choosing the important variables "as an expert in the field". It might seem tempting to include every possibly involved variable and then just narrowing it down through some exhaustive search for the most important ones (known as stepwise regression), but this is actually a very bad idea.

Second, let's discuss why this is not always a useful answer. If expert knowledge can decide the inclusion of variables, this is the way to go. However, this approach assumes that the data generating process is already well understood and that this choice of variables can be reasonably made. Moreover, it assumes this expert knowledge to be correct! In practice, there is often a lot of uncertainty what can and cannot affect the outcome. Lurking variables which are excluded because they are not known to affect the outcome will not be discovered.

Because of this, there are many proposed alternatives to stepwise regression, most of which are some form of regularization. For example:

The LASSO penalty shrinks certain coefficients to zero, essentially selecting the non-zero ones;
Ridge regression does this in a way that respects pairwise correlation among predictors more, but cannot shrink to zero (i.e. cannot select variables);
Elastic net combines the penalties;
Horseshoe is yet another form of shrinkage intended to be a 'best of both';
Partial Least Squares deconstructs the explanatory variables and adds weights to the principal components based on their correlation with the outcome.

However, do keep in mind that there is no guarantee that LASSO or any other method will choose the right variables. It is still better to choose which to include based on expert knowledge, if possible. If there are enough observations, the predictive accuracy can help decide which model is best.

So does that mean we are forever stuck in a loophole of deciding on which variables to include? I don't think so and I think this is where exploratory analysis can help out. If you are really clueless about the inclusion of a set of candidate variables, perhaps the first study should merely investigate potential relationships and clearly state in the report the analysis is exploratory in nature. In a second study, new, independent dataset(s) can be used to verify which of these found relationships are not spurious. This is not too different from my field (biology), where large sets of genes, proteins or metabolites are studied with some 'shotgun' approach, followed by confirmation using a directed approach on new samples.

Related Solutions

Solved – How many correlations are too many

If one variable (A or B) is "dependent" and the other is "independent" then you could use regression with all the demographic variables in the equation as well, and possibly interactions between the main independent variable and the demographic variables.

The former will control for the effects of the the demographic variables; the latter will, in addition, look for differences in the relationship between A and B at different levels of the demographic variables.

Solved – How exactly does one “control for other variables”

There are many ways to control for variables.

The easiest, and one you came up with, is to stratify your data so you have sub-groups with similar characteristics - there are then methods to pool those results together to get a single "answer". This works if you have a very small number of variables you want to control for, but as you've rightly discovered, this rapidly falls apart as you split your data into smaller and smaller chunks.

A more common approach is to include the variables you want to control for in a regression model. For example, if you have a regression model that can be conceptually described as:

BMI = Impatience + Race + Gender + Socioeconomic Status + IQ

The estimate you will get for Impatience will be the effect of Impatience within levels of the other covariates - regression allows you to essentially smooth over places where you don't have much data (the problem with the stratification approach), though this should be done with caution.

There are yet more sophisticated ways of controlling for other variables, but odds are when someone says "controlled for other variables", they mean they were included in a regression model.

Alright, you've asked for an example you can work on, to see how this goes. I'll walk you through it step by step. All you need is a copy of R installed.

First, we need some data. Cut and paste the following chunks of code into R. Keep in mind this is a contrived example I made up on the spot, but it shows the process.

covariate <- sample(0:1, 100, replace=TRUE)
exposure  <- runif(100,0,1)+(0.3*covariate)
outcome   <- 2.0+(0.5*exposure)+(0.25*covariate)

That's your data. Note that we already know the relationship between the outcome, the exposure, and the covariate - that's the point of many simulation studies (of which this is an extremely basic example. You start with a structure you know, and you make sure your method can get you the right answer.

Now then, onto the regression model. Type the following:

lm(outcome~exposure)

Did you get an Intercept = 2.0 and an exposure = 0.6766? Or something close to it, given there will be some random variation in the data? Good - this answer is wrong. We know it's wrong. Why is it wrong? We have failed to control for a variable that effects the outcome and the exposure. It's a binary variable, make it anything you please - gender, smoker/non-smoker, etc.

Now run this model:

lm(outcome~exposure+covariate)

This time you should get coefficients of Intercept = 2.00, exposure = 0.50 and a covariate of 0.25. This, as we know, is the right answer. You've controlled for other variables.

Now, what happens when we don't know if we've taken care of all of the variables that we need to (we never really do)? This is called residual confounding, and its a concern in most observational studies - that we have controlled imperfectly, and our answer, while close to right, isn't exact. Does that help more?

Best Answer

When to Correct for a Variable

Related Solutions

Solved – How many correlations are too many

Solved – How exactly does one “control for other variables”

Related Question