Solved – How exactly does one “control for other variables”

causalityconfoundingcontrolling-for-a-variableregressionstatistics-in-media

Here is the article that motivated this question: Does impatience make us fat?

I liked this article, and it nicely demonstrates the concept of “controlling for other variables” (IQ, career, income, age, etc) in order to best isolate the true relationship between just the 2 variables in question.

Can you explain to me how you actually control for variables on a typical data set?

E.g., if you have 2 people with the same impatience level and BMI, but different incomes, how do you treat these data? Do you categorize them into different subgroups that do have similar income, patience, and BMI? But, eventually there are dozens of variables to control for (IQ, career, income, age, etc) How do you then aggregate these (potentially) 100’s of subgroups? In fact, I have a feeling this approach is barking up the wrong tree, now that I’ve verbalized it.

Thanks for shedding any light on something I've meant to get to the bottom of for a few years now…!

Best Answer

There are many ways to control for variables.

The easiest, and one you came up with, is to stratify your data so you have sub-groups with similar characteristics - there are then methods to pool those results together to get a single "answer". This works if you have a very small number of variables you want to control for, but as you've rightly discovered, this rapidly falls apart as you split your data into smaller and smaller chunks.

A more common approach is to include the variables you want to control for in a regression model. For example, if you have a regression model that can be conceptually described as:

BMI = Impatience + Race + Gender + Socioeconomic Status + IQ

The estimate you will get for Impatience will be the effect of Impatience within levels of the other covariates - regression allows you to essentially smooth over places where you don't have much data (the problem with the stratification approach), though this should be done with caution.

There are yet more sophisticated ways of controlling for other variables, but odds are when someone says "controlled for other variables", they mean they were included in a regression model.

Alright, you've asked for an example you can work on, to see how this goes. I'll walk you through it step by step. All you need is a copy of R installed.

First, we need some data. Cut and paste the following chunks of code into R. Keep in mind this is a contrived example I made up on the spot, but it shows the process.

covariate <- sample(0:1, 100, replace=TRUE)
exposure  <- runif(100,0,1)+(0.3*covariate)
outcome   <- 2.0+(0.5*exposure)+(0.25*covariate)

That's your data. Note that we already know the relationship between the outcome, the exposure, and the covariate - that's the point of many simulation studies (of which this is an extremely basic example. You start with a structure you know, and you make sure your method can get you the right answer.

Now then, onto the regression model. Type the following:

lm(outcome~exposure)

Did you get an Intercept = 2.0 and an exposure = 0.6766? Or something close to it, given there will be some random variation in the data? Good - this answer is wrong. We know it's wrong. Why is it wrong? We have failed to control for a variable that effects the outcome and the exposure. It's a binary variable, make it anything you please - gender, smoker/non-smoker, etc.

Now run this model:

lm(outcome~exposure+covariate)

This time you should get coefficients of Intercept = 2.00, exposure = 0.50 and a covariate of 0.25. This, as we know, is the right answer. You've controlled for other variables.

Now, what happens when we don't know if we've taken care of all of the variables that we need to (we never really do)? This is called residual confounding, and its a concern in most observational studies - that we have controlled imperfectly, and our answer, while close to right, isn't exact. Does that help more?

Related Question