Solved – the difference between controlling for a variable in a regression model vs. controlling for a variable in your study design

controlling-for-a-variableexperiment-designregression

I imagine that controlling for a variable in your study design is more effective at reducing error than controlling for it post-hoc in your regression model.

Would someone mind explaining formally how these two instances of "controlling" differ? How comparatively effective are they at reducing error and yielding more precise predictions?

Best Answer

By "controlling for a variable in your study design", I assume you mean causing a variable to be constant across all study units or manipulating a variable so that the level of that variable is independently set for each study unit. That is, controlling for a variable in your study design means that you are conducting a true experiment. The benefit of this is that it can help with inferring causality.

In theory, controlling for a variable in your regression model can also help with inferring causality. However, this is only the case if you control for every variable that has a direct causal connection to the response. If you omit such a variable (perhaps you didn't know to include it), and it is correlated with any of the other variables, then your causal inferences will be biased and incorrect. In practice, we don't know all the relevant variables, so statistical control is a fairly dicey endeavor that relies on big assumptions you can't check.

However, your question asks about "reducing error and yielding more precise predictions", not inferring causality. This is a different issue. If you were to make a given variable constant through your study design, all of the variability in the response due to that variable would be eliminated. On the other hand, if you simply control for a variable, you are estimating its effect which is subject to sampling error at a minimum. In other words, statistical control wouldn't be quite as good, in the long run, at reducing residual variance in your sample.

But if you are interested in reducing error and getting more precise predictions, presumably you primarily care about out of sample properties, not the precision within your sample. And therein lies the rub. When you control for a variable by manipulating it in some form (holding it constant, etc.), you create a situation that is more artificial than the original, natural observation. That is, experiments tend to have less external validity / generalizability than observational studies.


In case it's not clear, an example of a true experiment that holds something constant might be assessing a treatment in a mouse model using inbred mice that are all genetically identical. On the other hand, an example of controlling for a variable might be representing family history of disease by a dummy code and including that variable in a multiple regression model (cf., How exactly does one “control for other variables”?, and How can adding a 2nd IV make the 1st IV significant?).

Related Question