Solved – Choosing control variables

controlling-for-a-variablecorrelationregression

I am designing a regression following the advice tho include control variables if they could have a causal effect on the dependent variable and if the variable is correlated with the independent variable of interest.

How correlated does an independent variable have to be to the variable of interest to be included as a control variable? In my current regression I am using all potential controls that are correlated with the variable of interest at .1 or higher, but there are also potential control variables at .02, 04, and .06, and I'm not sure what to do with them. What should I do and why?

Best Answer

In the field I work in (biomedical research), causal effects are deemed very hard to study and some even say it is only possible to do so by doing a blinded randomized controlled trial (RCT), which 'balances' the inherent prognosis, eventual measurement errors and the course of the underlying disease over the groups with and without the variable of interest so that the only difference (on a group level) can be assumed to be caused by having the variable of interest.

I mention this because this idea is the basis of any correction in observational studies (not a RCT), where said balancing was not part of the data generating mechanism. To do this, we need additional information (the other independent variables you mentioned).

However, we do not necessarily need all information, we need the information which could 'confound' the association between variable of interest. Confounders are loosely defined as those variables which affect the variable of interest, and, through some other known or unknown pathway affect (incidence of) the dependent variable as well. To get the closest to the aforementioned 'balancing' act, you will need to include as much confounders as you can. Any selection of information is therefore not based on correlations (which could occur by chance!) but on sound (bio)logical knowledge when concerning the known or assumed causal pathways.

Even more so, building statistical models using preselection of independent variables based on univariable p-values or correlation has been shown to lead to spurious (biased) results. In prediction modelling research this is made very clear (see for example the TRIPOD guidelines item 10b for a good overview)

Related Question