Solved – regression analysis with confounding variables, how to interpret your main coefficient when controlling for confounders

confoundingmultiple regressionregression

I'm interested in the effect of X on Y and want to adjust for confounding variables in my regression model. If the model (regression, F-test) is not significant but the predictor of which I'm interested in is, could I still report that there is an association between X and Y? So I just wanted to adjust for confounding variables but my interest is the relation between X and Y.
Thank you.

Lauren

Best Answer

One purpose of regression is to control for the effects of covariates. This question is predicated on the (correct) understanding that this purpose should not be confused with testing the significance of those covariates.


In a linear multiple regression model

$$\mathbb{E}(y) = \alpha + \beta_1 x_1 + \cdots + \beta_k x_k,$$

the $F$-test compares the null hypothesis

$$H_0: \beta_1 = \beta_2 = \cdots = \beta_k = 0$$

to the alternative

$$H_1: \beta_j \ne 0\text{ for at least one }j.$$

In your case, you're not interested in this hypothesis because most of those coefficients are associated with covariates. Letting $j$ be the index of the single predictor in which you are interested and $n$ be the amount of data, your test should be based on comparing

$$H_0: \beta_j = 0$$

to

$$H_1: \beta_j \ne 0.$$

This is usually done with a t-test in which the estimate $\hat \beta_j$ is divided by its standard error $se(\hat\beta_j)$ and the resulting t-statistic is referred to the Student t distribution with $n-k-1$ degrees of freedom. If you consider that result to be significant, then you will reject this null hypothesis (rather than the omnibus null hypothesis of the F test) and conclude that after controlling for all covariates, variable $x_j$ was found to be significantly associated with $y$.


Additional considerations

Note that if you intended to conduct several such tests separately, involving several variables, then this procedure would no longer be correct for any one of them. Context matters! You would need first to perform a test to see whether any of that set of variables is significant. The usual procedure is an F test based on the "extra sum of squares" associated with the variables of interest. In the case of a single variable, this F test is mathematically equivalent to the Student t test.

More subtly, note that what matters is the number of tests you planned to make before seeing the data. If first you examined the data and then based on that examination you selected $x_j$ as the sole variable of interest, then you would somehow have to figure out how to account for the additional information you used in order to narrow the model down to this single variable. You might, for instance, attempt (as honestly as possible) to enumerate all the variables you could possibly ever have been interested in testing, then treat them as a group as just described.


Reference

Montgomery, Peck, and Vining, Introduction to Linear Regression Analysis. Fifth Edition, 2012. John Wiley & Sons. Section 3.3.

Related Question