Regression – Should Covariates with Non-Linear Relationships Be Included in Linear Regression

pre-post-comparisonregression

In my linear regression model, baseline BMI is a significant predictor of BMI change following a weight loss intervention and including it improves fit (e.g. AIC). But it is strongly associated with gender, so the relationship is not linear (see the two clusters in the scatterplot below). Is it therefore a bad idea to include baseline BMI as a covariate, as it violates one of the key assumptions of linear models, even though it improves fit?

I have tried squaring it, and interacting it with gender, but it made little difference to fit (e.g. AIC) or specification tests (e.g. RESET).

enter image description here

Best Answer

It is a bad idea not to include baseline BMI as this implicitly makes a strong assumption about the relationship between pre- and post-treatment BMI.

$$ Y_{change} = Y_{post} - Y_{pre} = \alpha + \beta\text{Female} + \theta\text{Treatment} + \text{Error} $$ is equivalent to $$ \color{white}{Y_{change} = } Y_{post} = \alpha + \beta\text{Female} + Y_{pre} + \theta\text{Treatment} + \text{Error} $$

If the outcome is change from baseline but you don't include pre-treatment BMI in the predictors, you assume that the coefficient for pre-treatment BMI is fixed at 1. The regression can handle estimating one more coefficient so you don't have to make unnecessary assumptions.

Including pre-treatment BMI as a predictor also adjusts for any differences in BMI between the treatment and control group, which can occur by chance even in a randomized study. Since you are working with observational data, it is even more important to adjust for possible confounders. (A confounder is a covariate that is associated with the treatment and/or the outcome. For example, diet and exercise are potential confounders for weight loss.)

After we add the effect of pre-treatment BMI the model becomes:

$$ \text{(1)} \quad Y_{change} = \alpha + \beta\text{Female} + \gamma Y_{pre} + \theta\text{Treatment} + \text{Error} $$

In fact, consider modeling post-treatment BMI rather than the change in BMI:

$$ \text{(2)} \quad Y_{post} = \alpha + \beta\text{Female} + \gamma Y_{pre} + \theta\text{Treatment} + \text{Error} $$

The treatment effect $\theta$ is the same in models (1) and (2). But the interpretation is more straightforward in (2).

Finally, the plots suggest that you should include an interaction between Gender and Treatment. This is the most interesting feature in the data: the response is very different between males and females in the treatment and control groups. (I guess this is what you mean by non-linear relationship.) Re-doing these plots with post-BMI on the y-axis might be even more interesting.


You can read more about adjusting for pre-treatment measurements in Chapter 19, Section 3 of Regression and Other Stories [1] and in the BBR course notes, which argue strongly against modeling change from baseline and in favor of modeling post-treatment outcome [2].

[1] A. Gelman, J. Hill, and A. Vehtari. Regression and Other Stories. Cambridge University Press, 2020. See Chapter 19, Section 3 for a discussionabout pre-treatment predictors.

[2] Biostatistics for Biomedical Research course notes. Available online.


Previous CV posts discuss change from baseline in lots more detail. Thank you to @EdM for the references.

Is it valid to include a baseline measure as control variable when testing the effect of an independent variable on change scores?

Best practice when analysing pre-post treatment-control designs