Solved – Beta coefficients from stratified analysis when there are covariates

interactionmultiple regressionregressionregression coefficientsstratification

Suppose I have a regression model shown below

Model 1:
$$
Y = \beta_0^\ + \beta_1SEX\ + \beta_2ALCOHOL\ + \beta_3SEX*ALCOHOL\
$$

The predictors I am interested in are SEX (binary: 0 female, 1 male) and Alcohol consumption (binary: drinker, non-drinker). Suppose that I found a significant interaction between SEX and ALCOHOL and decided to stratify the data by sex. So I would have two new models:

Model 2a:
$$
\text{Female: }
Y_F = \beta_0^\ + \beta_2ALCOHOL\
$$

So for the female subset, the intercept is still $ \beta_0$ and the slope for ALCOHOL is $ \beta_2$

Model 2b:
$$
\text{Male: }
Y_M = (\beta_0^\ + \beta_1) + (\beta_2\ +\beta_3) ALCOHOL
$$

For the male subset, the intercept is now $\beta_0^\ + \beta_1$ and the slope for ALCOHOL is $\beta_2^\ + \beta_3$

This is pretty straightforward. If you fit a model like this in any statistical package, you would get this kind of result. However, if say, in the model, I actually included an additional variable AGE, which is a covariate (assuming that it does not interact with either SEX or ALCOHOL), the original model would be the one below:

Model 3:
$$
Y = \beta_0^\ + \beta_1SEX\ + \beta_2ALCOHOL\ + \beta_3SEX*ALCOHOL\ + \beta_4AGE\
$$

Further suppose that we still have a significant interaction between SEX and ALCOHOL and I would like to stratify the data again. I would get two models below if I followed the logic above:

Model 4a:
$$
\text{Female: }
Y_F = \beta_0^\ + \beta_2ALCOHOL\ + \beta_4AGE\
$$

Model 4b:
$$
\text{Male: }
Y_M = (\beta_0^\ + \beta_1) + (\beta_2\ +\beta_3)ALCOHOL + \beta_4AGE\
$$

However, the actual beta coefficients obtained using a computer program can be very different from what you obtain using the updated equations above. The difference in $\beta_0$ makes sense, because the stratified models in 4a and 4b still assume the pooled mean for age; namely, it estimates $\bar{Y}$ at the mean of all subjects' age, whereas in the stratified analyses done by a computer program, the intercepts of the models estimate $\bar{Y}$ at the mean age of a sub-group.

However, I wonder why the slopes are different. In other words, why are the slopes in Models 4a and 4b different from those produced by a statistical package.

Best Answer

Here's an intuitive answer: When you stratify those last models, as you note, the intercept changes. However, the actual values do not change, so the mean predicted value should not change. But if you kept the coefficients the same, the mean predicted value would change - by just as much as the intercept changed.

Related Question