In a linear model that predicts birth rate (TFR) per country from per capita GDP, the country is encoded in "treatment coding", and there are several measurements (different years) per country. I would thus have thought that the first level represents the "reference intercept" and the predictions for only this first level should change when the intercept is removed from the model.
However, the predictions do not change for any country, if I remove the intercept:
> fit1 <- lm(TFR ~ logGDPpc + logGDPpc2 +
country, data=x)
> fit2 <- lm(TFR ~ logGDPpc + logGDPpc2 +
country - 1, data=x)
> max(abs(fit1$fitted.values -
fit2$fitted.values))
[1] 1.847411e-13
This also applies to the relative error of the differences:
> max(abs((fit1$fitted.values -
fit2$fitted.values)/fit2$fitted.values))
[1] 7.482906e-14
Is this the expected behavior? Why?
Best Answer
As @Russ Lenth points out these models have equivalent parametrizations.
Usually (in R) we specify models with a formula such as y ~ x1 + x2. It's very convenient. Under the hood, R uses the formula and the data to come up with the design matrix.
It's often helpful to look at the design matrix to figure out how R processed the inputs, esp. if the formula includes categorical variables, polynomials or other variable transformations. Use the model.matrix function to construct the design matrix explicitly.