In ordinary least squares regression (OLS), if the plot of the residuals against the fitted values form a horizontal line around 0, then we can say that the dependent variable is linearly related to the independent variable.
I had thought that this is true because $E(y_i – \hat{y}_I)=0$ when the dependent variable is linearly related to the independent variable, see here.
However, suppose:
$y_i = \alpha + \sin(x_i) + \epsilon_i$.
Then $E(y_i – \hat{y}_i)$ is still 0, see here but then the plot of its residuals against its fitted value is no longer a horizontal line around 0, as this R code shows:
n <- 10^3
df <- data.frame(x=runif(n, 1, 10))
df$mean.y.given.x <- sin(df$x)
df$y <- df$mean.y.given.x + rnorm(n)
model <- lm(y ~ x, data=df)
plot(predict(model, newdata=df), residuals(model))
abline(a=0,b=0,col='blue')
So my question is, which assumption(s) of OLS that causes the plot of the residuals and the fitted value to be a horizontal line around 0 and why/how is it true?
Best Answer
Scortchi and Peter Flom have both correctly pointed out that you didn't fit the model you specified.
However, there's no coefficient on $\sin(x)$ in the model, so if you actually want to fit $y_i = \alpha + \sin(x_i) + \epsilon_i$ you should not regress on $\sin(x)$. In that model it's an offset, not a regressor.
The correct way to specify the model
$$y_i = \alpha + \sin(x_i) + \epsilon_i$$
in R is:
which produces the residual vs fitted plot:
or as a residuals vs x plot:
Alternatively, one could fit
which gives the same estimate for $\alpha$. The residual vs fitted plot is of no use in this case (because of the difference in the way the offset was brought into the model by modifying $y$), but the residuals vs x plot is identical to the second plot above.
Gung is right to suggest in comments that it often makes sense to fit the offset as a regressor anyway (for example, to check that the offset-coefficient of 1 is reasonable); this is the model that Scortchi and Peter Flom were discussing in comments.
Here's how you do that:
If we look at the summary (
summary(model3)
) we get:which has coefficients close to what we'd expect.
Finally, you might do this:
but its effect is only to reduce the fitted coefficient of $\sin(x)$ by 1, so we can extract the same information from model3's output.