Background
I'm trying to understand the first example in a course on fitting
models (so this may seem ludicrously simple). I've done the calculations by hand and they match the example,
but when I repeat them in R, the model coefficients are off. I thought
the difference may be due to the textbook using population variance
($\sigma^2$) whereas R may be using sample variance ($S^2$), but I
can't see where these are used in the calculations. For example, if
lm()
uses var()
somewhere, the help section on var()
notes:
The denominator n – 1 is used which gives an unbiased estimator of
the (co)variance for i.i.d. observations.
I have looked at the code for both lm()
and lm.fit()
and neither
make use of var()
, but lm.fit()
passes that data to compiled C
code (z <- .Call(C_Cdqrls, x, y, tol, FALSE)
) which I don't have
access to.
Question
Can anyone explain why R is giving different results? Even if there is
a difference in using sample vs population variance, why do the
coefficient estimates differ?
Data
Fit a line to predict shoe size from grade in school.
# model data
mod.dat <- read.table(
text = 'grade shoe
1 1
2 5
4 9'
, header = T);
# mean
mod.mu <- mean(mod.dat$shoe);
# variability
mod.var <- sum((mod.dat$shoe - mod.mu)^2)
# model coefficients from textbook
mod.m <- 8/3;
mod.b <- -1;
# predicted values ( 1.666667 4.333333 9.666667 )
mod.man.pred <- mod.dat$grade * mod.m + mod.b;
# residuals ( -0.6666667 0.6666667 -0.6666667 )
mod.man.resid <- (mod.dat$shoe - mod.man.pred)
# residual variance ( 1.333333 )
mod.man.unexpl.var <- sum(mod.man.resid^2);
# r^2 ( 0.9583333 )
mod.man.expl.var <- 1 - mod.man.unexpl.var / mod.var;
# but lm() gives different results:
summary(lm(shoe ~ grade, data = mod.dat))
Call:
lm(formula = shoe ~ grade, data = mod.dat)
Residuals:
1 2 3
-0.5714 0.8571 -0.2857
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.0000 1.3093 -0.764 0.585
grade 2.5714 0.4949 5.196 0.121
Residual standard error: 1.069 on 1 degrees of freedom
Multiple R-squared: 0.9643, Adjusted R-squared: 0.9286
F-statistic: 27 on 1 and 1 DF, p-value: 0.121
Edit
As Ben Bolker has shown, it looks like teachers make mistakes sometimes. It seems that R calculations are correct. Moral of the story: don't believe something just because a teacher says it is true. Verify it for yourself!
Best Answer
It looks like the author made a mathematical error somewhere.
If you expand the sum-of-squares deviation
$$ S = ((b+m)-1)^2+ ((b+2m)-5)^2 + ((b+4m)-9)^2 $$ you get $$ \begin{split} S = & b^2+2 b m+ m^2 + 1 - 2 b - 2 m \\ + & b^2+4 b m+ 4 m^2 + 25 - 10 b -20 m \\ + & b^2+8 b m+16 m^2 + 81 - 18 b -72 m \end{split} $$
which reduces to $$ 3 b^2 + 14 b m + 21 m^2 + 107 - 30 b - 94 m $$ which is the same as the author's expression, except the constant term, which doesn't matter anyway).
Now we need to try to minimize this by setting the derivatives of $S$ with respect to $b$ and $m$ to zero and solving the system. $$ dS/db = 6 b + 14 m -30 \to 3 b +7 m-15 = 0 $$ $$ dS/dm = 14 b +42 m -94 \to 7 b + 21 m -47 = 0 $$
Solve
$$ \begin{split} b & = (15-7m)/3 \\ 0 & = 7 (15-7m)/3 + 21 m-47 \\ 47 - 35 & = (-49/3 + 21) m \\ m & = (47-35)/(21-49/3) = 18/7 \end{split} $$
R says this is indeed 2.571429 ...
Based on this link this seems to be from a Coursera course ... ? Maybe there was a mis-transcription of the data somewhere?
The other, independent way to do this calculation is to know that the estimated regression slope is equal to the sum of cross products ($\sum (y-\bar y) (x-\bar x)$) divided by the sum of squares ($\sum (x-\bar x)^2$).
If think if the shoe sizes were $\{1,11/3,9\}$ instead of $\{1,5,9\}$ then the slope would come out to 8/3 ...