Solved – Why does R’s lm() return different coefficient estimates than the textbook

lmrregressionself-study

Background

I'm trying to understand the first example in a course on fitting
models (so this may seem ludicrously simple). I've done the calculations by hand and they match the example,
but when I repeat them in R, the model coefficients are off. I thought
the difference may be due to the textbook using population variance
($\sigma^2$) whereas R may be using sample variance ($S^2$), but I
can't see where these are used in the calculations. For example, if
lm() uses var() somewhere, the help section on var() notes:

The denominator n – 1 is used which gives an unbiased estimator of
the (co)variance for i.i.d. observations.

I have looked at the code for both lm() and lm.fit() and neither
make use of var(), but lm.fit() passes that data to compiled C
code (z <- .Call(C_Cdqrls, x, y, tol, FALSE)) which I don't have
access to.

Question

Can anyone explain why R is giving different results? Even if there is
a difference in using sample vs population variance, why do the
coefficient estimates differ?

Data

Fit a line to predict shoe size from grade in school.

# model data
mod.dat <- read.table(
    text = 'grade shoe
                1    1
                2    5
                4    9'
    , header = T);

# mean
mod.mu  <- mean(mod.dat$shoe);
# variability 
mod.var <- sum((mod.dat$shoe - mod.mu)^2)

# model coefficients from textbook
mod.m  <- 8/3;
mod.b  <- -1;

# predicted values  ( 1.666667 4.333333 9.666667 )
mod.man.pred       <- mod.dat$grade * mod.m + mod.b;
# residuals         ( -0.6666667  0.6666667 -0.6666667 )
mod.man.resid      <- (mod.dat$shoe - mod.man.pred)
# residual variance ( 1.333333 )
mod.man.unexpl.var <- sum(mod.man.resid^2);
# r^2               ( 0.9583333 )
mod.man.expl.var   <- 1 - mod.man.unexpl.var / mod.var;

# but lm() gives different results:
summary(lm(shoe ~ grade, data = mod.dat))

Call:
lm(formula = shoe ~ grade, data = mod.dat)

Residuals:
      1       2       3 
-0.5714  0.8571 -0.2857 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -1.0000     1.3093  -0.764    0.585
grade         2.5714     0.4949   5.196    0.121

Residual standard error: 1.069 on 1 degrees of freedom
Multiple R-squared:  0.9643,    Adjusted R-squared:  0.9286 
F-statistic:    27 on 1 and 1 DF,  p-value: 0.121

Edit

As Ben Bolker has shown, it looks like teachers make mistakes sometimes. It seems that R calculations are correct. Moral of the story: don't believe something just because a teacher says it is true. Verify it for yourself!

Best Answer

It looks like the author made a mathematical error somewhere.

If you expand the sum-of-squares deviation

$$ S = ((b+m)-1)^2+ ((b+2m)-5)^2 + ((b+4m)-9)^2 $$ you get $$ \begin{split} S = & b^2+2 b m+ m^2 + 1 - 2 b - 2 m \\ + & b^2+4 b m+ 4 m^2 + 25 - 10 b -20 m \\ + & b^2+8 b m+16 m^2 + 81 - 18 b -72 m \end{split} $$

which reduces to $$ 3 b^2 + 14 b m + 21 m^2 + 107 - 30 b - 94 m $$ which is the same as the author's expression, except the constant term, which doesn't matter anyway).

Now we need to try to minimize this by setting the derivatives of $S$ with respect to $b$ and $m$ to zero and solving the system. $$ dS/db = 6 b + 14 m -30 \to 3 b +7 m-15 = 0 $$ $$ dS/dm = 14 b +42 m -94 \to 7 b + 21 m -47 = 0 $$

Solve

$$ \begin{split} b & = (15-7m)/3 \\ 0 & = 7 (15-7m)/3 + 21 m-47 \\ 47 - 35 & = (-49/3 + 21) m \\ m & = (47-35)/(21-49/3) = 18/7 \end{split} $$

R says this is indeed 2.571429 ...

Based on this link this seems to be from a Coursera course ... ? Maybe there was a mis-transcription of the data somewhere?

The other, independent way to do this calculation is to know that the estimated regression slope is equal to the sum of cross products ($\sum (y-\bar y) (x-\bar x)$) divided by the sum of squares ($\sum (x-\bar x)^2$).

g <- c(1,2,4)
g0 <- g - mean(g)
s <- c(1,5,9)
s0 <- s- mean(s)
sum(g0*s0)/(sum(g0^2))
## [1] 2.571429

If think if the shoe sizes were $\{1,11/3,9\}$ instead of $\{1,5,9\}$ then the slope would come out to 8/3 ...

Background

Question

Data

Edit

Best Answer

Related Solutions

Solved – Why would R return NA as a lm() coefficient

R – Why Are rlm() Regression Coefficient Estimates Different Than lm() in R?

Related Question