Solved – Can’t understand this multiple R-squared value in R lm()

lmrr-squared

> dt = data.table(x = rnorm(100))
> dt[, y := 1+0.2*x + rnorm(100)]
> 
> fit = lm(y~x, data = dt)
> summary(fit)

Call:
lm(formula = y ~ x, data = dt)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.19493 -0.75218 -0.03459  0.64181  2.38214 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.8561     0.1036   8.266 6.86e-13 ***
x             0.3025     0.1056   2.865   0.0051 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.022 on 98 degrees of freedom
Multiple R-squared:  0.07731,   Adjusted R-squared:  0.06789 
F-statistic: 8.211 on 1 and 98 DF,  p-value: 0.005095

> dt[, 1 - sum(fit$residuals^2) / sum(y^2)]
[1] 0.419261

Shouldn't R-squared be 0.419? Or at least close to it, not 0.07?

Best Answer

Your formula

1 - sum(fit$residuals^2) / sum(y^2)

is wrong. It should be

$$ R^2 = 1 - \frac{\sum_i (y_i-\hat{y_i})^2}{\sum_i (y_i-\bar{y})^2} $$

where $\hat y_i$ is prediction and $\bar y$ is mean of $y$. Check Wikipedia entry on $R^2$ to learn more.

Notice that this could be easily generalized since $\bar y$ is basically the same as prediction from intercept-only regression, so the dividend is residuals from your model and divisor is residuals from the null model (the most basic one that we can imagine).