> dt = data.table(x = rnorm(100))
> dt[, y := 1+0.2*x + rnorm(100)]
>
> fit = lm(y~x, data = dt)
> summary(fit)
Call:
lm(formula = y ~ x, data = dt)
Residuals:
Min 1Q Median 3Q Max
-2.19493 -0.75218 -0.03459 0.64181 2.38214
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8561 0.1036 8.266 6.86e-13 ***
x 0.3025 0.1056 2.865 0.0051 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.022 on 98 degrees of freedom
Multiple R-squared: 0.07731, Adjusted R-squared: 0.06789
F-statistic: 8.211 on 1 and 98 DF, p-value: 0.005095
> dt[, 1 - sum(fit$residuals^2) / sum(y^2)]
[1] 0.419261
Shouldn't R-squared be 0.419? Or at least close to it, not 0.07?
Best Answer
Your formula
is wrong. It should be
$$ R^2 = 1 - \frac{\sum_i (y_i-\hat{y_i})^2}{\sum_i (y_i-\bar{y})^2} $$
where $\hat y_i$ is prediction and $\bar y$ is mean of $y$. Check Wikipedia entry on $R^2$ to learn more.
Notice that this could be easily generalized since $\bar y$ is basically the same as prediction from intercept-only regression, so the dividend is residuals from your model and divisor is residuals from the null model (the most basic one that we can imagine).