Solved – R-squared is equal to 81% means what

machine learningr-squaredregression

I was studying linear regression and got stuck in r-squared. I know how to calculate r-squared like a machine, but I want to understand r-squared in human language. For example, what is the meaning of r-squared = 81%?
I googled and watch several tutorials and gathered some human intuition of r-squared = 81%.

r-squared = 81% means:

  • 81% less variance around the regression line than mean line
  • 81% less error between predicted values and actual values
  • Actual data is 81% close to the regression line than mean line
  • 81% better prediction of actual values using regression line than mean line

These are all human language of r-squared = 81% I got. Please correct me if I am wrong.
I watched a video 1 and found another explanation of r-squared. Which is:
"r-squared is the percentage of variation in 'Y' that is accounted for by its regression on 'X'"

Well, the last explanation is a bit confusing for me. Could anyone make me understand with a simple example of what this line actually means?

Best Answer

As a matter of fact, this last explanation is the best one:

r-squared is the percentage of variation in 'Y' that is accounted for by its regression on 'X'

Yes, it is quite abstract. Let's try to understand it.

Here is some simulated data.

scatterplot

R code:

set.seed(1)
xx <- runif(100)
yy <- 1-xx^2+rnorm(length(xx),0,0.1)
plot(xx,yy,pch=19)

What we are mainly interested in is the variation in the dependent variable $y$. In a first step, let's disregard the predictor $x$. In this very simple "model", the variation in $y$ is the sum of the squared differences between the entries of $y$ and the mean of $y$, $\overline{y}$:

scatterplot with mean

abline(h=mean(yy),col="red",lwd=2)
lines(rbind(xx,xx,NA),rbind(yy,mean(yy),NA),col="gray")

This sum of squares turns out to be:

sum((yy-mean(yy))^2)
[1] 8.14846

Now, we try a slightly more sophisticated model: we regress $y$ on $x$ and check how much variation remains after that. That is, we now calculate the sums of squared differences between the $y$ and the regression line:

scatterplot regression line

plot(xx,yy,pch=19)
model <- lm(yy~xx)
abline(model,col="red",lwd=2)
lines(rbind(xx,xx,NA),rbind(yy,predict(model),NA),col="gray")

Note how the differences - the gray lines - are much smaller now than before!

And here is the sum of squared differences between the $y$ and the regression line:

sum(residuals(model)^2)
[1] 1.312477

It turns out that this is only about 16% of the sums of squared residuals we had above:

sum(residuals(model)^2)/sum((yy-mean(yy))^2)
[1] 0.1610705

Thus, our regression line model reduced the unexplained variation in the observed data $y$ by 100%-16% = 84%. And this number is precisely the $R^2$ that R will report to us:

summary(model)

Call:
lm(formula = yy ~ xx)
... snip ...    
Multiple R-squared:  0.8389,    Adjusted R-squared:  0.8373 

Now, one question you might have is why we calculate variation as a sum of squares. Wouldn't it be easier to just sum up the absolute lengths of the deviations we plot above? The reason for that lies in the fact that squares are just much easier to handle mathematically, and it turns out that if we work with squares, we can prove all kinds of helpful theorems about $R^2$ and related quantities, namely $F$ tests and ANOVA tables.

Related Question