I was studying linear regression and got stuck in r-squared. I know how to calculate r-squared like a machine, but I want to understand r-squared in human language. For example, what is the meaning of r-squared = 81%?
I googled and watch several tutorials and gathered some human intuition of r-squared = 81%.
r-squared = 81% means:
- 81% less variance around the regression line than mean line
- 81% less error between predicted values and actual values
- Actual data is 81% close to the regression line than mean line
- 81% better prediction of actual values using regression line than mean line
These are all human language of r-squared = 81% I got. Please correct me if I am wrong.
I watched a video 1 and found another explanation of r-squared. Which is:
"r-squared is the percentage of variation in 'Y' that is accounted for by its regression on 'X'"
Well, the last explanation is a bit confusing for me. Could anyone make me understand with a simple example of what this line actually means?
Best Answer
As a matter of fact, this last explanation is the best one:
Yes, it is quite abstract. Let's try to understand it.
Here is some simulated data.
R code:
What we are mainly interested in is the variation in the dependent variable $y$. In a first step, let's disregard the predictor $x$. In this very simple "model", the variation in $y$ is the sum of the squared differences between the entries of $y$ and the mean of $y$, $\overline{y}$:
This sum of squares turns out to be:
Now, we try a slightly more sophisticated model: we regress $y$ on $x$ and check how much variation remains after that. That is, we now calculate the sums of squared differences between the $y$ and the regression line:
Note how the differences - the gray lines - are much smaller now than before!
And here is the sum of squared differences between the $y$ and the regression line:
It turns out that this is only about 16% of the sums of squared residuals we had above:
Thus, our regression line model reduced the unexplained variation in the observed data $y$ by 100%-16% = 84%. And this number is precisely the $R^2$ that R will report to us:
Now, one question you might have is why we calculate variation as a sum of squares. Wouldn't it be easier to just sum up the absolute lengths of the deviations we plot above? The reason for that lies in the fact that squares are just much easier to handle mathematically, and it turns out that if we work with squares, we can prove all kinds of helpful theorems about $R^2$ and related quantities, namely $F$ tests and ANOVA tables.