I'm really confused about the difference in meaning regarding the context of linear regression of the following terms:
- F statistic
- R squared
- Residual standard error
I found this webstie which gave me great insight in the different terms involved in linear regression, however the terms mentioned above look a like quite a lot (as far as I understand). I will cite what I read and what confused me:
Residual Standard Error is measure of the quality of a linear
regression fit…….The Residual Standard Error is the average amount that the response (dist) will deviate from the true regression line.
1. This is thus actually the average distance of the observed values from the lm line?
The R-squared statistic provides a measure of how well the
model is fitting the actual data.
2. Now I'm getting confused because if RSE tells us how far our observed points deviate from the regression line a low RSE is actually telling us "your model is fitting well based on the observed data points" –> thus how good our models fits, so what is the difference between R squared and RSE?
F-statistic is a good indicator of whether there is a relationship
between our predictor and the response variables.
3. Is it true that we can have a F value indicating a strong relationship that is NON LINEAR so that our RSE is high and our R squared is low
Best Answer
The best way to understand these terms is to do a regression calculation by hand. I wrote two closely related answers (here and here), however they may not fully help you understanding your particular case. But read through them nonetheless. Maybe they will also help you conceptualizing these terms better.
In a regression (or ANOVA), we build a model based on a sample dataset which enables us to predict outcomes from a population of interest. To do so, the following three components are calculated in a simple linear regression from which the other components can be calculated, e.g. the mean squares, the F-value, the $R^2$ (also the adjusted $R^2$), and the residual standard error ($RSE$):
Each of them are assessing how well the model describes the data and are the sum of the squared distances from the data points to fitted model (illustrated as red lines in the plot below).
The $SS_{total}$ assess how well the mean fits the data. Why the mean? Because the mean is the simplest model we can fit and hence serves as the model to which the least-squares regression line is compared to. This plot using the
cars
dataset illustrates that:The $SS_{residual}$ assess how well the regression line fits the data.
The $SS_{model}$ compares how much better the regression line is compared to the mean (i.e. the difference between the $SS_{total}$ and the $SS_{residual}$).
To answer your questions, let's first calculate those terms which you want to understand starting with model and output as a reference:
The sums of squares are the squared distances of the individual data points to the model:
The mean squares are the sums of squares averaged by the degrees of freedom:
My answers to your questions:
Q1:
The residual standard error ($RSE$) is the square root of the residual mean square ($MS_{residual}$):
If you remember that the $SS_{residual}$ were the squared distances of the observed data points and the model (regression line in the second plot above), and $MS_{residual}$ was just the averaged $SS_{residual}$, the answer to your first question is, yes: The $RSE$ represents the average distance of the observed data from the model. Intuitively, this also makes perfect sense because if the distance is smaller, your model fit is also better.
Q2:
Now the $R^2$ is the ratio of the $SS_{model}$ and the $SS_{total}$:
The $R^2$ expresses how much of the total variation in the data can be explained by the model (the regression line). Remember that the total variation was the variation in the data when we fitted the simplest model to the data, i.e. the mean. Compare the $SS_{total}$ plot with the $SS_{model}$ plot.
So to answer your second question, the difference between the $RSE$ and the $R^2$ is that the $RSE$ tells you something about the inaccuracy of the model (in this case the regression line) given the observed data.
The $R^2$ on the other hand tells you how much variation is explained by the model (i.e. the regression line) relative the variation that was explained by the mean alone (i.e. the simplest model).
Q3:
So the $F$-value on the other is calculated as the model mean square $MS_{model}$ (or the signal) divided by the $MS_{residual}$ (noise):
Or in other words the $F$-value expresses how much of the model has improved (compared to the mean) given the inaccuracy of the model.
Your third question is a bit difficult to understand but I agree with the quote your provided.