Solved – Linear regression what does the F statistic, R squared and residual standard error tell us

f statisticlinearlinear modelnonlinear regressionr-squared

I'm really confused about the difference in meaning regarding the context of linear regression of the following terms:

  • F statistic
  • R squared
  • Residual standard error

I found this webstie which gave me great insight in the different terms involved in linear regression, however the terms mentioned above look a like quite a lot (as far as I understand). I will cite what I read and what confused me:

Residual Standard Error is measure of the quality of a linear
regression fit…….The Residual Standard Error is the average amount that the response (dist) will deviate from the true regression line.

1. This is thus actually the average distance of the observed values from the lm line?

The R-squared statistic provides a measure of how well the
model is fitting the actual data.

2. Now I'm getting confused because if RSE tells us how far our observed points deviate from the regression line a low RSE is actually telling us "your model is fitting well based on the observed data points" –> thus how good our models fits, so what is the difference between R squared and RSE?

F-statistic is a good indicator of whether there is a relationship
between our predictor and the response variables.

3. Is it true that we can have a F value indicating a strong relationship that is NON LINEAR so that our RSE is high and our R squared is low

Best Answer

The best way to understand these terms is to do a regression calculation by hand. I wrote two closely related answers (here and here), however they may not fully help you understanding your particular case. But read through them nonetheless. Maybe they will also help you conceptualizing these terms better.

In a regression (or ANOVA), we build a model based on a sample dataset which enables us to predict outcomes from a population of interest. To do so, the following three components are calculated in a simple linear regression from which the other components can be calculated, e.g. the mean squares, the F-value, the $R^2$ (also the adjusted $R^2$), and the residual standard error ($RSE$):

  1. total sums of squares ($SS_{total}$)
  2. residual sums of squares ($SS_{residual}$)
  3. model sums of squares ($SS_{model}$)

Each of them are assessing how well the model describes the data and are the sum of the squared distances from the data points to fitted model (illustrated as red lines in the plot below).

The $SS_{total}$ assess how well the mean fits the data. Why the mean? Because the mean is the simplest model we can fit and hence serves as the model to which the least-squares regression line is compared to. This plot using the cars dataset illustrates that:

enter image description here

The $SS_{residual}$ assess how well the regression line fits the data.

enter image description here

The $SS_{model}$ compares how much better the regression line is compared to the mean (i.e. the difference between the $SS_{total}$ and the $SS_{residual}$).

enter image description here

To answer your questions, let's first calculate those terms which you want to understand starting with model and output as a reference:

# The model and output as reference
m1 <- lm(dist ~ speed, data = cars)
summary(m1)
summary.aov(m1) # To get the sums of squares and mean squares

The sums of squares are the squared distances of the individual data points to the model:

# Calculate sums of squares (total, residual and model)
y <- cars$dist
ybar <- mean(y)
ss.total <- sum((y-ybar)^2)
ss.total
ss.residual <- sum((y-m1$fitted)^2)
ss.residual
ss.model <- ss.total-ss.residual
ss.model

The mean squares are the sums of squares averaged by the degrees of freedom:

# Calculate degrees of freedom (total, residual and model)
n <- length(cars$speed)
k <- length(m1$coef) # k = model parameter: b0, b1
df.total <- n-1
df.residual <- n-k
df.model <- k-1

# Calculate mean squares (note that these are just variances)
ms.residual <- ss.residual/df.residual
ms.residual
ms.model<- ss.model/df.model
ms.model

My answers to your questions:

Q1:

  1. This is thus actually the average distance of the observed values from the lm line?

The residual standard error ($RSE$) is the square root of the residual mean square ($MS_{residual}$):

# Calculate residual standard error
res.se <- sqrt(ms.residual)
res.se  

If you remember that the $SS_{residual}$ were the squared distances of the observed data points and the model (regression line in the second plot above), and $MS_{residual}$ was just the averaged $SS_{residual}$, the answer to your first question is, yes: The $RSE$ represents the average distance of the observed data from the model. Intuitively, this also makes perfect sense because if the distance is smaller, your model fit is also better.

Q2:

  1. Now I'm getting confused because if RSE tells us how far our observed points deviate from the regression line a low RSE is actually telling us "your model is fitting well based on the observed data points" --> thus how good our models fits, so what is the difference between R squared and RSE?

Now the $R^2$ is the ratio of the $SS_{model}$ and the $SS_{total}$:

# R squared
r.sq <- ss.model/ss.total
r.sq

The $R^2$ expresses how much of the total variation in the data can be explained by the model (the regression line). Remember that the total variation was the variation in the data when we fitted the simplest model to the data, i.e. the mean. Compare the $SS_{total}$ plot with the $SS_{model}$ plot.

So to answer your second question, the difference between the $RSE$ and the $R^2$ is that the $RSE$ tells you something about the inaccuracy of the model (in this case the regression line) given the observed data.

The $R^2$ on the other hand tells you how much variation is explained by the model (i.e. the regression line) relative the variation that was explained by the mean alone (i.e. the simplest model).

Q3:

  1. Is it true that we can have a F value indicating a strong relationship that is NON LINEAR so that our RSE is high and our R squared is low

So the $F$-value on the other is calculated as the model mean square $MS_{model}$ (or the signal) divided by the $MS_{residual}$ (noise):

# Calculate F-value
F <- ms.model/ms.residual
F
# Calculate P-value
p.F <- 1-pf(F, df.model, df.residual)
p.F 

Or in other words the $F$-value expresses how much of the model has improved (compared to the mean) given the inaccuracy of the model.

Your third question is a bit difficult to understand but I agree with the quote your provided.

Related Question