regression – Linear Regression: Understanding F-Test for Lack of Fit Using ANOVA

anovaf-testleast squaresregression

This is the data set:

https://raw.githubusercontent.com/FlorinAndrei/misc/master/stackexchange_attach/normtemp.csv

I'm looking at a simple linear regression. The data is quite trivial.

import pandas as pd
from statsmodels.formula.api import ols

nt = pd.read_csv('normtemp.csv')
model = ols('hr ~ temp', data=nt).fit()
print(nt.head())

Output:

   temp  hr
0  96.3  70
1  96.7  71
2  96.9  74
3  97.0  80
4  97.1  73

To test whether there is lack of fit, I've seen the F-test done this way:

from statsmodels.stats.api import anova_lm

nt_fact = nt.copy()
nt_fact['temp'] = nt_fact['temp'].astype(object)
model_fact = ols('hr ~ temp', data=nt_fact).fit()

an_out = anova_lm(model, model_fact)
print(an_out)

Output:

   df_resid          ssr  df_diff      ss_diff         F    Pr(>F)
0     126.0  6009.601776      0.0          NaN       NaN       NaN
1      96.0  4177.428932     30.0  1832.172843  1.403484  0.110304

Same thing in R:

linear.model <- lm( hr ~ temp)
full.model <- lm(hr ~ factor(temp))
anova(linear.model, full.model)

The p-value from ANOVA is 0.1103, do not reject H0, there is not significant evidence that the linear model is not appropriate.

EDIT: This is not about overfitting. This is to check whether the linear model is or is not a good fit for the data. Rejecting H0 from ANOVA would indicate the linear model is not a good fit for the data.

I am blanking completely on the intuitive explanation for why this works. We build two models here – the actual LR model where the predictor is continuous, and another model where the predictor is converted to categorical (and that's the only change) but the data is otherwise the same.

What exactly are we comparing here with ANOVA, and why rejecting H0 would indicate the fit is not good? (or the other way around)

Best Answer

TLDR; Bascially this anova to test the goodness of fit is effectively testing whether the conditional means of the residuals are different from zero. A good model is supposed to model the conditional mean of the data and reduce the expectation value of the conditional mean of the residuals to zero.

I am blanking completely on the intuitive explanation for why this works. We build two models here - the actual LR model where the predictor is continuous, and another model where the predictor is converted to categorical (and that's the only change) but the data is otherwise the same.

What exactly are we comparing here with ANOVA, and why rejecting H0 would indicate the fit is not good? (or the other way around)

The image below might give an intuitive view what this model is testing.

plots

Scatter plot hr vs temp

In the first plot we see the scatter plot along with the two models.

  • The linear model (the green line).

    Is the linear model (the green line) a good model?

    This is not a question about whether the linear model explains ever single data point. (It doesn't explain a lot and the coefficient of determination is small $R^2 =0.06$).

    Instead, it is about whether the linear model explains the expectation value of the heart rate as a function of the temperature.

  • The full model (red line).

    So for the goodness of fit it is not about the individual points (that is what the coefficient of determination is expressing), but instead it is about the expectation value of the points.

    The full model, the red line, is the estimate of this expectation value evaluated separately for each single temperature point.

Now, the question is whether the full model (the red line) is doing 'better' than the linear model (the green line). This 'better' might mean: does it have a lower variance of the residuals. If the full model is doing better, then this means that the linear model is not a model that completely describes the expectation value (however, this doesn't mean it is a bad model, it means that it might not be perfect).

Since the full model (red line) is more flexible it will fit the random noise as well and it will always do somewhat better even when the linear model (green line) is the true model. So more precisely is the question whether the full model is doing statistical significantly better. Is the full model (red line) doing better beyond what might be expected in comparison to the null hypothesis, which is that the linear model (green line) is the true model?

To test this statistical significance we use the variance of the noise. We see that even with the full model (red line) there is variation in the data points. If the null model is true, then the variation of the noise should explain the variation in the difference between the red line and the green line (think for instance about how you estimate the variance of the mean based on the residuals).

the question becomes, is the variation of the means larger or not than what we expect based on the variation of the residuals. So that is your ANOVA/F-test in a nutshell, a comparison of the variance in the residuals and the variance in the means of the residuals.

So in a certain sense this F-test (which compares variances) is a test for whether the expectation value of the residuals is different from zero. This we see in the second plot.

Residual plot

In the second plot we see a scatter plot for the residuals. It is the difference of the observations with the estimation based on the linear model.

Also in this plot, is the difference between the linear model and the full model. What we see here is that there is not much of a clear trend (this is another way to test goodness of fit: test the residual plot visually).

We can also test this with an ANOVA but now using a model for the residuals. In R this will use the following:

 linear.model <- lm( hr ~ temp)
 full.model <- lm(hr ~ factor(temp))

 res = residuals(linear.model)  

 modf <- lm(res ~ factor(temp)) 
 mod0 <- lm(res ~ 1)

 anova(linear.model, full.model)
 anova(mod0,modf)

which gives the same tables

 > anova(linear.model, full.model)
 Analysis of Variance Table

 Model 1: hr ~ temp
 Model 2: hr ~ factor(temp)
   Res.Df    RSS Df Sum of Sq      F Pr(>F)
 1    126 6009.6                           
 2     96 4177.4 30    1832.2 1.4035 0.1103

 > anova(mod0,modf)
 Analysis of Variance Table

 Model 1: res ~ 1
 Model 2: res ~ factor(temp)
   Res.Df    RSS Df Sum of Sq      F Pr(>F)
 1    127 6009.6                           
 2     96 4177.4 31    1832.2 1.3582 0.1315

There is only a difference in degrees of freedom. The reason for this is because the residuals are, if the null hypothesis is correct, not the same as independent Gaussian noise. The residuals are in a space with one degree of freedom less (see Why are the residuals in $\mathbb{R}^{n-p}$?).