Solved – Multiple regression, full and restricted model

model selectionmultiple regressionrregressionregression coefficients

So I have a data set that looks like this.

I want to write a full and restricted model which would evaluate the null hypothesis that latitude – controlling for continent and sex – has a significant relationship with wing size. I am currently using R

Is it okay to do this for my full model?

fit <- lm(wingsize ~ continent + sex + latitude, data=wing)

This is my output:

$$\Tiny \text{WINGSIZE} = 836.1648 + (-4.1289)\times \text{CONTINENT} + (-98.8571)\times \text{SEX} + 1.7926\times \text{LATITUDE}$$

And for restricted model, I did this:

res <- lm(wingsize ~ latitude, data=wing)

and have a model

$$\Tiny\text{WINGSIZE} = 780.532 + 1.883\times\text{LATITUDE}$$

Is it supposed to look like a plain simple linear regression for the restricted part? Is that what restricted mean?

Also, how many degrees of freedom would exist for full and restricted models? Are the degree of freedoms always the same? Is this question supposed to refer to the residual error degrees of freedom?

Thank you for any input.

Best Answer

Nice project! As you guessed correctly, in the context of multiple linear regression, with predictors $X_1,\dots,X_p$ and response $Y$, the full (or unrestricted) model is the usual OLS estimate, where we put no restrictions on the regression coefficients of the various predictors. A restricted model is one for which we impose a set of constraints on the regression coefficients $\beta_i$. In the simplest case, we set one or more $\beta_i$ to 0: in general, we can consider a set of linear constraints given in matrix form by $\mathbf{R}\beta=\mathbf{r}$. In your case, you considered the two simple constraints $\beta_{sex}=\beta_{continent}=0$.

Before comparing formally the two models, let's see what the relationship between latitude and wingsize looks by making a simple plot:

with(wing, plot(latitude, wingsize, pch=16))

enter image description here

It seems that data are separated into two groups, and inside each group there's a clear increasing trend, which is what one would expect according to Bergmann's rule. The separation is nearly too good to be true, but at least for some raptors it's well-known that females are bigger than males. We can easily check that the two groups correspond to the two sexes:

with(wing, plot(latitude, wingsize, col = c("red", "blue")[sex+1], pch=16))

enter image description here

Thus we expect latitude to be highly significant for wingsize if we control for sex, but not necessarily otherwise. What about continent? It doesn't seem that including it helps explaining any variance:

with(wing, plot(latitude, wingsize, col = c("red", "blue")[continent+1], pch=16))

enter image description here

Let's verify this formally:

res_model <- lm(wingsize ~ latitude, data = wing)
summary(res_model)
Call:
lm(formula = wingsize ~ latitude, data = wing)

Residuals:
    Min      1Q  Median      3Q     Max 
-72.433 -50.335   5.956  49.651  72.132 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  780.532     64.526  12.096 6.08e-15 ***
latitude       1.883      1.436   1.312    0.197    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 51.82 on 40 degrees of freedom
Multiple R-squared:  0.04125,   Adjusted R-squared:  0.01728 
F-statistic: 1.721 on 1 and 40 DF,  p-value: 0.1971

Not even significant at the 0.05 level. BTW, since you asked about degrees of freedom, note that R reports DF = 40. For linear regression, DF=n-p-1 where n is the sample size (42 in your case), p is the number of predictors in the model (1 since we're considering the model with only latitude) and thus 42-1-1=40.

Let's now consider the full model:

> full_model <- lm(wingsize ~ ., data = wing)
> summary(full_model)

Call:
lm(formula = wingsize ~ ., data = wing)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.7285  -6.5128   0.6035   5.0069  30.7508 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 836.1648    14.8448  56.327  < 2e-16 ***
continent    -4.1289     3.5224  -1.172    0.248    
latitude      1.7926     0.3158   5.676 1.59e-06 ***
sex         -98.8571     3.4114 -28.978  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 11.05 on 38 degrees of freedom
Multiple R-squared:  0.9586,    Adjusted R-squared:  0.9553 
F-statistic:   293 on 3 and 38 DF,  p-value: < 2.2e-16

As expected, now latitude, in a model which already includes sex (and continent), is very significant, and similarly sex is highly significant in a model which already includes latitude and continent. Instead, once we control for sex and latitude, it looks like (the difference between zero and the regression coefficient of) continent is not statistically significant. We can also check that the difference among the two models is statistically significant by using ANOVA, since the restricted model is always a nested model of the unrestricted model, and thus ANOVA is applicable.

> anova(res_model, full_model)
Analysis of Variance Table

Model 1: wingsize ~ latitude
Model 2: wingsize ~ continent + latitude + sex
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1     40 107425                                  
2     38   4643  2    102782 420.56 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

As expected, the difference is highly significant!

Finally, we noted that continent didn't seem to explain any residual variance, once latitude and sex were already included in the model. Be warned! Normally this "fishing for significance" is a terrible idea. If you start by removing the variable with the largest p-value, refit the model and iterate the process, by removing at each step the variable with the highest p-value (backward stepwise regression), then the p-values you obtain in the final model are not valid (because they're computed without taking into account this selection process), and inference becomes unreliable. However, just for this special case, we may close an eye, since we are going to remove only one predictor, whose p-value is not just a bit higher than the others, but it's several orders of magnitude larger. Let's build the restricted model which contains only sex and latitude, and see how it compares to the other two.

> res_model_2 <- lm(wingsize ~ latitude + sex, data = wing)
> summary(res_model_2)

Call:
lm(formula = wingsize ~ latitude + sex, data = wing)

Residuals:
    Min      1Q  Median      3Q     Max 
-23.005  -5.681  -0.494   5.197  32.560 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 829.9603    13.9354  59.558  < 2e-16 ***
latitude      1.8832     0.3077   6.121 3.52e-07 ***
sex         -98.8571     3.4277 -28.841  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 11.11 on 39 degrees of freedom
Multiple R-squared:  0.9571,    Adjusted R-squared:  0.9549 
F-statistic: 434.6 on 2 and 39 DF,  p-value: < 2.2e-16

Very good! It seems that the model is very similar to the full model, in terms of predictive performance (as estimated by adjusted R-squared). As a matter of fact, if we now repeat the ANOVA test, we see that the difference between the first restricted model and this new restricted model is significant, but not the difference between the new restricted model and the full one:

> anova(res_model, res_model_2, full_model)
Analysis of Variance Table

Model 1: wingsize ~ latitude
Model 2: wingsize ~ latitude + sex
Model 3: wingsize ~ continent + latitude + sex
  Res.Df    RSS Df Sum of Sq       F Pr(>F)    
1     40 107425                                
2     39   4811  1    102614 839.752 <2e-16 ***
3     38   4643  1       168   1.374 0.2484    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Let me stress again that in general variable selection based on stepwise regression is BAD! If you need to do variable selection in a reliable way, use LASSO instead.