So I have a data set that looks like this.
I want to write a full and restricted model which would evaluate the null hypothesis that latitude – controlling for continent and sex – has a significant relationship with wing size. I am currently using R
Is it okay to do this for my full model?
fit <- lm(wingsize ~ continent + sex + latitude, data=wing)
This is my output:
$$\Tiny \text{WINGSIZE} = 836.1648 + (-4.1289)\times \text{CONTINENT} + (-98.8571)\times \text{SEX} + 1.7926\times \text{LATITUDE}$$
And for restricted model, I did this:
res <- lm(wingsize ~ latitude, data=wing)
and have a model
$$\Tiny\text{WINGSIZE} = 780.532 + 1.883\times\text{LATITUDE}$$
Is it supposed to look like a plain simple linear regression for the restricted part? Is that what restricted mean?
Also, how many degrees of freedom would exist for full and restricted models? Are the degree of freedoms always the same? Is this question supposed to refer to the residual error degrees of freedom?
Thank you for any input.
Best Answer
Nice project! As you guessed correctly, in the context of multiple linear regression, with predictors $X_1,\dots,X_p$ and response $Y$, the full (or unrestricted) model is the usual OLS estimate, where we put no restrictions on the regression coefficients of the various predictors. A restricted model is one for which we impose a set of constraints on the regression coefficients $\beta_i$. In the simplest case, we set one or more $\beta_i$ to 0: in general, we can consider a set of linear constraints given in matrix form by $\mathbf{R}\beta=\mathbf{r}$. In your case, you considered the two simple constraints $\beta_{sex}=\beta_{continent}=0$.
Before comparing formally the two models, let's see what the relationship between
latitude
andwingsize
looks by making a simple plot:It seems that data are separated into two groups, and inside each group there's a clear increasing trend, which is what one would expect according to Bergmann's rule. The separation is nearly too good to be true, but at least for some raptors it's well-known that females are bigger than males. We can easily check that the two groups correspond to the two sexes:
Thus we expect
latitude
to be highly significant forwingsize
if we control forsex
, but not necessarily otherwise. What aboutcontinent
? It doesn't seem that including it helps explaining any variance:Let's verify this formally:
Not even significant at the 0.05 level. BTW, since you asked about degrees of freedom, note that R reports
DF
= 40. For linear regression,DF=n-p-1
wheren
is the sample size (42 in your case),p
is the number of predictors in the model (1 since we're considering the model with onlylatitude
) and thus 42-1-1=40.Let's now consider the full model:
As expected, now
latitude
, in a model which already includessex
(andcontinent
), is very significant, and similarlysex
is highly significant in a model which already includeslatitude
andcontinent
. Instead, once we control forsex
andlatitude
, it looks like (the difference between zero and the regression coefficient of)continent
is not statistically significant. We can also check that the difference among the two models is statistically significant by using ANOVA, since the restricted model is always a nested model of the unrestricted model, and thus ANOVA is applicable.As expected, the difference is highly significant!
Finally, we noted that
continent
didn't seem to explain any residual variance, oncelatitude
andsex
were already included in the model. Be warned! Normally this "fishing for significance" is a terrible idea. If you start by removing the variable with the largest p-value, refit the model and iterate the process, by removing at each step the variable with the highest p-value (backward stepwise regression), then the p-values you obtain in the final model are not valid (because they're computed without taking into account this selection process), and inference becomes unreliable. However, just for this special case, we may close an eye, since we are going to remove only one predictor, whose p-value is not just a bit higher than the others, but it's several orders of magnitude larger. Let's build the restricted model which contains onlysex
andlatitude
, and see how it compares to the other two.Very good! It seems that the model is very similar to the full model, in terms of predictive performance (as estimated by adjusted R-squared). As a matter of fact, if we now repeat the ANOVA test, we see that the difference between the first restricted model and this new restricted model is significant, but not the difference between the new restricted model and the full one:
Let me stress again that in general variable selection based on stepwise regression is BAD! If you need to do variable selection in a reliable way, use LASSO instead.