Solved – Confused about Residual Degree of Freedom

degrees of freedomregression

My understanding of degrees of freedom for a regression model is that if you have 5 terms in your model, you have 6 total parameters. Therefore, you have 6 degrees of freedom in your model (including the constant). So, if you increase the degrees of freedom, you decrease model bias with the risk of increasing model variance.

I'm confused as to what residual degrees of freedom are and what their significance is? What are they used for?

I'd appreciate any help. Thanks guys!

Best Answer

There may be another post on this system addressing this, and if so, I'm hoping someone else will link to it in a comment.

In brief, the residual degrees of freedom are the remaining "dimensions" that you could use to generate a new data set that "looks" like your current data set. A very simple example, if you have three numbers and you know they have a mean of 10, then you can pick the first two numbers (say 8 and 15)...but then you have no choice left for the third and last number (7...else the sum won't be 30, and the mean won't be 10).

Before jumping to regression, let's think about a simpler setting: a two sample $t$-test. Here, you have two groups (sample sizes $n_1$ and $n_2$) and each has a mean. Though it may not seem the most intuitive approach, you could write this as $$y = a + b·x$$ where x is just a dummy variable (coded 0 for one group and 1 for the other group). Going back to the analogy above, the degrees of freedom are the number of values (dimensions) that we can select and still have a data set that "looks" the same. In this context, looking the same would mean the same sample size in each group, and each group would have the same mean. By a similar argument to that above, we would have $n_1 -1$ choices for the first group (and then the last term would have to make sure the mean is the same as the first group's mean) and $n_2 -1$ choices for the second group. Or, $N - 2$, where $N$ is the combined sample size, $n_1+n_2$.

Let's bump it up to 5 groups. Now, we only need 4 dummy variables to indicate the group membership ($X_1 = 1$ if the score is from group 1 and 0 if it is from any other group; $X_2 = 1$ i the score is from group 2 and 0 if it is from any other group; etc.). With this coding, we only need 4 variables to distinguish 5 groups (because the values for all variables would be $X_1 = ... = X_4 = 0$ if the score is from group 5). The regression formula would be $$y = a + b_1 · X_1 + b_2 · X_2 + b_3 · X_3 + b_4 · X_4 $$ But, the same rationale holds. If we want all 5 group to have the same respective means as the original data groupings, then we would have $N-5$ degrees of freedom.

The argument extends to the scenario where the $X_i$ are continuous—instead of dummy/categorical—variables. If you have $k$ predictors in the model, then you would have $N-k-1$ degrees of freedom to build a "new" data set that "looks" like your original one.

Related Question