Solved – Degrees of freedom in regression

degrees of freedomregression

I went through this thread on degrees of freedom: How to understand degrees of freedom?, and the great answers in it, but then I read the following on Wikipedia on the article about regression:

Statistical assumptions

When the number of measurements, N, is larger than the number of
unknown parameters, k, and the measurement errors εi are normally
distributed then the excess of information contained in (N − k)
measurements is used to make statistical predictions about the unknown
parameters. This excess of information is referred to as the degrees
of freedom of the regression.

Given this definition, if $N$ increases, the degrees of freedom increase as well, but intuitively that would make the problem more constrained (we have more information per parameter). Why is N-k then called degrees of freedom, and it isn't the other way around e.g. (k-N)?

Best Answer

You may be confused between degrees of freedom attributed to different things.

We would not use negative numbers to count; but there are two sides to the ledger.

In common situations, the data degrees of freedom will be $N$, say.

The model degrees of freedom -- the degrees of freedom the model has to fit the data -- is $k$, and the residual degrees of freedom is what's left over: $N-k$. That $k$ may often be partitioned into various components of the model.

Any of them might be called "the" degrees of freedom depending on what, exactly, is being discussed.

Indeed, we use 'degrees of freedom' more broadly still, whence the appearance of noninteger degrees of freedom for some kinds of models, and references to things like "researcher degrees of freedom".

Related Solutions

Solved – Distance to a regression line, and degrees of freedom

There is a well established theory of prediction intervals in the context of linear regression. New values at $x=x_0$ have a normal distribution with mean $\alpha+\beta x_0$ (not surprisingly) and variance $\sigma^2\left(1+\frac{1}{n} + \frac{(x_0-\bar{x})^2}{\sum{(x_i-\bar{x})^2}}\right)$.

After plugging in the estimated versions of the parameters, the standardized distribution will be a $t$ distribution with $n-2$ degrees of freedom. That's because the estimate of $\sigma^2$ has that many degrees of freedom, and the df of the chi-squared term in the denominator drives the degrees of freedom.

Intuitively, you can think that you are not using the new data point for estimating anything, so you are not gaining any degrees of freedom.

Solved – How to understand degrees of freedom

This is a subtle question. It takes a thoughtful person not to understand those quotations! Although they are suggestive, it turns out that none of them is exactly or generally correct. I haven't the time (and there isn't the space here) to give a full exposition, but I would like to share one approach and an insight that it suggests.

Where does the concept of degrees of freedom (DF) arise? The contexts in which it's found in elementary treatments are:

The Student t-test and its variants such as the Welch or Satterthwaite solutions to the Behrens-Fisher problem (where two populations have different variances).
The Chi-squared distribution (defined as a sum of squares of independent standard Normals), which is implicated in the sampling distribution of the variance.
The F-test (of ratios of estimated variances).
The Chi-squared test, comprising its uses in (a) testing for independence in contingency tables and (b) testing for goodness of fit of distributional estimates.

In spirit, these tests run a gamut from being exact (the Student t-test and F-test for Normal variates) to being good approximations (the Student t-test and the Welch/Satterthwaite tests for not-too-badly-skewed data) to being based on asymptotic approximations (the Chi-squared test). An interesting aspect of some of these is the appearance of non-integral "degrees of freedom" (the Welch/Satterthwaite tests and, as we will see, the Chi-squared test). This is of especial interest because it is the first hint that DF is not any of the things claimed of it.

We can dispose right away of some of the claims in the question. Because "final calculation of a statistic" is not well-defined (it apparently depends on what algorithm one uses for the calculation), it can be no more than a vague suggestion and is worth no further criticism. Similarly, neither "number of independent scores that go into the estimate" nor "the number of parameters used as intermediate steps" are well-defined.

"Independent pieces of information that go into [an] estimate" is difficult to deal with, because there are two different but intimately related senses of "independent" that can be relevant here. One is independence of random variables; the other is functional independence. As an example of the latter, suppose we collect morphometric measurements of subjects--say, for simplicity, the three side lengths $X$, $Y$, $Z$, surface areas $S=2(XY+YZ+ZX)$, and volumes $V=XYZ$ of a set of wooden blocks. The three side lengths can be considered independent random variables, but all five variables are dependent RVs. The five are also functionally dependent because the codomain (not the "domain"!) of the vector-valued random variable $(X,Y,Z,S,V)$ traces out a three-dimensional manifold in $\mathbb{R}^5$. (Thus, locally at any point $\omega\in\mathbb{R}^5$, there are two functions $f_\omega$ and $g_\omega$ for which $f_\omega(X(\psi),\ldots,V(\psi))=0$ and $g_\omega(X(\psi),\ldots,V(\psi))=0$ for points $\psi$ "near" $\omega$ and the derivatives of $f$ and $g$ evaluated at $\omega$ are linearly independent.) However--here's the kicker--for many probability measures on the blocks, subsets of the variables such as $(X,S,V)$ are dependent as random variables but functionally independent.

Having been alerted by these potential ambiguities, let's hold up the Chi-squared goodness of fit test for examination, because (a) it's simple, (b) it's one of the common situations where people really do need to know about DF to get the p-value right and (c) it's often used incorrectly. Here's a brief synopsis of the least controversial application of this test:

You have a collection of data values $(x_1, \ldots, x_n)$, considered as a sample of a population.
You have estimated some parameters $\theta_1, \ldots, \theta_p$ of a distribution. For example, you estimated the mean $\theta_1$ and standard deviation $\theta_2 = \theta_p$ of a Normal distribution, hypothesizing that the population is normally distributed but not knowing (in advance of obtaining the data) what $\theta_1$ or $\theta_2$ might be.
In advance, you created a set of $k$ "bins" for the data. (It may be problematic when the bins are determined by the data, even though this is often done.) Using these bins, the data are reduced to the set of counts within each bin. Anticipating what the true values of $(\theta)$ might be, you have arranged it so (hopefully) each bin will receive approximately the same count. (Equal-probability binning assures the chi-squared distribution really is a good approximation to the true distribution of the chi-squared statistic about to be described.)
You have a lot of data--enough to assure that almost all bins ought to have counts of 5 or greater. (This, we hope, will enable the sampling distribution of the $\chi^2$ statistic to be approximated adequately by some $\chi^2$ distribution.)

Using the parameter estimates, you can compute the expected count in each bin. The Chi-squared statistic is the sum of the ratios

$$\frac{(\text{observed}-\text{expected})^2}{\text{expected}}.$$

This, many authorities tell us, should have (to a very close approximation) a Chi-squared distribution. But there's a whole family of such distributions. They are differentiated by a parameter $\nu$ often referred to as the "degrees of freedom." The standard reasoning about how to determine $\nu$ goes like this

I have $k$ counts. That's $k$ pieces of data. But there are (functional) relationships among them. To start with, I know in advance that the sum of the counts must equal $n$. That's one relationship. I estimated two (or $p$, generally) parameters from the data. That's two (or $p$) additional relationships, giving $p+1$ total relationships. Presuming they (the parameters) are all (functionally) independent, that leaves only $k-p-1$ (functionally) independent "degrees of freedom": that's the value to use for $\nu$.

The problem with this reasoning (which is the sort of calculation the quotations in the question are hinting at) is that it's wrong except when some special additional conditions hold. Moreover, those conditions have nothing to do with independence (functional or statistical), with numbers of "components" of the data, with the numbers of parameters, nor with anything else referred to in the original question.

Let me show you with an example. (To make it as clear as possible, I'm using a small number of bins, but that's not essential.) Let's generate 20 independent and identically distributed (iid) standard Normal variates and estimate their mean and standard deviation with the usual formulas (mean = sum/count, etc.). To test goodness of fit, create four bins with cutpoints at the quartiles of a standard normal: -0.675, 0, +0.657, and use the bin counts to generate a Chi-squared statistic. Repeat as patience allows; I had time to do 10,000 repetitions.

The standard wisdom about DF says we have 4 bins and 1+2 = 3 constraints, implying the distribution of these 10,000 Chi-squared statistics should follow a Chi-squared distribution with 1 DF. Here's the histogram:

The dark blue line graphs the PDF of a $\chi^2(1)$ distribution--the one we thought would work--while the dark red line graphs that of a $\chi^2(2)$ distribution (which would be a good guess if someone were to tell you that $\nu=1$ is incorrect). Neither fits the data.

You might expect the problem to be due to the small size of the data sets ($n$=20) or perhaps the small size of the number of bins. However, the problem persists even with very large datasets and larger numbers of bins: it is not merely a failure to reach an asymptotic approximation.

Things went wrong because I violated two requirements of the Chi-squared test:

You must use the Maximum Likelihood estimate of the parameters. (This requirement can, in practice, be slightly violated.)
You must base that estimate on the counts, not on the actual data! (This is crucial.)

The red histogram depicts the chi-squared statistics for 10,000 separate iterations, following these requirements. Sure enough, it visibly follows the $\chi^2(1)$ curve (with an acceptable amount of sampling error), as we had originally hoped.

The point of this comparison--which I hope you have seen coming--is that the correct DF to use for computing the p-values depends on many things other than dimensions of manifolds, counts of functional relationships, or the geometry of Normal variates. There is a subtle, delicate interaction between certain functional dependencies, as found in mathematical relationships among quantities, and distributions of the data, their statistics, and the estimators formed from them. Accordingly, it cannot be the case that DF is adequately explainable in terms of the geometry of multivariate normal distributions, or in terms of functional independence, or as counts of parameters, or anything else of this nature.

We are led to see, then, that "degrees of freedom" is merely a heuristic that suggests what the sampling distribution of a (t, Chi-squared, or F) statistic ought to be, but it is not dispositive. Belief that it is dispositive leads to egregious errors. (For instance, the top hit on Google when searching "chi squared goodness of fit" is a Web page from an Ivy League university that gets most of this completely wrong! In particular, a simulation based on its instructions shows that the chi-squared value it recommends as having 7 DF actually has 9 DF.)

With this more nuanced understanding, it's worthwhile to re-read the Wikipedia article in question: in its details it gets things right, pointing out where the DF heuristic tends to work and where it is either an approximation or does not apply at all.

A good account of the phenomenon illustrated here (unexpectedly high DF in Chi-squared GOF tests) appears in Volume II of Kendall & Stuart, 5th edition. I am grateful for the opportunity afforded by this question to lead me back to this wonderful text, which is full of such useful analyses.

Edit (Jan 2017)

Here is R code to produce the figure following "The standard wisdom about DF..."

#
# Simulate data, one iteration per column of `x`.
#
n <- 20
n.sim <- 1e4
bins <- qnorm(seq(0, 1, 1/4))
x <- matrix(rnorm(n*n.sim), nrow=n)
#
# Compute statistics.
#
m <- colMeans(x)
s <- apply(sweep(x, 2, m), 2, sd)
counts <- apply(matrix(as.numeric(cut(x, bins)), nrow=n), 2, tabulate, nbins=4)
expectations <- mapply(function(m,s) n*diff(pnorm(bins, m, s)), m, s)
chisquared <- colSums((counts - expectations)^2 / expectations)
#
# Plot histograms of means, variances, and chi-squared stats.  The first
# two confirm all is working as expected.
#
mfrow <- par("mfrow")
par(mfrow=c(1,3))
red <- "#a04040"  # Intended to show correct distributions
blue <- "#404090" # To show the putative chi-squared distribution
hist(m, freq=FALSE)
curve(dnorm(x, sd=1/sqrt(n)), add=TRUE, col=red, lwd=2)
hist(s^2, freq=FALSE)
curve(dchisq(x*(n-1), df=n-1)*(n-1), add=TRUE, col=red, lwd=2)
hist(chisquared, freq=FALSE, breaks=seq(0, ceiling(max(chisquared)), 1/4), 
     xlim=c(0, 13), ylim=c(0, 0.55), 
     col="#c0c0ff", border="#404040")
curve(ifelse(x <= 0, Inf, dchisq(x, df=2)), add=TRUE, col=red, lwd=2)
curve(ifelse(x <= 0, Inf, dchisq(x, df=1)), add=TRUE, col=blue, lwd=2)
par(mfrow=mfrow)

Best Answer

Related Solutions

Solved – Distance to a regression line, and degrees of freedom

Solved – How to understand degrees of freedom

Edit (Jan 2017)

Related Question