Solved – R’s t.test() unequal variance degrees of freedom

degrees of freedomheteroscedasticitysatterthwaitet-test

Can anyone explain why, for an independent two-sample t-test assuming unequal variances, R's t.test function uses the label "Welch Two Sample t-test," when it appears to use Satterthwaite's 1946 formula for the degrees of freedom (d.f.) calculation?

Specifically, I accessed the body of the t.test function via getAnywhere(t.test.default), and the relevant portions of the code are given below (note the else is in response to if(var.equal), hence it's calculating the d.f. when variances are not assumed equal):

vx <- var(x)
nx <- length(x)
vy <- var(y)
ny <- length(y)

else {
    stderrx <- sqrt(vx/nx)
    stderry <- sqrt(vy/ny)
    stderr <- sqrt(stderrx^2 + stderry^2)
    df <- stderr^4/(stderrx^4/(nx - 1) + stderry^4/(ny - 
          1))
}

I worked out the algebra, and this indeed corresponds to

$$\frac{\left(s_x^2/n_x + s_y^2/n_y\right)^2}{\left(s_x^2/n_x\right)^2/(n_x-1)+\left(s_y^2/n_y\right)^2/(n_y-1)},$$

which was given by Satterthwaite in 1946: "An Approximate Distribution of Estimates of Variance Components". Biometrics Bulletin, 2, 6, pp. 110–114.

Further confusing the issue, the t.test documentation contains the following statements:

  • var.equal: a logical variable indicating whether to treat the two variances as being equal. If TRUE then the pooled variance is used to estimate the variance otherwise the Welch (or Satterthwaite) approximation to the degrees of freedom is used.
  • If var.equal is TRUE then the pooled estimate of the variance is used. By default, if var.equal is FALSE then the variance is estimated separately for both groups and the Welch modification to the degrees of freedom is used.

In summary:

  • R labels its t-test output as if it comes from Welch's theory
  • R's documentation talks some about Welch and Satterthwaite, but seems to "lean" toward Welch, yet doesn't give a reference to what it means by "the Welch modification"
  • From my calculation, R actually uses Satterthwaite's suggested d.f. calculation

So, which is it – Welch or Satterthwaite? Is Satterthwaite's result being falsely attributed to Welch? Is there something else I'm missing…?

Best Answer

If we had to award priority to only one person for that formula for degrees of freedom, Welch seems to deserve it. Specifically, Welch (1938) [1], equation (9)

and we can write $v=c t_f$, where $$f=\frac{\left(\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}\right)^2}{\frac{\sigma_1^4}{n_1^2(n_1-1)}+\frac{\sigma_2^4}{n_2^2(n_2-1)}};\quad c=1. \qquad ......(9)$$

is the population version of your formula for d.f. (i.e. is the same but with $\sigma_i$ for $s_i$). The idea of replacing the $\sigma^2$ terms with their sample estimates is a relatively small step at that point.

Welch explicitly gives the test statistic for what R calls the "Welch t-test" immediately above equation (13) ($v=...$) -- so it seems reasonable that the test be named for him; the issue of computing the degrees of freedom remains. He thereabouts discusses choices for the degrees of freedom in the case where we have to approximate it based on sample quantities. He discusses issues with various choices of parameters in a somewhat more general framework in that 1938 paper.

The only thing really not present in the 1938 paper (that I can see) is direct advocacy of that single choice for d.f. (i.e. what is now called Welch-Satterthwaite) of the possibilities being discussed for a good choice of $\nu$; that choice is explicit in both Satterthwaite 1946 and Welch 1947.

Neither Welch nor Satterthwaite look like they're aware of each other's work (none of Welch's eight references in his 1947 paper include Satterthwaite 1946 nor do they coincide with any of Satterthwaite's three references, while Satterthwaite seems aware of none of Welch's prior work, most crucially his 1938 paper. It would be hard to regard Sattherthwaite's paper as much more than a minor step - in effect advocating a particular choice encompassed by the discussion in Welch's prior paper, though it looks like the development was separate; often joint credit is given in such situations.)

In the light of Welch's (1938) pretty thorough treatment of the topic, I see no problem with referring to the formula you give as "Welch-Satterthwaite", nor with leaning toward giving Welch slightly more credit. Welch 1947 and Satterthwaite 1946 each give essentially the same formula, but Welch was walking all over the territory almost a decade earlier; why the referees / editors of Biometrika didn't draw it to Satterthwaite's attention seems a mystery. Even leaving Welch's contribution there aside, the lack of references to Fisher and Behrens in Satterthwaite's paper seems odd.

[1]: Welch, B.L. (1938). "The significance of the difference between two means when the population variances are unequal". Biometrika, 29, 3/4, pp. 350–62.