Solved – Two-Sample t-Test for Equal Means with unequal variances for large samples

heteroscedasticityhypothesis testingrt-test

How can I perform a two sample test of means with unequal variances for a very large sample in R?

In case of large samples the statistic will asymptotically follow a normal distribution.

Which R function will help me to do this?

Best Answer

While you can compute the z-statistic, actually an ordinary Welch t-test will do that just fine - in R that's t.test with all its default options.

The form of test statistic is the same in both cases. The only difference is in which table is used, and if the size of the smaller group is large enough, the tests will give almost identical p-values.

The Welch test will handle very large sample sizes.

e.g. in R:

> x=rnorm(1e7,1.00001,1)
> y=rnorm(1e7,1.00002,2)
> t.test(x,y)

    Welch Two Sample t-test

data:  x and y
t = 0.9052, df = 14708415, p-value = 0.3654
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.0007458214  0.0020259201
sample estimates:
mean of x mean of y 
 0.999757  0.999117 

I don't see a problem

> # compare:
> 2*pnorm((-abs(mean(y)-mean(x))/sqrt(var(y)/length(y)+var(x)/length(x))))
[1] 0.3653657

The p-values turn out to be the same to all the places shown in the second figure.

If that's not what you want, you need to more carefully explain what you do want.


Example with very different $n$:

> x=rnorm(1e7,1.00001,1)
> y=rnorm(1e2,1.002,2)
> t.test(x,y)

    Welch Two Sample t-test

data:  x and y
t = 0.7382, df = 99.001, p-value = 0.4622
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.2409398  0.5264124
sample estimates:
mean of x mean of y 
0.9998066 0.8570703 

> 2*pnorm((-abs(mean(y)-mean(x))/sqrt(var(y)/length(y)+var(x)/length(x))))
[1] 0.4604087

Once we're at 99df for the Welch, we start to notice a small difference in p-value from the asympotic result, but since we're at 99d.f., we're not really in the 'consider it as converged to normal' region.