Solved – Two-Sample t-Test for Equal Means with unequal variances for large samples

heteroscedasticityhypothesis testingrt-test

How can I perform a two sample test of means with unequal variances for a very large sample in R?

In case of large samples the statistic will asymptotically follow a normal distribution.

Which R function will help me to do this?

Best Answer

While you can compute the z-statistic, actually an ordinary Welch t-test will do that just fine - in R that's t.test with all its default options.

The form of test statistic is the same in both cases. The only difference is in which table is used, and if the size of the smaller group is large enough, the tests will give almost identical p-values.

The Welch test will handle very large sample sizes.

e.g. in R:

> x=rnorm(1e7,1.00001,1)
> y=rnorm(1e7,1.00002,2)
> t.test(x,y)

    Welch Two Sample t-test

data:  x and y
t = 0.9052, df = 14708415, p-value = 0.3654
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.0007458214  0.0020259201
sample estimates:
mean of x mean of y 
 0.999757  0.999117

I don't see a problem

> # compare:
> 2*pnorm((-abs(mean(y)-mean(x))/sqrt(var(y)/length(y)+var(x)/length(x))))
[1] 0.3653657

The p-values turn out to be the same to all the places shown in the second figure.

If that's not what you want, you need to more carefully explain what you do want.

Example with very different $n$:

> x=rnorm(1e7,1.00001,1)
> y=rnorm(1e2,1.002,2)
> t.test(x,y)

    Welch Two Sample t-test

data:  x and y
t = 0.7382, df = 99.001, p-value = 0.4622
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.2409398  0.5264124
sample estimates:
mean of x mean of y 
0.9998066 0.8570703 

> 2*pnorm((-abs(mean(y)-mean(x))/sqrt(var(y)/length(y)+var(x)/length(x))))
[1] 0.4604087

Once we're at 99df for the Welch, we start to notice a small difference in p-value from the asympotic result, but since we're at 99d.f., we're not really in the 'consider it as converged to normal' region.

Related Solutions

Correlation – What is the Bayesian Counterpart to a Two-Sample t-Test with Unequal Variances?

While you can do this in a Bayesian way, have you considered whether it would actually be better to estimate the difference in the means rather than test whether they are different? This is what Andrew Gelman frequently recommends. I can imagine some possible reasons for wanting to do hypothesis testing, but I don't think they're that common.

I don't think you need something like a t-test, because you can estimate the standard deviation well because you said the groups have very similar standard deviations.

If that's the case then I think this link should be what you need. It shows how to estimate a difference in means or do a hypothesis test (though I don't recommend this). You could also take a look at the part they reference in bolstad's book (you can find electronic copies online). Its possible to incorporate estimating the variances as well but it's more complex, so I suspect you're better off incorporating the prior information you have about the variances in a naive way (for example, using the unbiased Stdev estimator on each of the sets and then averaging them and pretending those are your 'known' stdevs).

Solved – Can a two-sample t-test be used with data that doesn’t follow a normal distribution

One simple way to convince yourself that the CLT applies or does not apply is with some simulations.

Here is some R code:

testfun <- function(n1=19, n2=15) {
    x <- rexp(n1, 1/3)
    y <- rt(n1, 5) + 3
    t.test(x,y)$p.value
}

out <- replicate(10000, testfun(n1=19, n2=15))
hist(out)
abline(v=0.05, col='red')
mean( out <= 0.05 )

This code defines a function (testfun) that generates data from 2 different distributions (t with 5 df and exponential ) that have the same mean (3 in this case) and runs the built in t.test function and returns the p-value.

The replicate then runs this 10,000 times and we look at the results. The histogram should be close to uniform, but in this case we see an excess of values close to 0. The mean function calculates the type I error rate (since the null is true in the simulations), for my run this was a little of 7% when it should be 5%. Is that far enough to cause you concern? or are you happy with that as a "close enough" approximation?

Of course you should probably run this generating data from distributions that are more reasonable for your study, it may be that for something less skewed than the exponential that the differences would be small enough to not worry about.

Best Answer

Related Solutions

Correlation – What is the Bayesian Counterpart to a Two-Sample t-Test with Unequal Variances?

Solved – Can a two-sample t-test be used with data that doesn’t follow a normal distribution

Related Question