Solved – Paired two one-sided t-tests (TOST) with unequal sample sizes

equivalencepaired-comparisonssample-sizet-testtost

I have data from two populations (let's call them "before" and "after"), collected in a paired fashion. That is, I take a measurement under one set of conditions before, and then I can take multiple measurements under the same set of conditions after. For example, the "pairs" in this case might consist of $(X^{b,i}, (X^{a,i}_{1}, X^{a,i}_{2}))$, where $X^{b,i}$ is the measurement from the "before" population taken under conditions i, and $X^{a,i}_1$ and $X^{a,i}_2$ are two measurements from the "after" population taken under conditions i. I have this data for $i=1, \ldots, n$ sets of conditions.

I want to test for equivalence of the means of the two populations. A paired two one-sided t-tests approach seems appropriate, and I've been using the implementation available in the statsmodels Python package.

How should I handle the unequal sample sizes when performing a paired two one-sided t-tests?

My current approach is to turn each $(X^{b,i}, (X^{a,i}_{1}, X^{a,i}_{2}))$ pair into two pairs: $(X^{b,i}, X^{a,i}_{1})$ and $(X^{b,i}, X^{a,i}_{2})$. Then I use a paired two one-sided t-tests on this new larger set of pairs (with $2n$ new pairs). Is this appropriate? I worry that the new pairs are not independent (because they share $X^{b,i}$), and that this will invalidate this approach.

Best Answer

I don't know any specific references for this case.

In analogy to some of the methods for repeated measures ANOVA, the relevant t-test would use the mean of the two 'after' observations and compare it with the 'before' observation. The variance of the average within difference will be smaller with more observations per individual, so the test still takes the larger sample into account.

An alternative approach would be to use cluster robust standard errors for the two pair differences as in your approach. This would be possible with OLS with the two pairs sample and specify the individual as grouping variable. OLS in statsmodels only has t-test after estimation, but the TOST rejection decision could still be obtained by comparing the 2 alpha confidence interval with the equivalence boundaries.

about cluster robust standard errors:

OLS provides a consistent estimator of the parameters for the linear model even if there is correlation across observations or heteroscedasticity. However, the usual estimate for the standard errors or covariance of the parameter estimates is incorrect. One possible solution to correlation and heteroscedasticity is to use the OLS parameter estimates, but correct the standard errors by using a sandwich form of robust standard errors.

For example here is the Wikipedia page for heteroscedasticity robust standard errors http://en.wikipedia.org/wiki/Heteroscedasticity-consistent_standard_errors

For the specific case when we have correlation within small groups or clusters but no correlation across groups, we can use cluster robust standard errors to correct for the within cluster correlation. An extensive discussion is available in

Cameron, A. Colin, and Douglas L. Miller. "A practitioner’s guide to cluster-robust inference."

(aside: statsmodels provides robust covariance matrices for the linear model, OLS, WLS, for discrete models like Logit and Poisson, for GLM. Cluster robust standard errors are the default for GEE. The list of provided types is here http://statsmodels.sourceforge.net/devel/generated/statsmodels.regression.linear_model.RegressionResults.get_robustcov_results.html )

Related Solutions

Solved – Use of Wilcoxon test for non-normal data akin to Two One Sided T-test

The short answer is yes, you can do it, since the TOST methodology is not restricted to t-tests. The p-value is the larger of the two p-values. A quick Google search led me to a methodological article (Meier U. Nonparametric equivalence testing with respect to the median difference. Pharm Stat. 2010 Apr-Jun;9(2):142-50) describing this procedure in detail.

Solved – When does an unpaired test result in higher p-value than a paired test

As @whuber says in the comment above, when the measures are negatively correlated, the p-value can be lower in the unpaired test than in the paired test. Here's an example where there is a difference:
```
library(MASS)
s  <- matrix(c(1, -0.8, -0.8, 1), 2)
df <- mvrnorm(n=100, mu=c(0, 0.3), Sigma=s, empirical=TRUE)
t.test(df[,1], df[, 2], paired=FALSE)
t.test(df[,1], df[, 2], paired=TRUE)
```
The first test (unpaired) gives p=0.035, the second gives p=0.117.
Yes, this is a design issue. This book chapter discusses it: Keren, G. (2014). Between-or within-subjects design: A methodological dilemma. A Handbook for Data Analysis in the Behaviorial Sciences: Volume 1: Methodological Issues Volume 2: Statistical Issues, 257, which you can read some of on Google books.

Hmmm... I'm not sure. I'd do a simulation to find out the effect on the type I error rate. How this affects your power is a separate issue that I haven't looked into here. Slight adaptation of my previous code:

paired   <- rep(NA, 1000)
unpaired <- rep(NA, 1000)
for(i in 1:1000){
      df          <- mvrnorm(n=100, mu=c(0, 0), Sigma=s, empirical=FALSE)
      unpaired[i] <- t.test(df[,1], df[, 2], paired=FALSE)$p.value
      paired[i]   <- t.test(df[,1], df[, 2], paired=TRUE )$p.value
}

sum(paired < 0.05)
sum(unpaired < 0.05)

Result:

> sum(paired < 0.05)
[1] 46
> sum(unpaired < 0.05)
[1] 137

Well look at that. If you treat them as unpaired, your type I error rate rockets. You need to treat them as paired to get the right answer. I believe (it's a long time since I've read it) that this is one of the issues Keren talks about in that chapter. If you're going to have data that might be negative correlated (e.g. amount of soup and amount of burgers someone eats) you'll have more power with an unpaired design.

Best Answer

Related Solutions

Solved – Use of Wilcoxon test for non-normal data akin to Two One Sided T-test

Solved – When does an unpaired test result in higher p-value than a paired test

Related Question