I don't know any specific references for this case.
In analogy to some of the methods for repeated measures ANOVA, the relevant t-test would use the mean of the two 'after' observations and compare it with the 'before' observation.
The variance of the average within difference will be smaller with more observations per individual, so the test still takes the larger sample into account.
An alternative approach would be to use cluster robust standard errors for the two pair differences as in your approach.
This would be possible with OLS with the two pairs sample and specify the individual as grouping variable.
OLS in statsmodels only has t-test after estimation, but the TOST rejection decision could still be obtained by comparing the 2 alpha confidence interval with the equivalence boundaries.
about cluster robust standard errors:
OLS provides a consistent estimator of the parameters for the linear model even if there is correlation across observations or heteroscedasticity. However, the usual estimate for the standard errors or covariance of the parameter estimates is incorrect. One possible solution to correlation and heteroscedasticity is to use the OLS parameter estimates, but correct the standard errors by using a sandwich form of robust standard errors.
For example here is the Wikipedia page for heteroscedasticity robust standard errors http://en.wikipedia.org/wiki/Heteroscedasticity-consistent_standard_errors
For the specific case when we have correlation within small groups or clusters but no correlation across groups, we can use cluster robust standard errors to correct for the within cluster correlation. An extensive discussion is available in
Cameron, A. Colin, and Douglas L. Miller. "A practitioner’s guide to cluster-robust inference."
(aside: statsmodels provides robust covariance matrices for the linear model, OLS, WLS, for discrete models like Logit and Poisson, for GLM. Cluster robust standard errors are the default for GEE. The list of provided types is here http://statsmodels.sourceforge.net/devel/generated/statsmodels.regression.linear_model.RegressionResults.get_robustcov_results.html )
A two-sample $t$-test seems more reasonable here, predominantly because of the following reason.
If you were to do this experiment again, and match group(X) and control(Y), it is unlikely that the exact same participants will be matched. In fact, if you have 2 people in group(X) who have exactly the same demographics, you can match both these participants with the same control. Thus, matching here will not be helpful.
If you want to study the effect of the treatment as a function of age, sex, etc, you should consider regression with these demographics as covariates and observed blood pressure as response.
When doing a two samples $t$-test, you are assuming the two samples are not dependent on each other. This assumption will not be violated just because the participants had a similar demographics. If you went out and obtained the $Y$ randomly and the $X$ randomly, and assigned them treatment randomly, then there should be no problem with the independence assumption.
Best Answer
In a two-sample test, you have two independent samples. Of course, by independence, we expect the two sets of measurements will not be correlated. Sample sizes for the two samples need not be equal. (But it often makes sense for them to be approximately equal.)
In a paired test you have one sample of pairs. Typically, there will be $n$ paired observations $(x_i, x_2).$ These may be two measurements on each individual subject. (For example, Before and After 'treatment' scores on a questionnaire, exam, or lab test.) Alternatively, the pairs may be pairs of subjects. (For example, married couples, twins, or subjects matched according to some criterion. They might also be two devices manufactured at the same time and place.) It is expected that the two measurements on pairs will be correlated. In analysis, one may look at a sample of differences $d_i = x_1-x_2.$
Examples:
Paired. Suppose you have a study in which subjects have diverse skills at a particular task. It is claimed that a training course will increase skills by a small, but important amount. You test 20 subjects before (test score
x1
) and after (x2
) the training.The average score increases from 105.91 Before training to 109.16 After. Before and after scores are highly correlated.
A scatterplot shows the strong linear correlation. Also, most of the $(X_{1i},X_{2i})$ ploints lie above the 45-degree line, indicating modest, but mostly positive results from the training course.
A paired t test in R, tests $H_0: \mu_B = \mu_A$ against the one-sided alternative $H_0: \mu_B < \mu_A.$ The P-value (near $0)$ shows that the observed improvement is highly significant.
The pairing has allowed us to detect a small improvement 'above the noise' of great subject variability.
Two-samples. By contrast, suppose we gave the training to a group of 20 randomly chosen subjects obtaining test scores
t
. For comparison we taka a group of 20 subjects from the same population who did not take the training course. Also, suppose the true mean score of subjects with the training is 4 points above the true mean score of subjects who did not take the training.Then we would have a one-sided, two-sample t test. And, we would have only about a 20% chance of detecting the difference, on account of the diversity of skills in the population.
Note: In the simulation above 10,000 two-sample tests are performed on samples of size $n_1=n_2=20$ from populations $\mathsf{Norm}(100, 15)$ and $\mathsf{Norm}(104, 15),$ respectively. The data summary and the Welch 2-sample t test are shown for the first one of these 10,000 simulated experiments.
Notice that, in this particular case, the sample for the mean score for the subjects in the training group happens to be smaller than the mean for the control group. With such variable populations this is not a rare occurrence. Generally speaking, much larger sample sizes (about 200 in each sample) would be required for the two-sample t test reliably to detect a 'training' effect of 4 units.