Hypothesis Testing – Is There a Statistical Test to Compare Two Samples of Size 1 and 3?

hypothesis testingsample-sizet-test

For an ecology project, my lab group added vinegar to 4 tanks containing equal volumes of pond water, 1 control with no elodea (an aquatic plant) and 3 treatments with the same amount of elodea in each. The purpose of adding the vinegar was to reduce the pH. The hypothesis was that the tanks with elodea would go back to their normal pH quicker. This was indeed the case. We measured the pH of each tank daily for about two weeks. All the tanks eventually returned to their natural pH, but the length of time that this took was much shorter for the tanks with elodea.

When we told our professor about our experimental design, he said that there exists no statistical test that can be performed on the data to compare the control to the treatment. That because there was no replicate for the control (we only used one control tank) we cannot calculate variance and so we can't compare the sample means of the control and the treatment. So my question is, is this true? I definitely understand what he means. For example, if you took the height of one man and one woman, you can't draw conclusions about their respective populations. But we did 3 treatments, and the variance was small. It seems reasonable to assume that the variance would be similar in the control?

Update:

Thank you for the excellent answer. We got more water and elodea from the wetland and decided we would run the experiment again with smaller tanks but this time with 5 controls and 5 treatments. We were going to combine this with our original data but the starting pH of the tanks was different enough that it doesn't seem valid to consider the new experiment to be sampled from the same population as the original experiment.

We considered adding different amounts of elodea and trying to correlate speed of pH remediation (measured as time elapsed until pH returned to its original value) with amount of elodea, but we decided that wasn't necessary. Our objective is only to show that the elodea makes a positive difference, not to construct some kind of predictive model for exactly how the pH responds to differing amounts of elodea. It would be interesting to determine the optimal amount of elodea, but that's probably just the maximum amount that can survive. Trying to fit a regression curve to the data wouldn't be especially illuminating because of the various complicated changes that occur to the community when adding a large amount. The elodea dies, decomposes, new organisms start to dominate, and so on.

Best Answer

Note gung's question; it matters. I will assume that the treatment was the same for every tank in the treatment group.

If you can argue the variance would be equal for the two groups (which you would typically assume for a two sample t-test anyway), you can do a test. You just can't check that assumption, no matter how badly violated it might be.

The concerns expressed in this answer to a related question are even more relevant to your situation, but there's less you can do about it.

[You ask about it being reasonable to assume the variances are equal. We can't answer that for you, that's something you'd have to convince subject matter experts (i.e. ecologists) was a reasonable assumption. Are there other studies where such levels have been measured under both treatment and control? Others where similar tests (t-tests or anova especially - I bet you can find a better precedent) have been done or similar assumptions made? Some form of general reasoning you can see to apply?]

If $\bar{x}$ is the sample mean of the treatment and $\bar{y}$ is the mean of the control, and both are from normal distributions with variance $\sigma^2$, then $\bar{x}-\bar{y}$ will have mean $\mu_x - \mu_y$ and variance $\sigma^2 (1/n_x + 1/n_y)$ irrespective of whether one of the $n$'s is 1.

So when $n_y$ is 1,

$$ \frac{(\bar{x}-\bar{y})}{s_x\sqrt{1/n_x+1}} $$

(where $s_x$ is the standard deviation computed from the treatments) will be $t$-distributed (with $n_x - 1$ degrees of freedom) under the null.

You may notice that with the best available estimate of $\sigma$, $s_x$ used for $s_p$, this is exactly like the ordinary two-sample t-test formula with $n_y$ set to 1.

Edit:

Here's a simulated power curve for this test. The sample size at the null was 10000, at the other points was 1000. As you see, the rejection rate at the null is 0.05, and the power curve, while it requires a large difference in population means to have decent power, has the right shape. That is, this test does what it is supposed to.

power curve

(End edit)

With sample sizes so small, this will be somewhat sensitive to distributional assumptions, however.

If you're prepared to make different assumptions, or want to test equality of some other population quantity, some test may still be possible.

So all is not lost... but where possible, it's generally better to have at least some replication in both groups.

Related Solutions

Solved – How to test hypothesis for group differences

The problem is that, as you say, this is a very poorly designed experiment. You have no control group of sick people who didn't get medication; no group of sick people who got Type 1 but not Type 2; and no group who got Type 2 and not Type 1. I think that no amount of statistics will let you reliably test your second and third hypotheses. For example, if you find that their protein levels have changed after they get Type 2 treatment, you will have no way of deciding if the change comes from a delayed effect from Type 1, or just a general natural effect from time. So I won't offer any suggestions for testing those hypotheses as any result will be misleading.

Your first hypothesis you can test if and only if you are confident that people do not get better without treatment. You could not conclude this from your experiment, so you would need to know this from other experience eg clinical experience with this illness that people do not get better naturally. I've no idea if this is realistic or not.

Assuming the condition in the above paragraph is correct, I would measure the difference in the sick people's protein levels at the end of the experiment (after they got both treatments) from their protein levels at the beginning (when they turned up sick but before getting any treatment).

You first look for evidence that the protein levels have increased by a positive amount during this duration. This would be a one-sided t test, based on the differences (hopefully improvements) measured above, comparing it to zero.

The second part of your hypothesis was that the improvement brings the sick people up to the level of the well people. Assume there is no controversy about the fact that the illness reduces protein levels in the first place (as this wasn't one of the hypotheses you wanted to check). In this case, compare the average protein level in the sick group at the end of the experiment with the average protein level in the well group. Again, this is a one-sided t test (assuming protein levels are normally distributed), but this time based on comparing the two average protein levels (as opposed to in the para before where it was based on average improved protein compared to zero).

I don't think the set of measurements after treatment 1 but before treatment 2 can tell us anything.

You will find it easier to analyse this in R than Matlab, I think - R has many more statistical functions built in and ready to go for the user. However, if my answer above is right, you only need to do t-tests, which are pretty straightforward. I would advocate some graphical data analysis as well - if only to check for plausibility, outliers, distributions, etc - which will certainly be easier in R.

Solved – Which test should be used to compare two mean differences

ancova. I believe the econometric term is difference-in-difference (DID). You may also want to see the Wikipedia pages for ANCOVA and DID, as there may be important differences and assumptions. It's possible to estimate as a general linear model with ordinary least squares, though whether this is optimal will depend on the specific nature of your data. Here's some code for an OLS GLM anyway:

v3=with(data.frame(v2),data.frame(pre=c(pre1,pre2),post=c(post1,post2),
   condition=rep(c(0,1),c(4,4)))) #reorganizing your data into pre-post with a dummy variable
summary(lm(post~scale(pre,scale=F)*condition,v3))  #scaled out nonessential multicollinearity

Results:

Coefficients:                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      909.6146    40.5553  22.429 2.34e-05 ***
scale(pre, scale = F)              1.0405     0.1096   9.491 0.000688 ***
condition                       -155.9272    57.3534  -2.719 0.053058 .  
scale(pre, scale = F):condition   -0.1447     0.1533  -0.943 0.398892

Residual standard error: 81.09 on 4 degrees of freedom
Multiple R-squared:  0.9764,    Adjusted R-squared:  0.9588 
F-statistic: 55.24 on 3 and 4 DF,  p-value: 0.001033

A scatterplot using the ggplot2 package: and its code:

ggplot(v3,aes(x=pre,y=post,colour=factor(condition)))+geom_point()+
stat_smooth(method='lm',formula=y~scale(x,scale=F))

Looks like the residuals are bigger for your second group. Test the null hypothesis that they're not, if you like: leveneTest(summary(lm(post~pre,v3))$resid~factor(condition),v3): $F_{(1,6)}=6.2$, $p=.05$. This heteroscedasticity violates an ANCOVA assumption, but that may not matter greatly ^{(Olejnik & Algina, 1984)}.

If you want, it's easy to repeat the above after transforming your post scores to ranks (using rank()). The transformation reduces heteroscedasticity $(F_{(1,6)}=2.5,p=.17)$, though the residuals distribute a little less normally. The group difference comes out a little clearer, but the within-subjects differences get obscured slightly:

Coefficients:                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      5.5488993  0.4110978  13.498 0.000174 ***
scale(pre, scale = F)            0.0054333  0.0011114   4.889 0.008109 ** 
condition                       -2.0952137  0.5813763  -3.604 0.022680 *  
scale(pre, scale = F):condition -0.0002872  0.0015544  -0.185 0.862395

And model fit worsens a little bit:

Residual standard error: 0.822 on 4 degrees of freedom
Multiple R-squared: 0.9357, Adjusted R-squared: 0.8874 
F-statistic: 19.39 on 3 and 4 DF,  p-value: 0.007594

And here's that scatterplot: You can see why this emphasizes the group effect relative to the within-subjects effect: ranking wipes out the interaction mostly, and makes the confidence bands evener. Whether this is actually an improvement may depend on your purposes and, again, the specific nature of your data. As for why you shouldn't use an independent-samples $t$ test on change scores, see "Best practice when analysing pre-post treatment-control designs". There's quite a lot of literature on the topic, and even some room for debate, but not within this answer.

Conclusion:

Your two groups appear to have been sampled from different populations. The second group scores lower in general, and lower pre-scores relate to lower post-scores. I see that changes are consistently negative in your second group, and changes in your first group are consistently $\ge0$, but you'd probably want to collect more observations of this difference in the relationship of pre-scores to post-scores across conditions before concluding that the difference in change generalizes to your samples' populations.

_{Reference

Olejnik, S. F., & Algina, J. (1984). Parametric ANCOVA and the rank transform ANCOVA when the data are conditionally non-normal and heteroscedastic. Journal of Educational and Behavioral Statistics, 9(2), 129–149.}

Best Answer

Related Solutions

Solved – How to test hypothesis for group differences

Solved – Which test should be used to compare two mean differences

Related Question