F-Test Sample Size Effect – Understanding the Impact

f-testhypothesis testingp-valuesample-sizet-test

Having worked hard to understand the t-test, I'm rapidly falling out of love with it. All that is required in the t-test to gain significance is increase sample size, which renders it close to pointless IMO.

But what about the F-test, as used in ANOVA, linear regression etc? Variance is independent of sample size, so am I right in saying the significance of the P-value in the F-test is unaffected by sample size?

Best Answer

All that is required in the t-test to gain significance is increase sample size

The property you mention is effectively consistency (or rather, is what you'd expect to see given consistency under some commonly satisfied conditions). Consistency says that as $n\to\infty$, $P(\text{reject } H_0|H_0 \text{ false})\to 1$

All the tests you mention are consistent. In fact, I think most people* would regard inconsistency as a reason to reject a proposed test.

If you think consistency is a reason not to use a test, that suggests you should probably abandon hypothesis tests altogether, because you'll be hard pressed to find any other kind being regularly used, outside of some very specific situations.

* Not all, however. Some people are happy to use an inconsistent test, as long as the properties are reasonable at the sample size they're using it at. However, since they'd generally switch to another test once sample sizes became large enough that it was advantageous to do so, they're not actually avoiding power going to 1 as sample size goes to infinity.

Your comment suggests you're either using hypothesis tests in situations where a different tool would be better (which is quite often the case - hypothesis tests are vastly overused*), or perhaps that you don't really follow what's going on with significance tests.

You might find that in some situations confidence intervals or even just the estimates themselves do what you need. You may find in other cases equivalence tests come closer to what you want.

* as an example, if you look at questions here related to hypothesis tests of assumptions of other tests, you'll find that where I answer those questions I almost always advise against it -- because it doesn't answer the question of interest in that case.

Variance is independent of sample size,

This also tends to suggest you don't quite understand what's going on.

The population variances in a t-test don't change with sample size either. What gets smaller is the standard error of the difference in mean. The analogous situation occurs in regression and ANOVA. While the population variances of the observations don't change with sample size, the variances of the effects being measured decrease with $n$.

The numerator in an F-test will contain an estimate of variance that has two components - the variation between the population means (the thing that's zero under the null) and the variability of the sample means about their population means (which is a function of the variance in the error term and the sample sizes). The denominator only has the variance in the error term.

If the sample sizes in an ANOVA increase, the variation about the means will diminish but the variation between means will not. So if the means are unequal, as sample sizes become larger, the F-statistic will tend to become larger and larger.

Indeed, it's perfectly possible to cast the t-test you currently reject as an ANOVA F-test and as a test in regression (either as a t-test of a coefficient or as an F-test based on change in sums of squares)

Edit:

The equivalence of the t-test, one way ANOVA and regression on the group indicator is discussed here, but I'll try to motivate it a little further.

The difference is that the t-statistic puts into the denominator a scaling factor that the F test puts (the reciprocal of) into the numerator. Once you rearrange the t-statistic and square it, it's actually exactly the formula for the F.

Here's the t-statistic (I'll call this statistic $T$) for a two-sample t-test:

$T=\frac {\bar{x}-\bar{y}} {s_p\sqrt{\frac{1}{n_x}+\frac{1}{n_y}}}$

Now rewrite it so the estimate of the error standard deviation is alone on the numerator:

$=\frac {(\bar{x}-\bar{y})\frac{1}{\sqrt{\frac{1}{n_x}+\frac{1}{n_y}}}} {s_p}$

Nothing is different, it's the same statistic written a different way.

Now square it:

$T^2 =\frac {(\bar{x}-\bar{y})^2\frac{1}{\frac{1}{n_x}+\frac{1}{n_y}}} {s_p^2}$

I don't want to labor the point with a lot of algebra*, but the numerator is now the numerator of the F in a two-group ANOVA, while the denominator is the denominator of that F. In the F, the term that in the $t$-test scales $\hat{\sigma}$ to give the standard error of the difference in means is in the numerator, turning the squared difference in means into a mean square.

* (basically, you rewrite $(\bar{x}-\bar{y})^2$ in terms of a sum of squares of deviations from the overall mean and do a little manipulation to show the numerator there is the same as the treatment mean square; the denominator is more clearly the same. You might like to try doing the algebra for the equal-sample size case.)

At the above link my answer does an example using a t-test and regression on the sleep data in R. I'll include the data for anyone who wants to follow along; the data set is small enough you could even check everything on a calculator if you were so inclined:

> unstack(sleep[,1:2])
     X1   X2
1   0.7  1.9
2  -1.6  0.8
3  -0.2  1.1
4  -1.2  0.1
5  -0.1 -0.1
6   3.4  4.4
7   3.7  5.5
8   0.8  1.6
9   0.0  4.6
10  2.0  3.4

To expand on the equivalence some more, recall that rearranged t-statistic:

$T=\frac {(\bar{x}-\bar{y})\frac{1}{\sqrt{\frac{1}{n_x}+\frac{1}{n_y}}}} {s_p}$

Here's the group means and sd's:

> with(sleep,tapply(extra,group,mean))
   1    2 
0.75 2.33 

> with(sleep,tapply(extra,group,sd))
       1        2 
1.789010 2.002249

The sample sizes are both 10. So the numerator of the above $T$ is

${(0.75-2.33)\frac{1}{\sqrt{\frac{1}{10}+\frac{1}{10}}}}$

> (num=(0.75-2.33)*1/sqrt(1/10+1/10))
[1] -3.532987

The denominator, $s_p$ is

$\sqrt{\frac{(n_1-1)s_1^2+(n_2-1)s_2^2}{n_1+n_2-2}}=\sqrt{\frac{9s_1^2+9s_2^2}{18}}=\sqrt{\frac{s_1^2+s_2^2}{2}}$

> (denom=sqrt(sum(with(sleep,tapply(extra,group,sd))^2)/2))
[1] 1.898625

Is this rearranged form really the t-statistic? Let's check:

> (T=num/denom)
[1] -1.860813

(t.test gave t = -1.8608 as you can see at the linked post)

So now for the equivalence to F. Let's square the numerator and denominator:

> num^2;denom^2
[1] 12.482
[1] 3.604778

Now here's the one-way ANOVA. Look at the Mean Sq column:

> summary(aov(extra~group,sleep))
            Df Sum Sq Mean Sq F value Pr(>F)  
group        1  12.48  12.482   3.463 0.0792 
Residuals   18  64.89   3.605

Well how about that. Also:

> T^2
[1] 3.462627

Related Solutions

Solved – Effects of different changes on test statistic

Since this is an assignment, here's a hint: Look at the formula for the one sample t-test

$$t = \frac{\overline{x} - \mu_0}{s/\sqrt{n}}$$

and think about what each change would do.

Solved – Smallest possible sample size per group in Levene’s test

As with most hypothesis tests small sample sizes inflate the occurrence of Type II errors, the test is "underpowered". That does not mean though that the test is moot but rather than it has a higher probability of being misleading. Ultimately, Levene's test is an $F$-test and should be treated as one.

I think it will be relevant to give warnings for potentially:

having single observation groups ( ~~$0$ degrees of freedom for the residuals (this will more or less equate to testing that we have more than one observation per group)~~) and
$0$ within-group variances (no point treating a constant as an R.V.)

Given these two conditions are not met, the findings are "valid" in the sense they are sensible on face value. Note that these are warnings that stem from the "user's question" rather than the "R code's validity". In that sense we do not need to check for a minimum sample size but rather for cases that the sample used is inadequate to provide even an approximate answer. The statistical power of a test is not only a function of the sample size but also of the effect size, so strictly focusing on sample size misses part of the "power" problem.

Probing this a bit further the R code within car::leveneTest actually does an ANOVA on an lm object (Exempt from leveneTest.default: table <- anova(lm(resp ~ group))[, c(1, 4, 5)]) which brings us back to the case that standard ANOVA/lm warnings should probably adequate. In that sense:

A <- data.frame(y = runif(4), g = c(rep("a",2), rep("b",2)) )
car::leveneTest(y ~ g, A)

is a "valid" call and the problem/warning becomes that the lm has an $R^2$ = 1 showing that something went very fishy.

Best Answer

Related Solutions

Solved – Effects of different changes on test statistic

Solved – Smallest possible sample size per group in Levene’s test

Related Question