Solved – Why can bigger sample size increase power of a test

hypothesis testing

The sample size determines the amount of sampling error inherent in a
test result. Other things being equal, effects are harder to detect in
smaller samples. Increasing sample size is often the easiest way to
boost the statistical power of a test.

I wonder why it is often said that a bigger sample size can increase the power (i.e. true positive rate) of a test in general.
Does bigger sample size always increase testing power?

Added: Suppose at each sample size $n$, reject null iff $T_n(X) \geq c_n$. How power changes with $n$ depends on how $T_n$ and $c_n$ are defined in terms of $n$, doesn't it? Even if $c_n$ is chosen so that the size of the testing rule is a value $\alpha \in [0,1]$ fixed for all $n$ values, will the power necessarily increase with $n$?

Explanations that are rigorous and intuitive are both welcome.

Thanks!

Best Answer

The power of the test depends on the distribution of the test statistic when the null hypothesis is false. If $R_n$ is the rejection region for the test statistic under the null hypothesis and for sample size $n$, the power is $$\beta = \mbox{Prob}(X_n \in R_n | H_A)$$ where $H_A$ is the null hypothesis and $X_n$ is the test statistic for a sample of size $n$. I am assuming a simple alternative --- although in practice, we usually care about a range of parameter values.

Typically, a test statistic is some sort of average whose long term behaviour is governed by the strong and/or weak law of large numbers. As the sample size gets large, the distribution of the test statistic approaches that of a point mass --- under either the null or the alternative hypotheses.

Thus, as $n$ gets large, the acceptance region (complement of the rejection region), gets smaller and closer to the value of the null. Intuitively, probable outcomes under the null and probable outcomes under the alternative no longer overlap - meaning that the rejection probability approaches 1 (under $H_A$) and 0 under $H_0$. Intuitively, increasing sample size is like increasing the magnification of a telescope. From a distance, two dots might seem indistinguishably close: with the telescope, you realize there is space between them. Sample size puts "probability space" between the null and alternative.

I am trying to think of an example where this does not occur --- but it is hard to imagine oneself using a test statistic whose behaviour does not ultimately lead to certainty. I can imagine situations where things don't work: if the number of nuisance parameters increases with sample size, things can fail to converge. In time series estimation, if the series is "insufficiently random" and the influence of the past fails to diminish at a reasonable rate, problems can arise as well.

Related Solutions

Solved – Power and sample size in regression context

There are two main approaches to power analysis:

When your design conforms to "classical standards" with regard to estimators used and distributional assumptions, then the formulae from Cohen (most of which are much older than that reference) are mathematically correct, provably so.

When your design starts to depart from these standards, either because you are using nonstandard estimators (for whatever reason) or because there are other wrinkles with your data generation or selection process, theory generally breaks down quickly. Whilst there do exist formulae for a few cases which are very close to the classical paradigm, the normal approach is simulation. If you believe your effect is of a certain magnitude then simulate, say, 10000 datasets of a given sample size, with this magnitude of effect. Apply your chosen estimator to each of these datasets, and see how many return a significant result. Then, adjust the sample size to suit your needs (if not enough of the replicates are significant, you should increase the sample size. If more were significant than were required, you can get away with reducing it.)

Solved – Intuition – Impact of baseline conversion rate on sample size

Thanks to khol for supplying the js function in the comments to the OP, it helps to see what exactly this function is doing, although I think it appears to be the one for the absolute change rather than relative unless delta is pre-processed before this function. As an intuitive answer was requested I will not delve too deeply unless requested. I have however provided an explanation of what the function is doing if you want to go that far.

Intuitive Explanation

It might be expected that since higher conversion rates are associated with a higher variance it would require more samples to counter the extra variance. But this is only one side of the calculation.

At 10% background rate and a relative 2% change, your signal you want to detect is 0.47% of the expected variance (noise) under null. At 30% background rate it is 0.93% of the expected variance under null, approximately twice the signal to noise ratio. Since noise scales by the square root of N, this means you would expect a 10% background rate to require about 4x the numbers required for 30% if you want to maintain the same confidence in the result. You will observe this approximates you figures well.

So the intuitive explanation is that a 2% relative change at 10% background rate is much smaller relative to the noise than it would be at 30% background rate. Because it has a smaller signal to noise value you need larger numbers to average out your noise and attain the same level of confidence in your result.

Explanation of the calculation based on code

num_subjects(alpha, power_level, p, delta) 
{ if (p > 0.5) { p = 1.0 - p; } 
      var t_alpha2 = ppnd(1.0-alpha/2); 
      var t_beta = ppnd(power_level); 
      var sd1 = Math.sqrt(2 * p * (1.0 - p)); 
      var sd2 = Math.sqrt(p * (1.0 - p) + (p + delta) * (1.0 - p - delta)); 
      return (t_alpha2 * sd1 + t_beta * sd2) * (t_alpha2 * sd1 + t_beta * sd2) / (delta * delta); }

p appears to be the background conversion rate. Line 1 simply inverses the probability if it is >0.5 since it is simply the mirror image (the direction of change is not important).

The second line var t_alpha2 calculates the variance associated with the specified significance level (alpha)

The third line var t_beta is the variance associated with the specified power (power_level)

The fourth line var sd1 is the variance associated with the background rate. Since backround rate (p) is involved this will change as p changes. The closer to 0.5 p is the bigger the resulting sd1. For your values of background rate this will give sd1 = 0.424264069, 0.565685425 and 0.64807407 respectively.

The fifth line var sd2 is the variance associated with the relative change in background rate (i.e. the background minus and plus the relative change). With relative change the absolute value of change is smaller for smaller background rates. So 2% of 10% is 0.2%, for 20% it is 0.4 and for 30% it is 0.6%. This gives values for sd2 of 0.426140822, 0.567788693 and 0.649895376 respectively.

This means the last line returns the square of the sum of the products of (background rate and significance) and (relative change and power) divided by the square of the relative change.

The background rate is your null hypothesis which is why its variance is multiplied with the variance associated with your significance (which is conditional on no change).

The relative change is your alternative hypothesis, which is why its variance is multiplied by the variance associated with your power (which is conditional on the expected change).

All in all this means that if you make the denominator (delta squared) really small then the result will be a big number. Since the relative change squared is 9 times higher (0.000036 compared to 0.000004) at baseline 30 vs 10 %, while the variances SD1 and SD2 only change by a factor of 1.527525232 and 1.525071861 respectively.