Solved – Empirical rejection rate of a wrong null hypothesis -is it the “power of the test”

hypothesis testingstatistical-powert-test

I guess this is basic, but it forms part of the holes in my statistics education.
What I know is that the power of a hypothesis test is not a fixed magnitude, but a function of various factors (significance level, distance of the alternative hypothesis from the null…). Assuming that this is correct:

We perform a Monte Carlo study and we consider a two-tail t-test of statistical significance on an unknown parameter -so the null hypothesis is that this parameter is zero, and the alternative is that it is not zero.

The data are generated using a non-zero parameter. So our null is a false hypothesis. For a given significance level, we observe that the test rejects the false null XX% of the times it is executed. This looks like an empirically measured "power of the test" -but it is just a number, not a function. Granted, the sample size is given, the number of samples generated is given, the significance level is given. Changing any one of these will, I guess, give us different empirical rejection rates (and so an empirically determined range of the power function w.r.t. these factors) -but where and how does the alternative hypothesis comes in? Shouldn't it also co-determine the "power of the test"?

In general, how the empirical rejection rate of a wrong null hypothesis by a two-tailed test of statistical significance is to be interpreted?

Best Answer

The usual way to investigate the power properties is via a power curve (or sometimes, power surface, if we want to investigate the response to varying two things at once).

On these curves, the y-variable is the rejection rate and the x-value has the particular value of the thing we're varying.

The most common type of power curve is one we produce as we vary the parameter that is the subject of the test (e.g. in a test of means, as the true mean changes from the hypothesized value). Here's an example of a power curve for a two-sample t-test under a particular set of conditions:

$\hspace{1cm}$enter image description here

(that one was not generated empirically, but by calling a function)

Here's a power comparison of paired t-test (curve) and Signed Rank test (points) for 4 pairs of normal observations (it's actually two-sided, but the left half isn't shown as it's a mirror image of the right half):
enter image description here

The t-test was carried out at the exact significance level of the signed rank test (since it can only take a few significance levels).

Here's a pair of (one-sided) power curves for a power comparison in a test of normality, where the alternatives are gamma distributed (which include the normal as a limiting case, under appropriate standardization):

enter image description here

(this one was generated empirically in essentially the manner you describe)

As you suggest, at some specified value of the alternative, you can compute the power, and then as you vary that, you obtain a function that varies with the parameter you change, giving a power curve (or more strictly a curve of rejection rates, since at the null that's not power, but significance level).

I've generated quite a few of these curves in various of my answers. See here for another example that compares power from an "algebraic" calculation (/function call) and an empirical calculation (i.e. simulation).


Some general advice relating to empirical power:

1) since these are empirical rejection rates (i.e. binomial proportions), we can compute standard errors, confidence intervals and so on. So you know how accurate they are. If I can spare the time I usually simulate enough samples so the standard error is on the order of a pixel in my image (or even somewhat less), at least if it's not a large image (if you're doing vector graphics, think maybe half a percent or so of the height-dimension of the plot instead).

2) power curves are typically going to be smooth. Very smooth. As such, we can, with a bit of cleverness, avoid calculating power at a huge number of values (and indeed we can use this fact to reduce the number of simulations required at each point). One thing I do is to take a transformation of the power that will "straighten" the power curve, at least when we're away from 0 (inverse normal cdf is often a good choice), do cubic spline smoothing, and then transform back (beware doing that anywhere your rejection rates are exactly 0 or 1; you may want to leave those alone). If you do that well, you should be able to get away with 10-20 points or so.

If you already have so many simulations your points are accurate to the pixel, after transforming to approximate local linearity, linear interpolation will usually be sufficient, and produces smooth, highly accurate curves after you transform back. If in doubt, produce a few more points and see whether the curves are generally within a couple of standard errors of those simulated values (since if those standard errors are only on the order of a pixel, you can't actually see the difference... so the miniscule bias this might introduce really doesn't matter).

You can also exploit obvious symmetries and so on. (In the t-test vs signed rank test power curve above, we exploited a relationship between the within-pair correlation ($\rho$) and the standard error of the difference to give different x-axes (above and below the plot) that then have the same power curves).

Sometimes a little fiddling is required to get it just so, but you should get very smooth, more accurate estimates of the power with such smoothing. (On the other hand, sometimes its faster just to do more points - but in any case I would rarely do more than about 30 points because the eye happily fills the rest in.)

3) Since we're doing Monte Carlo simulation we can exploit various variance reduction techniques (though keep in mind the impact on calculated standard errors; at the worst, if you can't calculate it any more, the unreduced variance will be an upper bound). For example, we can use control variates - one thing I did when comparing power of a nonparametric test with a t-test was to compute the empirical rate for both tests and then use the error in the power for the t-test to help reduce the error in the other test (again, smoothing the result a little) ... but it works better if you do it on the right scale. A number of other variance-reduction techniques can also be used. (If I remember rightly, I might have used a control variate on a transformed-scale for the one-sample-t vs signed-rank test comparison above.)

But often simple brute force will suffice and requires little brain effort. If all it takes is going for a cup of coffee while the full simulation runs, might as well let it go. (No point spending half an hour working out some clever computation to save 15 minutes of a half hour run time.)

Related Question