Simulated p-values – Approximate Real Significance Level and Power

p-valuesimulationstatistical significancestatistical-powert-test

I run $N = 1000$ simulations during which I generate $X\sim N(0,1)$ and $Y\sim N(1,1)$. During each simulation I conduct a t-test on the equality of means ($H_A: \mu_X \neq \mu_Y$) and find a p-value which store in a array.
After the loop, I have a 1000-element array, $P$, containing p-values. Now, I want to find real power and significance level from the data.

While calculating I use the following picture:
enter image description here

Am I right that $\text{Power} = \frac 1N \sum_{i=1}^N 1_{P_i < 0.05}$?

Best Answer

It is often useful to simulate the power of a test. However, your question is unclear. [The figure does not match the distributions you say you simulate. The null and alternative hypotheses are unclear. You do not say whether you are doing a one-sample or a two-sample t test. You do not indicate the sample size(s).]

Suppose you have random samples of sizes $n_1 = n_2 = 20$ from the null distribution $\mathsf{Norm}(0,1)$ and the specific alternative distribution $\mathsf{Norm}(1,1).$ Also, suppose you will do a pooled 2-sample t test at level $\alpha = 0.05$ and wish to know the power of the test (probability of rejecting).

Here is one such 2-sample t test in R.

set.seed(1234)
x = rnorm(20, 0, 1)
y = rnorm(20, 1, 1)
t.test(x,y, var.eq=T)$p.val  # P-value only
[1] 0.02451879

t.test(x,y, var.eq=T)        # complete output

        Two Sample t-test

data:  x and y
t = -2.3421, df = 38, p-value = 0.02452
alternative hypothesis: 
 true difference in means is not equal to 0
95 percent confidence interval:
 -1.25582408 -0.09136416
sample estimates:
 mean of x  mean of y 
-0.2506641  0.4229301 

In this one example, the null hypothesis $H_0: \mu_x=\mu_y$ is rejected in favor of the two-sided alternative $H_0: \mu_x\ne\mu_y$ at the 5% level because the P-value $0.02452 < 0.05 = 5\%.$

Now we show a simulation of 100,000 such pooled 2-sample t tests. [A simulation with only 1000 iterations would give only a very approximate answer.]

set.seed(2021)
pv = replicate(10^5, t.test(rnorm(20,0,1), 
                 rnorm(20,1,1), var.eq=T)$p.val)
mean(pv <= 0.05)   # aprx power of test at 5% level
[1] 0.86787

summary(pv)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0000000 0.0003599 0.0029373 0.0284559 0.0180662 0.9986458 

2*sd(pv <= 0.05)/sqrt(20^5)
[1] 0.000378604    # aprx 95% margin of simulation error

So the simulated power of the test at the 5% level is $0.8679 \pm 0.0004,$ or about 87%.

Notes: (1) The numerical vector pv contains $10^5$ P-values. The logical vector pv <= 0.05 contains $10^5$ TRUEs and FALSEs. Its mean is the proportion of its TRUEs. The last line of the R code gives a Wald 95% confidence interval for that proportion.

(2) For this specific test one can use an exact formula involving the non-central t distribution to find an exact power value. However, the simulation method shown here works in cases where no exact analytic formula is available. (I would use simulation to get the power of a Welch 2-sample t test, which does not assume equal variances. I have never seen an exact formula for that test.)

(3) With only the first 1000 iterations, we get power $0.857\pm 0.022,$ which might be good enough.

set.seed(2021)
pv = replicate(1000, t.test(rnorm(20,0,1), 
               rnorm(20,1,1), var.eq=T)$p.val)
mean(pv <= 0.05)
[1] 0.857
2*sd(pv <= 0.05)/sqrt(1000)
[1] 0.02215163

Addendum on simulating significance levels, per Comments at end. You can use simulation to check that the significance level of a pooled t test designed to have a 5% critical value truly has significance level 5%. Why might you need to check? Maybe you doubt the software is programmed correctly. Maybe you doubt the assumptions are met (random normal data? equal population variances?).

Here is a simulation to check that that the pooled t test truly has the 5% level (rejection rate when both populations have the same normal distribution):

set.seed(2021)
pv = replicate(10^5, t.test(rnorm(20,0,1), 
                 rnorm(20,0,1), var.eq=T)$p.val)
mean(pv <= 0.05)
[1] 0.04961
2*sd(pv <= 0.05)/sqrt(10^5)
[1] 0.001373307

Significance level is $0.0496\pm 0.0014$ (so very near 5%).

However, if population variances are not equal, then the pooled 2-sample t test does not have the 'advertised' 5% rejection rate when population means are equal $(H_0:\mu_x=\mu_y).$ This can lead to 'false discovery'--especially if sample sizes are unequal and the smaller sample is from the population with the larger variance.

pv = replicate(10^5, t.test(rnorm(40,0,1), 
                rnorm(15,0,2), var.eq=T)$p.val)
mean(pv <= 0.05)
[1] 0.14371

The actual significance level is about 14%, not 5%. So we might reject when population means are truly equal, if population variances are not equal. That is why it is recommended to use the Welch t test if there is any doubt at all about the equal variances assumption. In R you can get the Welch 2-sample t test, which does not assume equal population variances, by using the default 2-sample t test (omit parameter 'vareq=T').

The simulation below, shows that a Welch t test at the nominal 5% level gives a true significance level near 5% $(0.0515\pm 0.0014),$ even when population variances differ (by 1:4 ratio).

set.seed(2021)
pv.w = replicate(10^5, t.test(rnorm(40,0,1), 
                 rnorm(15,0,2))$p.val)
mean(pv.w <= 0.05)
[1] 0.0515
2*sd(pv.w <= 0.05)/sqrt(10^5)
[1] 0.00139783
Related Question