Simulated p-values – Approximate Real Significance Level and Power

p-valuesimulationstatistical significancestatistical-powert-test

I run $N = 1000$ simulations during which I generate $X\sim N(0,1)$ and $Y\sim N(1,1)$. During each simulation I conduct a t-test on the equality of means ($H_A: \mu_X \neq \mu_Y$) and find a p-value which store in a array.
After the loop, I have a 1000-element array, $P$, containing p-values. Now, I want to find real power and significance level from the data.

While calculating I use the following picture:

Am I right that $\text{Power} = \frac 1N \sum_{i=1}^N 1_{P_i < 0.05}$?

Best Answer

It is often useful to simulate the power of a test. However, your question is unclear. [The figure does not match the distributions you say you simulate. The null and alternative hypotheses are unclear. You do not say whether you are doing a one-sample or a two-sample t test. You do not indicate the sample size(s).]

Suppose you have random samples of sizes $n_1 = n_2 = 20$ from the null distribution $\mathsf{Norm}(0,1)$ and the specific alternative distribution $\mathsf{Norm}(1,1).$ Also, suppose you will do a pooled 2-sample t test at level $\alpha = 0.05$ and wish to know the power of the test (probability of rejecting).

Here is one such 2-sample t test in R.

set.seed(1234)
x = rnorm(20, 0, 1)
y = rnorm(20, 1, 1)
t.test(x,y, var.eq=T)$p.val  # P-value only
[1] 0.02451879

t.test(x,y, var.eq=T)        # complete output

        Two Sample t-test

data:  x and y
t = -2.3421, df = 38, p-value = 0.02452
alternative hypothesis: 
 true difference in means is not equal to 0
95 percent confidence interval:
 -1.25582408 -0.09136416
sample estimates:
 mean of x  mean of y 
-0.2506641  0.4229301

In this one example, the null hypothesis $H_0: \mu_x=\mu_y$ is rejected in favor of the two-sided alternative $H_0: \mu_x\ne\mu_y$ at the 5% level because the P-value $0.02452 < 0.05 = 5\%.$

Now we show a simulation of 100,000 such pooled 2-sample t tests. [A simulation with only 1000 iterations would give only a very approximate answer.]

set.seed(2021)
pv = replicate(10^5, t.test(rnorm(20,0,1), 
                 rnorm(20,1,1), var.eq=T)$p.val)
mean(pv <= 0.05)   # aprx power of test at 5% level
[1] 0.86787

summary(pv)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0000000 0.0003599 0.0029373 0.0284559 0.0180662 0.9986458 

2*sd(pv <= 0.05)/sqrt(20^5)
[1] 0.000378604    # aprx 95% margin of simulation error

So the simulated power of the test at the 5% level is $0.8679 \pm 0.0004,$ or about 87%.

Notes: (1) The numerical vector pv contains $10^5$ P-values. The logical vector pv <= 0.05 contains $10^5$ TRUEs and FALSEs. Its mean is the proportion of its TRUEs. The last line of the R code gives a Wald 95% confidence interval for that proportion.

(2) For this specific test one can use an exact formula involving the non-central t distribution to find an exact power value. However, the simulation method shown here works in cases where no exact analytic formula is available. (I would use simulation to get the power of a Welch 2-sample t test, which does not assume equal variances. I have never seen an exact formula for that test.)

(3) With only the first 1000 iterations, we get power $0.857\pm 0.022,$ which might be good enough.

set.seed(2021)
pv = replicate(1000, t.test(rnorm(20,0,1), 
               rnorm(20,1,1), var.eq=T)$p.val)
mean(pv <= 0.05)
[1] 0.857
2*sd(pv <= 0.05)/sqrt(1000)
[1] 0.02215163

Addendum on simulating significance levels, per Comments at end. You can use simulation to check that the significance level of a pooled t test designed to have a 5% critical value truly has significance level 5%. Why might you need to check? Maybe you doubt the software is programmed correctly. Maybe you doubt the assumptions are met (random normal data? equal population variances?).

Here is a simulation to check that that the pooled t test truly has the 5% level (rejection rate when both populations have the same normal distribution):

set.seed(2021)
pv = replicate(10^5, t.test(rnorm(20,0,1), 
                 rnorm(20,0,1), var.eq=T)$p.val)
mean(pv <= 0.05)
[1] 0.04961
2*sd(pv <= 0.05)/sqrt(10^5)
[1] 0.001373307

Significance level is $0.0496\pm 0.0014$ (so very near 5%).

However, if population variances are not equal, then the pooled 2-sample t test does not have the 'advertised' 5% rejection rate when population means are equal $(H_0:\mu_x=\mu_y).$ This can lead to 'false discovery'--especially if sample sizes are unequal and the smaller sample is from the population with the larger variance.

pv = replicate(10^5, t.test(rnorm(40,0,1), 
                rnorm(15,0,2), var.eq=T)$p.val)
mean(pv <= 0.05)
[1] 0.14371

The actual significance level is about 14%, not 5%. So we might reject when population means are truly equal, if population variances are not equal. That is why it is recommended to use the Welch t test if there is any doubt at all about the equal variances assumption. In R you can get the Welch 2-sample t test, which does not assume equal population variances, by using the default 2-sample t test (omit parameter 'vareq=T').

The simulation below, shows that a Welch t test at the nominal 5% level gives a true significance level near 5% $(0.0515\pm 0.0014),$ even when population variances differ (by 1:4 ratio).

set.seed(2021)
pv.w = replicate(10^5, t.test(rnorm(40,0,1), 
                 rnorm(15,0,2))$p.val)
mean(pv.w <= 0.05)
[1] 0.0515
2*sd(pv.w <= 0.05)/sqrt(10^5)
[1] 0.00139783

Related Solutions

Solved – How to calculate the power of a test that compares two proportions

The power calculator you're using operates on the arcsin transformed scale (which is what the cohen's H is based on). So, you'd need to use the approximate normality of the arcsine transformed proportions and do the analytic calculation on that scale to recover the result of pwr.2p.test. I'll leave that analysis to you. It shouldn't be too complicated. Just use the fact that difference between two independent approximately normal variables is approximately normal and calculate areas under that distribution. Give it a shot and post the answer if you figure it out. Good luck.

As a sidebar, below is some code to estimate the power of the $\chi^2$ test of equality of proportions by simulation by calculating the proportion of times (out of a large number, 1000 in this case) the null hypothesis is rejected.

set.seed(1122)
p1 = 0.69 
p2 = 0.68 
n = 10000
n.reject = 0
n.rep = 1000
for(k in 1:n.rep)
{ 
  x1 = sum(runif(n)<p1)
  x2 = sum(runif(n)<p2)
  ptest = prop.test( c(x1,x2), c(n,n), correct=FALSE)
  if(  ptest$p.value < 0.05 ) n.reject = n.reject + 1
}
n.reject/n.rep
[1] 0.322

Statistical Power and Significance – Understanding the Relationship

Significance (p-value) is the probability that we reject the null hypothesis while it is true. Power is the probability of rejecting the null hypothesis while it is false. Significance is thus the probability of Type I error, whereas $1 -power$ is the probability of Type II error. Mathematically these are not complementary probabilities: p-value is calculated using the probability distribution for the null hypothesis, while the power with the probability distribution for the alternative hypothesis.

The frequently used, joking but correct illustration is the use of a pregnancy test, which from time to time may turn to be negative for a visibly pregnant woman (Type I error) and return a positive result for a man.

Best Answer

Related Solutions

Solved – How to calculate the power of a test that compares two proportions

Statistical Power and Significance – Understanding the Relationship

Related Question