Significance testing is what Fisher devised and hypothesis testing is what Neyman and Pearson devised to replace significance testing. They are not the same and are mutually incompatible to an extent that would surprise most users of null hypothesis tests.
Fisher's significance tests yield a p value that represents how extreme the observations are under the null hypothesis. That p value is an index of evidence against the null hypothesis and is the level of significance.
Neyman and Pearson's hypothesis tests set up both a null hypothesis and an alternative hypothesis and work as a decision rule for accepting the null hypothesis. Briefly (there is more to it than I can put here) you choose an acceptable rate of false positive inference, alpha (usually 0.05), and either accept or reject the null based on whether the p value is above or below alpha. You have to abide by the statistical test's decision if you wish to protect against false positive errors.
Fisher's approach allows you to take anything you like into account in interpreting the result, for example pre-existing evidence can be informally taken into account in the interpretation and presentation of the result. In the N-P approach that can only be done in the experimental design stage, and seems to be rarely done. In my opinion the Fisherian approach is more useful in basic bioscientific work than is the N-P approach.
There is a substantial literature about inconsistencies between significance testing and hypothesis testing and about the unfortunate hybridisation of the two. You could start with this paper:
Goodman, Toward evidence-based medical statistics. 1: The P value fallacy.
https://pubmed.ncbi.nlm.nih.gov/10383371/
It seems like you simply hit a specific R peculiarity: When you analyze linear models, and you have a predictor with numerical values, you have to tell R whether it really represents data from a numerical variable (default, leading to a regression model), or whether it actually is a factor (leading to an ANOVA).
In your case you just have to change result <- summary(aov(test_matrix[i,] ~ group))
to result <- summary(aov(test_matrix[i,] ~ factor(group)))
to get close to correct results.
In addition, I don't understand your correction to the standard deviations. The sds
are the true standard deviations that are required to simulate data with rnorm()
. Leave out your correction, and you get even closer to the correct result.
When you have time to explore R some more, you might want to look at some features / strategies that make simulations like yours somewhat easier. E.g.
rnorm()
is vectorized, supply a vector of $\mu$s and $\sigma$s, each with length = number of simulated values, and you can eliminate the double loop in create_sim_data()
anova(lm())
returns a data frame that is a lot easier to index than the result of summary(aov())
rep()
accepts a vector for its times
argument that simplifies its use for your purpose.
Here's your simulation stripped to the bare bones, just for the group size of 40, giving us the p-values.
Nj <- c(40, 40, 40) # group sizes for 3 groups
mu <- c(0.2, 0, -0.2) # expected values in groups
sigma <- c(1, 1, 1) # true standard deviations in groups
mus <- rep(mu, times=Nj) # for use in rnorm(): vector of mus
sigmas <- rep(sigma, times=Nj) # for use in rnorm(): vector of true sds
IV <- factor(rep(1:3, times=Nj)) # factor for ANOVA
nsims <- 1000 # number of simulations
# reference: correct power
power.anova.test(groups=3, n=Nj[1], between.var=var(mu), within.var=sigma[1]^2)$power
doSim <- function() { # function to run one ANOVA on simulated data
DV <- rnorm(sum(Nj), mus, sigmas) # data from all three groups
anova(lm(DV ~ IV))["IV", "Pr(>F)"] # p-value from ANOVA
}
pVals <- replicate(nsims, doSim()) # run the simulation nsims times
(power <- sum(pVals < 0.05) / nsims) # fraction of significant ANOVAs
Best Answer
Significance (p-value) is the probability that we reject the null hypothesis while it is true. Power is the probability of rejecting the null hypothesis while it is false. Significance is thus the probability of Type I error, whereas $1 -power$ is the probability of Type II error. Mathematically these are not complementary probabilities: p-value is calculated using the probability distribution for the null hypothesis, while the power with the probability distribution for the alternative hypothesis.
The frequently used, joking but correct illustration is the use of a pregnancy test, which from time to time may turn to be negative for a visibly pregnant woman (Type I error) and return a positive result for a man.