The question looks simple, but your reflection around it shows that it is not that simple.
Actually, p-values are a relatively late addition to the theory of statistics. Computing a p-value without a computer is very tedious; this is why the only way to perform a statistical test until recently was to use tables of statistical tests, as I explain in this blog post. Because those tables were computed for fixed $\alpha$ levels (typically 0.05, 0.01 and 0.001) you could only perform a test with those levels.
Computers made those tables useless, but the logic of testing is still the same. You should:
- Formulate a null hypothesis.
- Formulate an alternative hypothesis.
- Decide a maximum type I error (the probability of falsely rejecting the null hypothesis) error you are ready to accept.
- Design a rejection region. The probability that the test statistic falls in the rejection region given that the null hypothesis is your level $\alpha$. As @MÃ¥nsT explains, this should be no smaller than your acceptable type I error, and in many cases use asymptotic approximations.
- Carry out the random experiment, compute the test statistic and see whether it falls in the rejection region.
In theory, there is a strict equivalence between the events "the statistic falls in the rejection region" and "the p-value is less than $\alpha$", which is why it is felt that you can report the p-value instead. In practice, it allows you to skip step 3. and evaluate the type I error after the test is done.
To come back to your post, the statement of the null hypothesis is incorrect. The null hypothesis is that the probability of flipping a head is $1/2$ (the null hypothesis cannot pertain to the results of the random experiment).
If you repeat the experiment again and again with a threshold p-value of 0.05, yes, you should have approximately 5% rejection. And if you set a p-value cut-off of 0.06, you should end up with roughly 6% rejection. More generally, for continuous tests, by definition of the p-value $p$
$$ Prob(p < x) = x, \, (0 < x < 1), $$
which is only approximately true for discrete tests.
Here is some R code that I hope can clarify this a bit. The binomial test is relatively slow, so I do only 10,000 random experiments in which I flip 1000 coins. I perform a binomial test and collect the 10,000 p-values.
set.seed(123)
# Generate 10,000 random experiments of each 1000 coin flipping
rexperiments <- rbinom(n=10000, size=1000, prob=0.5)
all_p_values <- rep(NA, 10000)
for (i in 1:10000) {
all_p_values[i] <- binom.test(rexperiments[i], 1000)$p.value
}
# Plot the cumulative density of p-values.
plot(ecdf(all_p_values))
# How many are less than 0.05?
mean(all_p_values < 0.05)
# [1] 0.0425
# How many are less than 0.06?
mean(all_p_values < 0.06)
# 0.0491
You can see that the proportions are not exact, because the sample size is not infinite and the test is discrete, but there is still an increase of roughly 1% between the two.
ISTR there is a form of hypothesis testing where the null hypothesis is the thing you want to assert to be true. IIRC this is based on statistical power, which is the probability [in a frequentist sense] that the null hypothesis will be rejected when it is false. Therefore if the p-value is above the significance level, but the test has high statistical power, then we would expect the null to be rejected if it were false as the test has high power, so the fact that it doesn't suggests it isn't, simple! ;o)
I'll see if I can remember what it is called and look it up, until then caveat lector!
Update: I think what I had in mind is "accept support" hypothesis testing, rather than "reject support" testing, see e.g. here.
Another (hopefully) illustrative update:
Climate skeptics often claim that there has been no global warming since 1998, often citing a BBC interview with Prof. Phil Jones of the Climatic Research Unit at UEA (where I also work). Prof. Jones was asked:
Q: Do you agree that from 1995 to the present there has been no statistically-significant global warming
and answered:
A: Yes, but only just. I also calculated the trend for the period 1995 to 2009. This trend (0.12C per decade) is positive, but not significant at the 95% significance level. The positive trend is quite close to the significance level. Achieving statistical significance in scientific terms is much more likely for longer periods, and much less likely for shorter periods.
The test Jones is using here is the standard reject-support type hypothesis test, where the null hypothesis is the opposite of which he would assert to be true
H0: The rate of warming since 1998 is zero.
H1: The rate of warming since 1998 is greater than zero.
Over the period concerned, the likelihood of the observations under the null hypothesis p > 0.05, which is why Prof. Jones correctly said that there had not been statistically significant warming since 1998.
However, for a skeptic to use this test to support their view that there were no global warming would not be a good idea as they are arguing FOR the null hypothesis, and reject-support hypothesis testing is biased in favour of the null hypothesis. We start off by assuming that H0 is true and only proceed with H1 if H0 is inconsistent with the observations.
What a climate skeptic should do is to perform an accept-support test, so we fix a significance level and then see if we have sufficient observations for the power of the test to be sufficient to be confident of rejecting the null hypothesis if it were actually false. Sadly computing statistical power is rather tricky (which is presumably why reject-support testing is more popular). It turns out that in this case, the test doesn't have sufficient statistical power. Combining the two hypothesis tests we find that the observations don't rule out the possibility that it hasn't warmed, nor do they rule out the possibility that it has continued to rise at the original rate (which is easily seen by looking at the confidence interval for the trend without all this hassle).
Note that Prof. Jones suggests that the likelihood of being able to find statistically significant warming depends on the length of the timescale on which you look, which suggests that he does understand the idea of the power of a test.
Hopefully this example illustrates that you can take H0 to be the thing that you want to be true, but it is so much more complicated that it is worth avoiding if you can. It is also a nice example of how the general public doesn't really understand statistical significance.
Best Answer
If you do a two-tailed test and computation gives you $p=0.03$, then $p<0.05$. The result is significant. If you do a one-tailed test, you will get a different result, depending on which tail you investigate. It will be either a lot larger or only half as big.
$\alpha=0.05$ is the usual convention, no matter whether you test one- ode two-tailed. You don't halve that (except maybe in Bonferroni-correction, which is not the topic here). Thus yes, sometimes a one-tailed test will give you a significant result where the two-tailed does not. However, this is not how things work: You have to always determine upfront, whether you consider a one- or a two-tailed test appropriate as you have to determine your $\alpha$-level upfront. Then you calculate the $p$-value for that test and there are no more degrees of freedom how to test or what to compare the $p$-value to. If you determine on the sidedness of your test depending on whether you like the result, this is not good scientific practice.
That being said, there is hardly ever a situation where it is appropriate to test one-tailed. In far most circumstances it would be worth communicating a significant result in both directions. If you test one-tailed, some of your audience will consider this a trick to hack your $p$-value into being as small as possible.