With 15000 tests, you should indeed have a proportion of rejection close to the nominal error rate and a relatively clean histogram/density plot. If the error level is 1%, 3500 false rejections of the null is definitely a surprising result. My first guess would be that successive replicates are not as independent as you thought. I would therefore first try to split each set of 6 observations randomly to assess that.
Also inspect some of these p-values and look at a stripchart or a density plot, just to make sure the histogram is not misleading you and your code is fine. You can also try simulated data, again to make sure everything else in your procedure is working as intended.
One thing you did not specify is the nature of the data and the specific test you are using. With only 3 observations in each group, the sampling distribution of some test statistics (e.g. rank-based tests) can be very unusual (but usually not in the way you describe, I would think).
Just a quick illustration of what I mean, with a simulation (the code is in R).
set.seed(4123412) # Makes the whole thing reproducible
# Some test with 3 observations, null hypothesis is true by construction
rndtest1 <- function() {
t.test(rnorm(3), rnorm(3))$p.value
}
First, a simulation with 50 tests in total:
dist1 <- replicate(50, rndtest1())
hist(dist1)
As you see, the histogram is quite bumpy because with 50 observations you only have a rough idea of the distribution (or anything else, really).
Now, a simulation with 15000 tests:
dist2 <- replicate(15000, rndtest1())
hist(dist2)
Here the histogram looks almost the way you want it to look like under the null, i.e. like a uniform distribution.
(There is however a little quirk on the left hand of the plot. Indeed, the test is a little conservative:
> sum(dist2 < .05)/15000
[1] 0.03526667
> sum(dist2 < .01)/15000
[1] 0.006133333
It's an artifact of the small sample size and correction for unequal variances. Without the latter the histogram would be flat, an issue unrelated to the point I am making.)
Finally, my other point, a simulation with a test behaving differently with very small samples:
rndtest2 <- function() {
wilcox.test(rnorm(3), rnorm(3))$p.value
}
dist3 <- replicate(15000, rndtest2())
hist(dist3)
In fact, the p-value distribution is discrete:
> xtabs(~dist3)
dist3
0.1 0.2 0.4 0.7 1
985 989 1976 3039 3011
and therefore can never, ever, reject the null hypothesis at the 5% error level. This is why more information on what the data and test exactly are could be useful to spot other problems but in any case, 15000 tests should be enough to get a good idea of the p-value distribution and, hopefully, get uniform-looking data under the null.
Firstly we are going to formalize the hypotheses that you are testing
$$H_0 = \{ \text{using condition #1 gives the same results as using condition #2} \}$$
versus
$$H_1 = \{ \text{using condition #1 gives different results than using condition #2} \}$$
(I corrected your $H_0$, I think you want to formulate it like this)
Then you can formulate your hypothesis in terms of smaller hypothesis
$$H_0 = \cap_{i,j \in \{1,2,3,4,5\}} \{ \text{time series $i$ from condition 1 and time series $j$}$$
$$\quad \quad \quad \quad \text{ from condition 2 are comparable}\}$$
Is that correct?
Further, in this formulation there are only 5*5=25 smaller hypotheses to test, not 120...
With respect to our question,I would say no because significance implies that you are controling the probability for an type 1 error. Instead you are controling the FDR which is another concept.
Best Answer
This is a good question, but you have several concepts confused.
Firstly, to answer your broader question, yes splitting p-values and performing correction on them separately is an often-performed and well known approach when you have prior information about the system you're studying. To see more examples of this and a proof for independent tests, see Lei Sun et al. $^1$ The idea behind this approach is to maintain the same level of FDR control but increase the number of true positives that you find, so everyone wins!
The answer to your actual question (why not take the trivial case where every test is a unique strata) lies in the behavior of the estimators. As detailed in the Sun paper, to use a fixed FDR framework (i.e. keep the same FDR and reject as many tests as possible, leading to an increased number of true positives being found), one must estimate $\pi_0$ with $\hat{\pi}_0$ which has been a contentious issue in the field for many years now. $\pi_0$ is the proportion of true null hypotheses, while $\hat{\pi}_0$ is its estimator. You can read this yourself in the Methods section under Estimating $\pi_0$ and FDR and in the Discussion section if you want. The bias of estimating $\pi_0$ does not increase with the number of strata, but the variance does; so as we increase the number of strata, we have a worse and worse estimator of $\pi_0$, which means we have a worse and worse estimator of $\alpha^{(k)}$, or the p-value needed to reject the null in order to maintain $\gamma$ overall FDR control. If I were to have to guess, and I have no proof of this other than intuition, I would say that the expected value of $\alpha^{k}$ for all $k$ when the number of strata is equal to $k$ would simply be $\gamma$, which negates the purpose of doing the exercise.
I would like to know where you got the notion that FDR is overly cautious; that is very much not the impression that I have. Indeed many people have found that the empirical FDR closely matches the control rate except in circumstances of special kinds of dependencies (see a discussion of that here). What you found in your simulation was not FDR but the false positive rate. You had no true positive associations so your FDR was always 1 by definition, as FDR is the expected value of false positives over all positives. You created a set up where data were randomly generated and you found 1) the P value and 2) the BH corrected P value (the $q$-value is actually a different concept and unique to John Storey's implementation $^2$). You found that 5% of uncorrected P values rejected the null when all were false, which is the meaning of setting $\alpha$ to 0.05 which you did. You also found that 0% of FDR corrected P values rejected the null, which is entirely to be expected as you had no true positives to identify and your results were all within the realm of chance. So really, you found that FDR was doing exactly what it was supposed to do!
[1] Sun L, Craiu RV, Paterson AD and Bull SB (2006). Stratified false discovery control for large-scale hypothesis testing with application to genome-wide association studies. Genetic Epidemiology 30:519-530.
[2] https://projecteuclid.org/euclid.aos/1074290335