With 15000 tests, you should indeed have a proportion of rejection close to the nominal error rate and a relatively clean histogram/density plot. If the error level is 1%, 3500 false rejections of the null is definitely a surprising result. My first guess would be that successive replicates are not as independent as you thought. I would therefore first try to split each set of 6 observations randomly to assess that.
Also inspect some of these p-values and look at a stripchart or a density plot, just to make sure the histogram is not misleading you and your code is fine. You can also try simulated data, again to make sure everything else in your procedure is working as intended.
One thing you did not specify is the nature of the data and the specific test you are using. With only 3 observations in each group, the sampling distribution of some test statistics (e.g. rank-based tests) can be very unusual (but usually not in the way you describe, I would think).
Just a quick illustration of what I mean, with a simulation (the code is in R).
set.seed(4123412) # Makes the whole thing reproducible
# Some test with 3 observations, null hypothesis is true by construction
rndtest1 <- function() {
t.test(rnorm(3), rnorm(3))$p.value
}
First, a simulation with 50 tests in total:
dist1 <- replicate(50, rndtest1())
hist(dist1)
As you see, the histogram is quite bumpy because with 50 observations you only have a rough idea of the distribution (or anything else, really).
Now, a simulation with 15000 tests:
dist2 <- replicate(15000, rndtest1())
hist(dist2)
Here the histogram looks almost the way you want it to look like under the null, i.e. like a uniform distribution.
(There is however a little quirk on the left hand of the plot. Indeed, the test is a little conservative:
> sum(dist2 < .05)/15000
[1] 0.03526667
> sum(dist2 < .01)/15000
[1] 0.006133333
It's an artifact of the small sample size and correction for unequal variances. Without the latter the histogram would be flat, an issue unrelated to the point I am making.)
Finally, my other point, a simulation with a test behaving differently with very small samples:
rndtest2 <- function() {
wilcox.test(rnorm(3), rnorm(3))$p.value
}
dist3 <- replicate(15000, rndtest2())
hist(dist3)
In fact, the p-value distribution is discrete:
> xtabs(~dist3)
dist3
0.1 0.2 0.4 0.7 1
985 989 1976 3039 3011
and therefore can never, ever, reject the null hypothesis at the 5% error level. This is why more information on what the data and test exactly are could be useful to spot other problems but in any case, 15000 tests should be enough to get a good idea of the p-value distribution and, hopefully, get uniform-looking data under the null.
A point by point response to your questions:
You do not say what kind of test-statistic your $p$-values apply to. If you are talking about continuous distributions, such as for t or z statistics, then technically all of your $p$-values are strictly less than 1, although some of them may be very close to 1.
You test a bunch of hypotheses, and some of them are significant (without multiple comparisons adjustments), and some of them are not. Great.
Generally, one does not need to remove any $p$-values prior to conducting multiple comparisons adjustments for step-wise adjustment procedures (although the FDR gives the same results for a given level of $\alpha$). All but one adjusted $p$-value (i.e. $q$-values) will be always larger than the corresponding unadjusted $p$-value. Conversely, one can think of multiple comparisons adjustments as adjusting the rejection-probability (e.g. $\alpha$), rather than adjusting $p$-values, and here all but one of the adjusted rejection probabilities are less than the nominal type 1 error rate. One advantage to working the math out this way is one never has to adjust $p$-values so that they are larger than/truncated at the value 1.
It sounds like, after adjustment for multiple comparisons using the FDR, you would not reject any hypotheses. This is a possibility (without seeing your vector of $p$-values it is not possible to show you the math).
The FDR does not assume a uniform distribution of $p$-values.
You are seemingly not making any mistake, other than being surprised by your results versus your expectations of your results.
Update: Have a look at this spreadsheet producing both adjusted alpha (i.e. the FDR), and alternatively adjusted $p$-values, for the 927 $p$-values in the spreadsheet you supplied.
Notice that: (1) column B contains the $p$-values <1 sorted largest to smallest; (2) column C contains the sorting order ($i$), (3) the adjusted $\frac{\alpha}{2} = \frac{0.05}{2}\times\frac{927+1-i}{927}$, (4) the adjusted $p$-values $=\frac{927}{927+1-i}p_{i}$, and finally, (5) you would reject the hypotheses corresponding to the two smallest $p$-values because (a) $3.78\times 10^{-5} < 5.39\times 10^{-5}$ (i.e. $p_{926} < \alpha_{926}^{*}$), or alternately (b) $0.0175 < 0.025$ (i.e. $q_{926} < \frac{\alpha}{2}$).
Best Answer
Benjamini Hochberg is valid as long as the null p-values are superuniform, this means:
$$ \Pr[P_i \leq t \mid H_0] \leq t $$
This is valid with "$=$" for uniform null p-values. It is also true for U-shaped mixture distributions (if the left peak of the U corresponds to alternatives, then a uniform component + a peak close to 1 will correspond to the null distribution, which consequently is subuniform). Also superuniformity holds for discrete distributions (which cannot lead to uniformly distributed p-values because of the discreteness).