"Multiple comparisons" is the name attached to the general problem of making decisions based on the results of more than one test. The nature of the problem is made clear by the famous XKCD "Green jelly bean" cartoon in which investigators performed hypothesis tests of associations between consumption of jelly beans (of 20 different colors) and acne. One test reported a p-value less than $1/20$, leading to the conclusion that "green jelly beans cause acne." The joke is that p-values, by design, have a $1/20$ chance of being less than $1/20$, so intuitively we would expect to see a p-value that low among $20$ different tests.
What the cartoon does not say is whether the $20$ tests were based on separate datasets or one dataset.
With separate datasets, each of the $20$ results has a $1/20$ chance of being "significant." Basic properties of probabilities (of independent events) then imply that the chance all $20$ results are "insignificant" is $(1-0.05)^{20}\approx 0.36$. The remaining chance of $1-0.36 = 0.64$ is large enough to corroborate our intuition that a single "significant" result in this large group of results is no surprise; no cause can validly be assigned to such a result except the operation of chance.
If the $20$ results were based on a common dataset, however, the preceding calculation would be erroneous: it assumes all $20$ outcomes were statistically independent. But why wouldn't they be? Analysis of Variance provides a standard example: when comparing two or more treatment groups against a control group, each comparison involves the same control results. The comparisons are not independent. Now, for instance, "significant" differences could arise due to chance variation in the controls. Such variation could simultaneously change the comparisons with every group.
(ANOVA handles this problem by means of its overall F-test. It is sort of a comparison "to rule them all": we will not trust group-to-group comparison unless first this F-test is significant.)
We can abstract the essence of this situation with the following framework. Multiple comparisons concerns making a decision from the p-values $(p_1, p_2, \ldots, p_n)$ of $n$ distinct tests. Those p-values are random variables. Assuming all the corresponding null hypotheses are logically consistent, each should have a uniform distribution. When we know their joint distribution, we can construct reasonable ways to combine all $n$ of them into a single decision. Otherwise, the best we can usually do is rely on approximate bounds (which is the basis of the Bonferroni correction, for instance).
Joint distributions of independent random variables are easy to compute. The literature therefore distinguishes between this situation and the case of non-independence.
Accordingly, the correct meaning of "independent" in the quotations is in the usual statistical sense of independent random variables.
Note that an assumption was needed to arrive at this conclusion: namely, that all $n$ of the null hypotheses are logically consistent. As an example of what is being avoided, consider conducting two tests with a batch of univariate data $(x_1, \ldots, x_m)$ assumed to be a random sample from a Normal distribution of unknown mean $\mu$. The first is a t-test of $\mu=0$, with p-value $p_1$, and the second is a t-test of $\mu=1$, with p-value $p_2$. Since both cannot logically hold simultaneously, it would be problematic to talk about "the null distribution" of $(p_1, p_2)$. In this case there can be no such thing at all! Thus the very concept of statistical independence sometimes cannot even apply.
It so happens that by coincidence I read this same paper just a couple of weeks ago. Colquhoun mentions multiple comparisons (including Benjamini-Hochberg) in section 4 when posing the problem, but I find that he does not make the issue clear enough -- so I am not surprised to see your confusion.
The important point to realize is that Colquhoun is talking about the situation without any multiple comparison adjustments. One can understand Colquhoun's paper as adopting a reader's perspective: he essentially asks what false discovery rate (FDR) can he expect when he reads scientific literature, and this means what is the expected FDR when no multiple comparison adjustments were done.
Multiple comparisons can be taken into account when running multiple statistical tests in one study, e.g. in one paper. But nobody ever adjusts for multiple comparisons across papers.
If you actually control FDR, e.g. by following Benjamini-Hochberg (BH) procedure, then it will be controlled. The problem is that running BH procedure separately in each study, does not guarantee overall FDR control.
Can I safely assume that in the long run, if I do such analysis on a regular basis, the FDR is not $30\%$, but below $5\%$, because I used Benjamini-Hochberg?
No. If you use BH procedure in every paper, but independently in each of your papers, then you can essentially interpret your BH-adjusted $p$-values as normal $p$-values, and what Colquhoun says still applies.
General remarks
The answer to Colquhoun's question about the expected FDR is difficult to give because it depends on various assumptions. If e.g. all the null hypotheses are true, then FDR will be $100\%$ (i.e. all "significant" findings would be statistical flukes). And if all nulls are in reality false, then FDR will be zero. So the FDR depends on the proportion of true nulls, and this is something that has be externally estimated or guessed, in order to estimate the FDR. Colquhoun gives some arguments in favor of the $30\%$ number, but this estimate is highly sensitive to the assumptions.
I think the paper is mostly reasonable, but I dislike that it makes some claims sound way too bold. E.g. the first sentence of the abstract is:
If you use $p=0.05$ to suggest that you have made a discovery, you will be wrong at least $30\%$ of the time.
This is formulated too strongly and can actually be misleading.
Best Answer
I would not recommend you use that procedure. My recommendation is: Abandon this project. Just give up and walk away. You have no hope of this working.
source for image
Setting aside the standard problems with stepwise selection (cf., here), in your case you are very likely to have perfect predictions due to separation in such a high-dimensional space.
I don't have specifics on your situation, but you state that you have "only a few 10s of samples". Let's be charitable and say you have 90. You further say you have "several thousand features". Let's imagine that you 'only' have 2,000. For the sake of simplicity, let's say that all your features are binary. You "believe that the class label can be accurately predicted using only a few features", let's say that you will look for sets of up to only 9 features max. Lastly, let's imagine that the relationship is deterministic, so that the true relationship will always be perfectly present in your data. (We can change these numbers and assumptions, but that should only make the problem worse.) Now, how well would you be able to recover that relationship under these (generous) conditions? That is, how often would the correct set be the only set that yields perfect accuracy? Or, put another way, how many sets of nine features will also fit by chance alone?
Some (overly) simple math and simulations should provide some clues to this question. First, with 9 variables, each of which could be 0 or 1, the number of patterns an observation could show are $2^9 = 512$, but you will have only 90 observations. Thus it is entirely possible that, for a given set of 9 binary variables, every observation has a different set of predictor values—there are no replicates. Without replicates with the same predictor values where some have y=0 and some y=1, you will have complete separation and perfect prediction of every observation will be possible.
Below, I have a simulation (coded in R) to see how often you might have no patterns of x-values with both 0s and 1s. The way it works is that I get a set of numbers from 1 to 512, which represent the possible patterns, and see if any of the patterns in the first 45 (that might be the 0s) match any of the pattern in the second 45 (that might be the 1s). This assumes that you have perfectly balanced response data, which gives you the best possible protection against this problem. Note that having some replicated x-vectors with differing y-values doesn't really get you out of the woods, it just means you wouldn't be able to perfectly predict every single observation in your dataset, which is the very stringent standard I'm using here.
The simulation suggests you would have this issue with approximately 1.8% of the sets of 9 x-variables. Now, how many sets of 9 are there? Strictly, that would be $1991 \text{ choose } 9 = 1.3\times 10^{24}$ (since we've stipulated that the true 9 deterministic causal variables are in your set). However, many of those sets will be overlapping; there will be $1991 / 9 \approx 221$ non-overlapping sets of 9 within a specified partition of your variables (with many such partitions possible). Thus, within some given partition, we might expect there would be $221\times 0.018\approx 4$ sets of 9 x-variables that will perfectly predict every observation in your dataset.
Note that these results are only for cases where you have a relatively larger dataset (within the "tens"), a relatively smaller number of variables (within the "thousands"), only looks for cases where every single observation can be predicted perfectly (there will be many more sets that are nearly perfect), etc. Your actual case is unlikely to work out 'this well'. Moreover, we stipulated that the relationship is perfectly deterministic. What would happen if there is some random noise in the relationship? In that case, you will still have ~4 (null) sets that perfectly predict your data, but the right set may well not be among them.
Tl;dr, the basic point here is that your set of variables is way too large / high dimensional, and your amount of data is way too small, for anything to be possible. If it's really true that you have "tens" of samples, "thousands" of variables, and absolutely no earthly idea which variables might be right, you have no hope of getting anywhere with any procedure. Go do something else with your time.