Solved – Sane stepwise regression

logisticmultiple-comparisonsregressionstepwise regression

Suppose I want to build a binary classifier. I have several thousand features and only a few 10s of samples. From domain knowledge, I have a good reason to believe that the class label can be accurately predicted using only a few features, but I have no idea which ones. I also want the final decision rule to be easy to interpret/explain, further necessitating a small number of features. Certain subsets of my features are highly correlated, so selecting the most predictive few independently wouldn't work. I also want to be able to meaningfully do hypothesis testing on my features.

Is the following stepwise regression procedure reasonable under these conditions:

  1. Given the features already in the model (or just the intercept on the first iteration), select the feature that produces the largest log likelihood ratio when added to the model. Use the likelihood ratio chi-square test to calculate a nominal P-value for each hypothesis test performed in this selection. The null here is that adding the extra variable to the model provides no additional predictive ability. The alternative is that it does increase predictive abilityl

  2. Treat the hypotheses tested in Step 1 of each iteration as a family and calculate the false discovery rate for the smallest P-value (for the feature selected) using something like Benjamini-Hochberg.

  3. Goto 1 unless some stopping criteria are met.

  4. Report the false discovery rates for the individual features, but not the P-value for the model as a whole (since this will be massively inflated). Each of these multiple testing corrected P-values represents the statistical significance of that feature given all of the features previously added to the model.

Does doing something like this under these circumstances successfully avoid all of the typical criticisms of stepwise regression? Are the false discovery rates calculated in this way reasonable?

Best Answer

I would not recommend you use that procedure. My recommendation is: Abandon this project. Just give up and walk away. You have no hope of this working.

Dore illustration of Dante's Inferno "Abandon hope" source for image

Setting aside the standard problems with stepwise selection (cf., here), in your case you are very likely to have perfect predictions due to separation in such a high-dimensional space.

I don't have specifics on your situation, but you state that you have "only a few 10s of samples". Let's be charitable and say you have 90. You further say you have "several thousand features". Let's imagine that you 'only' have 2,000. For the sake of simplicity, let's say that all your features are binary. You "believe that the class label can be accurately predicted using only a few features", let's say that you will look for sets of up to only 9 features max. Lastly, let's imagine that the relationship is deterministic, so that the true relationship will always be perfectly present in your data. (We can change these numbers and assumptions, but that should only make the problem worse.) Now, how well would you be able to recover that relationship under these (generous) conditions? That is, how often would the correct set be the only set that yields perfect accuracy? Or, put another way, how many sets of nine features will also fit by chance alone?

Some (overly) simple math and simulations should provide some clues to this question. First, with 9 variables, each of which could be 0 or 1, the number of patterns an observation could show are $2^9 = 512$, but you will have only 90 observations. Thus it is entirely possible that, for a given set of 9 binary variables, every observation has a different set of predictor values—there are no replicates. Without replicates with the same predictor values where some have y=0 and some y=1, you will have complete separation and perfect prediction of every observation will be possible.

Below, I have a simulation (coded in R) to see how often you might have no patterns of x-values with both 0s and 1s. The way it works is that I get a set of numbers from 1 to 512, which represent the possible patterns, and see if any of the patterns in the first 45 (that might be the 0s) match any of the pattern in the second 45 (that might be the 1s). This assumes that you have perfectly balanced response data, which gives you the best possible protection against this problem. Note that having some replicated x-vectors with differing y-values doesn't really get you out of the woods, it just means you wouldn't be able to perfectly predict every single observation in your dataset, which is the very stringent standard I'm using here.

set.seed(7938)  # this makes the simulation exactly reproducible
my.fun = function(){
  x = sample.int(512, size=90, replace=TRUE)
  return(sum(x[1:45]%in%x[46:90])==0)
}
n.unique = replicate(10000, my.fun())
mean(n.unique)  # [1] 0.0181

The simulation suggests you would have this issue with approximately 1.8% of the sets of 9 x-variables. Now, how many sets of 9 are there? Strictly, that would be $1991 \text{ choose } 9 = 1.3\times 10^{24}$ (since we've stipulated that the true 9 deterministic causal variables are in your set). However, many of those sets will be overlapping; there will be $1991 / 9 \approx 221$ non-overlapping sets of 9 within a specified partition of your variables (with many such partitions possible). Thus, within some given partition, we might expect there would be $221\times 0.018\approx 4$ sets of 9 x-variables that will perfectly predict every observation in your dataset.

Note that these results are only for cases where you have a relatively larger dataset (within the "tens"), a relatively smaller number of variables (within the "thousands"), only looks for cases where every single observation can be predicted perfectly (there will be many more sets that are nearly perfect), etc. Your actual case is unlikely to work out 'this well'. Moreover, we stipulated that the relationship is perfectly deterministic. What would happen if there is some random noise in the relationship? In that case, you will still have ~4 (null) sets that perfectly predict your data, but the right set may well not be among them.

Tl;dr, the basic point here is that your set of variables is way too large / high dimensional, and your amount of data is way too small, for anything to be possible. If it's really true that you have "tens" of samples, "thousands" of variables, and absolutely no earthly idea which variables might be right, you have no hope of getting anywhere with any procedure. Go do something else with your time.