Solved – Stepwise versus L2 regularized logistic regression: dataset-specific performance

feature selectionlogisticridge regressionstepwise regression

I have two data sets from different collections. The second data set is smaller. They were both analyzed with the same methods in order to derive feature sets of 10-30 features each. Each feature set was produced the same way for both data sets.

Then, I run many Logistic Regressions to fit both data sets with all feature sets. Additionally, all of the experiments were repeated with both L2 regularized and Stepwise Logistic Regression. The observation is that the best fitting for the first data set was done with Stepwise Logistic Regression, while for the second one was done with L2 regularized Logistic Regression. This is quite consistent, i.e. for all 15 experiments on each dataset.

Why did each method perform better on either data-set? May this have to do with particular characteristics of each data-set?

For example, I know that L2 deals better with multicollinearity and lower ratio of observations/variables. Can I assume that L2 performed better on 2nd data set because it had multicollinearity? Then, L2 does not zero out any coefficients, which Stepwise in fact does. Can I say that Stepwise did better on 1st data set because it may have had some too noisy features that needed to be zeroed out?

Best Answer

What you observed follows from statistical theory and is completely expected. And "best fitting on the first dataset" is a by-product of overfitting and is not really interesting.

Why the need to reduce the number of features to 10-30? Why is parsimony good? Why not just fit an L2 penalized model with all features?

Related Solutions

Logistic Regression – Insights into Stepwise Logistic Regression and Sampling

If you're going to use a stepwise procedure, don't resample. Create one random subsample once and for all. Perform your analysis on it. Validate the results against the held-out data. It's likely most of the "significant" variables will turn out not to be significant.

(Edit 12/2015: You can indeed go beyond such a simple approach by resampling, repeating the stepwise procedure, and re-validating: this will lead you into a form of cross-validation. But in such a case more sophisticated methods of variable selection, such as ridge regression, the Lasso, and the Elastic Net are likely preferable to stepwise regression.)

Focus on the variables that make sense, not those that fit the data a little better. If you have more than a handful of variables for 330 records, you're at great risk of overfitting in the first place. Consider using fairly severe entering and leaving criteria for the stepwise regression. Base it on AIC or $C_p$ instead of thresholds for $F$ tests or $t$ tests.

(I presume you have already carried out the analysis and exploration to identify appropriate re-expressions of the independent variables, that you have identified likely interactions, and that you have established that there really is an approximately linear relationship between the logit of the dependent variable and the regressors. If not, do this essential preliminary work and only then return to the stepwise regression.)

Be cautious about following generic advice like I just gave, by the way :-). Your approach should depend on the purpose of the analysis (prediction? extrapolation? scientific understanding? decision making?) as well as the nature of the data, the number of variables, etc.

Solved – Sane stepwise regression

I would not recommend you use that procedure. My recommendation is: Abandon this project. Just give up and walk away. You have no hope of this working.

_{source for image}

Setting aside the standard problems with stepwise selection (cf., here), in your case you are very likely to have perfect predictions due to separation in such a high-dimensional space.

I don't have specifics on your situation, but you state that you have "only a few 10s of samples". Let's be charitable and say you have 90. You further say you have "several thousand features". Let's imagine that you 'only' have 2,000. For the sake of simplicity, let's say that all your features are binary. You "believe that the class label can be accurately predicted using only a few features", let's say that you will look for sets of up to only 9 features max. Lastly, let's imagine that the relationship is deterministic, so that the true relationship will always be perfectly present in your data. (We can change these numbers and assumptions, but that should only make the problem worse.) Now, how well would you be able to recover that relationship under these (generous) conditions? That is, how often would the correct set be the only set that yields perfect accuracy? Or, put another way, how many sets of nine features will also fit by chance alone?

Some (overly) simple math and simulations should provide some clues to this question. First, with 9 variables, each of which could be 0 or 1, the number of patterns an observation could show are $2^9 = 512$, but you will have only 90 observations. Thus it is entirely possible that, for a given set of 9 binary variables, every observation has a different set of predictor values—there are no replicates. Without replicates with the same predictor values where some have y=0 and some y=1, you will have complete separation and perfect prediction of every observation will be possible.

Below, I have a simulation (coded in R) to see how often you might have no patterns of x-values with both 0s and 1s. The way it works is that I get a set of numbers from 1 to 512, which represent the possible patterns, and see if any of the patterns in the first 45 (that might be the 0s) match any of the pattern in the second 45 (that might be the 1s). This assumes that you have perfectly balanced response data, which gives you the best possible protection against this problem. Note that having some replicated x-vectors with differing y-values doesn't really get you out of the woods, it just means you wouldn't be able to perfectly predict every single observation in your dataset, which is the very stringent standard I'm using here.

set.seed(7938)  # this makes the simulation exactly reproducible
my.fun = function(){
  x = sample.int(512, size=90, replace=TRUE)
  return(sum(x[1:45]%in%x[46:90])==0)
}
n.unique = replicate(10000, my.fun())
mean(n.unique)  # [1] 0.0181

The simulation suggests you would have this issue with approximately 1.8% of the sets of 9 x-variables. Now, how many sets of 9 are there? Strictly, that would be $1991 \text{ choose } 9 = 1.3\times 10^{24}$ (since we've stipulated that the true 9 deterministic causal variables are in your set). However, many of those sets will be overlapping; there will be $1991 / 9 \approx 221$ non-overlapping sets of 9 within a specified partition of your variables (with many such partitions possible). Thus, within some given partition, we might expect there would be $221\times 0.018\approx 4$ sets of 9 x-variables that will perfectly predict every observation in your dataset.

Note that these results are only for cases where you have a relatively larger dataset (within the "tens"), a relatively smaller number of variables (within the "thousands"), only looks for cases where every single observation can be predicted perfectly (there will be many more sets that are nearly perfect), etc. Your actual case is unlikely to work out 'this well'. Moreover, we stipulated that the relationship is perfectly deterministic. What would happen if there is some random noise in the relationship? In that case, you will still have ~4 (null) sets that perfectly predict your data, but the right set may well not be among them.

Tl;dr, the basic point here is that your set of variables is way too large / high dimensional, and your amount of data is way too small, for anything to be possible. If it's really true that you have "tens" of samples, "thousands" of variables, and absolutely no earthly idea which variables might be right, you have no hope of getting anywhere with any procedure. Go do something else with your time.

Best Answer

Related Solutions

Logistic Regression – Insights into Stepwise Logistic Regression and Sampling

Solved – Sane stepwise regression

Related Question