So I have been reading some books (or parts of them) on modeling (F. Harrell's "Regression Modeling Strategies" among others), since my current situation right now is that I need to do a logistic model based on binary response data. I have both continuous, categorical, and binary data (predictors) in my data set. Basically I have around 100 predictors right now, which obviously is way too many for a good model. Also, many of these predictors are kind of related, since they are often based on the same metric, although a bit different.

Anyhow, what I have been reading, using univariate regression and step-wise techniques is some of the worst things you can do in order to reduce the amount of predictors. I think the LASSO technique is quite okay (if I understood that correctly), but obviously you just can't use that on 100 predictors and think any good will come of that.

So what are my options here ? Do I really just have to sit down, talk to all my supervisors, and smart people at work, and really think about what the top 5 best predictors could/should be (we might be wrong), or which approach(es) should I consider instead ?

And yes, I also know that this topic is heavily discussed (online and in books), but it sometimes seems a bit overwhelming when you are kind of new in this modeling field.


First of all, my sample size is +1000 patients (which is a lot in my field), and out of those there are between 70-170 positive responses (i.e. 170 yes responses vs. roughly 900 no responses in one of the cases).
Basically the idea is to predict toxicity after radiation treatment. I have some prospective binary response data (i.e. the toxicity, either you have it (1), or you don't (0)), and then I have several types of metrics. Some metrics are patient specific, e.g. age, drugs used, organ and target volume, diabetes etc., and then I have some treatment specific metrics based on the simulated treatment field for the target. From that I can retrieve several predictors, which is often highly relevant in my field, since most toxicity is highly correlated with the amount of radiation (i.e.dose) received. So for example, if I treat a lung tumour, there is a risk of hitting the heart with some amount of dose. I can then calculate how much x-amount of the heart volume receives x-amount of dose, e.g. "how much dose does 50% of the heart volume receive", and then do that in steps, so I check for example 30%, 35%, 40%, 45%, 50%, and so on. In turn I will get a lot of similar predictors, but I can't just pick one to start with (although that is what past experiments have tried to of course, and what I wish to do as well), because I need to know "exactly" at which degree there actually is a large correlation between heart toxicity and volume dose (again, as an example, there are other similar metrics, where the same strategy is applied). So yeah, that's pretty much how my data set looks like. Some different metrics, and some metrics that are somewhat similar.

What I then want to do is make a predictive model so I can hopefully predict which patients will have a risk of getting some kind of toxicity. And since the response data is binary, my main idea was of course to use a logistic regression model. At least that is what other people have done in my field. However, when going through many of these papers, where this has already been done, some of it just seems wrong (at least when reading these specific types of modeling books like F. Harrel's). Many use univariate regression analysis to pick predictors, and use them in multivariate analysis (a thing that is advised against if I'm not mistaken), and also many use step-wise techniques to reduce the amount of predictors.
Of course it's not all bad. Many uses LASSO, PCA, cross-validation, bootstrapping, etc., but the ones I have looked at, it seems like there is always one, or two of their approaches (either in the beginning, middle, or end) where they do these kind of techniques that I read is not a good idea.

Concerning feature selection, this is probably where I'm at now. How do I choose/find the right predictors to use in my model ? I have tried these univariate/step-wise approaches, but every time I think: "Why even do it, if it's wrong?". But maybe it's a good way to show, at least in the end, how a "good model" done the correct way goes up against a "bad model" done the wrong way. So I could probably do it the somewhat wrong way now, what I need help for is getting a direction into doing it the right way.

Sorry for the edit, and it being so long.

Just a quick example of how my data looks like:

'data.frame':   1151 obs. of  100 variables:
 $ Toxicity              : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
And if I run table(data$Toxicity) the output is:

> table(data$Toxicity)
   0    1 
1088   63 

Again, this is for one type of toxicity. I have 3 others as well.

Some of the answers you have received that push feature selection are off base.

The lasso or better the elastic net will do feature selection but as pointed out above you will be quite disappointed at the volatility of the set of "selected" features. I believe the only real hope in your situation is data reduction, i.e., unsupervised learning, as I emphasize in my book. Data reduction brings more interpretability and especially more stability. I very much recommend sparse principal components, or variable clustering followed by regular principal components on clusters.

The information content in your dataset is far, far too low for any feature selection algorithm to be reliable.