Logistic Regression – How to Reduce Predictors the Right Way for a Logistic Regression Model

feature selectionlogisticpredictive-modelsregression-strategies

So I have been reading some books (or parts of them) on modeling (F. Harrell's "Regression Modeling Strategies" among others), since my current situation right now is that I need to do a logistic model based on binary response data. I have both continuous, categorical, and binary data (predictors) in my data set. Basically I have around 100 predictors right now, which obviously is way too many for a good model. Also, many of these predictors are kind of related, since they are often based on the same metric, although a bit different.

Anyhow, what I have been reading, using univariate regression and step-wise techniques is some of the worst things you can do in order to reduce the amount of predictors. I think the LASSO technique is quite okay (if I understood that correctly), but obviously you just can't use that on 100 predictors and think any good will come of that.

So what are my options here ? Do I really just have to sit down, talk to all my supervisors, and smart people at work, and really think about what the top 5 best predictors could/should be (we might be wrong), or which approach(es) should I consider instead ?

And yes, I also know that this topic is heavily discussed (online and in books), but it sometimes seems a bit overwhelming when you are kind of new in this modeling field.

EDIT:

First of all, my sample size is +1000 patients (which is a lot in my field), and out of those there are between 70-170 positive responses (i.e. 170 yes responses vs. roughly 900 no responses in one of the cases).
Basically the idea is to predict toxicity after radiation treatment. I have some prospective binary response data (i.e. the toxicity, either you have it (1), or you don't (0)), and then I have several types of metrics. Some metrics are patient specific, e.g. age, drugs used, organ and target volume, diabetes etc., and then I have some treatment specific metrics based on the simulated treatment field for the target. From that I can retrieve several predictors, which is often highly relevant in my field, since most toxicity is highly correlated with the amount of radiation (i.e.dose) received. So for example, if I treat a lung tumour, there is a risk of hitting the heart with some amount of dose. I can then calculate how much x-amount of the heart volume receives x-amount of dose, e.g. "how much dose does 50% of the heart volume receive", and then do that in steps, so I check for example 30%, 35%, 40%, 45%, 50%, and so on. In turn I will get a lot of similar predictors, but I can't just pick one to start with (although that is what past experiments have tried to of course, and what I wish to do as well), because I need to know "exactly" at which degree there actually is a large correlation between heart toxicity and volume dose (again, as an example, there are other similar metrics, where the same strategy is applied). So yeah, that's pretty much how my data set looks like. Some different metrics, and some metrics that are somewhat similar.

What I then want to do is make a predictive model so I can hopefully predict which patients will have a risk of getting some kind of toxicity. And since the response data is binary, my main idea was of course to use a logistic regression model. At least that is what other people have done in my field. However, when going through many of these papers, where this has already been done, some of it just seems wrong (at least when reading these specific types of modeling books like F. Harrel's). Many use univariate regression analysis to pick predictors, and use them in multivariate analysis (a thing that is advised against if I'm not mistaken), and also many use step-wise techniques to reduce the amount of predictors.
Of course it's not all bad. Many uses LASSO, PCA, cross-validation, bootstrapping, etc., but the ones I have looked at, it seems like there is always one, or two of their approaches (either in the beginning, middle, or end) where they do these kind of techniques that I read is not a good idea.

Concerning feature selection, this is probably where I'm at now. How do I choose/find the right predictors to use in my model ? I have tried these univariate/step-wise approaches, but every time I think: "Why even do it, if it's wrong?". But maybe it's a good way to show, at least in the end, how a "good model" done the correct way goes up against a "bad model" done the wrong way. So I could probably do it the somewhat wrong way now, what I need help for is getting a direction into doing it the right way.

Sorry for the edit, and it being so long.

EDIT 2:
Just a quick example of how my data looks like:

'data.frame':   1151 obs. of  100 variables:
 $ Toxicity              : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
 $ Age                   : num  71.9 64 52.1 65.1 63.2 ...
 $ Diabetes              : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 1 ...
 $ Risk.Category         : Ord.factor w/ 3 levels "LOW"<"INTERMEDIATE"<..: 1 1 1 1 2 1 1 1 1 3 ...
 $ Organ.Volume.CC       : num  136.1 56.7 66 136.6 72.8 ...
 $ Target.Volume.CC      : num  102.7 44.2 58.8 39.1 56.3 ...
 $ D1perc                : num  7961 7718 7865 7986 7890 ...
 $ D1.5CC                : num  7948 7460 7795 7983 7800 ...
 $ D1CC                  : num  7996 7614 7833 7997 7862 ...
 $ D2perc                : num  7854 7570 7810 7944 7806 ...
 $ D2.5CC                : num  7873 7174 7729 7952 7604 ...
 $ D2CC                  : num  7915 7313 7757 7969 7715 ...
 $ D3perc                : num  7737 7379 7758 7884 7671 ...
 $ D3.5CC                : num  7787 6765 7613 7913 7325 ...
 $ D3CC                  : num  7827 6953 7675 7934 7480 ...
 $ D4perc                : num  7595 7218 7715 7798 7500 ...
 $ D5perc                : num  7428 7030 7638 7676 7257 ...
 $ DMEAN                 : num  1473 1372 1580 1383 1192 ...
 $ V2000CGY              : num  24.8 23.7 25.9 22.3 19.3 ...
 $ V2000CGY_CC           : num  33.7 13.4 17.1 30.4 14 ...
 $ V2500CGY              : num  22.5 21.5 24 20.6 17.5 ...
 $ V2500CGY_CC           : num  30.7 12.2 15.9 28.2 12.7 ...
 $ V3000CGY              : num  20.6 19.6 22.4 19.1 15.9 ...
 $ V3000CGY_CC           : num  28.1 11.1 14.8 26.2 11.6 ...
 $ V3500CGY              : num  18.9 17.8 20.8 17.8 14.6 ...
 $ V3500CGY_CC           : num  25.7 10.1 13.7 24.3 10.6 ...
 $ V3900CGY              : num  17.5 16.5 19.6 16.7 13.6 ...
 $ V3900CGY_CC           : num  23.76 9.36 12.96 22.85 9.91 ...
 $ V4500CGY              : num  15.5 14.4 17.8 15.2 12.2 ...
 $ V4500CGY_CC           : num  21.12 8.18 11.76 20.82 8.88 ...
 $ V5000CGY              : num  13.9 12.8 16.4 14 11 ...
 $ V5000CGY_CC           : num  18.91 7.25 10.79 19.09 8.03 ...
 $ V5500CGY              : num  12.23 11.14 14.84 12.69 9.85 ...
 $ V5500CGY_CC           : num  16.65 6.31 9.79 17.33 7.17 ...
 $ V6000CGY              : num  10.56 9.4 13.19 11.34 8.68 ...
 $ V6000CGY_CC           : num  14.37 5.33 8.7 15.49 6.32 ...
 $ V6500CGY              : num  8.79 7.32 11.35 9.89 7.44 ...
 $ V6500CGY_CC           : num  11.96 4.15 7.49 13.51 5.42 ...
 $ V7000CGY              : num  6.76 5.07 9.25 8.27 5.86 ...
 $ V7000CGY_CC           : num  9.21 2.87 6.1 11.3 4.26 ...
 $ V7500CGY              : num  4.61 2.37 6.22 6.13 4 ...
 $ V7500CGY_CC           : num  6.27 1.34 4.11 8.38 2.91 ...
 $ V8000CGY              : num  0.7114 0.1521 0.0348 0.6731 0.1527 ...
 $ V8000CGY_CC           : num  0.9682 0.0863 0.023 0.9194 0.1112 ...
 $ V8200CGY              : num  0.087 0 0 0 0 ...
 $ V8200CGY_CC           : num  0.118 0 0 0 0 ...
 $ V8500CGY              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ V8500CGY_CC           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ n_0.02                : num  7443 7240 7371 7467 7350 ...
 $ n_0.03                : num  7196 6976 7168 7253 7112 ...
 $ n_0.04                : num  6977 6747 6983 7055 6895 ...
 $ n_0.05                : num  6777 6542 6811 6871 6693 ...
 $ n_0.06                : num  6592 6354 6649 6696 6503 ...
 $ n_0.07                : num  6419 6180 6496 6531 6325 ...
 $ n_0.08                : num  6255 6016 6350 6374 6155 ...
 $ n_0.09                : num  6100 5863 6211 6224 5994 ...
 $ n_0.1                 : num  5953 5717 6078 6080 5840 ...
 $ n_0.11                : num  5813 5579 5950 5942 5692 ...
 $ n_0.12                : num  5679 5447 5828 5809 5551 ...
 $ n_0.13                : num  5551 5321 5709 5681 5416 ...
 $ n_0.14                : num  5428 5201 5595 5558 5285 ...
 $ n_0.15                : num  5310 5086 5485 5439 5160 ...
 $ n_0.16                : num  5197 4975 5378 5324 5039 ...
 $ n_0.17                : num  5088 4868 5275 5213 4923 ...
 $ n_0.18                : num  4982 4765 5176 5106 4811 ...
 $ n_0.19                : num  4881 4666 5079 5002 4702 ...
 $ n_0.2                 : num  4783 4571 4985 4901 4597 ...
 $ n_0.21                : num  4688 4478 4894 4803 4496 ...
 $ n_0.22                : num  4596 4389 4806 4708 4398 ...
 $ n_0.23                : num  4507 4302 4720 4616 4303 ...
 $ n_0.24                : num  4421 4219 4636 4527 4210 ...
 $ n_0.25                : num  4337 4138 4555 4440 4121 ...
 $ n_0.26                : num  4256 4059 4476 4355 4035 ...
 $ n_0.27                : num  4178 3983 4398 4273 3951 ...
 $ n_0.28                : num  4102 3909 4323 4193 3869 ...
 $ n_0.29                : num  4027 3837 4250 4115 3790 ...
 $ n_0.3                 : num  3955 3767 4179 4039 3713 ...
 $ n_0.31                : num  3885 3699 4109 3966 3639 ...
 $ n_0.32                : num  3817 3633 4041 3894 3566 ...
 $ n_0.33                : num  3751 3569 3975 3824 3496 ...
 $ n_0.34                : num  3686 3506 3911 3755 3427 ...
 $ n_0.35                : num  3623 3445 3847 3689 3361 ...
 $ n_0.36                : num  3562 3386 3786 3624 3296 ...
 $ n_0.37                : num  3502 3328 3725 3560 3233 ...
 $ n_0.38                : num  3444 3272 3666 3498 3171 ...
 $ n_0.39                : num  3387 3217 3609 3438 3111 ...
 $ n_0.4                 : num  3332 3163 3553 3379 3053 ...
 $ n_0.41                : num  3278 3111 3498 3321 2996 ...
 $ n_0.42                : num  3225 3060 3444 3265 2941 ...
 $ n_0.43                : num  3173 3010 3391 3210 2887 ...
 $ n_0.44                : num  3123 2961 3339 3156 2834 ...
 $ n_0.45                : num  3074 2914 3289 3103 2783 ...
 $ n_0.46                : num  3026 2867 3239 3052 2733 ...
 $ n_0.47                : num  2979 2822 3191 3002 2684 ...
 $ n_0.48                : num  2933 2778 3144 2953 2637 ...
 $ n_0.49                : num  2889 2734 3097 2905 2590 ...

And if I run table(data$Toxicity) the output is:

> table(data$Toxicity)
   0    1 
1088   63

Again, this is for one type of toxicity. I have 3 others as well.

Best Answer

Some of the answers you have received that push feature selection are off base.

The lasso or better the elastic net will do feature selection but as pointed out above you will be quite disappointed at the volatility of the set of "selected" features. I believe the only real hope in your situation is data reduction, i.e., unsupervised learning, as I emphasize in my book. Data reduction brings more interpretability and especially more stability. I very much recommend sparse principal components, or variable clustering followed by regular principal components on clusters.

The information content in your dataset is far, far too low for any feature selection algorithm to be reliable.

Related Solutions

Solved – Model Selection: Logistic Regression

This is probably not a good thing to do. Looking at all the individual covariates first, and then building a model with those that are significant is logically equivalent to an automatic search procedure. While this approach is intuitive, inferences made from this procedure are not valid (e.g., the true p-values are different from those reported by software). The problem is magnified the larger the size of the initial set of covariates is. If you do this anyway (and, unfortunately, many people do), you cannot take the resulting model seriously. Instead, you must run an entirely new study, gathering an independent sample and fitting the previous model, to test it. However, this requires a lot of resources, and moreover, since the process is flawed and the previous model is likely a poor one, there is a strong chance it will not hold up--meaning that it is likely to waste a lot of resources.

A better way is to evaluate models of substantive interest to you. Then use an information criterion that penalizes model flexibility (such as the AIC) to adjudicate amongst those models. For logistic regression, the AIC is: $$ AIC = -2\times\ln(\text{likelihood}) + 2k $$

where $k$ is the number of covariates included in that model. You want the model with the smallest value for the AIC, all things being equal. However, it is not always so simple; be wary when several models have similar values for the AIC, even though one may be lowest.

I include the complete formula for the AIC here, because different software outputs different information. You may have to calculate it from just the likelihood, or you may get the final AIC, or anything in between.

Solved – How to choose data for training a predictive model for attrition prediction

I would aggregate the data to weekly aggregate numbers, assuming that great / bad agents have some what consistent call center performance over the six months. Sometimes aggregating erases the effects of outliers before they can be classified as such. This would account for shifts in performance across the total 6 month period as well.

When it comes to sampling using 80% of data points to develop model and 20% to validate would be a good start. Can adjust those numbers depending on how big a data set you are dealing with.

I utilize Iowa State papers some times. Here is a good one on the basics (pdf).

Hope You have fun!!

Update: Just so we are clear you are aggregating by week per customer service rep right?

Both models don't fit good. You can tell variable fits using the coefficients section of the results. Significant variables have the stars next to there P value (more stars equals more significant typically and lower P value). Based on that none of your variables are actually

Coefficients

It's good that you are comparing the model vs actual results. ROC curves capture the model differences pretty well. Try running this and post what you get.

library(pROC)
g <- roc(admit ~ prob, data = mydata) 
plot(g)

Update: Its weekly aggregates, population wise(i.e the attrite population and the active population),didn do it agentwise because we will have cases when a agent leaves when he was at his peak performance but those are exceptional cases so i thought it would be better to compare the two populations, please advise if that's not the correct way of thinking

SO AW1 is first weeks performance metric aggregates for Attrites, similarly NAW1 is first weeks performance metric aggregates for Non-attrites/Active agents.

ROC Curve for the predictions of bayesglm model on full-data

Ran the "step" fuction(Selects a formula-based model by AIC) on the bayesglm model and the results are as below; Step fuction

results after the step function

Aggregating all the agent results together will mean you essentially are over fitting to match the total population metrics and not the agent's performance. Recommend that you tie in the agent level results. You mentioned there being a chance that a great agent leaves unexpectedly but for a well run unit that should be a rarity. Also, recommend you change model family parameter to

family = binomial(link = "probit")

This should give you probability of default for each agent. This would

Best Answer

Related Solutions

Solved – Model Selection: Logistic Regression

Solved – How to choose data for training a predictive model for attrition prediction

Related Question