Solved – Logistic regression with sparse predictor variables

logisticpredictorregressionsparse

I am currently modeling some data using a binary logistic regression. The dependent variable has a good number of positive cases and negative cases – it is not sparse. I also have a large training set (> 100,000) and the number of main effects I'm interested in is about 15 so I'm not worried about a p>n issue.

What I'm concerned about is that many of my predictor variables, if continuous, are zero most of the time, and if nominal, are null most of the time. When these sparse predictor variables take a value > 0 (or not null), I know because of familiarity with the data that they should be of importance in predicting my positive cases. I have been trying to look for information on how the sparseness of these predictors could be affecting my model.

In particular, I would not want the effect of a sparse but important variable to be not included in my model if there is another predictor variable that is not sparse and is correlated but actually doesn't do as good a job of predicting the positive cases.

To illustrate an example, if I were trying to model whether or not someone ended up being accepted at a particular ivy league university and my three predictors were SAT score, GPA, and "donation > 1 Million dollars" as a binary, I have reason to believe that "donation > 1 Million dollars", when true, is going to be very predictive of acceptance – more so than a high GPA or SAT – but it is also very sparse. How, if at all, is this going to affect my logistic model and do I need to make adjustments for this? Also, would another type of model (say decision tree, random forest, etc) handle this better?

Best Answer

1) Sparcity of data can be hadnled by L1 regularization.

2) You can also try sub sampling and over sampling of data.(don't forget to calibrate the result based on sampling ration used earlier)

3) Your model will also take care of significance of different variables.