Solved – Use Lasso Logistic Regression to Analyze Binary Data

binary datafeature selectionlassologistic

I am involved with a medical research that analyzes Coronary Artery Disease. The dataset has a couple of predictors such as age, gender, race, certain symptons and medical standard procedures to be diagnosed as CAD disease. Most of them are binary, (like whether the patient smokes, etc.) and the rest are continuous (like blood pressure, or certain hormone levels) The outcome variable is whether the patient has CAD disease or not (binary).

The research question is to build a model to find variables of the most interest and better predict. My idea is to perform a Lasso Logistic Regression to select the variables and look at the prediction. I did some research online and find a very useful tutorial by Trevor Hastie and Junyang Qian. Click the link here.

However, the total valid observation here is around 150 and at least 4/5 of patients don't have CAD diseases. In other words, the outcome variable in the data show extreme cases for "yes". I am not sure the number of observation is large enough to perform Lasso, either. Under this circumstance, in addition to the general proceduce above, do I need to set up anything else (such as weight adjustment or more penalties for "Yes") for model construction? If so, are there any methods to handle such problem?

Thanks in advance!

Best Answer

Your sense that you are limited by the number of cases is correct. The rule of thumb for standard multiple logistic regression is to have no more than 1 predictor variable per 15 cases of the least frequent class. In your case, that would be 30 cases, 2 predictor variables. Even though you might get an apparently good fit with more predictors, such a model would be unlikely to generalize well.

LASSO and other penalized methods like ridge regression let you use more predictors in your model than that. The regression coefficients in penalized models are lower in magnitude than they would be in a standard model for the same variables. This diminishes the "optimism" based on results from a small data set and makes the final model more likely to generalize provided that the penalty is chosen appropriately by, say, cross-validation.

Thus you are able to start with as many predictors as you wish for LASSO, ridge regression, or their hybrid elastic net. LASSO will penalize by selecting a subset of predictors and penalizing the coefficients of those selected. Ridge regression will keep all predictors with penalized coefficients.

There will be at least 2 types of limits from this approach. First, the particular variables selected by LASSO may differ substantially among data samples, even with a large data set, as you can test by repeating your modeling on multiple bootstrapped samples. Second, with so few cases your coefficients will be heavily penalized toward magnitudes of 0. Also, some care needs to be taken with categorical predictors, as discussed on this page.

Finally, coronary artery disease has been extensively studied in many large-scale data sets for many decades. Please think carefully about what you are likely to add to this body of knowledge with such a small data set.

Related Question