R – How to Select the Best Subset of Variables for Parsimonious Binary Logistic Regression Models

feature selectionr

In addition to PROC VARCLUS, randomForest, glmnet, and assessing multicollinearity among potential predictor variables (without regards to the outcome of interest), I am seeking other methods of variable selection in lieu of using stepwise methods for building more parsimonious binary logistic regression models (containing 8 to 12 variables to predict outcomes such as loan payment/default or current/late payment history) from a wide array of potential predictor variables (500+ variables, 200k+ records).

Below I have included an R script using FSelector to select the 8 highest "ranked" variables:

library(FSelector)
fit <- information.gain(outcome ~ ., dataset)
fit2 <- cutoff.k(fit,8)
reducedmodel <- as.simple.formula(fit2,"outcome")
print(reducedmodel)

I have two questions regarding this script and the FSelector algorithm in general:

  1. Is the information.gain criteria in the above script synonymous with Kullback-Leibler divergence?
    If so, can someone explain this in more layman terms than Wikipedia as I am relatively new to this area of statistics and would like to start off with the right idea of this concept as I may likely use this approach a great deal in the future?

  2. Is this a valid approach, if there is such a thing as a valid approach, to select a desired number of variables for a binary logistic regression model (e.g., selecting the 8 highest "ranked" variables for use in a parsimonious model)? If not, can you provide an alternative approach to do so?

Any insight or references regarding this topic and/or these questions will be greatly appreciated!

Best Answer

Variable selection without penalization is invalid.