In addition to PROC VARCLUS, randomForest, glmnet, and assessing multicollinearity among potential predictor variables (without regards to the outcome of interest), I am seeking other methods of variable selection in lieu of using stepwise methods for building more parsimonious binary logistic regression models (containing 8 to 12 variables to predict outcomes such as loan payment/default or current/late payment history) from a wide array of potential predictor variables (500+ variables, 200k+ records).
Below I have included an R script using FSelector to select the 8 highest "ranked" variables:
library(FSelector)
fit <- information.gain(outcome ~ ., dataset)
fit2 <- cutoff.k(fit,8)
reducedmodel <- as.simple.formula(fit2,"outcome")
print(reducedmodel)
I have two questions regarding this script and the FSelector
algorithm in general:
-
Is the
information.gain
criteria in the above script synonymous with Kullback-Leibler divergence?
If so, can someone explain this in more layman terms than Wikipedia as I am relatively new to this area of statistics and would like to start off with the right idea of this concept as I may likely use this approach a great deal in the future? -
Is this a valid approach, if there is such a thing as a valid approach, to select a desired number of variables for a binary logistic regression model (e.g., selecting the 8 highest "ranked" variables for use in a parsimonious model)? If not, can you provide an alternative approach to do so?
Any insight or references regarding this topic and/or these questions will be greatly appreciated!
Best Answer
Variable selection without penalization is invalid.