Solved – Variable selection in Logistics Regression

feature selectionlogisticregression

I am running a logistic regression model on a telecom dataset having 78 variables.
Which approach should I follow to select most significant variables.
I have learned methods like forward selection and backward elimination.
But to apply such methods for 78 independent variables would be very time consuming as it require select or reject one variable at a time.
Would it be correct to make 8 groups of 10 variables and each group has 10 predictors along with the dependent variable and run the logistic regression to select significant variables.
Later combine the result of all groups and again run logistic regression to further filter variables.

Please help me.

One more question, can we use factor analysis or PCA techniques in logistic regression to select significant variables.

Best Answer

Forward and backward selection are not recommended. See the discussions here.

What are disadvantages of using the lasso for variable selection for regression?

Using PCA to do feature selection is also not recommended. Since, it only check for the variance of the independent variable, but not the "correlation" with response variable. See discussion here.

How to decide between PCA and logistic regression?

If you are focusing on getting an accurate model, you can use regularized logistic regression. An example can be found here.

Regularization methods for logistic regression

Of course, you can use a package to do it. Search for R glmnet.