Solved – glmStepAIC model is doing better that other models

caretgeneralized linear modelmodelroc

I am training a model on an imbalanced dataset (about 5-20% of positive class) and trying out different algorithms in R using caret package.
I have 57 predictors and around 2000-3000 observations in my training dataset.

So far, I tried several models and got ROC and AUC PR plots for these models:

enter image description here

I see a lot of criticism of using Stepwise Logistic Regression with R and I do understand that there are indeed a lot of problems with it. At the same time, I see that it is doing rather well and I am not sure how to interpret it. May it be that I do something wrong with training other models?

I am using repeated 5-fold cross-validation:

objControl <- trainControl(method = 'repeatedcv', 
                         number = 5, 
                         repeats = 5, 
                         summaryFunction = twoClassSummary, 
                         classProbs = TRUE)


 gbm_fit <- train(training[,predictors, drop = FALSE], training[[bm_name]], 
                   method='gbm', 
                   verbose = TRUE,
                   trControl = objControl,  
                   metric = "ROC",
                   preProc = c("center", "scale"),
                   train.fraction = 0.5)

Any guidance is highly appreciated.

Thank you!

Best Answer

Stepwise variable selection tends to provide too optimistic results (too low p values etc). The main critique with the method is that researchers often ignore that fact and present the model results without mentioning that bias.

In your comparison, the focus is not on how valid the results are but on how the method competes with alternative modelling techniques regarding a couple of performance metrics. So the critique is not how accurate the models are but how valid p values etc. are after stepwise variable selection.

In practical situations, it is very, very difficult to have fair comparison of predictive performance of methods. Why?

1) Each method needs different preparation of covariables to let the method shine (outliers, missing values, decorrelation, standarization, creation of non-linear terms and interaction etc.).

2) Besides 1), each method has usually different parameters to optimize like the k for k-nn, 6-7 parameters with GBM/XGBoost, mtry with random forests, ... this takes a lot of time and a very strong validation strategy.

3) Some methods are more flexible in choosing an appropriate loss function than others.

4) ...

Related Question