Solved – Why does forward stepwise selection reduce the AUC of a classifier to values < 0.500

aicclassificationfeature selectionstepwise regression

I've recently been benchmarking different methods for feature selection, and found a weird issue when using forward stepwise regression. Specifically, when I train a sparse logistic regression model using forward stepwise selection (with AIC as my selection criterion), I can obtain a model with an AUC < 0.500.

I've included code to reproduce this situation below using the StepAIC function in the MASS R package and a processed version of the UCI Mammography Dataset (link). I ran this code several times to show how AUC and AIC change based on the number of variables added to the model.

As shown, AIC decreases monotonically with each variable, which is expected given that we are add variables to minimize AIC. However, the AUC falls way below 0.500 after adding the second variable… and then stays below 0.500 thereafter. To be clear, I wouldn’t expect forward stepwise selection to monotonically increase AUC, but I also would not expect it to reduce AUC so drastically in a single step / output a classifier that is worse than random guessing.

I'm wondering if anyone can explain this phenomenon? Is there a methodological issue in using AIC as the selection criterion? Or could there be a bug with the StepAIC package?


AUC vs # of Variables Selected

AIC vs # of Variables Selected


Example (Dataset @ mammography_processed.csv):

require(ROCR)
require(MASS)

target_size = 2
train_data = read.csv(file = "mammography_processed.csv")

#fit model using forward stepwise logistic regression
model_initial = glm(y ~ 1, family = binomial, data = train_data)
model_final = glm(formula = y ~ ., family = binomial, data = train_data)
model = stepAIC(model_initial, scope =  as.formula(model_final), direction = "forward", steps = target_size, trace = FALSE)
coefs = model$coefficients;

#store coefficients all variables + the intercept in a vector
coef_names = c("(Intercept)", colnames(X));
coefficients = setNames(rep(0, length(coef_names)), coef_names);
idx = which(coef_names %in% names(coefs))
coefficients[idx] = coefs;

#compute AUC
X = as.matrix(train_data[1:ncol(train_data)-1])
scores = cbind(1, X) %*% coefficients
probabilities = inv.logit(scores)
prediction_object = prediction(probabilities, labels = train_data$y);
auc = performance(prediction_object, measure = "auc")@y.values[[1]]
aic = model$aic

print(sprintf("auc: %1.3f", auc))
print(sprintf("aic: %1.1f", aic))

Best Answer

I don't see the problem. Check this demo.

library(pROC)  ## simpler alternative package
library(MASS)

target_size <- 2
train_data <- within(read.csv(file = "mammography_processed.csv"), {
  y <- factor(y)
})

## fit model using forward stepwise logistic regression
model_initial <- glm(y ~ 1, family = binomial, data = train_data)

upper_formula <- as.formula(paste("~",
                                  paste(setdiff(colnames(train_data), "y"),
                                        collapse = "+")))
target <- train_data$y

k <- ncol(train_data)
AUC <- AIC <- numeric(k)
AUC[1] <- auc(roc(target, predict(model_initial,
                                  train_data, type = "response")))
AIC[1] <- stats::AIC(model_initial)

for(j in 2:k) {
  model <- stepAIC(model_initial,
                   scope =  list(lower = ~1, upper = upper_formula),
                   direction = "forward",
                   steps = j-1, trace = FALSE)
  AUC[j] <- auc(roc(target, predict(model, train_data, type = "response")))
  AIC[j] <- stats::AIC(model)
}

dev.new(height = 6, width = 12)
par(mfrow = c(1,2))
plot(AIC, type = "b")
plot(AUC, type = "b", ylim = c(0, 1))
abline(h = 0.5, col = "red")