Solved – Why does forward stepwise selection reduce the AUC of a classifier to values < 0.500

aicclassificationfeature selectionstepwise regression

I've recently been benchmarking different methods for feature selection, and found a weird issue when using forward stepwise regression. Specifically, when I train a sparse logistic regression model using forward stepwise selection (with AIC as my selection criterion), I can obtain a model with an AUC < 0.500.

I've included code to reproduce this situation below using the StepAIC function in the MASS R package and a processed version of the UCI Mammography Dataset (link). I ran this code several times to show how AUC and AIC change based on the number of variables added to the model.

As shown, AIC decreases monotonically with each variable, which is expected given that we are add variables to minimize AIC. However, the AUC falls way below 0.500 after adding the second variable… and then stays below 0.500 thereafter. To be clear, I wouldn’t expect forward stepwise selection to monotonically increase AUC, but I also would not expect it to reduce AUC so drastically in a single step / output a classifier that is worse than random guessing.

I'm wondering if anyone can explain this phenomenon? Is there a methodological issue in using AIC as the selection criterion? Or could there be a bug with the StepAIC package?

Example (Dataset @ mammography_processed.csv):

require(ROCR)
require(MASS)

target_size = 2
train_data = read.csv(file = "mammography_processed.csv")

#fit model using forward stepwise logistic regression
model_initial = glm(y ~ 1, family = binomial, data = train_data)
model_final = glm(formula = y ~ ., family = binomial, data = train_data)
model = stepAIC(model_initial, scope =  as.formula(model_final), direction = "forward", steps = target_size, trace = FALSE)
coefs = model$coefficients;

#store coefficients all variables + the intercept in a vector
coef_names = c("(Intercept)", colnames(X));
coefficients = setNames(rep(0, length(coef_names)), coef_names);
idx = which(coef_names %in% names(coefs))
coefficients[idx] = coefs;

#compute AUC
X = as.matrix(train_data[1:ncol(train_data)-1])
scores = cbind(1, X) %*% coefficients
probabilities = inv.logit(scores)
prediction_object = prediction(probabilities, labels = train_data$y);
auc = performance(prediction_object, measure = "auc")@y.values[[1]]
aic = model$aic

print(sprintf("auc: %1.3f", auc))
print(sprintf("aic: %1.1f", aic))

Best Answer

I don't see the problem. Check this demo.

library(pROC)  ## simpler alternative package
library(MASS)

target_size <- 2
train_data <- within(read.csv(file = "mammography_processed.csv"), {
  y <- factor(y)
})

## fit model using forward stepwise logistic regression
model_initial <- glm(y ~ 1, family = binomial, data = train_data)

upper_formula <- as.formula(paste("~",
                                  paste(setdiff(colnames(train_data), "y"),
                                        collapse = "+")))
target <- train_data$y

k <- ncol(train_data)
AUC <- AIC <- numeric(k)
AUC[1] <- auc(roc(target, predict(model_initial,
                                  train_data, type = "response")))
AIC[1] <- stats::AIC(model_initial)

for(j in 2:k) {
  model <- stepAIC(model_initial,
                   scope =  list(lower = ~1, upper = upper_formula),
                   direction = "forward",
                   steps = j-1, trace = FALSE)
  AUC[j] <- auc(roc(target, predict(model, train_data, type = "response")))
  AIC[j] <- stats::AIC(model)
}

dev.new(height = 6, width = 12)
par(mfrow = c(1,2))
plot(AIC, type = "b")
plot(AUC, type = "b", ylim = c(0, 1))
abline(h = 0.5, col = "red")

Related Solutions

Solved – Incorporating random effects in the logistic regression formula in R

Short answer is you can't - well, not without recoding a version of stepAIC() that knows how to handle S4 objects. stepAIC() knows nothing about lmer() and glmer() models, and there is no equivalent code in lme4 that will allow you to do this sort of stepping.

I also think your whole process needs carefully rethinking - why should there be the one best model? AIC could be used to identify several candidate models that do similar jobs and average those models, rather than trying to find the best model for your sample of data.

Selection via AIC is effectively doing multiple testing - but how should you correct the AIC to take into account the fact that you are doing all this testing? How do you interpret the precision of the coefficients for the final model you might select?

A final point; don;t do all the as.factor() in the model formula as it just makes the whole thing a mess, takes up a lot of space and doesn't aid understanding of the model you fitted. Get the data in the correct format first, then fit the model, e.g.:

RShifting <- transform(RShifting,
                       Age = as.factor(Age),
                       Educ = as.factor(Educ),
                       Child = as.factor(Child))

then

glmer(decision ~ Age + Educ + Child + (1|town), family=binomial, 
      data=RShifting)

Apart from making things far more readable, it separates the tasks of data processing from the data analysis steps.

Solved – Generalized linear mixed models: model selection

Stepwise selection is wrong in multilevel models for the same reasons it is wrong in "regular" regression: The p-values will be too low, the standard errors too small, the parameter estimates biased away from 0 etc. Most important, it denies you the opportunity to think.

9 IVs is not so very many. Why did you choose those 9? Surely you had a reason.

One initial thing to do is look at a lot of plots; which precise ones depends a little on whether your data are longitudinal (in which case plots with time on the x-axis are often useful) or clustered. But surely look at relationships between the 9 IVs and your DV (parallel box plots are one simple possibility).

The ideal would be to build a few models based on substantive sense and compare them using AIC, BIC or some other measure. But don't be surprised if no particular model comes forth as clearly best. You don't say what field you work in, but in many (most?) fields, nature is complicated. Several models may fit about equally well and a different model may fit better on a different data set (even if both are random samples from the same population).

As for references - there are lots of good books on nonlinear mixed models. Which one is best for you depends on a) What field you are in b) What the nature of the data is c) What software you use.

Responding to your comment

If all 9 variables are scientifically important, I would at least consider including them all. If a variable that everyone thinks is important winds up having a small effect, that is interesting.
Certainly plot all your variables over time and in various ways.
For general issues about longitudinal multilevel models I like Hedeker and Gibbons; for nonlinear longitudinal models in SAS I like Molenberghs and Verbeke. The SAS documentation itself (for PROC GLIMMIX) also provides guidance.

Best Answer

Related Solutions

Solved – Incorporating random effects in the logistic regression formula in R

Solved – Generalized linear mixed models: model selection

Related Question