Why does a subset of variables produce a higher AUC value than all variables in a logistic regression

aucglmnetlogisticpredictive-modelsr

I have to predict when the soil dries out. The dependent variable is therefore binary (the soil is wet or dry). I have a lot of variables, and I have clustered them together into three main clusters.

  1. Weather
  2. Vegetation
  3. Soil

When I run a penalised ridge logistic regression (glmnet) for all parameters I get an AUC value of around 0.81. Then I run it for each individual cluster. Weather and vegetation both amount to an AUC of 0.5, while the soil parameter has a AUC of 0.84.

  1. How can I get a better prediction of the when the soil dries with a cluster of variables than all variables included?
  2. Do the 'non-predictive' variables in the weather and vegetation cluster "drag down" the overall AUC score for the whole model and that is what I see with the higher AUC score for soil alone?

Here is the script:

library(readr)
library(caret)
library(tidyverse)
library(glmnet)
library(ROCR)
library(pROC)
library(doParallel)
registerDoParallel(4, cores = 4)
set.seed(123)
data <- read_csv("path/soildryness.csv")
df <- data %>% select(V1, V2, ... V25)
df.W <- data %>% select(V1, V2, ... V7)
df.V <- data %>% select(V8, V9, ... V18)
df.S <- data %>% select(V19, V20, ... V25)

training.samples <- df$V1 %>% createDataPartition(p = 0.8, list = FALSE)
train <- df[training.samples, ]
test <- df[-training.samples, ]
x.train <- data.frame(train[, names(train) != "V1"])
x.train <- data.matrix(x.train)
y.train <- train$V1
x.test <- data.frame(test[, names(test) != "V1"])
x.test <- data.matrix(x.test)
y.test <- test$V1
foldid <- sample(rep(seq(10), length.out = nrow(train)))

fits <- cv.glmnet(x.train, y.train, type.measure = "dev", alpha = 0, family = "binomial", nfolds = 10, foldid = foldid, parallel = TRUE, standardized = TRUE)

predicted <- predict(fits, s = fits$lambda.1se, newx = x.test, type = 'response')
pred <- prediction(predicted, y.test)
perf <- performance(pred, "tpr", "fpr")
plot(perf, color = "black")
abline(a = 0, b = 1, lty = 2, col = "red")
auc_ROCR <- performance(pred, measure = "auc")
auc_ROCR <- [email protected][[1]]
auc_ROCR

Sum up the AUC values:

Weather:    0.5
Vegetation: 0.5
Soil:       0.84
All:        0.81

Best Answer

I do not follow your code 100%, but it looks like you are finding this in out-of-sample data. In that case, it means that you are adding features that do not contribute much. Therefore, your model overfits to those features and is tricked by them when it comes time to evaluate out-of-sample performance.

However, this can happen with in-sample performance, too!

set.seed(2021)
N <- 1000
B <- 1000
x <- runif(1000, -3, 3)
z <- 0
pr <- 1/(1 + exp(-z))
log_diff <- auc_diff <- rep(NA, B)
for (i in 1:B){
    y <- rbinom(N, 1, pr)
    L1 <- glm(y ~ x, family = binomial)
    L0 <- glm(y ~ 1, family = binomial)
    preds0 <- 1/(1 + exp(-predict(L0)))
    preds1 <- 1/(1 + exp(-predict(L1)))
    auc_diff[i] <- pROC::roc(y, preds0)$auc - pROC::roc(y, preds1)$auc
    log_diff[i] <- (-mean(y*log(preds0) + (1 - y)*log(1 - preds0))) - (-mean(y*log(preds1) + (1 - y)*log(1 - preds0)))
}
summary(auc_diff)
summary(log_diff)

I get a mix of the intercept-only model having higher and lower AUC than the model with a predictor (which is not part of the true data-generating process). In contrast, the log loss is always higher for the intercept-only model, as we expect.

What's going on is that the logistic regression fit is not optimizing AUC. The logistic regression fit is optimizing log loss, which is equivalent to maximum likelihood estimation in this case.

$$ \text{Log Loss}\\ L(\hat p, y) = -\dfrac{1}{n} \sum_{i = 1}^n \bigg( y_i\log(\hat p_i) + (1 - y_i)\log(1 - \hat p_i) \bigg) $$

When we evaluate nested models on a different metric than the one that was optimized, it is possible that the more complex model will have inferior performance, even though it is guaranteed to outperform the smaller model on the metric for which it was optimized.

This is akin to how with nested OLS models, the complex one will always have the smaller in-sample (training) SSE, but it might not have the smaller in-sample (training) MAE.

Of interest: Does a logistic regression maximizing likelihood necessarily also maximize AUC over linear models?

Related Question