Why does a subset of variables produce a higher AUC value than all variables in a logistic regression

aucglmnetlogisticpredictive-modelsr

I have to predict when the soil dries out. The dependent variable is therefore binary (the soil is wet or dry). I have a lot of variables, and I have clustered them together into three main clusters.

Weather
Vegetation
Soil

When I run a penalised ridge logistic regression (glmnet) for all parameters I get an AUC value of around 0.81. Then I run it for each individual cluster. Weather and vegetation both amount to an AUC of 0.5, while the soil parameter has a AUC of 0.84.

How can I get a better prediction of the when the soil dries with a cluster of variables than all variables included?
Do the 'non-predictive' variables in the weather and vegetation cluster "drag down" the overall AUC score for the whole model and that is what I see with the higher AUC score for soil alone?

Here is the script:

library(readr)
library(caret)
library(tidyverse)
library(glmnet)
library(ROCR)
library(pROC)
library(doParallel)
registerDoParallel(4, cores = 4)
set.seed(123)
data <- read_csv("path/soildryness.csv")
df <- data %>% select(V1, V2, ... V25)
df.W <- data %>% select(V1, V2, ... V7)
df.V <- data %>% select(V8, V9, ... V18)
df.S <- data %>% select(V19, V20, ... V25)

training.samples <- df$V1 %>% createDataPartition(p = 0.8, list = FALSE)
train <- df[training.samples, ]
test <- df[-training.samples, ]
x.train <- data.frame(train[, names(train) != "V1"])
x.train <- data.matrix(x.train)
y.train <- train$V1
x.test <- data.frame(test[, names(test) != "V1"])
x.test <- data.matrix(x.test)
y.test <- test$V1
foldid <- sample(rep(seq(10), length.out = nrow(train)))

fits <- cv.glmnet(x.train, y.train, type.measure = "dev", alpha = 0, family = "binomial", nfolds = 10, foldid = foldid, parallel = TRUE, standardized = TRUE)

predicted <- predict(fits, s = fits$lambda.1se, newx = x.test, type = 'response')
pred <- prediction(predicted, y.test)
perf <- performance(pred, "tpr", "fpr")
plot(perf, color = "black")
abline(a = 0, b = 1, lty = 2, col = "red")
auc_ROCR <- performance(pred, measure = "auc")
auc_ROCR <- [email protected][[1]]
auc_ROCR

Sum up the AUC values:

Weather:    0.5
Vegetation: 0.5
Soil:       0.84
All:        0.81

Best Answer

I do not follow your code 100%, but it looks like you are finding this in out-of-sample data. In that case, it means that you are adding features that do not contribute much. Therefore, your model overfits to those features and is tricked by them when it comes time to evaluate out-of-sample performance.

However, this can happen with in-sample performance, too!

set.seed(2021)
N <- 1000
B <- 1000
x <- runif(1000, -3, 3)
z <- 0
pr <- 1/(1 + exp(-z))
log_diff <- auc_diff <- rep(NA, B)
for (i in 1:B){
    y <- rbinom(N, 1, pr)
    L1 <- glm(y ~ x, family = binomial)
    L0 <- glm(y ~ 1, family = binomial)
    preds0 <- 1/(1 + exp(-predict(L0)))
    preds1 <- 1/(1 + exp(-predict(L1)))
    auc_diff[i] <- pROC::roc(y, preds0)$auc - pROC::roc(y, preds1)$auc
    log_diff[i] <- (-mean(y*log(preds0) + (1 - y)*log(1 - preds0))) - (-mean(y*log(preds1) + (1 - y)*log(1 - preds0)))
}
summary(auc_diff)
summary(log_diff)

I get a mix of the intercept-only model having higher and lower AUC than the model with a predictor (which is not part of the true data-generating process). In contrast, the log loss is always higher for the intercept-only model, as we expect.

What's going on is that the logistic regression fit is not optimizing AUC. The logistic regression fit is optimizing log loss, which is equivalent to maximum likelihood estimation in this case.

$$ \text{Log Loss}\\ L(\hat p, y) = -\dfrac{1}{n} \sum_{i = 1}^n \bigg( y_i\log(\hat p_i) + (1 - y_i)\log(1 - \hat p_i) \bigg) $$

When we evaluate nested models on a different metric than the one that was optimized, it is possible that the more complex model will have inferior performance, even though it is guaranteed to outperform the smaller model on the metric for which it was optimized.

This is akin to how with nested OLS models, the complex one will always have the smaller in-sample (training) SSE, but it might not have the smaller in-sample (training) MAE.

Of interest: Does a logistic regression maximizing likelihood necessarily also maximize AUC over linear models?

Related Solutions

Solved – Why does logistic regression produce well-calibrated models

Yes.

The predicted probability vector $p$ from logistic regression satisfies the matrix equation

$$ X^t(p - y) = 0$$

Where $X$ is the design matrix and $y$ is the response vector. This can be viewed as a collection of linear equations, one arising from each column of the design matrix $X$.

Specializing to the intercept column (which is a row in the transposed matrix), the associated linear equation is

$$ \sum_i( p_i - y_i) = 0 $$

so the overall average predicted probability is equal to the average of the response.

More generally, for a binary feature column $x_{ij}$, the associated linear equation is

$$ \sum_i x_{ij}(p_i - y_i) = \sum_{i \mid x_{ij} = 1}(p_i - y_i) = 0$$

so the sum (and hence average) of the predicted probabilities equals the sum of the response, even when specializing to those records for which $x_{ij} = 1$.

Solved – AUC score less than 0.5 for logistic regression

UPDATE: Sycorax posted the following link in the comments: Can a random forest be used for feature selection in multiple linear regression? deals with this problem and describes why this might not work too well.

Similar explanation: your data/model might suffer from the Curse of dimensionality, as logistic regression is prone to fall to this curse.

Several points: (might be comments with enough reputation)

pipe.fit(X_train, y_train)

Where did you define the training data?

Have you tried class_weight="balanced" for logistic regression? This might produce a different rate of misclassification.

What were the results without the RFE step?

Best Answer

Related Solutions

Solved – Why does logistic regression produce well-calibrated models

Solved – AUC score less than 0.5 for logistic regression

Related Question