Solved – Area Under Curve ROC penalizes somehow models with too many explanatory variables

logisticoverfittingregression-strategiesroc

I'm using Area Under Curve ROC as a performance measure of my classification algorithms (logistic regressions). Since I'm going to choose the model that maximize the Area Under Curve ROC, I would like to know if AUC penalizes somehow models with too many regressors (for example, like BIC information criterion).

Best Answer

You mention in the comments that you are computing the AUC using a 75-25 train-test split, and you are puzzled why AUC is maximized when training your model on only 8 of your 30 regressors. From this you have gotten the impression that AUC is somehow penalizing complexity in your model.

In reality there is something penalizing complexity in your model, but it is not the AUC metric. It is the train-test split. Train-test splitting is what makes it possible to use pretty much any metric, even AUC, for model selection, even if they have no inherent penalty on model complexity.

As you probably know, we do not measure performance on the same data that we train our models on, because the training data error rate is generally an overly optimistic measure of performance in practice (see Section 7.4 of the ESL book). But this is not the most important reason to use train-test splits. The most important reason is to avoid overfitting with excessively complex models.

Given two models A and B such that B "contains A" (the parameter set of B contains that of A) the training error is mathematically guaranteed to favor model B, if you are fitting by optimizing some fit criterion and measuring error by that same criterion. That's because B can fit the data in all the ways that A can, plus additional ways that may produce lower error than A's best fit. This is why you were expecting to see lower error as you added more predictors to your model.

However, by splitting your data into two reasonably independent sets for training and testing, you guard yourself against this pitfall. When you fit the training data aggressively, with many predictors and parameters, it doesn't necessarily improve the test data fit. In fact, no matter what the model or fit criterion, we can generally expect that a model which has overfit the training data will not do well on an independent set of test data which it has never seen. As model complexity increases into overfitting territory, test set performance will generally worsen as the model picks up on increasingly spurious training data patterns, taking its predictions farther and farther away from the actual trends in the system it is trying to predict. See for example slide 4 of this presentation, and sections 7.10 and 7.12 of ESL.

If you still need convincing, a simple thought experiment may help. Imagine you have a dataset of 100 points with a simple linear trend plus gaussian noise, and you want to fit a polynomial model to this data. Now let's say you split the data into training and test sets of size 50 each and you fit a polynomial of degree 50 to the training data. This polynomial will interpolate the data and give zero training set error, but it will exhibit wild oscillatory behavior carrying it far, far away from the simple linear trendline. This will cause extremely large errors on the test set, much larger than you would get using a simple linear model. So the linear model will be favored by CV error. This will also happen if you compare the linear model against a more stable model like smoothing splines, although the effect will be less dramatic.

In conclusion, by using train-test splitting techniques such as CV, and measuring performance on the test data, we get an implicit penalization of model complexity, no matter what metric we use, just because the model has to predict on data it hasn't seen. This is why train-test splitting is universally used in the modern approach to evaluating performance in regression and classification.