Validity of AUC for binary categorical variables

auccategorical datapredictive-modelsroc

Scikit-learn function roc_auc_score can be used to get area under curve (AUC) of ROC curve. This score is generally used for numeric predictors' value in predicting outcomes.

However, this function can also be used for categorical variables also. Following is an example (in Python language) where the variable sex is used to predict variable survived and AUC is obtained using this function:

import seaborn, pandas, sklearn
from sklearn.metrics import roc_auc_score
tdf = seaborn.load_dataset('titanic')
print(tdf[['survived','sex']].head(10))
x = tdf['sex'].apply(lambda x: 1 if x=='female' else 0)
y = tdf['survived']
auc = roc_auc_score(y, x)   
auc = round(auc, 4)
print()
print("AUC for sex to predict survived:", auc)

Output:

   survived     sex
0         0    male
1         1  female
2         1  female
3         1  female
4         0    male
5         0    male
6         0    male
7         0    male
8         1  female
9         1  female


AUC for sex to predict survived: 0.7669

However, is this technique statistically sound? Does the AUC obtained using this method a valid value for the relation between 2 categorical variables? Thanks for your help.

Edit: I have reversed the coding of sex to 0 and 1, so that the AUC now is 0.7669

Edit2: From very interesting answers given below, following points seem important:

  • AUC can be used with categorical variables also, provided it is interpreted correctly.

  • It needs to be emphasized that the greater the AUC is away from 0.5, the better it is, and not necessarily higher. Hence, AUC of 0.1 is more predictive, albeit in opposite direction, than AUC of 0.7

  • One may report "Absolute AUC" given by following simple Python code:

    Abs_AUC = AUC if (AUC>0.5)
    else (1-AUC)

Hence, for an AUC of 0.1, the absolute AUC is 0.9; this will help in comparing AUCs of different variables without missing out ones on the other side of the diagonal in the ROC curve. Note: this is being suggested for predicted variable with only 2 categories.

Best Answer

The ROC curve is a statistic of ranks, so it's valid as long as the way you're sorting the data is meaningful. In its most common application, we're sorting according to the predicted probabilities produced by a model. This is meaningful, in the sense that we have the most likely events at one extreme and the least likely events at the other extreme. This is useful because each operating point on the curve tells you (1) how much of your outcome you capture at each threshold using the decision rule "alert if $\hat{p} > \text{threshold}$" and (2) how many false positives you capture with that same rule.

The ROC AUC is the probability a randomly-chosen positive example is ranked more highly than a randomly-chosen negative example. When we're using ROC AUC to assess a machine learning model, we always want a higher AUC value, because we want our model to give positives a higher rank. On the other hand, if we built a model that had an out-of-sample AUC well below 0.5, we'd know that the model was garbage.

In OP's example, OP demonstrated that the arbitrary choice of how they encoded the categorical data can reverse the meaning of AUC. In the initial post, OP wrote:

AUC for sex to predict survived: 0.2331

but then edited to reverse how genders were sorted and found

Edit: I have reversed the coding of sex to 0 and 1, so that the AUC now is 0.7669.

The results are completely opposite. In the first case, we had an AUC of $c$, but in the second case, we had an AUC of $1-c$. This is an effective demonstration of why the choice of how you sort the categorical data is crucial! For this reason, I wouldn't recommend using AUC to interpret unordered data.

This is usually where people will point out that you can reverse really bad predictions to get a really high AUC. This is true as far as it goes, but "Let's run 2 tests, fiddle with our data, and report the most favorable result" is not sound statistical practice.

Your suggested procedure of reporting the larger of AUC and 1-AUC gives you a massive optimism bias.

  • If your data has 3 or more categories and you impose an arbitrary order on them, you might need to test all permutations to get the highest AUC, not just reverse the ordering (reporting 1 - AUC is equivalent to reversing the ordering). An example is that the categories are "red," "green," and "blue" instead of "male" and "female." There's more than 2 ways to sort them, so simply reversing the order doesn't cover all possible permutations.
  • In extrema, you may encounter categorical variables that uniquely identify each observational unit (e.g. national ID numbers, telephone numbers, geolocation coordinates, or similar information). The optimal sorting of these unique identifiers will have an AUC of 1 (put all the positives at the lowest rank), but it won't generalize because you won't know where new unique identifiers should be placed.
  • If you’ve badly overfit a classifier, this method cheerfully reports a much higher AUC than you have in reality.
  • Hypothesis tests will be bogus, because you’re choosing the most favorable statistic.

On the other hand, a does not give a different statistic if you change how you order your categories. It also works when you have 3 or more categories.