Scikit-learn function roc_auc_score
can be used to get area under curve (AUC) of ROC curve. This score is generally used for numeric predictors' value in predicting outcomes.
However, this function can also be used for categorical variables also. Following is an example (in Python language) where the variable sex
is used to predict variable survived
and AUC is obtained using this function:
import seaborn, pandas, sklearn
from sklearn.metrics import roc_auc_score
tdf = seaborn.load_dataset('titanic')
print(tdf[['survived','sex']].head(10))
x = tdf['sex'].apply(lambda x: 1 if x=='female' else 0)
y = tdf['survived']
auc = roc_auc_score(y, x)
auc = round(auc, 4)
print()
print("AUC for sex to predict survived:", auc)
Output:
survived sex
0 0 male
1 1 female
2 1 female
3 1 female
4 0 male
5 0 male
6 0 male
7 0 male
8 1 female
9 1 female
AUC for sex to predict survived: 0.7669
However, is this technique statistically sound? Does the AUC obtained using this method a valid value for the relation between 2 categorical variables? Thanks for your help.
Edit: I have reversed the coding of sex to 0 and 1, so that the AUC now is 0.7669
Edit2: From very interesting answers given below, following points seem important:
-
AUC can be used with categorical variables also, provided it is interpreted correctly.
-
It needs to be emphasized that the greater the AUC is away from 0.5, the better it is, and not necessarily higher. Hence, AUC of 0.1 is more predictive, albeit in opposite direction, than AUC of 0.7
-
One may report "Absolute AUC" given by following simple Python code:
Abs_AUC = AUC if (AUC>0.5)
else (1-AUC)
Hence, for an AUC of 0.1, the absolute AUC is 0.9; this will help in comparing AUCs of different variables without missing out ones on the other side of the diagonal in the ROC curve. Note: this is being suggested for predicted variable with only 2 categories.
Best Answer
The ROC curve is a statistic of ranks, so it's valid as long as the way you're sorting the data is meaningful. In its most common application, we're sorting according to the predicted probabilities produced by a model. This is meaningful, in the sense that we have the most likely events at one extreme and the least likely events at the other extreme. This is useful because each operating point on the curve tells you (1) how much of your outcome you capture at each threshold using the decision rule "alert if $\hat{p} > \text{threshold}$" and (2) how many false positives you capture with that same rule.
The ROC AUC is the probability a randomly-chosen positive example is ranked more highly than a randomly-chosen negative example. When we're using ROC AUC to assess a machine learning model, we always want a higher AUC value, because we want our model to give positives a higher rank. On the other hand, if we built a model that had an out-of-sample AUC well below 0.5, we'd know that the model was garbage.
In OP's example, OP demonstrated that the arbitrary choice of how they encoded the categorical data can reverse the meaning of AUC. In the initial post, OP wrote:
but then edited to reverse how genders were sorted and found
The results are completely opposite. In the first case, we had an AUC of $c$, but in the second case, we had an AUC of $1-c$. This is an effective demonstration of why the choice of how you sort the categorical data is crucial! For this reason, I wouldn't recommend using AUC to interpret unordered data.
This is usually where people will point out that you can reverse really bad predictions to get a really high AUC. This is true as far as it goes, but "Let's run 2 tests, fiddle with our data, and report the most favorable result" is not sound statistical practice.
Your suggested procedure of reporting the larger of AUC and 1-AUC gives you a massive optimism bias.
On the other hand, a chi-squared-test does not give a different statistic if you change how you order your categories. It also works when you have 3 or more categories.