Algorithmic Fairness – Why Classifiers Cannot Be Well-Calibrated and Achieve Error Rate Balance Across Groups

algorithmic-fairnesscalibrationmachine learning

There are several results in the literature now, stating that a classifier cannot fulfill calibration and error rate balance at the same time if there are actual differences between groups. To pick one exemplary result, Kleinberg et al. (2016) derive that the following three conditions can only be fulfilled simultaneously if there are no actual differences between the groups (copying verbatim from their paper in the following):

Calibration within groups, i.e., for each group t, and each bin b with associated score vb, the
expected number of people from group t in b who belong to the positive class should be a vb fraction
of the expected number of people from group t assigned to b.
Balance for the negative class, i.e., the average score assigned to people of group 1 who
belong to the negative class should be the same as the average score assigned to people of group 2
who belong to the negative class.
Balance for the positive class, i.e., the average score assigned to people of
group 1 who belong to the positive class should be the same as the average score assigned to people
of group 2 who belong to the positive class.

I can follow their derivation (and other, similar ones), but I am still missing an intuition for why I cannot have a well-calibrated classifier that achieves error rate balance in any non-trivial case? Why are these two requirements contradictory? It seems to have something to do with the fact that if the base rates differ between groups, one cannot have true positive rate, false positive rate, positive predictive value, and negative predictive value all be equal across groups. (See the Fair ML book, p. 56-57.) But it still can't wrap my head around why – intuitively – that is not possible. Maybe someone has a nice illustrative example or can otherwise provide intuition?

Best Answer

The essential intuition for why calibration by groups and separation (=balance for the positive/negative classes) are incompatible is that the average score of a calibrated classifier within each group is equal to the base rate of that group, $p(y|x, \text{group}=i)$. From this, it is already almost apparent that equal average risk scores in the positive/negative classes of each group cannot be achieved, if there are base rate differences (and the classifier is not perfect).

More formally, from the above insight, one can derive that calibrated classifiers lie on straight lines in the following diagram for the different groups, and one would need the lines for the different groups to intersect - which can only happen if the classifier is either perfect or if there are no base rate differences between the groups.

Crucially, error rate balance (=equal TPR, FPR) is not the same as separation, and it is in principle possible to achieve error rate balance and calibration by group at the same time. (I provide an example of this on my blog, see link below.) To achieve exact error rate balance, the ROC curves for the different groups would have to intersect, however, which is unlikely to be the case for any practical application.

I just summarized all of this in more detail in a post on my personal blog.

Related Solutions

Solved – Evaluation of classifier using ROC curve in the presence of rare events

Let us try it out. Generate positively correlated quantitative classifier variable and binary state variable (0="negative", 1="positive"). And supply 3 weighting variables. Weight1 makes distribution 0/1 = 45/45. Weight2 makes it 15/75 (i.e. positive event is frequent). Weight3 makes it 75/15 (i.e. positive event is rare).

classifier    state  weight1  weight2  weight3
     .801         0        3        1        5
     .270         0        3        1        5
     .253         0        3        1        5
     .220         0        3        1        5
     .142         0        3        1        5
     .229         0        3        1        5
     .352         0        3        1        5
     .341         0        3        1        5
     .198         0        3        1        5
     .169         0        3        1        5
     .525         0        3        1        5
     .533         0        3        1        5
     .395         0        3        1        5
     .586         0        3        1        5
     .072         0        3        1        5
     .776         1        3        5        1
     .772         1        3        5        1
     .813         1        3        5        1
     .507         1        3        5        1
     .112         1        3        5        1
     .664         1        3        5        1
     .979         1        3        5        1
     .877         1        3        5        1
     .414         1        3        5        1
     .887         1        3        5        1
     .675         1        3        5        1
     .514         1        3        5        1
     .793         1        3        5        1
     .622         1        3        5        1
     .468         1        3        5        1

Weight the data with the weight variables one by one and perform ROC (I did it in SPSS). Below are statistics for Area under the curve.

Area    Std. Error(a)   Asymptotic Sig.(b)  Asymptotic 95% Confidence Interval  
                                              Lower Bound   Upper Bound
Weighted by weight1:
.840        .045            2.76045E-008             .753          .927
Weighted by weight2:
.840        .056            3.45509E-005             .731          .949
Weighted by weight3:
.840        .064            3.45509E-005             .715          .965

(a) Under the nonparametric assumption              
(b) Null hypothesis: true area = 0.5

You may notice that Area is the same, be the positive event rare, frequent or in-between. However, Error of the Area and other statistics around it are affected by whether the positive event is rare, frequent or in-between. The shape of curve itself (shown below) is not affected. So, background "rareness" of positive event has no impact on the choice of optimal classification cut-point in the classifier variable.

enter image description here

Solved – How to calculate precision and recall when some of the test data remains unclassified

It's useful to keep in mind that precision/recall are inherently tied to a particular state or label of interest. In information retrieval that label might be "relevant" as opposed to "not relevant," whereas in cancer that label might be "malignant" as opposed to "benign."

As @Thomas Jungblut mentions, it would be valid to treat this not as a binary classification problem ("A" or "B") but instead as a multiclass classification problem ("A," "B," or "Unclassified"). There are other metrics besides precision/recall that can be of interest in multiclass classification. However, if you insist on precision/recall then you must pick your label of interest and then this sort of becomes de facto binary classification once again. You have various options for how to frame the problem ("A" vs "B or unclassified" is not the same as "A or unclassified" vs "B", etc.). However, effectively these are the same as simply picking a default label.

Since you seem to impart special meaning to the classification score of 0, it seems that perhaps it would be appropriate to also apply some domain knowledge or some knowledge of the specific classification algorithm being used. In the general case there's nothing really magical about a score of 0, but perhaps you really have a specific problem in mind where this is not the case.

Best Answer

Related Solutions

Solved – Evaluation of classifier using ROC curve in the presence of rare events

Solved – How to calculate precision and recall when some of the test data remains unclassified

Related Question