Solved – Evaluation of classifier using ROC curve in the presence of rare events

machine learningrare-eventsroc

For binary classification problems where the data is highly imbalanced, i.e. much more negative samples than positive samples, it is recommended to evaluate the performance of a classifier using the ROC curve because it does not depend on the actual ratio between positive and negative class (see e.g. He et al). Yet, I recently came across an article that stated that rare events are usually "better predicted" when looking at the ROC curve. Unfortunately, I did not save the article and was unable to find it again so far.

Therefore, I decided to ask here if someone could point me to a paper that demonstrates this behaviour or can give an explanation where this comes from. My follow-up questions would then be, what the preferred way is to evaluate a classifier under these circumstances.

Best Answer

Let us try it out. Generate positively correlated quantitative classifier variable and binary state variable (0="negative", 1="positive"). And supply 3 weighting variables. Weight1 makes distribution 0/1 = 45/45. Weight2 makes it 15/75 (i.e. positive event is frequent). Weight3 makes it 75/15 (i.e. positive event is rare).

classifier    state  weight1  weight2  weight3
     .801         0        3        1        5
     .270         0        3        1        5
     .253         0        3        1        5
     .220         0        3        1        5
     .142         0        3        1        5
     .229         0        3        1        5
     .352         0        3        1        5
     .341         0        3        1        5
     .198         0        3        1        5
     .169         0        3        1        5
     .525         0        3        1        5
     .533         0        3        1        5
     .395         0        3        1        5
     .586         0        3        1        5
     .072         0        3        1        5
     .776         1        3        5        1
     .772         1        3        5        1
     .813         1        3        5        1
     .507         1        3        5        1
     .112         1        3        5        1
     .664         1        3        5        1
     .979         1        3        5        1
     .877         1        3        5        1
     .414         1        3        5        1
     .887         1        3        5        1
     .675         1        3        5        1
     .514         1        3        5        1
     .793         1        3        5        1
     .622         1        3        5        1
     .468         1        3        5        1

Weight the data with the weight variables one by one and perform ROC (I did it in SPSS). Below are statistics for Area under the curve.

Area    Std. Error(a)   Asymptotic Sig.(b)  Asymptotic 95% Confidence Interval  
                                              Lower Bound   Upper Bound
Weighted by weight1:
.840        .045            2.76045E-008             .753          .927
Weighted by weight2:
.840        .056            3.45509E-005             .731          .949
Weighted by weight3:
.840        .064            3.45509E-005             .715          .965

(a) Under the nonparametric assumption              
(b) Null hypothesis: true area = 0.5    

You may notice that Area is the same, be the positive event rare, frequent or in-between. However, Error of the Area and other statistics around it are affected by whether the positive event is rare, frequent or in-between. The shape of curve itself (shown below) is not affected. So, background "rareness" of positive event has no impact on the choice of optimal classification cut-point in the classifier variable.

enter image description here