Solved – Facing unbalanced data: AUC vs. Cohen’s Kappa vs. Balanced Misclassification Rate

auccohens-kapparrandom forestunbalanced-classes

As the question title implies, I am dealing with unbalanced data (minority class 2%) classification. As a classification tool I chose Random Forest from R package "RandomForest".

So, I chose two ways to tackle the unbalance of my data. First, I tried oversampling minority class ("1"). Second, I tried to use the data as it is, but undersample the majority class ("0") in the sampsize argument, so it's something that I call "Pseudo-undersampling".

Found the answer: Cohen's kappa is dramatically affected by the prevalence and bias. So it's better to chose different metric.

"Pseudo-undersampling":

randomForest(x=train9[,-1],y=train9[,1],ntree=500,
                          mtry=mtry[i],replace=FALSE, strata= train9[,1],
                          sampsize = c(length(train9[,1][which(train9[,1]==1)]),
                                       length(train9[,1][which(train9[,1]==1)])),nodesize=2,
                          importance = FALSE, norm.votes = TRUE, keep.forest = TRUE)

Oversampling (object NO9 is train9 cases with class "0"):

randomForest(x=train9[,-1],y=train9[,1],ntree=500,
                                mtry=mtry[i],replace=T, strata= train9[,1],
                                sampsize = c(length(NO9[,1]),length(NO9[,1])),nodesize=2,
                                importance = FALSE, norm.votes = TRUE, keep.forest = TRUE)

I did repeated cross-validation on both models (trying different mtry argument for each model). # NOTE: I did oversampling not prior to, but within cross-validation loop, so the models were tested on "new" data, therefore, we can rely on results.

So the results are: "Pseudo-undersampling" had higher AUC by 0.01 (0.93 < 0.94, though not significantly) and lower Balanced misclassification rate (1-balanced accuracy) by 0.1 (0.24> 0.14, p<0.01). However, its Cohen's Kappa was lower by 0.36 (0.56>0.2, p<0.01).

How should I interpret these results and which model is better and more acceptable?

Generalising the question: on which metric should one rely more, when dealing with unbalanced data?

Best Answer

There are so many problems with your approach that is difficult to know where to begin. As an aside the concordance index $c$ (AUROC) is not sensitive to the distribution of $Y$. But you have not used any proper accuracy scoring rules, and have cast the problem as a classification problem even though you may be more interested in tendencies than in forced choices. Any method that requires you to remove samples or over-sample is highly suspect and not based on good statistical principles. For more information see http://www.fharrell.com/2017/01/classification-vs-prediction.html and http://www.fharrell.com/2017/03/damage-caused-by-classification.html .