Solved – Importance of McNemar test in caret::confusionMatrix

classificationmcnemar-testr

There are many metrics to evaluate the performance of predictive model. Many of these appear relatively straightforward to me (e.g. Accuracy, Kappa, AUC-ROC, etc.) but I am uncertain regarding the McNemar test. Could someone kindly help me understand the interpretation of the McNemar Test on a predictive model contingency table? This is applied and the P-Value returned from the R function caret::confusionMatrix. Everything I read about McNemar talks about comparing between before and after a 'treatment'. In this case, I would be comparing predicted classes vs. the known test classes. Am I correct to interpret a significant McNemar test to mean that the proportion of classes is different between the testing classes and the predicted classes?

A second, but more general, followup question would be how should this factor in to interpreting the performance of a predictive model? For example, as reflected in the 1st example below, in some circumstances 75% accuracy may be considered great but the proportion of predicted classes may be different (assuming my understanding of a significant McNemar test is accurate). How would one approach such a circumstance?

Lastly, does this interpretation change if more classes or involved? For example a contingency matrix of 3×3 or larger.

Providing some reproducible examples mirrored from here:

#significant p-value
mat <- matrix(c(661,36,246,207), nrow=2)

caret::confusionMatrix(as.table(mat))
> caret::confusionMatrix(as.table(mat))
Confusion Matrix and Statistics

    A   B
A 661 246
B  36 207

               Accuracy : 0.7548          
                 95% CI : (0.7289, 0.7794)
    No Information Rate : 0.6061          
    P-Value [Acc > NIR] : < 2.2e-16       

                  Kappa : 0.4411          
 Mcnemar's Test P-Value : < 2.2e-16    
... truncated

# non-significant p-value
mat <- matrix(c(663,46,34,407), nrow=2)

caret::confusionMatrix(as.table(mat))
Confusion Matrix and Statistics

    A   B
A 663  34
B  46 407

               Accuracy : 0.9304          
                 95% CI : (0.9142, 0.9445)
    No Information Rate : 0.6165          
    P-Value [Acc > NIR] : <2e-16          

                  Kappa : 0.8536          
 Mcnemar's Test P-Value : 0.2188     
... truncated

Best Answer

Interpret the McNemar’s Test for Classifiers

McNemar’s Test captures the errors made by both models. Specifically, the No/Yes and Yes/No (A/B and B/A in your case) cells in the confusion matrix. The test checks if there is a significant difference between the counts in these two cells. That is all.

If these cells have counts that are similar, it shows us that both models make errors in much the same proportion, just on different instances of the test set. In this case, the result of the test would not be significant and the null hypothesis would not be rejected.

Fail to Reject Null Hypothesis: Classifiers have a similar proportion of errors on the test set.

Reject Null Hypothesis: Classifiers have a different proportion of errors on the test set.

More information can be found out here:

https://machinelearningmastery.com/mcnemars-test-for-machine-learning/