Solved – How to derive confidence intervals from the confusion matrix for a classifier

classificationconfidence intervalconfusion matrixmulti-classunbalanced-classes

I have am using k-fold cross validation to generate a confusion matrix for a classifier. I need to calculate 95% confidence intervals for the number of times each class is predicted when run against a bunch of input data.

So if my output after running 2000 samples through the classifier is:

Class A: 100
Class B: 1400
Class C: 500

I want to be able to report:

Class A: 100   +- (some value for a 95% interval)
Class B: 1400  +- (some value for a 95% interval)
Class C: 500   +- (some value for a 95% interval)

The interval for each class would depend on how good the classifier is for that class as indicated by the confusion matrix.

If this makes sense please give me some hints. Otherwise please point me in a better direction. I need something simple to report to unsophisticated users.

Best Answer

The question makes good sense. It is specifically noted that the contingency table is a result of cross-validation. Witten et al's Data Mining book (based around Weka) discusses a modified T-Test for (Repeated) Cross-Validation. A T-Test implicitly defines a confidence interval. Given we have a CV and each cell is an averaged statistic, CIs do exist per cell, although they will be most commonly calculated for the marginal statistics, and directly or via those for whole of table statistics.

In the following paper I explore adaptation of various generalizations of the confidence intervals applied to correlation to useful multiclass cases, and validate with monte carlo simulation, but it is difficult give a clear recommendation as the same measure can be overly conservative in some cases, and insufficiently conservative in others, nonetheless a reasonably choice is suggested and illustrated in simulations across a range of parameterizations:

Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation DMW Powers International Journal of Machine Learning Technology 2 (1), 37-63

It is possible to calculate recall and precision from a contingency table (divide a diagonal entry by the appropriate marginal sum) and their inverses (by the complement in the diagonal vs the margins - or simply to convert to binary tables) and define confidence intervals based on the Wald or Wilson techniques. A useful rule of thumb is introduced by Agresti et al for the normal distribution assumption at alpha=0.05, which is to add 2 positive and 2 negative examples. Tony Cai shows this is appropriate for the Binomial distribution and gives modified versions for the Negative Binomial (not applicable here) and the Poisson (arguably applicable, and used as an assumption in some of my derivations above).

The Poisson modification is probably most applicable here (think in terms of when another PpRr Predicted/Real pair might arrive) as it is focussed only on the class of interest and doesn't distributed errors/negativity amongst the other classes. It adds two more arrivals to a cell before calculating the statistics relating to that cell.

Wei Pan (2001) derives some other possible measures based around the binomial distribution and the T-test.

The Cai paper is here: http://www-stat.wharton.upenn.edu/~tcai/paper/Plugin-Exp-CI.pdf

My paper is here: http://dspace2.flinders.edu.au/xmlui/bitstream/handle/2328/27165/Powers%20Evaluation.pdf

This is something I'm still exploring - hence returning periodically to see if there's any new/useful contributions on the topic...

Related Question