Presumably, you are using these classifiers to help choose one particular class for a given set of feature values (as you said you are creating a multiclass classifier).
So, lets say you have $N$ classes, then your confusion matrix would be an $N\times N$ matrix, with the left axis showing the true class (as known in the test set) and the top axis showing the class assigned to an item with that true class. Each element $i,j$ of the matrix would be the number of items with true class $i$ that were classified as being in class $j$.
This is just a straightforward extension of the 2-class confusion matrix.
First you should create 3x3 confusion matrix and then calculate statistics, we have two type of calculation (macro and micro) for overall statistics (overall precision, overall recall and ...) look at these links for formula :
Overall Accuracy
$$ACC_{Overall}=\frac{\sum_{i=1}^{|C|}TP_i}{Population}$$
Precision Micro
$$PPV_{Micro}=\frac{\sum_{i=1}^{|C|}TP_i}{\sum_{i=1}^{|C|}TP_i+FP_i}$$
Precision Macro
$$PPV_{Macro}=\frac{1}{|C|}\sum_{i=1}^{|C|}\frac{TP_i}{TP_i+FP_i}$$
Recall Micro
$$TPR_{Micro}=\frac{\sum_{i=1}^{|C|}TP_i}{\sum_{i=1}^{|C|}TP_i+FN_i}$$
Recall Macro
$$TPR_{Macro}=\frac{1}{|C|}\sum_{i=1}^{|C|}\frac{TP_i}{TP_i+FN_i}$$
I suggest my lib for your purpose : PyCM
Example Usage :
>>> from pycm import *
>>> y_actu = [2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2] # or y_actu = numpy.array([2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2])
>>> y_pred = [0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2] # or y_pred = numpy.array([0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2])
>>> cm = ConfusionMatrix(actual_vector=y_actu, predict_vector=y_pred) # Create CM From Data
>>> cm.classes
[0, 1, 2]
>>> cm.table
{0: {0: 3, 1: 0, 2: 0}, 1: {0: 0, 1: 1, 2: 2}, 2: {0: 2, 1: 1, 2: 3}}
>>> print(cm)
Predict 0 1 2
Actual
0 3 0 0
1 0 1 2
2 2 1 3
Overall Statistics :
95% CI (0.30439,0.86228)
Bennett_S 0.375
Chi-Squared 6.6
Chi-Squared DF 4
Conditional Entropy 0.95915
Cramer_V 0.5244
Cross Entropy 1.59352
Gwet_AC1 0.38931
Joint Entropy 2.45915
KL Divergence 0.09352
Kappa 0.35484
Kappa 95% CI (-0.07708,0.78675)
Kappa No Prevalence 0.16667
Kappa Standard Error 0.22036
Kappa Unbiased 0.34426
Lambda A 0.16667
Lambda B 0.42857
Mutual Information 0.52421
Overall_ACC 0.58333
Overall_RACC 0.35417
Overall_RACCU 0.36458
PPV_Macro 0.56667
PPV_Micro 0.58333
Phi-Squared 0.55
Reference Entropy 1.5
Response Entropy 1.48336
Scott_PI 0.34426
Standard Error 0.14232
Strength_Of_Agreement(Altman) Fair
Strength_Of_Agreement(Cicchetti) Poor
Strength_Of_Agreement(Fleiss) Poor
Strength_Of_Agreement(Landis and Koch) Fair
TPR_Macro 0.61111
TPR_Micro 0.58333
Class Statistics :
Classes 0 1 2
ACC(Accuracy) 0.83333 0.75 0.58333
BM(Informedness or bookmaker informedness) 0.77778 0.22222 0.16667
DOR(Diagnostic odds ratio) None 4.0 2.0
ERR(Error rate) 0.16667 0.25 0.41667
F0.5(F0.5 score) 0.65217 0.45455 0.57692
F1(F1 score - harmonic mean of precision and sensitivity) 0.75 0.4 0.54545
F2(F2 score) 0.88235 0.35714 0.51724
FDR(False discovery rate) 0.4 0.5 0.4
FN(False negative/miss/type 2 error) 0 2 3
FNR(Miss rate or false negative rate) 0.0 0.66667 0.5
FOR(False omission rate) 0.0 0.2 0.42857
FP(False positive/type 1 error/false alarm) 2 1 2
FPR(Fall-out or false positive rate) 0.22222 0.11111 0.33333
G(G-measure geometric mean of precision and sensitivity) 0.7746 0.40825 0.54772
LR+(Positive likelihood ratio) 4.5 3.0 1.5
LR-(Negative likelihood ratio) 0.0 0.75 0.75
MCC(Matthews correlation coefficient) 0.68313 0.2582 0.16903
MK(Markedness) 0.6 0.3 0.17143
N(Condition negative) 9 9 6
NPV(Negative predictive value) 1.0 0.8 0.57143
P(Condition positive) 3 3 6
POP(Population) 12 12 12
PPV(Precision or positive predictive value) 0.6 0.5 0.6
PRE(Prevalence) 0.25 0.25 0.5
RACC(Random accuracy) 0.10417 0.04167 0.20833
RACCU(Random accuracy unbiased) 0.11111 0.0434 0.21007
TN(True negative/correct rejection) 7 8 4
TNR(Specificity or true negative rate) 0.77778 0.88889 0.66667
TON(Test outcome negative) 7 10 7
TOP(Test outcome positive) 5 2 5
TP(True positive/hit) 3 1 3
TPR(Sensitivity, recall, hit rate, or true positive rate) 1.0 0.33333 0.5
>>> cm.matrix()
Predict 0 1 2
Actual
0 3 0 0
1 0 1 2
2 2 1 3
>>> cm.normalized_matrix()
Predict 0 1 2
Actual
0 1.0 0.0 0.0
1 0.0 0.33333 0.66667
2 0.33333 0.16667 0.5
Best Answer
The question makes good sense. It is specifically noted that the contingency table is a result of cross-validation. Witten et al's Data Mining book (based around Weka) discusses a modified T-Test for (Repeated) Cross-Validation. A T-Test implicitly defines a confidence interval. Given we have a CV and each cell is an averaged statistic, CIs do exist per cell, although they will be most commonly calculated for the marginal statistics, and directly or via those for whole of table statistics.
In the following paper I explore adaptation of various generalizations of the confidence intervals applied to correlation to useful multiclass cases, and validate with monte carlo simulation, but it is difficult give a clear recommendation as the same measure can be overly conservative in some cases, and insufficiently conservative in others, nonetheless a reasonably choice is suggested and illustrated in simulations across a range of parameterizations:
Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation DMW Powers International Journal of Machine Learning Technology 2 (1), 37-63
It is possible to calculate recall and precision from a contingency table (divide a diagonal entry by the appropriate marginal sum) and their inverses (by the complement in the diagonal vs the margins - or simply to convert to binary tables) and define confidence intervals based on the Wald or Wilson techniques. A useful rule of thumb is introduced by Agresti et al for the normal distribution assumption at alpha=0.05, which is to add 2 positive and 2 negative examples. Tony Cai shows this is appropriate for the Binomial distribution and gives modified versions for the Negative Binomial (not applicable here) and the Poisson (arguably applicable, and used as an assumption in some of my derivations above).
The Poisson modification is probably most applicable here (think in terms of when another PpRr Predicted/Real pair might arrive) as it is focussed only on the class of interest and doesn't distributed errors/negativity amongst the other classes. It adds two more arrivals to a cell before calculating the statistics relating to that cell.
Wei Pan (2001) derives some other possible measures based around the binomial distribution and the T-test.
The Cai paper is here: http://www-stat.wharton.upenn.edu/~tcai/paper/Plugin-Exp-CI.pdf
My paper is here: http://dspace2.flinders.edu.au/xmlui/bitstream/handle/2328/27165/Powers%20Evaluation.pdf
This is something I'm still exploring - hence returning periodically to see if there's any new/useful contributions on the topic...