Solved – how to obtain a confusion matrix from a test subset using cross-validation approach

cross-validationr

I´m doing data classification using SGB algorithm. First I divided my dataset into training (80%) and test (20%) subsets. I used the training dataset to train and tune the SGB parameters and evaluate its performance using 10 fold cross-validation. Based on this approach I select the best SGB model and used this to predict new values in the test subset to evaluate its accuracy calculating a confusion matrix. However, I would like to now if it is possible to obtain a confusion matrix based on cross-validation using the test subset. The reason is that some classes have few observation, therefore.

I used the caret R package to perform the SGB classification:

# data partition:
set.seed(3456)
tab_x<-createDataPartition(tab_a$code,p=0.80,list=FALSE,times=1)
train_set<-tab_a[tab_x,]
test_set<-tab_a[-tab_x,]

# tuning model SGB (Stochastic Gradient Boosting)

fitControl<-trainControl(method="repeatedcv",number=10,repeats=3)
sgbGrid_x<-expand.grid(interaction.depth=c(1,3,5,9,11),n.trees=(1:30)*50,shrinkage=0.01)
nrow(sgbGrid_x)

"code" represents 8 categories and "X1, X2, X3, X4, X5,X6,X7" are predictor variables

sgbFit<-train(code~X1+X2+X3+X4+X5+X6+X7,data=train_set,method="gbm",trControl=fitControl,bag.fraction=0.50,verbose=FALSE,tuneGrid=sgbGrid_x)

#fitting the best SGB model (founded during the previous steps)

sgbGrid_a<-expand.grid(interaction.depth=5,n.trees=550,shrinkage=0.05)
nrow(sgbGrid_a)
sgbFit_a<-train(code~X1+X2+X3+X4+X5+X6+X7,data=train_set,method="gbm",trControl=fitControl,verbose=FALSE,tuneGrid=sgbGrid_a)

#calculating confusion matrix using test samples

classif<-predict(sgbFit_a,newdata=test_set,type="raw")
cm_sgb<-confusionMatrix(classif,test_set$code)
cm_sgb

Since some categories have few observations in the test subset, I think this is better to use a k-fold cross validation to produce a confusion matrix, but how can I do this in R? Please can anyone help me with the R code?

Best Answer

Firstly, it is not recommended that a single classifier be used. Wolpert & McReady's No Free Lunch Theorem states that: "In the universe of all classifiers and cost (objective) functions, there is no single best classifier." Second, Watanabe's Ugly Duckling Theorem states that: "in the universe of all features sets, there is no single best set of features." Although you are using SGB, did you evaluate other classifiers to determine if your classes are linear separable? Try k-nearest neighbors (kNN) and determine how it performs in terms of classification accuracy, sensitivity, specificity, etc. Then, compare kNN performance against linear discriminant analysis (LDA), and then finally compare with SGB performance. (Boosting is used when there are problems in the data). If kNN's performance is comparable, the use it because it is much less expensive computationally. Using multiple classifiers is called an ensemble, where each trained classifier casts a vote for the predicted class membership of each test object. Something else that known is that the best set of classifiers in an ensemble is also not known (see Kuncheva et al book and papers on classifier diversity and oracles).

Creating a confusion matrix for test objects can be done by creating a class-by-class matrix, for which rows represent the true class and columns the predicted class for every test object. Each test object will have a true class and a predicted class, so simply add ("pad") the cell corresponding to the true class and predicted class with a one. When done divide the sum of the counts on the diagonal elements of the confusion matrix (where true class was equal to predicted), by the total counts from all elements of the matrix. This is classification accuracy. You need to determine sensitivity and specificity, and AUC as well, which are not biased by low performance the way accuracy is.

It is also recommended to perform "ten 10-fold" CV runs, whereby objects are shuffled or permuted after each 10-fold run, and assigned to the 10 new folds. Doing this ten times is called "re-partitioning" before each 10-fold CV run. You can pad the confusion matrix for all the test objects that were used throughout the ten repartitions and their 10-fold CV runs in order to determine accuracy.

Overall, never use a classifier just because you like its performance. The art of supervised classification involves knowing why a given classifier needs to be employed. You also need to evaluate linear separability as well, which requires other classifiers. A majority vote ensemble of classifiers is usually the preferred way to go -- only because classifiers can break down for a variety of reasons. By using classifiers that can perform well when other classifiers break down, the bias/variance dilemma can be minimized.

Related Solutions

Solved – Is cross-validation still valid when the sample size is small

I don't think there is much confusion in your thoughts, you're putting your finger on one very important problem of classifier validation: not only classifier training but also classifier validation has certain sample size needs.

Well, seeing the edit: there may be some confusion after all... What the "Elements" tell you is that in practice the most likely cause of such an observation is that there is a leak between training and testing, e.g. because the "test" data was used to optimze the model (which is a training task)

The section of the Elements is concerned with an optimistic bias caused by this. But there is also variance uncertainty, and, even doing all splitting correctly you can observe extreme outcomes.

IIRC the variance problematic is not discussed in great detail in the Elements (there's more to that than what the Elements discuss in section 7.10.1), so I'll give you a start here:

Yes, it can and does happen that you either accidentally have a predictor that predicts this particular small data set perfectly (train & test set). You may even just get a splitting that does accidentally lead to seemingly perfect results while the resubstitution error would be > 0.

This can happen also with correct (and thus unbiased) cross validation because the results are also subject to variance.

IMHO it is a problem that people do not take this variance uncertainty into account (on contrast, bias is often discussed in great length; I've hardy seen any paper discussing the variance uncertainty of their results although in my field with usually < 100, frequently even < 20 patients in one study it is the predominant source of uncertainty). It is not that difficult to get a few basic sanity checks that would avoid most of these issues.

There are two points here:

With too few training cases (trainig samples ./. model complexity and no. of variates), models get unstable. Their predictive power can be all over the place. On average, it isn't that great, but it can accidentally be truly good.
You can measure the influence of model instability on the predictions in a very easy way using the results of an iterated/repeated $k$-fold cross-validation: in each iteration, each case is predicted exactly once. As the case stays the same, any variation in these predictions is caused by instability of the surrogate models, i.e. the reaction of the model to exchanging a few training cases.
See e.g. Beleites, C. & Salzer, R.: Assessing and improving the stability of chemometric models in small sample size situations, Anal Bioanal Chem, 390, 1261-1271 (2008).
DOI: 10.1007/s00216-007-1818-6

IMHO checking whether the surrogate models are stable is a sanity check that should always be done in small sample situations. Particularly as it comes at nearly zero cost: it just needs a slightly different aggregation of the cross validation results (and $k$-fold cross-validation should be iterated anyways unless it is shown that the models are stable).
Like you say: With too few test cases, your observed sucesses and failure may be all over the place. If you calculate proportions like error rate or hit rate, etc. they will also be all over the place. This is known as these proportions being subject to high variance.
E.g. if the model truly has 50% hit rate, the probability to observe 3 correct out of 3 predictions is $0.5^3 = 12.5 \%$ (binomial distribution). However, it is possible to calculate confidence intervals for proportions, and these take into account how many cases were tested. There is a whole lot of literature about how to calculate them, and what approximations work well or not at all in what situations. For the extremely small sample size of 3 test cases in my example:
```
binom.confint (x=3, n=3, prior.shape1=1, prior.shape2=1)
#           method x n mean     lower     upper
# 1  agresti-coull 3 3  1.0 0.3825284 1.0559745  
# 2     asymptotic 3 3  1.0 1.0000000 1.0000000  
# 3          bayes 3 3  0.8 0.4728708 1.0000000  
# 4        cloglog 3 3  1.0 0.2924018 1.0000000
# 5          exact 3 3  1.0 0.2924018 1.0000000
# 6          logit 3 3  1.0 0.2924018 1.0000000
# 7         probit 3 3  1.0 0.2924018 1.0000000
# 8        profile 3 3  1.0 0.4043869 1.0000000 # generates warnings
# 9            lrt 3 3  1.0 0.5271642 1.0000000
# 10     prop.test 3 3  1.0 0.3099881 0.9682443
# 11        wilson 3 3  1.0 0.4385030 1.0000000
```
You'll notice that there is quite some variation particularly in the lower bound. This alone is an indicator that the test sample size is so small that hardy anything can be concluded from the test results.
However, in practice it hardly matters whether the confidence interval spans the range from "guessing" to "perfect" or from "worse than guessing" to "perfect"
conclusion 1: think beforehand how precise the performance results need to be in order to allow a useful interpretation. From that, you can roughly calculate the needed (test) sample size.
conclusion 2: calculate confidence intervals for your performance estimates
For model comparisons on the basis of correct/wrong predictions, don't even think of doing that with less than several hundred test cases for each classifier.
Have a look at McNemar's test (for paired situations, i.e. you can test the same cases with both classifiers). If you cannot do the comparison paired, look for "comparison of proportions", you'll need even more cases, see the paper I link below for examples.
You may be interested in our paper about these problems:
Beleites, C. et al.: Sample size planning for classification models., Anal Chim Acta, 760, 25-33 (2013). DOI: 10.1016/j.aca.2012.11.007; arXiv: 1211.1323

second update about randomly selecting features: the bagging done for random forests regularly uses this strategy. Outside that context I think it is seldom, but it is a valid possibility.

Solved – How to choose the right number of CV folds for the nested cross validation for a small sample

@John is right that sampling variability is your problem. In particular, the variance on the performance estimates.

In contrast to his advise, I'd strongly recommend not to do LOO. The main reason for that (apart from the possible complication of strong pessimistic bias due to inherent lack of stratification) is that with LOO you cannot distinguish two different sources of variance:

variance due to the limited number of cases tested and
variance due to model instability (i.e. due to the training sample size being so limited that exchanging a few training cases does make a difference). Model instability is one symptom of unsuccessful optimization.

Doing e.g. repeated k-fold cross valiation (or out-of-bootstrap, ...), you can separate these influences as you can check whether predictions for the same case by different surrogate models are the same or not (= model instability). The more aggressively you optimize in the inner loop, the more important it is to make sure the optimization yields stable results (across the surrogate models of the outer loop).

Now one consequence of your limited number of cases is that the estimates of model performance will have high variance due to the low number of test cases. If you work with 0/1 loss and e.g. accuracy*, you can do some back-of-the-envelope calculations what uncertainty to expect.

outer loop has 30 cases. At the end, all those have been tested. The best possible case is that all were correctly - outer loop has 30 cases. At the end, all those have been tested. The best possible case is that all were correctly predicted. A binomial 95% confidence interval for 30 out of 30 cases yields roughly 90 - 100% accuracy.
say you do 6-fold in the outer loop (which you can do nicely stratified for your application). Then the optimization has 25 cases, and a correspondingly higher confidence interval for its performance estimates.

Without going into calculations, I think it unlikely that the expected differences across the models compared in the optimization step differ enough to reliably measure this difference with only 25 or 30 cases available.

Thus I recommend considering not to do any optimization but restrict yourself to a model where you can fix the hyperparameters by external knowledge (if there are any). We wrote a paper on a closely related topic that may be of interest:
Beleites, C. and Neugebauer, U. and Bocklitz, T. and Krafft, C. and Popp, J.: Sample size planning for classification models. Anal Chim Acta, 2013, 760, 25-33.
DOI: 10.1016/j.aca.2012.11.007
accepted manuscript on arXiv: 1211.1323

* there are other figures of merit, e.g. proper scoring rules, that are much better behaved from a statistical point of view. Nevertheless, they usually don't provide miracles, neither.

update: Plausibility check whether doing an optimization is worth while:

take an unoptimized model that doesn't need hyperparameters or that is calculated with manually set plausible hyperparameters (e.g. logistic regression without regularization, or random forest with manually set hyperparameters) and cross validate this (total = 30 tested cases).
Let's assume you get 31 correct = 70% accuracy.
Check e.g. by simulated McNemar's tests how much better the optimized model would need to be in order to recognize the superiority.
In the example, McNemar's test would be significant if the optimized model had 90% accuracy in the paired test without making any error that the reference model didn't make. Or it may make one new error at accuracy > 93%.
It is then up to you to judge how realistic it is to expect such an improvement from the optimization and whether it is worth trying.
similarly, you can check with a proportion test simulation what performance you'd need to observe in order to have performance significantly better than, say, random guessing of the class label.

Best Answer

Related Solutions

Solved – Is cross-validation still valid when the sample size is small

Solved – How to choose the right number of CV folds for the nested cross validation for a small sample

Related Question