Solved – how to obtain a confusion matrix from a test subset using cross-validation approach

cross-validationr

I´m doing data classification using SGB algorithm. First I divided my dataset into training (80%) and test (20%) subsets. I used the training dataset to train and tune the SGB parameters and evaluate its performance using 10 fold cross-validation. Based on this approach I select the best SGB model and used this to predict new values in the test subset to evaluate its accuracy calculating a confusion matrix. However, I would like to now if it is possible to obtain a confusion matrix based on cross-validation using the test subset. The reason is that some classes have few observation, therefore.

I used the caret R package to perform the SGB classification:

# data partition:
set.seed(3456)
tab_x<-createDataPartition(tab_a$code,p=0.80,list=FALSE,times=1)
train_set<-tab_a[tab_x,]
test_set<-tab_a[-tab_x,]

# tuning model SGB (Stochastic Gradient Boosting)

fitControl<-trainControl(method="repeatedcv",number=10,repeats=3)
sgbGrid_x<-expand.grid(interaction.depth=c(1,3,5,9,11),n.trees=(1:30)*50,shrinkage=0.01)
nrow(sgbGrid_x)

"code" represents 8 categories and "X1, X2, X3, X4, X5,X6,X7" are predictor variables

sgbFit<-train(code~X1+X2+X3+X4+X5+X6+X7,data=train_set,method="gbm",trControl=fitControl,bag.fraction=0.50,verbose=FALSE,tuneGrid=sgbGrid_x)

#fitting the best SGB model (founded during the previous steps)

sgbGrid_a<-expand.grid(interaction.depth=5,n.trees=550,shrinkage=0.05)
nrow(sgbGrid_a)
sgbFit_a<-train(code~X1+X2+X3+X4+X5+X6+X7,data=train_set,method="gbm",trControl=fitControl,verbose=FALSE,tuneGrid=sgbGrid_a)

#calculating confusion matrix using test samples

classif<-predict(sgbFit_a,newdata=test_set,type="raw")
cm_sgb<-confusionMatrix(classif,test_set$code)
cm_sgb

Since some categories have few observations in the test subset, I think this is better to use a k-fold cross validation to produce a confusion matrix, but how can I do this in R? Please can anyone help me with the R code?

Best Answer

Firstly, it is not recommended that a single classifier be used. Wolpert & McReady's No Free Lunch Theorem states that: "In the universe of all classifiers and cost (objective) functions, there is no single best classifier." Second, Watanabe's Ugly Duckling Theorem states that: "in the universe of all features sets, there is no single best set of features." Although you are using SGB, did you evaluate other classifiers to determine if your classes are linear separable? Try k-nearest neighbors (kNN) and determine how it performs in terms of classification accuracy, sensitivity, specificity, etc. Then, compare kNN performance against linear discriminant analysis (LDA), and then finally compare with SGB performance. (Boosting is used when there are problems in the data). If kNN's performance is comparable, the use it because it is much less expensive computationally. Using multiple classifiers is called an ensemble, where each trained classifier casts a vote for the predicted class membership of each test object. Something else that known is that the best set of classifiers in an ensemble is also not known (see Kuncheva et al book and papers on classifier diversity and oracles).

Creating a confusion matrix for test objects can be done by creating a class-by-class matrix, for which rows represent the true class and columns the predicted class for every test object. Each test object will have a true class and a predicted class, so simply add ("pad") the cell corresponding to the true class and predicted class with a one. When done divide the sum of the counts on the diagonal elements of the confusion matrix (where true class was equal to predicted), by the total counts from all elements of the matrix. This is classification accuracy. You need to determine sensitivity and specificity, and AUC as well, which are not biased by low performance the way accuracy is.

It is also recommended to perform "ten 10-fold" CV runs, whereby objects are shuffled or permuted after each 10-fold run, and assigned to the 10 new folds. Doing this ten times is called "re-partitioning" before each 10-fold CV run. You can pad the confusion matrix for all the test objects that were used throughout the ten repartitions and their 10-fold CV runs in order to determine accuracy.

Overall, never use a classifier just because you like its performance. The art of supervised classification involves knowing why a given classifier needs to be employed. You also need to evaluate linear separability as well, which requires other classifiers. A majority vote ensemble of classifiers is usually the preferred way to go -- only because classifiers can break down for a variety of reasons. By using classifiers that can perform well when other classifiers break down, the bias/variance dilemma can be minimized.

Related Question