MATLAB: Classification Learner APP. Cross-validation, scatter plot and confusion matrix.

classification learnerconfusion matrix

I have a question regarding this app, hopefully some app-experts can help me 🙂

I read from the website: "If you use k-fold cross-validation, then the app computes the accuracy scores using the observations in the k validation folds and reports the average cross-validation error. It also makes predictions on the observations in these validation folds and computes the confusion matrix and ROC curve based on these predictions".

Ok for the accuracy but.. if you look at the confusion matrix generated after selecting "k-fold validation", you have integer values. How are they determined? It is not an average of the confusion matrices obtained by eack of the k validation folds… they are neither summed up, since the sum of all the elements corresponds with the number of the learning set trials provided… so?

The same for the scatter plot after training: you can notice correct and incorrect trials in the figure.. But are they considered correct/incorrect on the basis of the average results obtained in all the k validation folds? Or this depicts the classification obtained through only one representative fold?

Thanks in advance.

Best Answer

Hi Giansu,

Let's understand the scatter plot and confusion matrix generated by Classification Learner App for k-fold cross-validation with an example of iris dataset having 150 samples and 5-fold cross-validation.

As we choose 5-folds, the app will partition the data into 5 disjoint sets or folds cross-validation. For each fold, the app trains a model using 4 folds as training data and remaining 1-fold (i.e. held-out fold) as validation data.

It means whenever we use k-fold cross-validation, all the 150 samples will be considered as validation data or held-out fold for once. For e.g., for first iteration 1st fold will be validation and remaining 4 folds will be training data and similarly for second iteration 2nd fold will be validation and remaining 4 folds will be training data.

Scatter plot: The each prediction shown in the scatter plot is obtained when that particular observation was a part of held-out fold or validation data while model was training.

Confusion Matrix: The confusion matrix depicts how correctly the model predicted the class of the observation when that particular observation was a part of held-out fold or validation data while model was training. Hence the values are integer in confusion matrix.

Accuracy: The accuracy is calculated for each k-fold and to calculate the accuracy for the model we do average.

Following are the scatter plot and confusion matrix which I got on iris data for 5-fold cross validation:

Hope it helps!

Related Solutions

MATLAB: Is the accuracy reported in the Classification Learner app different from the accuracy of the exported model on the training data set

The Classification Learner app is reporting the validation accuracy on the data based on the validation scheme that we choose when starting a new session in the app. The default setting in MATLAB R2018a is 5-fold cross-validation, and as such the accuracy reported in the app is based on the accuracy on the held-out validation set after training on the other 4 folds.

When the model is exported to the workspace, it is trained using the full data set. As a result, when we predict on that same data set, the accuracy is very high. However, if we were to predict on unseen data, the accuracy would be much lower.

To verify this, when loading the data into the Classification Learner app, you may set the 'Validation' option in the right-hand pane to 'No Validation'. After training, you should see that the accuracy reported in the app is near 100%.

You can also verify this by splitting your data into training and testing sets. Then, you may train the model in the app using only the training set and export the model. The accuracy of the exported model on the test set should be comparable to the accuracy reported in the app. I tried this myself by randomly splitting the data table using the following code:

>> perm = randperm(2282);
>> trainData = dataTable(perm(1:2000),:);
>> testData = dataTable(perm(2001:end),:);

Here is a bit more elaboration on how the app deals with the training data compared to how the generated code deals with the data:

When training a model in the app:

The model is trained on the full dataset (i.e. all observations). This model is the one that is exported (, but is never visible to the user in any way within the app. There is no time lag when you click the export button because the model object was already created at training time.
Performs validation (cross-validation, holdout etc). This is done purely to get the validation accuracy value as reported in the list of models, and to generate plots. The results of this validation step are not exported – you can only view it from within the app.
When exporting a model from the app, the user gets the former model i.e. the one trained on the full dataset. The reasoning is that the validation accuracies (and plots) are there purely for choosing the type of model and hyperparameter selection in a statistically valid way.

How the accuracy is computed: # For the 5-fold cross-validation, let i=1:n be the indices of all observations, and partition these indices into 5 validation sets V_k for k=1,…,5. Let T_k = i / V_k i.e. a training set is defined as all points except the ones in the validation set V_k. Thus for each fold k, we have the training-validation set pairs T_k, V_k. We train on T_k and predict on V_k. This way we have predictions for all observations because we have predictions on each V_k, and the collection of V_k forms a partition on all observations. # Computing the accuracy is just a matter of comparing the above cross-validation predicted response vs. the observed response.

How the confusion matrix is computed (and other plots showing prediction information): # The same logic applies as above – since we have the predictions for each observation, we essentially call confusionmat(cross-validation-predictedResponse, observedResponse) to get the confusion matrix.

Generated code: # This emulates what the app does – i.e. you ‘import’ the dataset, preprocess it, train the model on the full dataset (which is returned), and perform validation (the validation accuracy is returned).

MATLAB: How to do cross-validation and calculate the optimum number of hidden neurons in the Neural Networks Toolbox 7.0.2 (R2011b)

The ability to use k-fold cross validation instead of random partitioning, or to calculate the optimum number of hidden neurons during training, are not available in NNTOOL. As a workaround for cross-validation, you can use CVPARTITION in the Statistics Toolbox.

Best Answer

Related Solutions

MATLAB: Is the accuracy reported in the Classification Learner app different from the accuracy of the exported model on the training data set

MATLAB: How to do cross-validation and calculate the optimum number of hidden neurons in the Neural Networks Toolbox 7.0.2 (R2011b)

Related Question