Solved – Dealing with imbalanced data-set and cross-validation

classificationcross-validationdatasetMATLABpattern recognition

I have a data set of brain tumours, 700 malignant, and 225 benign. And I want to build a classification model using SVM, to classify the tumours types based on the data I have. My first question, is it considered an imbalanced dataset? if so, should I do undersampling of the malignant class?

Also, is it correct to use the below code to do cross-validation for my dataset? NOTE: groups = instances' labels vector (sorted malignant 0s then benign 1s) data = instances' data feature matrix

k=10;
cp = classperf(groups); 
cvFolds = crossvalind('Kfold', groups, k);   
for i = 1:k                                 
 testIdx = (cvFolds == i);                %# get indices of test instances
 trainIdx = ~testIdx;                     %# get indices training instances
 svmModel = fitcsvm(data(trainIdx,:), groups(trainIdx), 
'Standardize',true,'KernelFunction','RBF','KernelScale','auto');
 pred = predict(svmModel, meas(testIdx,:));
 cp = classperf(cp, pred, testIdx);

end

I still couldn't understand how crossvalind works? I mean does it guarantee that it takes instances from both classes at each fold?

Best Answer

The fact that you are bringing up the issue of balance means that you have not considered the fact that proportion "classified" "correctly" is a discontinuous improper accuracy scoring rule. If you use a proper scoring rule (e.g., Brier score or pseudo $R^2$) the issue goes away. See this and this for more.

Related Question