As a beginner at machine learning, I wanted to work on a small project in which the dataset has only 80 rows and 5 columns. The dataset I am working with is related to a medical condition, with 4 columns as biomarkers, and the 5th column indicates whether the row (a patient) has the condition or not. So far, I have fitted the following 5 models (with accuracy and MCC scores):
KNN (Accuracy: 43.5%, MCC:-0.164)
Logistic Regression (Accuracy: 65.2%, MCC: .312)
SVM (Accuracy: 60.9%, MCC: .214
Random Forest (Accuracy: 86.95%, MCC: .769)
Decision trees (Accuracy: 65.2%, MCC: .312)
I have used 5-fold cross validation to prevent overfitting, and yet most of my models are underperforming. I was also considering ensembling and bootstrapping, but with these lacking results, I am not sure how effective they would be. Do you have any tips concerning either:
- Better algorithms for small datasets
- Improvements I could make on the algorithms I have so far
- Another method (e. g. regularization)
Any help would be greatly appreciated.
Best Answer
How the cutoff value was chosen was not mentioned. To calculate the accuracy and mcc, a cutoff was used to mark an observation as an event or non-event. Was a 0.5 level used as the cutoff? (for the probabilistic classifiers)
A 0.5 cutoff is not always optimal. There are often asymmetric costs/benefits of true positive, false positive, true negative, false negative. A properly chosen cutoff seeks to balance these costs/benefits in the context of the problem. For example, if a routine blood test shows I have a 10% chance of cancer and the cancer is currently curable, perhaps I will choose to take a more advanced test or a 2nd test for confirmation. Or if a bank lends money, perhaps lending $$100 to someone with a 10% chance of paying it back is OK, but to lend $100K we want a 60% chance.
Choosing the cutoff is where a Subject Matter Expert comes in. I understand you are experimenting, hence do not have that SME. My advice is to plot the prediction probabilities for the appropriate classifiers and think/calculate the cost/benefit of correct vs incorrect predictions (TP, FP, TN, FN) for your problem. Choose the cutoff that optimizes that calculation. Then compare those algorithms.