Solved – Machine learning on small datasets

classificationmachine learningmedicinesmall-sample

As a beginner at machine learning, I wanted to work on a small project in which the dataset has only 80 rows and 5 columns. The dataset I am working with is related to a medical condition, with 4 columns as biomarkers, and the 5th column indicates whether the row (a patient) has the condition or not. So far, I have fitted the following 5 models (with accuracy and MCC scores):

KNN (Accuracy: 43.5%, MCC:-0.164)
Logistic Regression (Accuracy: 65.2%, MCC: .312)
SVM (Accuracy: 60.9%, MCC: .214
Random Forest (Accuracy: 86.95%, MCC: .769)
Decision trees (Accuracy: 65.2%, MCC: .312)

I have used 5-fold cross validation to prevent overfitting, and yet most of my models are underperforming. I was also considering ensembling and bootstrapping, but with these lacking results, I am not sure how effective they would be. Do you have any tips concerning either:

  1. Better algorithms for small datasets
  2. Improvements I could make on the algorithms I have so far
  3. Another method (e. g. regularization)

Any help would be greatly appreciated.

Best Answer

How the cutoff value was chosen was not mentioned. To calculate the accuracy and mcc, a cutoff was used to mark an observation as an event or non-event. Was a 0.5 level used as the cutoff? (for the probabilistic classifiers)

A 0.5 cutoff is not always optimal. There are often asymmetric costs/benefits of true positive, false positive, true negative, false negative. A properly chosen cutoff seeks to balance these costs/benefits in the context of the problem. For example, if a routine blood test shows I have a 10% chance of cancer and the cancer is currently curable, perhaps I will choose to take a more advanced test or a 2nd test for confirmation. Or if a bank lends money, perhaps lending $$100 to someone with a 10% chance of paying it back is OK, but to lend $100K we want a 60% chance.

Choosing the cutoff is where a Subject Matter Expert comes in. I understand you are experimenting, hence do not have that SME. My advice is to plot the prediction probabilities for the appropriate classifiers and think/calculate the cost/benefit of correct vs incorrect predictions (TP, FP, TN, FN) for your problem. Choose the cutoff that optimizes that calculation. Then compare those algorithms.

Related Question