Solved – multiclass SVM classification (using R)

classificationcross-validationmachine learningsvm

I'm new to supervised classification. Here's my case:

I want to classify subjects in 3 classes: healthy, sick and intermediate. I've been asked to use SVM to do the classification. I know how it works, that you have to have a training set, a testing set; cross validation etc… but i'm confused on which classes should I built the model with?

So, of course, I thought of doing 3 separate SVM classification (completely independent): SVM healthy/sick, SVM healthy/intermediate, SVM sick/intermediate. Then study accuracy, sensitivity, specificity, AUC for the 3 separated SVM classification…

Is it acceptable or it doesn't work like that?

PS: i'm using the caret package

Best Answer

As the very first point, I'd like to challenge your multi-class setup: your description as absence of disease [I guess you're talking of a somewhat specific condition] - intermedidiate - [full] presence of disease to me does not at all look like a classification problem, but rather like a continuum that is much better described as regression.
There are situations where clinical practice ties you to those convenient groups (even if Frank Harrell will tell you that this is BS, and you should not wantonly throw away the advantages of a regression for arbitrary class cuts - and he is completely right!).

If you decide to stay with classification, you need to answer a few basic questions about your application to decide how the model should be set up.

  • Do you need a discriminative classifier (assign any point of the sample space to one of the classes) or should "unknown" regions be kept as unknown?

  • Do you have a closed- or an open-world problem? I.e. are your classes mutually exclusive or not? Can one sample belong to more than one class? Can a sample belong to none of your classes (that's a variant of the first point)?

For medical applications, discriminative closed-world formulations can be sensible for differential diagnostic questions, but are hardly ever appropriate for other questions (like initial diagnosis or screening - e.g. hepatitis and broken bones are not mutually exclusive...)

Looking at your data, one further question for model setup is whether you expect "intermediate" to also be in between the two other groups in data space or whether you expect intermediate to be something fundamentally different from sick. Or, the other way round: is sick just more of the same as intermediate, or do completely new things happen? (Or any combination of those).
E.g. in Astrocytoma measurements by vibrational spectroscopy we saw that we get an axis in the data where tissues are roughly ordered by tumor grade (same project as the papers below, but not directly visible in those figures), plus some additional "peculiarities".

"Intermediate" prediction may better be modeled as a region on an axis from normal to sick rather than as its own proper class.

I've been a similar situation before, and here's what happened: