Solved – multiclass SVM classification (using R)

classificationcross-validationmachine learningsvm

I'm new to supervised classification. Here's my case:

I want to classify subjects in 3 classes: healthy, sick and intermediate. I've been asked to use SVM to do the classification. I know how it works, that you have to have a training set, a testing set; cross validation etc… but i'm confused on which classes should I built the model with?

So, of course, I thought of doing 3 separate SVM classification (completely independent): SVM healthy/sick, SVM healthy/intermediate, SVM sick/intermediate. Then study accuracy, sensitivity, specificity, AUC for the 3 separated SVM classification…

Is it acceptable or it doesn't work like that?

PS: i'm using the caret package

Best Answer

As the very first point, I'd like to challenge your multi-class setup: your description as absence of disease [I guess you're talking of a somewhat specific condition] - intermedidiate - [full] presence of disease to me does not at all look like a classification problem, but rather like a continuum that is much better described as regression.
There are situations where clinical practice ties you to those convenient groups (even if Frank Harrell will tell you that this is BS, and you should not wantonly throw away the advantages of a regression for arbitrary class cuts - and he is completely right!).

If you decide to stay with classification, you need to answer a few basic questions about your application to decide how the model should be set up.

Do you need a discriminative classifier (assign any point of the sample space to one of the classes) or should "unknown" regions be kept as unknown?
Do you have a closed- or an open-world problem? I.e. are your classes mutually exclusive or not? Can one sample belong to more than one class? Can a sample belong to none of your classes (that's a variant of the first point)?

For medical applications, discriminative closed-world formulations can be sensible for differential diagnostic questions, but are hardly ever appropriate for other questions (like initial diagnosis or screening - e.g. hepatitis and broken bones are not mutually exclusive...)

Looking at your data, one further question for model setup is whether you expect "intermediate" to also be in between the two other groups in data space or whether you expect intermediate to be something fundamentally different from sick. Or, the other way round: is sick just more of the same as intermediate, or do completely new things happen? (Or any combination of those).
E.g. in Astrocytoma measurements by vibrational spectroscopy we saw that we get an axis in the data where tissues are roughly ordered by tumor grade (same project as the papers below, but not directly visible in those figures), plus some additional "peculiarities".

"Intermediate" prediction may better be modeled as a region on an axis from normal to sick rather than as its own proper class.

I've been a similar situation before, and here's what happened:

SVM exist also in the variety of one-class classifiers
sens/spec and similar measures do apply quite naturally to situations with multiple independent classes. IMHO this cannot be a surprise considering that all but a very few differential diagnostic questions in medicine are naturally open-world multi-class questions - and sensitivity & Co. are successfully used for such diagnostic tools.

Best Answer

Related Solutions

Solved – Multi class classification always have better result than one class classification

Solved – How to combine linear and non-linear models

Related Question