Solved – Do categorical variables have to be dumthe coded in SVM

caretcategorical datarsvm

I am using R with the packages kernlab / caret and doing some analysis with SVM (ksvm).
I am using a Radial Based kernel for classification.

I have a few categorical variables which are set as factors in R, so they are internally represented as distinct integers.

Say in the case of a categorical variable with 3 levels, can I just leave it alone and SVM handles this automatically: levels 1, 2, 3. Or do I have to dummy code them to two columns like so:

x0     x1
 0      0          = level 1
 0      1          = level 2
 1      0          = level 3

etc?

I looked in the documentation where it sounds like if you use the formula interface (which I do), then this is handled automatically:

"If the predictor variables include factors, the formula interface
must be used to get a correct model matrix."

Does this mean so long as I use the formula interface "dummy coding" is happening for me behind the scenes?

Best Answer

Actually when you look at the model.matrix documentation, you will find that the way formula is specified it automatically dummy code the factor variables. You can specify explicitly via contrasts options about what to do with the factor variables.

Hope that helped!!