SVM – Mixing Categorical and Continuous Data in Support Vector Machines

categorical datamixed type datasvm

I have a dataset like

+--------+------+-------------------+
| income | year |        use        |
+--------+------+-------------------+
|  46328 | 1989 | COMMERCIAL EXEMPT |
|  75469 | 1998 | CONDOMINIUM       |
|  49250 | 1950 | SINGLE FAMILY     |
|  82354 | 2001 | SINGLE FAMILY     |
|  88281 | 1985 | SHOP & HOUSE      |
+--------+------+-------------------+

I embed it into a LIBSVM format vector space

+1 1:46328 2:1989 3:1
-1 1:75469 2:1998 4:1
+1 1:49250 2:1950 5:1
-1 1:82354 2:2001 5:1
+1 1:88281 2:1985 6:1

Feature indices:

  • 1 is "income"
  • 2 is "year"
  • 3 is "use/COMMERCIAL EXEMPT"
  • 4 is "use/CONDOMINIUM"
  • 5 is "use/SINGLE FAMILY"
  • 6 is "use/SHOP & HOUSE"

Is it OK to train a support vector machine (SVM) with a mix of continuous (year, income) and categorical (use) data like this?

Best Answer

Yes! But maybe not in the way you mean. In my research I frequently create categorical features from continuously-valued ones using an algorithm like recursive partitioning. I usually use this approach with the SVMLight implementation of support vector machines, but I've used it with LibSVM as well. You'll need to be sure you assign your partitioned categorical features to a specific place in your feature vector during training and classification, otherwise your model is going to end up jumbly.

Edit: That is to say, when I've done this, I assign the first n elements of the vector to the binary values associated with the output of recursive partitioning. In binary feature modeling, you just have a giant vector of 0's and 1's, so everything looks the same to the model, unless you explicitly indicate where different features are. This is probably overly specific, as I imagine most SVM implementations will do this on their own, but, if you like to program your own, it might be something to think about!