SVM Classification – Does a Sparse Training Set Adversely Affect an SVM?

classificationsparsesvm

I'm trying to classify messages into different categories using an SVM. I've compiled a list of desirable words/symbols from the training set.

For each vector, which represents a message, I set the corresponding row to 1 if the word is present:

"corpus" is: [mary, little, lamb, star, twinkle]

first message: "mary had a little lamb" -> [1 1 1 0 0]

second message: "twinkle little star" -> [0 1 0 1 1]

I think this is fairly common setup with SVM, but my question is, with thousands of words in the set, what if there's only 1-2 words per message that actually show up? Is the linear dependence of my set of training vectors going to adversely affect the ability of the algorithm to converge?

Best Answer

Sparsity and linear dependence are two different things. Linear dependence implies that some of the feature vectors are simple multiples of other feature vectors (or the same applied to examples). In the setup you have described I think linear dependence is unlikely (it implies two terms have the same frequency (or multiples thereof) across all documents). Simply having sparse features does not present any problem for the SVM. One way to see this is that you could do a random rotation of the co-ordinate axes, which would leave the problem unchanged and give the same solution, but would make the data completely non-sparse (this is in part how random projections work).

Also it appears that you are talking about the SVM in the primal. Note that if you use the kernel SVM, just because you have a sparse dataset does not mean that the kernel matrix will be sparse. It may, however, be low rank. In that case you can actually take advantage of this fact for more efficient training (see for example Efficient svm training using low-rank kernel representations).