Solved – Sentence classification with SVM

classificationmachine learningsvmtext mining

I'm trying to solve the following problem, I want to classify each row of a particular machine log to output only interesting information such as relevant/non-relevant.I have collected a dataset from such logs and I created a bag of words with ~11000 features.I'm trying to figure out what would be the best approach here to do exactly this classification.I was thinking of using SVM because as far as I know it handles high dimensional input pretty well, but since my feature vector would look like 3-4 positive values and the rest 11000 feautures set to 0 I doubt it will work well.

//Here's an example of 3 rows from my dataset

Notification Minor FPGA Status hw_node:0 Added Node: 1 // relevant
Notification Minor Battery failed test hw_node:1,hw_battery:0 Auto-resolve NEMOE event by Sysmgr at INIT  //relevant
Notification Minor CLI command error sw_cli {3paradm super all {{0 8}} -1 172.16.30.60 10741} {Command: startprog 0 ifconfig fcnet2 Error: user permission denied} {} // non-relevant

I intent to scan each word in the row and create a feature vector which I would pass to the SVM for classification.

My question would be – is this a good way to handle such problem ? I still haven't tested anything yet as I'm quite limited to computational resources at the moment and each test would take a while to complete..

Best Answer

Using a linear SVM for such a task is a sound idea. Linear SVMs are very fast to train, and you get a first result, against which you can check any other approach. As to the resources, it does not only depend on how many features, and the algorithm, but also the implementation.

scikit-learn provides efficient implementation for sparse features, and your problem at hand,

How to train your SVM also plays a role. Using stochastic gradient will need less resources than other second order methods,

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html