Solved – Logistic Regression\SVM implementation in Mahout

I am currently working on sentimental analysis of twitter data for one of telecom company data.I am loading the data into HDFS and using Mahout's Naive Bayes Classifier for predicting the sentiments as positive,negative or neutral .

Here's is what i am doing

I am providing training data to the machine (key :sentiment,value:text) .

Using mahout library by calculating tf-idf(Inverse Document Frequency) of text it is creating feature vector.

mahout seq2sparser -i /user/root/new_model/dataseq –maxDFPercent 1000000 –minSupport 4 –maxNGramSize 2 -a org.apache.lucene.analysis.WhitespaceAnalyzer -o /user/root/new_model/predicted

Splitting data as training set and testing set.

That feature vector I am passing to the naive Bayes algorithm to build a model.

mahout trainnb -i /user/root/new_model/train-vectors -el -li /user/root/new_model/labelindex -o /user/root/new_model/model -ow -c

Using this model I am predicting sentiment of new data.
This is very simple implementation what I am doing , By this implementation I am getting very low accuracy even if i have good training set . So I was thinking of switching to Logistic regression/SVM because they give better results for these kind of problem .

So my question how can i use these algorithm for building my model or predicting the sentiments of tweets using these two algorithms . What steps i need to follow to achieve this ?

Best Answer

For text it is common, if not mandatory, to do some preprocessing steps that you don't mention: lowercasing, stemming, stop-word removal. For twitter spacial care should be taken for hashes (you probably want them as hashes, not standard words).

I am sure that linear SVMs will do better than naive Bays. Not sure about logistic regression. In terms of methodology you follow same approach as you did with naive Bays.

You should pay attention on the measurement of performance you use. Do you really want accuracy? Accuracy can be easily misleading in these problems. Probably it is better if you go for precision/recall per sentiment of interest. For example, precision 95%, recall 50%: I am able to detect 50% of the tweets with sentiment, and from those that I detect I am gonna be correct in 95% of the cases.

Best Answer

Related Solutions

Solved – TF-IDF cutoff percentage for tweets

Solved – Logistic regression for ranking: how do you represent inter-human variation

Related Question