Solved – How to validate sentiment classification and compare different algorithms

classificationcross-validationmachine learningsentiment analysissvm

I need to compare SVM and NB about sentiment classification by evaluating accuracy, precision and recall measures.
I have 1500 manually classified documents, and I would know which is the best way to compare these two algorithms, also increasing training set size from 100 to 1000 documents.
I'm using scikit-learn and there are different methods, like KFold, ShuffleSplit, StratifiedKFold, StratifiedShuffleSplit. Which is the right one for my needs?

My approach is to use ShuffleSplit, using a fixed random seed, to first shuffle docs and then select 10 iterations of X training documents and 500 test documents, where X varies from 100 to 1000.

So, I have:

cycle 1) 10 iterations with training set of 100 docs and test set of 500 docs;
cycle 2) 10 iterations with training set of 100 docs and test set of 500 docs;
…
cycle 10) 10 iterations with training set of 1000 docs and test set of 500 docs;

However this approach modify the test set everytime: is this a correct approach?

EDIT:

I found many papers that suggest the use of stratified splitting, so that every training and test split has the same proportions of categories in the dataset. I also found that repeated cross validation can generate more accurate measures.

So the approach that I propose is to repeat 10 times every cycle of the previous approach (10×10), using StratifiedShuffleSplit instead of ShuffleSplit.

Do you think i'm doing it well?

Best Answer

The best generic way (more on that later) to compare the effectiveness of different classifiers in your data set is with receiver operating characteristics (ROC) curves and area under the ROC curve (AUC). See this paper. Briefly, the metric depends on the false positive and true positive rates (fpr/tpr) so you don't need to take special care making a 50/50 positive/negative data set and deal with the associated problems.

To make ROC curves in sklearn, first have your classifiers return the predicted probabilities:

classifier1 = svm.SVC(kernel='linear', probability=True)
probas_1 = classifier1.fit(X_train, y_train).predict_proba(X_test)

Repeat for classifier2, classifier3, etc.

Next, use sklearn functions to find the fpr, tpr, and auc:

# Compute ROC curve and area the curve
from sklearn.metrics import roc_curve, auc
fpr1, tpr1, thresholds1 = roc_curve(y_test, probas_1[:, 1])
roc_auc1 = auc(fpr1, tpr1)

Again, repeat this for each classifier you are preparing.

Then plot:

# Plot ROC curve
pl.clf()
pl.plot(fpr1, tpr1, label='SVC (area = %0.2f)' % roc_auc1)
pl.plot(fpr2, tpr2, label='KNN(3) (area = %0.2f)' % roc_auc2)
pl.plot(fpr3, tpr3, label='KNN(21) (area = %0.2f)' % roc_auc3)
pl.plot([0, 1], [0, 1], 'k--')
pl.xlim([0.0, 1.0])
pl.ylim([0.0, 1.0])
pl.xlabel('False Positive Rate')
pl.ylabel('True Positive Rate')
pl.title('Receiver operating characteristic example')
pl.legend(loc="lower right")
pl.show()

Comparing three classifiers with ROC curves and AUC

Each point in ROC space corresponds to a classifier and a threshold setting (in other words you might be better off counting > 65% as one class, not the default > 50%). The perfect classifier is the point (0, 1), the upper-left hand corner. The dashed line represents all the random classifiers where probability of positive is randomly assigned (lower left is 0%, upper right is 100%). Thus, meaningful classifiers are above and left of the dashed line.

You can also collect AUC each time you use a new cross-validation set. Then, the best classifier will have the distribution with the higher mean or median.

All that being said, you really want to look at your own cost/benefit function to decide which classifier to use. For your particular problem, quantify what you get with a true positive and false negative. Also quantify whatever bad things happen when you get a false positive or false negative. The best operating point (which classifier, and where to set the threshold) will depend on the application as seen through your cost/benefit.

Best Answer

Related Solutions

Solved – training approaches for highly-imbalanced data set

Solved – Increasing the sample size does not help the classification performance

Related Question