Accuracy vs F-measure
First of all, when you use a metric you should know how to game it. Accuracy measures the ratio of correctly classified instances across all classes. That means, that if one class occurs more often than another, then the resulting accuracy is clearly dominated by the accuracy of the dominating class. In your case if one constructs a Model M which just predicts "neutral" for every instance, the resulting accuracy will be
$acc=\frac{neutral}{(neutral + positive + negative)}=0.9188$
Good, but useless.
So the addition of features clearly improved the power of NB to differentiate the classes, but by predicting "positive" and "negative" one missclassifies neutrals and hence the accuracy goes down (roughly spoken). This behavior is independent of NB.
More or less Features ?
In general it is not better to use more features, but to use the right features. More features is better insofar that a feature selection algorithm has more choices to find the optimal subset (I suggest to explore: feature-selection of crossvalidated). When it comes to NB, a fast and solid (but less than optimal) approach is to use InformationGain(Ratio) to sort the features in decreasing order and select the top k.
Again, this advice (except InformationGain) is independent of the classification algorithm.
EDIT 27.11.11
There has been a lot of confusion regarding bias and variance to select the correct number of features. I therefore recommend to read the first pages of this tutorial: Bias-Variance tradeoff. The key essence is:
- High Bias means, that the model is less than optimal, i.e. the test-error is high (underfitting, as Simone puts it)
- High Variance means, that the model is very sensitive to the sample used to build the model. That means, that the error highly depends on the training set used and hence the variance of the error (evaluated across different crossvalidation-folds) will extremely differ. (overfitting)
The learning-curves plotted do indeed indicate the Bias, since the error is plotted. However, what you cannot see is the Variance, since the confidence-interval of the error is not plotted at all.
Example: When performing a 3-fold Crossvalidation 6-times (yes, repetition with different data partitioning is recommended, Kohavi suggests 6 repetitions), you get 18 values. I now would expect that ...
- With a small number of features, the average error (bias) will be lower, however, the variance of the error (of the 18 values) will be higher.
- with a high number of features, the average error (bias) will be higher, but the variance of the error (of the 18 values) lower.
This behavior of the error/bias is exactly what we see in your plots. We cannot make a statement about the variance. That the curves are close to each other can be an indication that the test-set is big enough to show the same characteristics as the training set and hence that the measured error may be reliable, but this is (at least as far as I understood it) not sufficient to make a statement about the variance (of the error !).
When adding more and more training examples (keeping the size of test-set fixed), I would expect that the variance of both approaches (small and high number of features) decrease.
Oh, and do not forget to calculate the infogain for feature selection using only the data in the training sample ! One is tempted to use the complete data for feature selection and then perform data partitioning and apply the crossvalidation, but this will lead to overfitting. I do not know what you did, this is just a warning one should never forget.
Best Answer
The Python package sdt_metrics by Roger Lew implements several non-parametric response bias measures. Unfortunately, the package is not maintained, but the references are still useful.
One of these references is an empirical study comparing five response bias measures:
Among the parametric response bias measures, they recommend B"D: $$B’’_D = \frac{(1-H)(1-FA)-(H)(FA)}{(1-H)(1-FA)+(H)(FA)}$$
Note that care must be taken when $H$ (hit rate) or $FA$ (false alarm rate) are at their boundaries: