Accuracy vs F-measure
First of all, when you use a metric you should know how to game it. Accuracy measures the ratio of correctly classified instances across all classes. That means, that if one class occurs more often than another, then the resulting accuracy is clearly dominated by the accuracy of the dominating class. In your case if one constructs a Model M which just predicts "neutral" for every instance, the resulting accuracy will be
$acc=\frac{neutral}{(neutral + positive + negative)}=0.9188$
Good, but useless.
So the addition of features clearly improved the power of NB to differentiate the classes, but by predicting "positive" and "negative" one missclassifies neutrals and hence the accuracy goes down (roughly spoken). This behavior is independent of NB.
More or less Features ?
In general it is not better to use more features, but to use the right features. More features is better insofar that a feature selection algorithm has more choices to find the optimal subset (I suggest to explore: feature-selection of crossvalidated). When it comes to NB, a fast and solid (but less than optimal) approach is to use InformationGain(Ratio) to sort the features in decreasing order and select the top k.
Again, this advice (except InformationGain) is independent of the classification algorithm.
EDIT 27.11.11
There has been a lot of confusion regarding bias and variance to select the correct number of features. I therefore recommend to read the first pages of this tutorial: Bias-Variance tradeoff. The key essence is:
- High Bias means, that the model is less than optimal, i.e. the test-error is high (underfitting, as Simone puts it)
- High Variance means, that the model is very sensitive to the sample used to build the model. That means, that the error highly depends on the training set used and hence the variance of the error (evaluated across different crossvalidation-folds) will extremely differ. (overfitting)
The learning-curves plotted do indeed indicate the Bias, since the error is plotted. However, what you cannot see is the Variance, since the confidence-interval of the error is not plotted at all.
Example: When performing a 3-fold Crossvalidation 6-times (yes, repetition with different data partitioning is recommended, Kohavi suggests 6 repetitions), you get 18 values. I now would expect that ...
- With a small number of features, the average error (bias) will be lower, however, the variance of the error (of the 18 values) will be higher.
- with a high number of features, the average error (bias) will be higher, but the variance of the error (of the 18 values) lower.
This behavior of the error/bias is exactly what we see in your plots. We cannot make a statement about the variance. That the curves are close to each other can be an indication that the test-set is big enough to show the same characteristics as the training set and hence that the measured error may be reliable, but this is (at least as far as I understood it) not sufficient to make a statement about the variance (of the error !).
When adding more and more training examples (keeping the size of test-set fixed), I would expect that the variance of both approaches (small and high number of features) decrease.
Oh, and do not forget to calculate the infogain for feature selection using only the data in the training sample ! One is tempted to use the complete data for feature selection and then perform data partitioning and apply the crossvalidation, but this will lead to overfitting. I do not know what you did, this is just a warning one should never forget.
Assigning all patterns to the negative class certainly is not a "wierd result". It could be that the Bayes optimal classifier always classifies all patterns as belonging to the majority class, in which case your classifier is doing exactly what it should do. If the density of patterns belonging to the positive class never exceeds the density of the patterns belonging to the negative class, then the negative class is more likely regardless of the attribute values.
The thing to do in such circumstances is to consider the relative importance of false-positive and false-negative errors, it is rare in practice that the costs of the two different types of error are the same. So determine the loss for false positive and false negative errors and take these into account in setting the threshold probability (differing misclassification costs is equivalent to changing the prior probabilities, so this is easy to implement for naive Bayes). I would recommend tuning the priors to minimise the cross-validation estimate of the loss (incorporating your unequal misclassification costs).
If your misclassification costs are equal, and your training set priors representative of operational conditions, then assuming that your implementation is correct, it is possible that you already have the best NB classifier.
Best Answer
I would have thought the "best" (most intuitive) estimate of error would be the probability that the classification is incorrect. Or alternatively/equivalently the odds against the class. Using the wikipedia page, you classify as Male or female. I would have thought the estimate of accuracy should be:
$$O(Male|evidence)=\frac{P(Male|evidence)}{P(Female|evidence)}$$
So you would report the data in terms of "the evidence gives odds O:1 in favour of this classification"
In a problem with more than 2 classes, you should report the "worst" odds ratio. That is given classes $C_{i}\;\;(i=1,\dots,R+1)$, one (and only one) of which is assumed to be true. Suppose you classify an observation as $C_{R+1}$, then you would report odds of:
$$O(C_{R+1}|evidence)=\frac{P(C_{R+1}|evidence)}{Max_{i\in[1,\dots,R]}P(C_{i}|evidence)}$$
So you would report the data in terms of "the evidence gives odds O:1 in favour of this classification against the next best alternative"
EDIT/UPDATE: in response to the new part of the question, to put some numbers into the calculation. Using the first 15 observations to "train" the classifier. Because you are dealing with normal distribution, you only need the sufficient statistics, which in this case are:
$$ \begin{pmatrix} \hat{\mu}_{1} & \hat{\mu}_{2} \\ \hat{\nu}_{1} & \hat{\nu}_{2} \\ \hat{\sigma}_{1} & \hat{\sigma}_{2} \\ \hat{\tau}_{1} & \hat{\tau}_{2} \end{pmatrix} = \begin{pmatrix} 0.3798 & 0.6275 \\ -0.1449 & 1.6748 \\ 0.8367 & 0.8685 \\ 0.8200 & 0.5451 \end{pmatrix} $$
Where the subscript denotes the class. $\mu$ and $\sigma$ denote the mean and standard deviation of the first attribute. $\nu$ and $\tau$ denote the mean and standard deviation of the second attribute. You could include correlation, but it is "safer" not to unless you know correlations actually exist. As these are just numbers to me, I have no reason to suppose them to be dependent. So I will not constrain them by forcing a dependence assumption.
Now you need a decision rule in order to classify an observation into 1 class or the other. The one I was suggesting was to use the odds ratio. So we need to calculate the probability of belonging to class 1, given the training data (denoted by D), the prior information (denoted by I), and the sample to test (denoted by y):
$$p(C_{1}|y,D,I)=\frac{p(C_{1}|D,I)p(y|C_{1},D,I)}{p(y|D,I)} \rightarrow O(C_{1}|y,D,I)=\frac{p(C_{1}|D,I)p(y|C_{1},D,I)}{p(C_{2}|D,I)p(y|C_{2},D,I)}$$
Where $p(C_{1}|D,I) = \frac{8}{15}$ assuming complete initial ignorance (because I am ignorant prior to seeing the data). If you knew it was possible for both categories to occur prior to observing the training data, then the probability would be given by the rule of succession $\frac{9}{17}$.
$Pr(y|C_{1},D,I)$ is the posterior predictive distribution for class 1, given by:
$$p(y|C_{1},D,I)=\int p(y|\mu_{1},\nu_{1},\sigma_{1},\tau_{1},I) p(\mu_{1},\nu_{1},\sigma_{1},\tau_{1}|D,I)d\mu_{1}d\nu_{1}d\sigma_{1}d\tau_{1}$$
Now I will just give the posterior assuming complete ignorance, it is not hard to derive, you use a prior $p(\mu_{1},\nu_{1},\sigma_{1},\tau_{1}|I)\propto\frac{1}{\sigma_{1}\tau_{1}}$ and do the necessary integrals. It is a product of two student t densities (denoted by $St(x|\mu,\sigma,df)$)
$$p(y|C_{1},D,I)=St(y_{1}|\hat{\mu}_{1},\hat{\sigma}_{1}\sqrt{\frac{8+1}{8-1}},8-1)St(y_{2}|\hat{\nu}_{1},\hat{\tau}_{1}\sqrt{\frac{8+1}{8-1}},8-1)$$
Where $y_j$ is the attribute j value of the new data point. This should make it fairly obvious how it would generalise to more than two attributes. Similarly, we have $p(C_{2}|D,I) = \frac{7}{15}$ assuming ignorance or $\frac{8}{17}$ using the rule of succession. The posterior predictive is:
$$p(y|C_{2},D,I)=St(y_{1}|\hat{\mu}_{2},\hat{\sigma}_{2}\sqrt{\frac{7+1}{7-1}},7-1)St(y_{2}|\hat{\nu}_{2},\hat{\tau}_{2}\sqrt{\frac{7+1}{7-1}},7-1)$$
And so the final odds ratio is given by:
$$O(C_{1}|y,D,I)=\frac{8}{7} \times \frac{St(y_{1}|\hat{\mu}_{1},\hat{\sigma}_{1}\sqrt{\frac{9}{7}},7)}{St(y_{1}|\hat{\mu}_{2},\hat{\sigma}_{2}\sqrt{\frac{8}{6}},6)} \times \frac{St(y_{2}|\hat{\nu}_{1},\hat{\tau}_{1}\sqrt{\frac{9}{7}},7)}{St(y_{2}|\hat{\nu}_{2},\hat{\tau}_{2}\sqrt{\frac{8}{6}},6)}$$
I think you will agree that this number is sensible by any criteria. It "goes in all the right directions", and it appropriately accounts for the uncertainty in estimating the parameters of the model. inserting in these densities gives:
$$\frac{8}{7} \times\frac{\hat{\sigma}_{2}\hat{\tau}_{2}}{\hat{\sigma}_{1}\hat{\tau}_{1}} \times\left[\frac{\frac{\Gamma(4)}{\Gamma(\frac{7}{2})\sqrt{7}}}{\frac{\Gamma(\frac{7}{2})}{\Gamma(3)\sqrt{6}}}\right]^2 \frac{ \left[1+\frac{1}{8}\left(\frac{y_{1}-\hat{\mu}_{2}}{\hat{\sigma}_{2}}\right)^2 \right]^{\frac{7}{2}} \left[1+\frac{1}{8}\left(\frac{y_{2}-\hat{\nu}_{2}}{\hat{\tau}_{2}}\right)^2 \right]^{\frac{7}{2}} }{ \left[1+\frac{1}{9}\left(\frac{y_{1}-\hat{\mu}_{1}}{\hat{\sigma}_{1}}\right)^2 \right]^{\frac{8}{2}} \left[1+\frac{1}{9}\left(\frac{y_{2}-\hat{\nu}_{1}}{\hat{\tau}_{1}}\right)^2 \right]^{\frac{8}{2}} } $$
Now in order to make a decision, you need to think about the consequences of making a wrong classification. is it worse to classify a $1$ as a $2$ compared to classifying a $2$ as a $1$? If not then the cut-off is simply $O(C_{1}|y,D,I)>1$ are classed as $1$, otherwise as $2$. the cut-off will slide up or down depending on "which is more important to get right". Although for this particular example, the classifier is so good, you hardly need to bother about this.
The table below shows the odds for each of the testing observations. Interestingly, attribute 2 is what is driving most of the changes in odds - attribute 1 does not appear to be as useful. In fact you can see that attribute 1 actually introduces more uncertainty with identifying group 2. This is obvious when you note that the mean and variance of attribute 1 are basically the same for each group (but mean and variance of attribute 2 is quite different between two groups):
$$ \begin{array}{c|c} y_{1} & y_{2} & O(C_{1}|y_{1},y_{2},D,I) & O(C_{1}|y_{2},D,I) & \text{True Class} \\ \hline -0.4049 & -0.3981 & 47.0 &37.0 & 1 \\ -0.0913 & 2.2094 & 0.109 & 0.089 & 2 \\ 0.3376 & -1.0467 & 102.4 & 93.5 & 1 \\ 0.3455 & 2.496 & 0.104 & 0.095 & 2 \\ 0.3232 & -0.5614 & 54.6 & 49.7 & 1 \end{array} $$