Accuracy vs F-measure
First of all, when you use a metric you should know how to game it. Accuracy measures the ratio of correctly classified instances across all classes. That means, that if one class occurs more often than another, then the resulting accuracy is clearly dominated by the accuracy of the dominating class. In your case if one constructs a Model M which just predicts "neutral" for every instance, the resulting accuracy will be
$acc=\frac{neutral}{(neutral + positive + negative)}=0.9188$
Good, but useless.
So the addition of features clearly improved the power of NB to differentiate the classes, but by predicting "positive" and "negative" one missclassifies neutrals and hence the accuracy goes down (roughly spoken). This behavior is independent of NB.
More or less Features ?
In general it is not better to use more features, but to use the right features. More features is better insofar that a feature selection algorithm has more choices to find the optimal subset (I suggest to explore: feature-selection of crossvalidated). When it comes to NB, a fast and solid (but less than optimal) approach is to use InformationGain(Ratio) to sort the features in decreasing order and select the top k.
Again, this advice (except InformationGain) is independent of the classification algorithm.
EDIT 27.11.11
There has been a lot of confusion regarding bias and variance to select the correct number of features. I therefore recommend to read the first pages of this tutorial: Bias-Variance tradeoff. The key essence is:
- High Bias means, that the model is less than optimal, i.e. the test-error is high (underfitting, as Simone puts it)
- High Variance means, that the model is very sensitive to the sample used to build the model. That means, that the error highly depends on the training set used and hence the variance of the error (evaluated across different crossvalidation-folds) will extremely differ. (overfitting)
The learning-curves plotted do indeed indicate the Bias, since the error is plotted. However, what you cannot see is the Variance, since the confidence-interval of the error is not plotted at all.
Example: When performing a 3-fold Crossvalidation 6-times (yes, repetition with different data partitioning is recommended, Kohavi suggests 6 repetitions), you get 18 values. I now would expect that ...
- With a small number of features, the average error (bias) will be lower, however, the variance of the error (of the 18 values) will be higher.
- with a high number of features, the average error (bias) will be higher, but the variance of the error (of the 18 values) lower.
This behavior of the error/bias is exactly what we see in your plots. We cannot make a statement about the variance. That the curves are close to each other can be an indication that the test-set is big enough to show the same characteristics as the training set and hence that the measured error may be reliable, but this is (at least as far as I understood it) not sufficient to make a statement about the variance (of the error !).
When adding more and more training examples (keeping the size of test-set fixed), I would expect that the variance of both approaches (small and high number of features) decrease.
Oh, and do not forget to calculate the infogain for feature selection using only the data in the training sample ! One is tempted to use the complete data for feature selection and then perform data partitioning and apply the crossvalidation, but this will lead to overfitting. I do not know what you did, this is just a warning one should never forget.
I wrote a function for this purpose, based on the exercise in the book Data Mining with R:
# Function: evaluation metrics
## True positives (TP) - Correctly idd as success
## True negatives (TN) - Correctly idd as failure
## False positives (FP) - success incorrectly idd as failure
## False negatives (FN) - failure incorrectly idd as success
## Precision - P = TP/(TP+FP) how many idd actually success/failure
## Recall - R = TP/(TP+FN) how many of the successes correctly idd
## F-score - F = (2 * P * R)/(P + R) harm mean of precision and recall
prf <- function(predAct){
## predAct is two col dataframe of pred,act
preds = predAct[,1]
trues = predAct[,2]
xTab <- table(preds, trues)
clss <- as.character(sort(unique(preds)))
r <- matrix(NA, ncol = 7, nrow = 1,
dimnames = list(c(),c('Acc',
paste("P",clss[1],sep='_'),
paste("R",clss[1],sep='_'),
paste("F",clss[1],sep='_'),
paste("P",clss[2],sep='_'),
paste("R",clss[2],sep='_'),
paste("F",clss[2],sep='_'))))
r[1,1] <- sum(xTab[1,1],xTab[2,2])/sum(xTab) # Accuracy
r[1,2] <- xTab[1,1]/sum(xTab[,1]) # Miss Precision
r[1,3] <- xTab[1,1]/sum(xTab[1,]) # Miss Recall
r[1,4] <- (2*r[1,2]*r[1,3])/sum(r[1,2],r[1,3]) # Miss F
r[1,5] <- xTab[2,2]/sum(xTab[,2]) # Hit Precision
r[1,6] <- xTab[2,2]/sum(xTab[2,]) # Hit Recall
r[1,7] <- (2*r[1,5]*r[1,6])/sum(r[1,5],r[1,6]) # Hit F
r}
Where for any binary classification task, this returns the precision, recall, and F-stat for each classification and the overall accuracy like so:
> pred <- rbinom(100,1,.7)
> act <- rbinom(100,1,.7)
> predAct <- data.frame(pred,act)
> prf(predAct)
Acc P_0 R_0 F_0 P_1 R_1 F_1
[1,] 0.63 0.34375 0.4074074 0.3728814 0.7647059 0.7123288 0.7375887
Calculating the P, R, and F for each class like this lets you see whether one or the other is giving you more difficulty, and it's easy to then calculate
the overall P, R, F stats. I haven't used the ROCR package, but you could easily derive the same ROC curves by training the classifier over the range of some parameter and calling the function for classifiers at points along the range.
Best Answer
The "baseline curve" in a PR curve plot is a horizontal line with height equal to the number of positive examples $P$ over the total number of training data $N$, ie. the proportion of positive examples in our data ($\frac{P}{N}$).
OK, why is this the case though? Let's assume we have a "junk classifier" $C_J$. $C_J$ returns a random probability $p_i$ to the $i$-th sample instance $y_i$ to be in class $A$. For convenience, say $p_i \sim U[0,1]$. The direct implication of this random class assignment is that $C_J$ will have (expected) precision equal to the proportion of positive examples in our data. It is only natural; any totally random sub-sample of our data will have $E\{\frac{P}{N}\}$ correctly classified examples. This will be true for any probability threshold $q$ we might use as a decision boundary for the probabilities of class membership returned by $C_J$. ($q$ denotes a value in $[0,1]$ where probability values greater or equal to $q$ are classified in class $A$.) On the other hand the recall performance of $C_J$ is (in expectation) equal to $q$ if $p_i \sim U[0,1]$. At any given threshold $q$ we will pick (approximately) $(100(1-q))\%$ of our total data which subsequently will contain (approximately) $(100(1-q))\%$ of the total number of instances of class $A$ in the sample. Hence the horizontal line we mentioned at the beginning! For every recall value ($x$ values in PR graph) the corresponding precision value ($y$ values in the PR graph) is equal to $\frac{P}{N}$.
A quick side-note: The threshold $q$ is not generally equal to 1 minus the expected recall. This happens in the case of a $C_J$ mentioned above only because of the random uniform distribution of $C_J$'s results; for a different distribution (eg. $ p_i \sim B(2,5)$) this approximate identity relation between $q$ and recall does not hold; $U[0,1]$ was used because it is the easiest to understand and mentally visualise. For a different random distribution in $[0,1]$ the PR profile of $C_J$ will not change though. Just the placement of P-R values for given $q$ values will change.
Now regarding a perfect classifier $C_P$, one would mean a classifier that returns probability $1$ to sample instance $y_i$ being of class $A$ if $y_i$ is indeed in class $A$ and additionally $C_P$ returns probability $0$ if $y_i$ is not a member of class $A$. This implies that for any threshold $q$ we will have $100\%$ precision (ie. in graph-terms we get a line starting at precision $100\%$). The only point we do not get $100\%$ precision is at $q = 0$. For $q=0$, the precision falls to the proportion of positive examples in our data ($\frac{P}{N}$) as (insanely?) we classify even points with $0$ probability of being of class $A$ as being in class $A$. The PR graph of $C_P$ has just two possible values for its precision, $1$ and $\frac{P}{N}$.
OK and some R code to see this first handed with an example where the positive values correspond to $40\%$ of our sample. Notice that we do a "soft-assignment" of class category in the sense that the probability value associated with each point quantifies to our confidence that this point is of class $A$.
where the black circles and triangles denote $q =0.50$ and $q=0.20$ respectively in the first two plots. We immediately see that the "junk" classifiers quickly go to precision equal to $\frac{P}{N}$; similarly the perfect classifier has precision $1$ across all recall variables. Unsurprisingly, the AUCPR for the "junk" classifier is equal to the proportion of positive example in our sample ($\approx 0.40$) and the AUCPR for the "perfect classifier" is approximately equal to $1$.
Realistically the PR graph of a perfect classifier is a bit useless because one cannot have $0$ recall ever (we never predict only the negative class); we just start plotting the line from the upper left corner as a matter of convention. Strictly speaking it should just show two points but this would make a horrible curve. :D
For the record, there are already have been some very good answer in CV regarding the utility of PR curves: here, here and here. Just reading through them carefully should offer a good general understand about PR curves.