I wrote a function for this purpose, based on the exercise in the book Data Mining with R:
# Function: evaluation metrics
## True positives (TP) - Correctly idd as success
## True negatives (TN) - Correctly idd as failure
## False positives (FP) - success incorrectly idd as failure
## False negatives (FN) - failure incorrectly idd as success
## Precision - P = TP/(TP+FP) how many idd actually success/failure
## Recall - R = TP/(TP+FN) how many of the successes correctly idd
## F-score - F = (2 * P * R)/(P + R) harm mean of precision and recall
prf <- function(predAct){
## predAct is two col dataframe of pred,act
preds = predAct[,1]
trues = predAct[,2]
xTab <- table(preds, trues)
clss <- as.character(sort(unique(preds)))
r <- matrix(NA, ncol = 7, nrow = 1,
dimnames = list(c(),c('Acc',
paste("P",clss[1],sep='_'),
paste("R",clss[1],sep='_'),
paste("F",clss[1],sep='_'),
paste("P",clss[2],sep='_'),
paste("R",clss[2],sep='_'),
paste("F",clss[2],sep='_'))))
r[1,1] <- sum(xTab[1,1],xTab[2,2])/sum(xTab) # Accuracy
r[1,2] <- xTab[1,1]/sum(xTab[,1]) # Miss Precision
r[1,3] <- xTab[1,1]/sum(xTab[1,]) # Miss Recall
r[1,4] <- (2*r[1,2]*r[1,3])/sum(r[1,2],r[1,3]) # Miss F
r[1,5] <- xTab[2,2]/sum(xTab[,2]) # Hit Precision
r[1,6] <- xTab[2,2]/sum(xTab[2,]) # Hit Recall
r[1,7] <- (2*r[1,5]*r[1,6])/sum(r[1,5],r[1,6]) # Hit F
r}
Where for any binary classification task, this returns the precision, recall, and F-stat for each classification and the overall accuracy like so:
> pred <- rbinom(100,1,.7)
> act <- rbinom(100,1,.7)
> predAct <- data.frame(pred,act)
> prf(predAct)
Acc P_0 R_0 F_0 P_1 R_1 F_1
[1,] 0.63 0.34375 0.4074074 0.3728814 0.7647059 0.7123288 0.7375887
Calculating the P, R, and F for each class like this lets you see whether one or the other is giving you more difficulty, and it's easy to then calculate
the overall P, R, F stats. I haven't used the ROCR package, but you could easily derive the same ROC curves by training the classifier over the range of some parameter and calling the function for classifiers at points along the range.
Generating a PR curve is similar to generating an ROC curve. To draw such plots you need a full ranking of the test set. To make this ranking, you need a classifier which outputs a decision value rather than a binary answer. The decision value is a measure of confidence in a prediction which we can use to rank all test instances. As an example, the decision values of logistic regression and SVM are a probability and a (signed) distance to the separating hyperplane, respectively.
If you dispose of decision values you define a set of thresholds on said decision values. These thresholds are different settings of a classifier: e.g. you can control the level of conservatism. For logistic regression, the default threshold would be $f(\mathbf{x}) = 0.5$ but you can go over the entire range of $(0, 1)$. Typically, the thresholds are chosen to be the unique decision values your model yielded for the test set.
At each choice of threshold, your model yields different predictions (e.g. different number of positive and negative predictions). As such, you get a set of tuples with different precision and recall at every threshold, e.g. a set of tuples $( T_i, P_i, R_i )$. The PR curve is drawn based on the $( P_i, R_i )$ pairs.
If I understood your comment correctly, the total similarity score you compute can be used as a decision value.
Best Answer
Working convention: Point $(0,1)$ is the upper left corner and corresponds to $0$ Recall (i.e. no Recall) and $1$ Precision (i.e. perfect Precision).
Regarding the first question: The start point can be at any point along $0$ or $\frac{1}{n_+}$ Recall, where the PR-curve start depends on the classifier performance. While we would hope that we will start at the point $(\frac{1}{n_+},1)$ and we will slow increase our Recall with little expense to Precision (i.e. we are very precise to begin with and slowly sacrifice Precision for Recall) that is not guaranteed at all. The obvious example is when we misclassify our "most probable" example of our test set. In that case we have both $0$-th Recall and $0$-th Precision, i.e. we start from point $(0,0)$. For example, in the left-most graph shown below (red line), we have an artificial example where we start at point $(0,0.5)$ because the first $\frac{N}{2}$ points are indistinguishable from each other. We "immediately" classify correctly some examples (i.e. we get TPs and thus non-zero Recall) but at the same time we get an equal number of FPs leading us at a $0.5$ Precision.
Please note, that in the case that no Positive examples are found (TPs or FPs), Precision is meaningless. There is not general rule on that what we do there.
sklearn
sets this to be $1$ that strictly for its convenience and explicitly says that these points "do not have a corresponding threshold". To that respect, in Davis & Goadrich (2006) the procedure of constructing a PR curve when presented with an algorithm returning probabilities is: "first find the probability that each test set example is positive, next sort this list and then traverse the sorted list in ascending order."; as such it is implied/suggested that for a probability that no example is positive, it makes no sense to construct a PR curve. In RPRROC::pr.curve
does a similar thing where the origin is at $(0,0$) from the first positive example (example shown inpr3
below).Side-note: in Python this leads in the slightly awkward situation of having Recall
0
with Precision0
and1
at the same time.Regarding the second question: Yes, the ideal classifier has AUCPR equal to 1. The only way to have an ideal classifier (i.e. performance that touches point $(1,1)$) but AUCPR less than $1$, is if we somehow moved towards $(1,1)$ while not already having perfect Precision (i.e. $y=1$). On occasion PR curves have the "sawtooth" shape (e.g. the middle graph shown below (dark green)), that suggests a significant jump in performance. That "tooth" though can never reach the point $(1,1)$ because by definition there are some misclassified points already. The "sawtooth effect" is due to us having a batch of correctly classified points, that helps us move both our Precision and Recall higher, followed by a batch of wrongly classified points that causes the sharp deep in Precision. To get the upward slope we increased our TP numbers while our FP & FN numbers remained the same but that does not mean though we removed our previously misclassified points; we can therefore never reach perfect Precision at $y=1$. For example in the right-most graph shown below (blue) a single point prohibits us from hitting $\text{AUCPR} = 1$; that misclassified FP point actually ranks higher than any other point in the positive class and thus forces our PR curve to start at $(0,0)$.
OK and some R code to see this first-handed: