F-Score Calculation – Identifying Positive Classes in Classification

classificationmachine learning

I'm calculating the F-Score for a sandbox dataset: 100 medical patients, 20 of which have cancer. Our classifier mis-classifies 20 healthy patients as having cancer, and 5 patients with cancer as healthy, the rest it gets right.

We compute True Positives; True Negatives; False Positives; and False Negatives.

We ran into a debate about which class comes first, those that test "Positive" for cancer, or the majority class, e.g. those that are "Healthy".

Explicit Question:
What is the correct true-positive rate in this dataset? Is it:

  1. # of predicted healthy patients over # of actual healthy patients
  2. # of predicted cancer patients over # of actual cancer patients

Bonus points if you can reference some literature that supports one supposition or the other.

Note, I've skimmed through a few texts on f-scores but haven't seen an explicit discussion of this point:

https://en.wikipedia.org/wiki/F1_score
http://rali.iro.umontreal.ca/rali/sites/default/files/publis/SokolovaLapalme-JIPM09.pdf

Wikipedias text on precision and recall seem to suggest that "true positive" be defined by whatever "test" is being performed, and thus in this case defined as the minority class because the "test" is for cancer. However I don't find the discussion rigorous enough to convince me. If I simply describe the test in terms of testing for "healthy" patients I change the f-score, but this was just a semantic change. I would expect the f-score to have a mathematically rigorous definition.

https://en.wikipedia.org/wiki/Precision_and_recall

Best Answer

I think you've discovered that the F-score is not a very good way to evaluate a classification scheme. From the Wikipedia page you linked, there is a simplification of the formula for the F-score:

$$ {F1} = \frac {2 {TP}} {2 {TP} + {FP} + {FN}} $$

where $TP,FP,FN$ are numbers of true positives, false positives, and false negatives, respectively.

You will note that the number of true negative cases (equivalently, the total number of cases) is not considered at all in the formula. Thus you can have the same F-score whether you have a very high or a very low number of true negatives in your classification results. If you take your case 1, "# of predicted healthy patients over # of actual healthy patients", the "true negatives" are those who were correctly classified as having cancer yet that success in identifying patients with cancer doesn't enter into the F-score. If you take case 2, "# of predicted cancer patients over # of actual cancer patients," then the number of patients correctly classified as not having cancer is ignored. Neither seems like a good choice in this situation.

If you look at any of my favorite easily accessible references on classification and regression, An Introduction to Statistical Learning, Elements of Statistical Learning, or Frank Harrell's Regression Modeling Strategies and associated course notes, you won't find much if any discussion of F-scores. What you will often find is a caution against evaluating classification procedures based simply on $TP,FP,FN,$ and $TN$ values. You are much better off focusing on an accurate assessment of likely disease status with an approach like logistic regression, which in this case would relate the probability of having cancer to the values of the predictors that you included in your classification scheme. Then, as Harrell says on page 258 of Regression Modeling Strategies, 2nd edition:

If you make a classification rule from a probability model, you are being presumptuous. Suppose that a model is developed to assist physicians in diagnosing a disease. Physicians sometimes profess to desiring a binary decision model, but if given a probability they will rightfully apply different thresholds for treating different patients or for ordering other diagnostic tests.

A good model of the probability of being a member of a class, in this case of having cancer, is thus much more useful than any particular classification scheme.

Related Question