Solved – SPSS – Binary logistic regression: classification cutoff

classificationlogisticoddsregression

Let's say I want to evaluate the predictive value of a continuous variable in the prediction of malignancy (event/status) of a tumour.

Malignant = 1
Nonmalignant = 0

In SPSS, I can run a binary logistic regression model to do so. It allows me to set a cutoff value for classification. My question is: SPSS assumes equal pretest chances and odds in both groups, and proposes a cutoff value of 0.5. However, research has shown that malignant tumours are 70 % of all tumours, and nonmalignant tumours are 30 % of all tumours. Hence, a priori, there is a chance of 70 % for a tumour to be malignant. There is, however, no way to include an a priori chance in SPSS (at least, not by my knowledge). Am I instead allowed to change the cutoff value to 0.5/(0.7/0.5)?

The rationale is that the probability for malignancy is 0.7/0.5 times larger than 0.5, and thus that, instead of changing pretest probabilities, the posttest probability could be reduced by a factor 0.7/0.5.

Is this correct, or not?

Best Answer

It's easy to get confused between the two different types of probabilities that you face in this type of study. One is the probability, before you've run any tests, that someone with such a tumor has the malignant type: the 70% prevalence of the malignant form for this type of tumor. It is prior, apriori probability. The second is the probability, after you've run the test to get the biomarker value, that a tumor with a particular biomarker value is malignant. That depends on the quality of your biomarker and your probability model. It is predicted probability. There may be no simple relation between these two types of probabilities.

A logistic regression model is a model of probabilities, as this answer among others on this site emphasizes. That's a model of the second type of probability: If I know the value of the biomarker, what's the probability that the tumor is malignant?

As an extreme example, say that all tumors with biomarker values < 9 were benign, all those with values > 11 were malignant, and the few tumors with values between 9 and 11 had a 50/50 chance of being malignant. So a tumor with a value of 10 has a probability of 0.5 of being malignant.

If you wanted to use a cutoff to map probabilities into yes/no predictions, then 10 could be a reasonable choice of cutoff even though it maps to the 0.5 probability cutoff in your model of that second type of probability. You would still score about 70% of tumors as malignant because about 70% of tumors have scores above the value of 10, the first type of probability. It's the prevalence of the biomarker scores among tumors, not the choice of cutoff probability itself, that's related to the overall prevalence of malignancy if you have a near-perfect biomarker like this.

That said, it is seldom wise to reduce your logistic regression probability model directly to an immediate yes/no decision about tumor type. In a clinical context there is always other information available that needs to be taken into account. And you also have to weigh the relative costs of misclassification: what are the risks of treating a benign tumor as if it were malignant, versus the risks of treating a malignant tumor as if it were benign? If you are forced for some reason to make an all-or-none decision about classification, it's those relative risks that should be informing your cutoff.

Related Question