Solved – When is a proper scoring rule a better estimate of generalization in a classification setting

errormachine learningmodel selectionscoring-rules

A typical approach to solving a classification problem is to identify a class of candidate models, and then perform model selection using some procedure like cross validation. Typically one selects the model with the highest accuracy, or some related function that encodes problem specific information, like $\text{F}_\beta$.

Assuming the end goal is to produce an accurate classifier (where the definition of accuracy is again, problem dependent), in what situations is it better to perform model selection using a proper scoring rule as opposed to something improper, like accuracy, precision, recall, etc? Furthermore, let's ignore issues of model complexity and assume a priori we consider all the models equally likely.

Previously I would have said never. We know, in a formal sense, classification is an easier problem than regression [1], [2] and we can derive tighter bounds for the former than the later ($*$). Furthermore, there are cases when trying to accurately match probabilities can result in incorrect decision boundaries or overfitting. However, based on the conversation here and the voting pattern of the community in regards to such issues, I've been questioning this view.

  1. Devroye, Luc. A probabilistic theory of pattern recognition. Vol. 31. springer, 1996., Section 6.7
  2. Kearns, Michael J., and Robert E. Schapire. Efficient distribution-free learning of probabilistic concepts. Foundations of Computer Science, 1990. Proceedings., 31st Annual Symposium on. IEEE, 1990.

$(*)$ This statement might be a little sloppy. I specifically mean that given labeled data of the form $S = \{(x_1, y_1), \ldots, (x_n, y_n)\}$ with $x_i \in \mathcal{X}$ and $y_i \in \{1, \ldots, K\}$, it seems to be easier to estimate a decision boundary than accurately estimate conditional probabilities.

Best Answer

Think of this as a comparison between the $t$-test/Wilcoxon test and the Mood median test. The median test uses optimum classification (above or below the median for a continuous variable) so that it only loses $\frac{1}{\pi}$ of the information in the sample. Dichotomization at a point different from the median will lose much more information. Using an improper scoring rule such as proportion classified "correctly" is at most $\frac{2}{\pi}$ or about $\frac{2}{3}$ efficient. This results in selection of the wrong features and finding a model that is bogus.

Related Question