Solved – Why is the true (test) error rate of any classifier 50%

classificationerror

In section 7.10.2 of Elements of Statistical Learning, it says that the true (test) error rate of any classifier is 50%. I'm having trouble understanding the intuition behind this. If you have a binary class (1 or 0) and your classifier is a die where if you roll a 1-5, the classification is 1 and if you roll a 6, the classification is 0. Then suppose that the true value of your binary class is 1. Then I would think that the error rate would be converge to 1/6 over time.

The excerpt from the text is below.

Consider a classification problem with a large number of predictors, as may arise, for example, in genomic or proteomic applications. A typical strategy for analysis might be as follows:
1. Screen the predictors: find a subset of “good” predictors that show fairly strong (univariate) correlation with the class labels
2. Using just this subset of predictors, build a multivariate classifier.
3. Use cross-validation to estimate the unknown tuning parameters and to estimate the prediction error of the final model.
Is this a correct application of cross-validation? Consider a scenario with N = 50 samples in two equal-sized classes, and p = 5000 quantitative predictors (standard Gaussian) that are independent of the class labels. The true (test) error rate of any classifier is 50%. We carried out the above recipe, choosing in step (1) the 100 predictors having highest correlation with the class labels, and then using a 1-nearest neighbor classifier, based on just these 100 predictors, in step (2). Over 50 simulations from this setting, the average CV error rate was 3%. This is far lower than the true error rate of 50%.

Best Answer

That's not a general statement about classifiers. In this particular case where the class frequencies are half & half, & none of the predictors are any use, the true error rate, of any classifier, is 50%. Imagine trying to predict the result of coin tosses from denomination, year of issue, metal content, &c.—in the long run you won't do better than 50% error rate. The point of the quoted passage is that cross-validation that ignores a model selection step gives an optimistic estimate of performance.