Solved – Calculating misclassification rate for k-fold cross validation (logistic regression)

cross-validationlogistic

I am trying to manually write code to perform a k-fold cross validation for a logistic regression model for the first time. Unfortunately I am getting stuck trying to implement the following formula from An Introduction to Statistical Learning with Applications in R:

In this setting,
cross-validation works just as described earlier in this chapter, except that
rather than using MSE to quantify test error, we instead use the number
of misclassified observations. For instance, in the classification setting, the
LOOCV error rate takes the form:

$$
CV_{(n)} = \frac{1}{n} \sum\limits_{i=1}^n Err_i
$$
where
$
Err_i = I(y_i \neq \hat{y_i}).
$

What I'm struggling with is how to find out if a specific observation was misclassified. As I understand it, the y-hat values from a logistic regression could be any value between 0 and 1. They can essentially be interpreted as the predicted probability that an outcome = 1 based on that specific combination of predictors.

The y values can only be either 0 or 1. So it seems to me that y and y-hat will never (or almost never) be equal.

So I am left with 2 questions:

1) Do I have some conceptual misunderstanding here?

2) How can I find out how many observations my model has classified correctly?

Best Answer

Your conceptual misunderstanding may come from one of the few complaints I have about ISLR, its overemphasis on misclassification rates as a measure of model quality. If you look at page 158 of ISLR you will see that the authors propose the same solution as @BrentFerrier in another answer here, which is to use a cutoff of 0.5 in the predicted probability for the classification. Although I appreciate the simplicity of this measure from an early-stage pedagogical purpose, this measure has the implicit assumption that all types of classification errors are equally important, which is seldom true in practice.

There are, however, much better possibilities for overall evaluation of logistic regression models than misclassification based on a cutoff of 0.5 probability. This page provides some entry into this area, and this page briefly compares 2 simple measures. For example, the Brier score is the mean-square difference between the predicted probabilities and the actual 0/1 outcomes. Measures of the calibration curve, as provided in the rms package in R, can be even better.

Related Solutions

Solved – When are Shao’s results on leave-one-out cross-validation applicable

You need to specify the purpose of the model before you can say whether Shao's results are applicable. For example, if the purpose is prediction, then LOOCV makes good sense and the inconsistency of variable selection is not a problem. On the other hand, if the purpose is to identify the important variables and explain how they affect the response variable, then Shao's results are obviously important and LOOCV is not appropriate.

The AIC is asymptotically LOOCV and BIC is asymptotically equivalent to a leave-$v$-out CV where $v=n[1-1/(\log(n)-1)]$ --- the BIC result for linear models only. So the BIC gives consistent model selection. Therefore a short-hand summary of Shao's result is that AIC is useful for prediction but BIC is useful for explanation.

Solved – Cross-validation with sampling weights

You can save yourself some coding effort, surprisingly enough, by simply doing the leave-one-out (LWO) cross-validation yourself:

data1$norm.wt <- data1$wt / sum(data1$wt)
lwo <- rep(0,nrow(data1))

for (j in 1:nrow(data1))
{
  fj <- glm(y~x1+x2, family=quasibinomial, weights=norm.wt, data=data1[-j,])
  lwo[j] <- predict(fj, data1[j,], type="response")
}

> print(lwo, digits=4)
 [1] 2.564e-02 2.220e-16 4.405e-01 2.128e-07 7.360e-01 5.360e-01 2.383e-02
 [8] 2.064e-02 1.316e-01 2.174e-02 4.152e-01 1.895e-01 7.162e-01

Normalizing the weights to sum to one prevents a numerical problem (in this case) that results in your parameter estimates blowing up:

> summary(glm(y~x1+x2, family=quasibinomial, weights=wt, data=data1))

Call:
glm(formula = y ~ x1 + x2, family = quasibinomial, data = data1, 
    weights = wt)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-394.9     0.0     0.0     0.0   164.4  

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept) -6.790e+15  2.354e+15  -2.885   0.0162 *
x1           2.004e+15  1.630e+15   1.230   0.2470  
x2           3.403e+15  1.393e+15   2.443   0.0347 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for quasibinomial family taken to be 2.030037e+18)

    Null deviance:  25460  on 12  degrees of freedom
Residual deviance: 324940  on 10  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 12

versus normalized weights:

> summary(glm(y~x1+x2, family=quasibinomial, weights=data1$wt/sum(data1$wt), data=data1))

Call:
glm(formula = y ~ x1 + x2, family = quasibinomial, data = data1, 
    weights = data1$wt/sum(data1$wt))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.4273  -0.1004  -0.0444   0.1706   0.3879  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   -8.038      6.978  -1.152    0.276
x1             1.631      3.177   0.513    0.619
x2             4.142      3.536   1.171    0.269

(Dispersion parameter for quasibinomial family taken to be 0.1440099)

    Null deviance: 1.30513  on 12  degrees of freedom
Residual deviance: 0.84517  on 10  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 6

You don't have to renormalize the weights at every step of the LWO loop, in effect they are renormalized anyway as the weights are relative.

This doesn't match the SAS probabilities, admittedly, but it seems to me it's what you're trying to do.

Best Answer

Related Solutions

Solved – When are Shao’s results on leave-one-out cross-validation applicable

Solved – Cross-validation with sampling weights

Related Question