[Math] How to use cross-validation to select probability threshold for logistic regression

data analysismachine learningregressionstatistics

I have a question about how to use cross-validation to select probability threshold for logistic regression. Suppose I want to minimize the misclassification rate. Say, I use 5-fold CV, and is this procedure correct:

1.fit 5 logistic regression models using each 4-folds of the data.

2.for each probability threshold(e.g. from 0.01 to 0.99), apply the 5 models on the left 1-fold of data, get misclassification rate. Then average these 5 error rates.

3.the optimal probability threshold is the one with smallest misclassification rate.

And suppose I fit a ridge logistic regression model, to select the tuning parameter $\lambda$, is it okay to first use CV to select an optimal $\lambda$(e.g. use cv.glmnet function in R package glmnet), then apply this parameter to the procedure above to find probability threshold?

Best Answer

Yes, I think you have the right idea.

Just to put it another way, I'd say:

  1. Fix the hyper-parameters that you don't want to search for (e.g. as you mention, this could be a regularization strength $\lambda$)

  2. Choose a set of hyper-parameters $\theta$ you want to "optimize" by CV and split your dataset into folds (note it is standard practice to first remove a portion of your data as a testing set, and then use the remaining part for CV).

  3. Fix a search space (e.g. $\theta_i\in[a_i,b_i]\;\forall\; i$) and then for each $\hat{\theta}$ in the space, learn the model $f(x;\hat{\theta})$ and compute the average error $\mathcal{E}(f)$ over the CV folds. Keep the $f$ with the least error.

Related Question