Solved – Do I do threshold selection for the logit model on the testing or training subset

auccross-validationpredictive-modelsrocthreshold

I have data with a binary outcome and I am doing logit model selection using AIC and BIC. I have already withheld 30% of the data as a holdout sample (testing subset) and used the remainder (training subset) to do model selection.

In order to calculate accuracy, sensitivity, specificity, PPV, NPV and all of those parameters, I need a threshold. I plan on using Youden's index to maximize the difference between my test and the random chance line.

However, in generating the ROC curve, do I use the training data to generate the curve and choose a threshold, and then apply this threshold to the predicted values for the testing data? Or do I generate the ROC curve using the testing dataset and pick the threshold from that?

In the former case, I am generating an ROC curve based on data that were used to make the model, which seems like it would give me a falsely-high AUC (since the models are fit to that particular data), and the threshold chosen won't necessarily be the best threshold. But in the latter case, I am generating an ROC curve based on the testing data, data that the models have not seen, and picking an optimal threshold from this data. This seems a little cheaty as well, since I can pick the threshold that gives me the highest sensitivity/specificity for the testing subset, but this might not be the case for generalizing to the intended population.

TL;DR: do I pick my threshold based on an ROC curve generated with the model testing or model training data?

Thanks.

Best Answer

Generate the ROC curve and choose the threshold within the training data, but then report accuracy, sensitivity, etc. when using this threshold to make predictions in the test data.

AUC is not a great metric, but if you want it (and you don't want it to be optimistically biased), generate another ROC curve for the test data and report the AUC for that.