Solved – Cross-validation and logistic regression

cross-validationlogisticmodel selectionrregression

I'm interested in building a set of candidate models in R for an analysis using logistic regression. Once I build the set of candidate models and evaluate their fit to the data using AICc (aicc = dredge(results, eval=TRUE, rank="AICc")), I would like to use k-fold cross fold validation to evaluate the predictability of the final model chosen from the analysis. I have a few questions associated to k-fold cross validation:

  1. I assume you use your entire data set for initially building your candidate set of models. For example, say I have 20,000 data values, wouldn't I first build my candidate set of models based on the entire 20,000 data values? Then do use AIC to rank the models and select the most parsimonious model?

  2. After you select the final model (or model averaged model), would you then conduct a k-fold cross validation to evaluate the predictability of the model?

  3. What is the easiest way to code a k-fold cross-validation in R?

  4. Does the k-fold cross validation code break up your entire data set (e.g., 20,000 data values) into training and validation sets automatically? Or do you have to subset the data manually?

Best Answer

Your current strategy will lead to overfitting. Note that dredge is essentially a form of best subsets selection. (The function name is rather evocative.) Such procedures are ill-advised in general (see my answer here: Algorithms for automatic model selection).

In addition to overfitting, only cross-validating the selected model will give you an over-optimistic estimate of the model's out of sample performance. Instead, you could include the entire model selection process in the cross-validation. For example, imagine you are doing 10-fold cross validation. On your first iteration, you would use the first nine folds to fit the models and select the best one, the selected model would then be applied to the tenth fold to assess its out of sample performance. Note that the models selected in this way may differ from one iteration to the next. This approach tells you the out of sample performance of a model selected in this way, rather than the out of sample performance of a particular model that has already been selected.

Regarding how to do this in R, there are a number of pre-existing functions and packages to help you with cross-validation. There is a helpful overview of several options here (pdf). You may also want to check out the caret package. To do some form of customized cross-validation, you may need to code it up yourself, though.

Related Question