Solved – Model selection and model performance in logistic regression

cross-validationlogisticmodel selection

I have a question about model selection and model performance in logistic regression. I have three models that are based on three different hypotheses. The first two models (lets name them z and x) only have one explanatory variable in each model, and the third (lets name it w) is more complicated. I’m using AIC for variable selection for the w model and then AIC for comparing which of the three models that explain the dependent variable best. I’ve found that the w model has the lowest AIC and now want to do some performance statistics on that model to get some idea about the predictive power of the model. Since all I know is that this model is better than the other two but not how good it is.

Since I’ve used all data to learn the model (to be able to compare all three models) how do I go about with model performance? From what I’ve gathered I can’t just do a k-fold cross validation on the final model I got from model selection using AIC but need to start from the beginning with all explanatory variables included, is this correct? I’d think that it is the final model I’ve chosen with AIC that I want to know how well it performs, but do realize that I’ve trained on all data so the model might be biased. So if I should start from the beginning with all explanatory variables in all folds I will get different final models for some folds, can I just choose the model from the fold which gave the best predictive power and apply that to the full data set to compare AIC with the two other models (z and x)? Or how does it work?

Second part of my question is a basic question about over-parameterization. I have 156 data points, 52 is 1’s the rest are 0’s. I have 14 explanatory variables to choose from for the w model, I realize that I can’t include all due to over-parameterization, I’ve read that you should only use 10% of the group of the dependent variable with fewest observations which only would be 5 for me. I’m trying to answer a question in ecology, is it ok to select the starting variables which I think explains the dependent best simply based on ecology? Or how do I choose the starting explanatory variables? Doesn’t feel right to completely exclude some variables.

So I really have three questions:

  • Could it be ok to test performance on a model trained on the full data set with cross-validation?
  • If not, how do I choose the final model when doing cross-validation?
  • How do I choose the starting variables so I want over-parameterize?

Sorry for my messy questions and my ignorance. I know that similar questions have been asked but still feel a little confused. Appreciate any thoughts and suggestions.

Best Answer

It's true that it is better to use a test set of data to validate your model. However, you can still say how well your model performed on your data, as long as you are honest about what you did. What you cannot really do is say that it will do this well on other data: It likely won't. Unfortunately, a lot of published articles at least hint at this incorrect notion.

You ask

is it ok to select the starting variables which I think explains the dependent best simply based on ecology?

Not only is it OK, it is better than any automated scheme. Indeed, these could also be the final variables. It depends, somewhat, on the extent of knowledge in the field. If not much is known about what you are researching, then a more exploratory approach may be necessary. But if you have good reason to think that certain variables should be in the model, then by all means, put them in. And I would argue for leaving them there, even if not significant.

Related Question