Solved – GLM after model selection or regularization

model selectionregressionregularization

I would like to pose this question in two parts. Both deal with a generalized linear model, but the first deals with model selection and the other deals with regularization.

Background: I utilize GLMs (linear, logistic, gamma regression) models for both prediction and for description. When I refer to the "normal things one does with a regression" I largely mean description with (i) confidence intervals around coefficients, (ii) confidence intervals around predictions and (iii) hypothesis tests concerning linear combinations of the coefficients such as "is there a difference between treatment A and treatment B?".

Do you legitimately lose the ability to do these things using the normal theory under each of the following? And if so, are these things really only good for models used for pure prediction?

I. When a GLM has been fit via some model selection process (for concreteness say its a stepwise procedure based on AIC).

II. When a GLM has been fit via a regularization method (say using glmnet in R).

My sense is that for I. the answer is technically that you should use a bootstrap for the "normal things one does with a regression", but no one really abides by that.

Add:
After getting a few responses and reading elsewhere, here is my take on this (for anyone else benefit as well as to receive correction).

I.
A) RE: Error Generalize. In order to generalize error rates on new data, when there is no hold out set, cross validation can work but you need to repeat the process completely for each fold – using nested loops – thus any feature selection, parameter tuning, etc. must be done independently each time. This idea should hold for any modeling effort (including penalized methods).

B) RE: Hypothesis testing and confidence intervals of GLM. When using model selection (feature selection, parameter tuning, variable selection) for a generalized linear model and a hold out set exists, it is permissible to train the model on a partition and then fit the model on the remaining data or the full data set and use that model/data to perform hypothesis tests etc. If a hold out set does not exist, a bootstrap can be used, as long as the full process is repeated for each bootstrap sample. This limits the hypothesis tests that can be done though since perhaps a variable will not always be selected for example.

C) RE: Not carrying about prediction on future data sets, then fit a purposeful model guided by theory and a few hypothesis tests and even consider leaving all variables in the model (significant or not) (along the lines of Hosmer and Lemeshow). This is small variable set classical type of regression modeling and then allows the use of CI's and hypothesis test.

D) RE: Penalized regression. No advice, perhaps consider this suitable for prediction only (or as a type of feature selection to then apply to another data set as in B above) as the bias introduced makes CI's and hypothesis tests unwise – even with the bootstrap.

Best Answer

You might check out David Freedman's paper, "A Note on Screening Regression Equations." (ungated)

Using completely uncorrelated data in a simulation, he shows that, if there are many predictors relative to the number of observations, then a standard screening procedure will produce a final regression that contains many (more than by chance) significant predictors and a highly significant F statistic. The final model suggests that it is effective at predicting the outcome, but this success is spurious. He also illustrates these results using asymptotic calculations. Suggested solutions include screening on a sample and assessing the model on the full data set and using at least an order of magnitude more observations than predictors.