Solved – When to divide data into training & test set in logistic regression

cross-validationlogisticmaximum likelihoodregressionregression coefficients

I am using Logistic Regression in a low event rate situation.
Overall universe: 46,000
Events: 420

Conventional logistic regression models divide the data into training and test sets and compute the error rates. The final coefficients and threshold levels are chosen and a model is created.

OTH, I'm just trying to prove that so and so coefficient is significant and has positive association with the event in study. I'm not developing a model as of now. I don't focus on error rates (too many true negatives!) and chose my threshold level ~ hit rate.

Should I consider dividing my universe into 2 samples, the conventional way? With such a low event rate, I'm worried that doing this to bias my coeff. estimates.

Best Answer

I do not think you need to divide the set if you are interested in the significance of a coefficient and not in prediction. Cross validation is used to judge the prediction error outside the sample used to estimate the model. Typically, the objective will be to tune some parameter that is not being estimated from the data.

For example, if you were interested in prediction, I would advise you to use regularized logistic regression. This is similar to logistic regression, except for the fact that coefficients (as a whole) are biased towards 0. The level of bias is determined by a penalty parameter that is typically fine tuned via cross validation. The idea is to choose the penalty parameter that minimizes the out of sample error (which is measured via cross validation.) When building a predictive model, it is acceptable (and desirable) to introduce some bias into the coefficients if said bias causes a much larger drop in the variance of the prediction (hence, resulting in a better model for predictive purposes.)

What you are trying to do is inference. You want an unbiased estimate of a coefficient (supposedly to judge the effect that changing one variable may have on another). The best way to obtain this is to have a well specified model and a sample as large as possible. Hence, I would not split the sample. If you are interested in sampling variation, you should try a bootstrap or a jacknife procedure instead.

EDIT:

Short version: You want an unbiased model. Cross validation can help you find a good predictive model, which are often biased. Hence, I do not think cross validation is helpful in this situation.

Related Question