Solved – Cutting out data for training, validation and testing for logistic regularization for limited observations

logisticmachine learningregressionregularizationsupervised learning

I want to run a logistic regression using lasso/ridge regularization for a dataset which has 4500 observations. The number of 1s in the data are 802(18%). I have ~500 predictors of which most of them are dummy (1 or 0). I am unsure how to segment the data for training, validating and testing of my model since the number of data points are very less.

Should I use 80% of the data for training and rest of the data for both validation and testing? Then again the testing results will be biased. If I divide the data into three parts then I am left with inadequate training dataset since number of predictors is large.

Best Answer

With lasso or ridge regression, you do not need to divide your data into 3 parts. Once you have determined how to best to split your data into two, you can use the training set with cross validation to determine the shrinkage parameter and fit the model using the same training data without introducing bias (see the lasso paper by Tibshirani on the Journal of Statistical Software, I believe). Consequently, your question should be how much data should be used to fit the model and how much to test. Since your sample size is small, I would recommend either a 70-30 or 80-20 split. There are really no rules about the split but I pay more attention to ensuring that I have enough data to estimate the parameters of my model more than I would care about having "sufficient" test data