Solved – Validation: Data splitting into training vs. test datasets

cross-validation

I was naively validating my binomial logit models by testing on a test dataset. I had randomly divided the available data (~2000 rows) into training (~1500) and validation (~500) datasets.

I now read a post in another thread ( Frank Harrell) that causes me to question my approach:

Data splitting is not very reliable unless you have more than 15,000
observations. In other words, if you split the data again, accuracy
indexes will vary too much from what you obtained with the first
split.

How serious is this worry and what are ways around it? The OP speaks of "resampling" but not sure how that works here for validation.

Edit: Adding context as per @Bernhard's comment below:

Comparing logistic regression models

Best Answer

The split sample validation you proposed above has become less popular in many fields because of the issue Harrell mentions (unreliable out of bag estimates). I know Harrell has mentioned this in his textbook, but other references would be Steyerberg "Clinical Prediction Models" p301, James et al "An Introduction to Statistical Learning" p175.

In the biomedical field boostrap resampling has thus become the standard. This is implemented in Harrell's rms package and so fairly easy to implement. But you could really use any of the other resampling methods, bootstap has just become popular because of a Steyerberg article suggesting it is the most efficient of the resampling methods ("Internal validation of predictive models: efficiency of some procedures for logistic regression analysis").

It is worth mention that the benefit of the rms package is that it easily enables you to include some of the variable selection in the bootstap (built in stepwise selection option). This can be awkward to achieve with most commercial packages.

I own sense is that the differences have been overemphasized. I usually get pretty reliable/consistent results irrespective of the method used. With large sample sizes the differences are really non-existent.

Bootstrap validation - as well as the other resampling methods - can also easily be done wrong. Often only some of the model building stages are included in the bootstrap giving inaccurate estimates. On the other hand it is fairly hard to mess up split sample validation. Given the face validity of split sampling - I know you didn't muck it up - I prefer split sample unless it is a very small dataset. It many cases the model building process is also complicated enough that it can't really be included in a resampling method.

If you want to publish in a biomedical journal though, and you aren't using a medicare size database, you will want to use a resampling method - likely bootstrapping. If the dataset is large, you can likely still get published with k-fold and save yourself some processing time.