Solved – Train/test split that resembles original dataset and each other

cross-validationmachine learning

I'm modelling a continuous variable (say, the average amount of smth per client). The variable has some asymmetric distribution: for example Gamma/Tweedie/ etc.

Suppose that I'm not able to do cross validation after building a model: All I can do is to select train/test subsets once (80%/20%) from initial dataset and then train model using train set.

The problem is that when generating 80% using pseudo random variable it might happen that my train test does not correctly resemble original dataset. Also the problem is that train and test set could not resemble each other.

Does anyone know how to correctly split data into train/test so that each part of train/test would resemble each other and initial distribution?

I understand that usually I should use cross-validation while selecting model parameters to overcome such type of problems, but is there anything one could do without it? I found some information about
KLIEP algorithm but I'm not sure that it is applicable to the case mentioned above.

I would appreciate any comments/links to read.

Best Answer

We can always stratify our sample so that the distribution of the underlying variables is similar between the two sets; stratified sampling is quite standard approach to ensure random subgroups have similar statistical properties. If we are using R they are multiple packages offering stratified sampling; e.g. the packages splitstackshape and stratification have a lot of readily available functionality. Most of stratified sampling methodology originates from survey statistics and ecology, so one might want to see a paper like Shao's (2003) "Impact of the Bootstrap on Sample Surveys", to get a better idea about potential implications of bootstrapping a (survey) sample. I have also found the UN's FAO (Food and Agriculture Organisation of the United Nations) Fisheries Technical Paper 434 on Sampling methods applied to fisheries science: a manual extremely readible and to the point (see in particular section 4 "Stratified random sampling")

There are techniques that allows precise covariate balancing between control and treatment groups that could also be applicable but they are almost certainly an over-kill to use for picking a hold-out set. They might be useful as diagnostic tools nevertheless.