Solved – Sample selection algorithms to ensure that training & validation sets are representative

cross-validationmachine learningrreliability

Currently, I am encountering a question, which is how to selection representative samples (training set and test set, even validation set) from the whole data set? I would like build a classification model using a training set and adjust the parameters using a validation set, and at last do prediction to the external test set which is not used to build model (in practice, some unseen data without labels).

Usually, some common methods are cross-validation, hold-out, etc. Random split methods can deal with the problem. But, when further data points do not fall in the model domain, the prediction will be not reliable. So instead of randomly splitting into training and test sets, are there split methods based on the independent variables (X), other than the Kennard-Stone algorithm? Or are there any algorithms can do these things reasonably? (These algorithms should consider the sample distribution between training and test set.)

Best Answer

If you want to go this route, you will need to consider the structure within your data in deciding which instances to use at some point (e.g. build a model).

One way to do this would be clustering your data and then using some instances from every cluster (depending on its size). If the clustering method and the classification method are related (for example both kernel-based), you are probably implicitly causing information leaks in the sampling approach. Ofcourse, the clustering method you choose essentially imposes a lot of assumptions so whether or not this beats random is very much the question.

Personally, I prefer randomized approaches: few assumptions and they never failed me yet in practice.