Solved – the continuous analog to stratified k-fold

cross-validationregressionsampling

When training machine to do classification we can use stratified k-fold cross validation to ensure that our training and test folds are representative (same mix of class labels) of our entire dataset.

Is there an analog when training regression machines that ensure folds are representative of the continuous distribution of our target variable?

Best Answer

I'm not aware of any approaches that got to having their own name (other than that stratification is not per se restricted to classification).


That being said, the building blocks are around, so let's design a cross validation experiment:

  1. Venetian Blinds Cross Validation assigns consecutive samples to consecutive folds: $fold = case~number \mod k$.
    If we sort cases* according to $y$ first, venetian blinds gets us close to stratified folds. This corresponds to assigning $fold = rank (y) \mod k$

    This approach has an inbuilt small but systematic difference between the folds as the difference between any two corresponding case in two folds will always have the same sign.

  2. We can improve our stratification by formulating the cross validation as randomized blocked experiment:

    • block according to $y$ into blocks of $k$ cases each, and then
    • randomly assigning fold within each block.

Somewhat related are techniques that sample cases from $\mathbf X$ in order to get uniform coverage in $\mathbf X$ (so input space rather than output space). This is particularly relevant where $\mathbf X$ is available for a large sample size but obtaining reference $y$ is costly and thus reference cases should be carefully selected*.

  • Kennard-Stone algorithm selects a subset of given size. The Duplex algorithm is an extension that selects two subsets (usually train/test split). It could be extended to producing $k$ groups and would then be a multi-dimensional analogon to approach 1 above.
  • Blocking as in approach 2 above can also be done in multidimensional data, e.g. by k-means clustering or Kohonen Maps (Self-Organizing Maps).

* This is a common situation e.g. in chemical analysis when calibrating spectroscopic data: spectra $\mathbf X$ can often obained in (semi)automated fashion, so lots of cases are measured spectroscopically. However, reference analyses $y$ are often expensive, so the task is to select a subset of $n$ (say, 100) cases that are sent for reference analysis from the much larger set of measured spectra $\mathbf X$. The regression model is then either trained in a supervised fashion from that subset of $\mathbf X$ and the corresponding $y$ or in a semi-supervised fashion from the whole $\mathbf X$ and the smaller $y$.