Solved – Cross-validation on really small datasets

cross-validationrsmall-sample

Let's say I have a small dataset:

data <- replicate(4,rnorm(13))

I want to test the out-of-sample predictions of a regression model as a function of increasing training set size (increasing by 10% in each increment).

I use the following procedure:

test.set <- 3

#for each iteration increase the test set by 10%
train.set <- (nrow(data)-test.set) * seq(0.1,0.9,0.1)
train.set <- train.set[train.set>1]
results <- vector()

  y=0
  for (t in train.set){
    y=y+1
    trainSize <- t
    train <- sample(1:nrow(data),trainSize)
    test <- sample(1:nrow(data),test.set) 

    test.data <- data[test,]
    train.data <- data[train,]



    #fit a linear regression:
    train.data <- as.data.frame(train.data)
    model <- lm(train.data[,1] ~., data=train.data[,2:4])

    #get predictions:
    test.data <- as.data.frame(test.data)
    predictions <- predict(model,test.data)

    #calculate out of sample R squared (1-SSE/TSS): 

    error <- 1 - sum( (test.data[,1] - predictions)^2 ) / ((nrow(test.data)-1) * var(test.data[,1]))

    results[y] <- error
  }

I repeat this procedure several times and take the average of the repetitions.

My problem is that I get unreliable results. I am assuming this is because the dataset is really small. What could I do to get more reasonable estimates?

Best Answer

(Disclaimer: I did not check the code, and that would be more appropriately asked on codereview.sx)

My problem is that I get unreliable results. I am assuming this is because the dataset is really small. What could I do to get more reasonable estimates?

There are (at least) two small sample size problems mixed here:

  • because you have too few training cases, your models (including the surrogate models of the cross validation) are not only bad on average, but also unstable (i.e. vary much if a few training instances are exchanged).

  • because you have too few test cases, your test results themselves are uncertain as well. So the results you observe may be (approximately) described as

    learning curve (= average performance as function of $n_{train}$) +
    model instability (= variance due to training sample size) +
    test variance (= variance due to test sample size) +
    testing bias (depending on how exactly you formulate your scenario, i.e. wrt. to $n_{train}$ or $n$)

For simulations, you can at least partially unmix the training and test sample size contributions to the total uncertainty by a reference measurement of the model performance with a separate large independent test set.

If there isn't any further data (i.e. real application, no simulation), make sure that the model itself is stable (low complexity/highly regularized and/or use an aggregated model): the instability contribution to variance is the one thing here that you can influence and you anyways need to do this, because unstable always means high generalization error. Stability can be checked by iterated/repeated cross validation or out-of-bootstrap validation. In addition, the variance contribution due to model instability cancels out over many runs of the cross/out of bootstrap validation.
Once you know that this component of the variance is negligible, you can attribute the variance you observe to the finite test sample size. Note that the $n$ is the number of distinct cases that have been tested, regardless of how many runs you did. With that, you can calculate e.g. hereconfidence intervals. Obviously this will not reduce the uncertainty as it is fundamentally caused by too few samples - but it will allow you to judge what you can (or cannot) conclude from your data. With that (possibly + simulations), you may be able to convince you supervisor/boss/customer that more samples are needed.

I only have 3 instances in the test set. is that large enough to get reliable estimates?

Typically, no.

For classification with 0/1 loss function, the situation is even worse than for regression, and there we found that for sample sizes < 100 test cases the test error is typically so large that not much can be said at all. I'd expect the situation to be somewhat nicer for regression/calibration, but unless your are superlucky and have basically noiseless data, 3 test cases will not be sufficient to draw any kind of practical conclusions.

Here's a paper we wrote about the situation for classification:
Beleites, C. and Neugebauer, U. and Bocklitz, T. and Krafft, C. and Popp, J.: Sample size planning for classification models. Anal Chim Acta, 2013, 760, 25-33.
DOI: 10.1016/j.aca.2012.11.007

accepted manuscript on arXiv: 1211.1323


If you need to discuss in more detail, I'd be willing to help e.g. writing a shared paper about this. I'm in chemometrics, and the topic would certainly be of practical importance.