Machine Learning – How to Speed Up Optimization in K-Fold Cross Validation

cross-validationmachine learningmaximum likelihoodoptimization

Short version: What are, if any, the ways to speed up the optimization in $k$-fold cross validation $(k \ge 10)$ for generic models in which the data points are statistically independent, by exploiting the fact that the optima of each fold are likely to be close to each other?


Long version:
Let us consider $k$-fold cross validation for model evaluation, with $k \ge 10$.

For each fold, I am performing maximum-likelihood estimation (MLE) on $\frac{k-1}{k}$-th of the data points and then computing the (log) likelihood of the test data. The models are all non-linear and fairly complex, but the data points $\textbf{x}_i$, $1 \le i \le N$, are assumed to be independent given the parameters $\theta$, that is:
$$\log f(\textbf{x}|\theta) = \sum_{i = 1}^N \log f_i(\textbf{x}_i|\theta).$$
In practice, $k$-fold cross validation amounts to running a minimum of $k$ independent optimizations (one per fold) to find the MLE; possibly each optimization is repeated from several starting points if the landscape in non-convex, so as to try and avoid local optima.

The point is, running $k$ independent optimizations (possibly multiple times) can be quite expensive and seems wasteful. If $k$ is large (and $N$ is large), and the folds are statistically i.i.d. (e.g., because the data approximately are, and we use stratification), the optima of the different folds will (often) be close to each other. So it seems that there should be ways to exploit this information to guide the optimizations and save function evaluations. For example, a naive approach might consist of initializing the optimization of each fold from the (vicinity of the) optimum of the previous fold, or from the optimum of the full function, or somehow exploiting information gained from optimization of other folds (various ideas come to mind for how to do so).

As an example, Multi-task Bayesian Optimization (Swersky, Snoek and Adams, NIPS 2013) infers a correlation kernel between related functions to speed up simultaneous optimization. Multi-task BO is not built specifically for cross-validation (it is a possible application), and it applies to costly function evaluations, so it is not suitable for what I am currently doing.

What are, if any, other known techniques in the literature [edit: not based on BO, that was just an example] that could be used to speed up the optimization for cross-validation in this case? Specific pointers to papers will be appreciated.

(For the record, I am already aware of Pareto-smoothed importance sampling leave-one-out CV by Vehtari and Gelman, 2015; but that is for MCMC sampling, not MLE, and in my case it fails.)

Best Answer

Warning: this is essentially a non-answer.

I your $k$-fold cross validation is to be used for validation purposes, the independence of the optimization is actually crucial. If you use model parameters obtained on the full model for initializing any of the surrogate model trainings, that surrogate model is not independent any more on the full data set. I.e., there is dependence also between the surrogate model and its test data! The same is true for any two surrogate models of the cross valiation: the test data of one surrogate model is always training data for the other.

It is quite possible that such an exploit of the assumed (!) similarity will lead to more similar models than you'd have obtained otherwise. I'd therefore say that the set of sped-up optimized surrogate models is actually produced by a (slightly) different training algorithm, and therefore are different models.


One of the reasons why I actually do cross valiation is to demonstrate that the obtained (surrogate) models are very similar, even if optimized/trained independently, i.e. I use (iterated/repeated) $k$-fold cross validation to measure model stability.

That is, the similarity you want to exploit for speed-up is actually something I consider part of the things to be measured/validated by the cross validation.


Update to answer @lacerbi's comments

Assuming vs. constructing similarity: We are talking of similarity here at at least 3 different levels - here's how I see this:

  1. We construct very similar training data.
  2. We assume (or hope) that very similar training data will lead to very similar models as well.
    However, this is not guaranteed: we call models that have this characteristic stable (wrt. small differences/perturbations in the training data).
  3. Likewise we assume that the predictions of the models are very similar (for the same test cases). This is an assumption that underlies the whole idea of resampling validation, it allows us to use the predictions of the cross validation models as surrogate for independent test case predictions of the model in question.

(I distinguish between 2 and 3 as I work with vibrational spectra where the physical data generation process leads to correlations which can cause model parameters to be less stable than the corresponding predictions)

it seems unsmart to me to throw away this information

This is one very valid point of view. However, double use of data will cause dependence. Which use you put the data to is up to you to decide. All I'm saying is that if you use the data to generate good starting values for your training, it is used up as far as validation with independent data is concerned.

As far as validating the model you get by using all data for training is concerned: you state that we can assume $k$ to be large. You explain that surrogate models trained on $\frac{k-1}{k}$ of the data set are actually unstable (see also below). Now I'd argue that the larger $k$, the less plausible is the assumption that the model trained on all data is stable (found the global minimum) given that (some) surrogate models had problems finding the global minimum. If possible, the validation procedure should therefore show the instability of the surrogate models: such instability is no "accident" of the surrogate models - it is a problem that applies to the model trained on all data as well, and there you do not have an initialization similar to the one you propose for the surrogate models.

assume independence which is not there.

The way I see cross validation, it traditionally does not assume independence but rather almost perfect similarity between the models: IMHO it assumes

  1. "surrogate models" trained on $\frac{k - 1}{k}$ of the data are almost the same as the model trained on all the data.
  2. As 1. regularly shows signs of violation (such as the well-known pessimistic bias), the weaker assumption is that at least all $k$ surrogate models are similar enough to each other, so that we are allowed to pool their testing results.

Independence is only assumed between each of the surrogate models and its corresponding test set.

 not exchanging information between training folds might prevent one or more folds from finding the global optimum (this happened to me with real data)

This is a concern about the training algorithm/its resulting models being unstable wrt. initialization and also wrt. to the actual training samples. This is an important characteristic of the model. One of the advantages of cross validation is that it can (to some extent) measure such an instability. This possibility is of course the more important, the more unstable your models are.

In other words, I'd see saving computational time by such an initialization as a legitimate and rather harmless shortcut as long as there is evidence that stability of the solutions is not a concern. But in your case the evidence wrt. model stability is rather that instability is a concern, and therefore I do not recommend the proposed shortcut.

Note that model aggregation as one way of dealing with unstable models does pretty much the opposite of your idea: instead of forcing the solution towards one of the unstable solutions, aggregating tries to cover as much as possible the actually occuring variation, and aggregates (averages) the solutions.

bias-variance-tradeoff

The variance reduction you propose in the extreme corresponds to picking one of the solutions, say, that of the 1st fold. I don't see how that can help very much. However, I do see that e.g. picking the most frequent solution (or some kind of median or average model) can help. The point here is that we are then actually talking about some kind of aggregated model, which is something different from the individual models that enter this aggregation. And you should also be very clear about your proposed aggregation scheme invalidating the out-of-bag shortcut for calculating performance estimates because you traded the independence of the test data for better training.