Solved – When not to use cross validation

cross-validationmachine learningself-study

As I read through the site most answers suggest that cross validation should be done in machine learning algorithms. However as I was reading through the book "Understanding Machine Learning" I saw there is an exercise that sometimes it's better not to use cross validation. I'm really confused. When training algorithm on the whole data is better than cross-validation? Does it happen in real data-sets?

Let $H_1,…,H_k$ be k hypothesis classes. Suppose you are given $m$ i.i.d. training examples and you would like to learn the class $H=\cup^k_{i=1}H_i$. Consider two alternative approaches:

  1. Learn $H$ on the $m$ examples using the ERM rule

  2. Divide the m examples into a training set of size $(1−\alpha)m$ and a validation set of size $\alpha m$, for some $\alpha\in(0,1)$. Then, apply the approach of model selection using validation. That is, first train each class $H_i$ on the $(1−\alpha)m$ training examples using the ERM rule with respect to $H_i$,and let $\hat{h}_1,\ldots,\hat{h}_k$ be the resulting hypotheses. Second, apply the ERM rule with respect to the finite class {$\hat{h}_1,\ldots,\hat{h}_k$} on the $\alpha m$ validation examples.

Describe scenarios in which the first method is better than the second and vice versa.

Image of the quastion.

Best Answer

Take-home-messages:


Unfortunately, the text you cite changes two things between approach 1 and 2:

  • Approach 2 performs cross validation and data-driven model selection/tuning/optimization
  • Approach 1 neither uses cross validation, nor data-driven model selection/tuning/optimization.
  • Approach 3 cross validation without data-driven model selection/tuning/optimization is perfectly feasible (amd IMHO would lead to more insight) in the context discussed here
  • Approach 4, no cross validation but data-driven model selection/tuning/optimization is possible as well, but more complex to construct.

IMHO, cross validation and data-driven optimization are two totally different (and largely independent) decisions in setting up your modeling strategy. The only connection is that you can use cross validation estimates as target functional for your optimization. But there exist other target functionals ready to be used, and there are other uses of cross validation estimates (importantly, you can use them for verification of your model, aka validation or testing)

Unfortunately, machine learning terminology is IMHO currently a mess which suggests false connections/causes/dependencies here.

  • When you look up approach 3 (cross validation not for optimization but for measuring model performance), you'll find the "decision" cross validation vs. training on the whole data set to be a false dichotomy in this context: When using cross validation to measure classifier performance, the cross validation figure of merit is used as estimate for a model trained on the whole data set. I.e. approach 3 includes approach 1.

  • Now, let's look at the 2nd decision: data-driven model optimization or not. This is IMHO the crucial point here. And yes, there are real world situations where not doing data-driven model optimization is better. Data-driven model optimization comes at a cost. You can think of it this way: the information in your data set is used to estimate not only the $p$ parameters/coefficients of the model, but what the optimization does is estimating further parameters, the so-called hyperparameters. If you describe the model fitting and optimiztion/tuning process as a search for the model parameters, then this hyperparameter optimization means that a vastly larger search space is considered. In other words, in approach 1 (and 3) you restrict the search space by specifiying those hyperparameters. Your real world data set may be large enough (contain enough information) to allow fitting within that restricted search space, but not large enough to fix all parameters sufficiently well in the larger search space of approaches 2 (and 4).

In fact, in my field I very often have to deal with data sets that far too small to allow any thought of data-driven optimization. So what do I do instead: I use my domain knowledge about the data and data generating processes to decide which model matches well with the physical nature of data and application. And within these, I still have to restrict my model complexity.

Related Question