Solved – Do I cross-validate the entire dataset, even the validation and test set

cross-validationgeneralized linear model

I have the following dataset where binary_peak is a binary response variable and I have (not shown) 9 explanatory variables (also binary).

    `binary_peak`   H3K18Ac H3K27me3    H3K36me3
1:00    0   0   0   0
2:00    0   0   0   0
3:00    0   0   0   0
4:00    0   0   0   0
5:00    0   0   0   0
---             
1903462:    0   0   1   0
1903463:    0   0   1   0
1903464:    0   0   0   0
1903465:    0   0   0   0
1903466:    0   0   1   0

I am a little bit confused about the cross validation procedure. The way I am currently doing this is fitting a model on all 1.9 million rows.

 r1 = glm(formula = binding_peak ~ 1 + H3K18Ac + H3K27me3 + H3K36me3  
              family = binomial(link = "logit"), 
              data = massive_ds)

After this, I run $K = 10$-fold cross validation on the entire dataset again.

cv.glm(data = massive_ds, glmfit = r1, K = 10)

I believe my approach here is wrong. What I should be doing is splitting the entire dataset into two (or three?) sets:

Training Set
Validation Set
Test Set (??)

Does this mean that when I fit my model and perform K-fold validation, I am ONLY using the training set? I was under the impression that this is exactly what the K-fold cross validation does? That it breaks my entire dataset into groups, uses one group to train a model, and then apply the model on the remaining groups.

Also, how do I then apply this model to the dataset? My goal is to create ROC curves characterizing the model's accuracy, but if I am using the entire thing as a training/validation set (interally), would it be sufficient to just apply the model again on the training set?

Some background: I have data on biologically significant areas of the entire mouse genome. The genome is split into bins of 200 basepairs, and the response variable (binary) indicates whether a bin is of interest or not. Once I get confirmation that I need to, indeed, split my entire dataset I would take chr 1 - 6 as a training set and use the rest as a validation set.

Best Answer

The syntax for cv.glm is clouding the issue here.

In general, one divides the data up into $k$ folds. The first fold is used as the test data, while the remaining $k-1$ folds are used to build the model. We evaluate the model's performance on the first fold and record it. This process is repeated until each fold is used once as test data and $k-1$ times as training data. There's no need to fit a model to the entire data set.

However, cv.glm is a bit of a special case. If you look at its the documentation for cv.glm, you do need to fit an entire model first. Here's the example at the very end of the help text:

require('boot')
data(mammals, package="MASS")
mammals.glm <- glm(log(brain) ~ log(body), data = mammals)
(cv.err <- cv.glm(mammals, mammals.glm)$delta)
(cv.err.6 <- cv.glm(mammals, mammals.glm, K = 6)$delta)

The 4th line does a leave-one-out validation (each fold contains one example), while the last line performs a 6-fold cross-validation.

This sounds problematic: using the same data for training and testing is a sure-fire way to bias your results, but it is actually okay. If you look at the source (in bootfuns.q, starting at line 811), the overall model is not used for prediction. The cross-validiation code just extracts the formula and other fitting options from the model object and reuses those for the cross-validation, which is fine* and then cross-validation is done in the normal leave-a-fold-out sort of way.

It outputs a list and the delta component contains two estimates of the cross-validated prediction error. The first is the raw prediction error (according to your cost function or the average squared error function if you didn't provide one) and the second attempts to adjust to reduce the bias from not doing leave-one-out-validation instead. The help text has a citation, if you care about why/how. These are the values I would report in my manuscript/thesis/email-to-the-boss and what I would use to build an ROC curve.

* I say fine, but it is annoying to fit an entire model just to initialize something. You might think you could do something clever like

my.model <- glm(log(brain) ~ log(body), data=mammals[1:5, :])
cv.err <- cv.glm(mammals, my.model)$delta

but it doesn't actually work because it uses the $y$ values from the overall model instead of the data argument to cv.glm, which is silly. The entire function is less than fifty lines, so you could also just roll your own, I guess.

Related Solutions

Cross Validation – How to Split the Dataset for Learning Curve and Final Evaluation

I'm not sure what you want to do in step 7a. As I understand it right now, it doesn't make sense to me.

Here's how I understand your description: in step 7, you want to compare the hold-out performance with the results of a cross validation embracing steps 4 - 6. (so yes, that would be a nested setup).

The main points why I don't think this comparison makes much sense are:

This comparison cannot detect two of the main sources of overoptimistic validation results I encounter in practice:
- data leaks (dependence) between training and test data which is caused by a hierarchical (aka clustered) data structure, and which is not accounted for in the splitting. In my field, we have typically multiple (sometimes thousands) of readings (= rows in the data matrix) of the same patient or biological replicate of an experiment. These are not independent, so the validation splitting needs to be done at patient level. However, such a data leak occurs, you'll have it both in the splitting for the hold out set and in the cross validation splitting. Hold-out wold then be just as optimistically biased as cross validation.
- Preprocessing of the data done on the whole data matrix, where the calculations are not independent for each row but many/all rows are used to calculation parameters for the preprocessing. Typical examples would be e.g. a PCA projection before the "actual" classification.
  Again, that would affect both your hold-out and the outer cross validation, so you cannot detect it.
For the data I work with, both errors can easily cause the fraction of misclassifications to be underestimated by an order of magnitude!
If you are restricted to this counted fraction of test cases type of performance, model comparisons need either extremely large numbers of test cases or ridiculously large differences in true performance. Comparing 2 classifiers with unlimited training data may be a good start for further reading.

However, comparing the model quality the inner cross validation claims for the "optimal" model and the outer cross validation or hold out validation does make sense: if the discrepancy is high, it is questionable whether your grid search optimization did work (you may have skimmed variance due to the high variance of the performance measure). This comparison is easier in that you can spot trouble if you have the inner estimate being ridiculously good compared to the other - if it isn't, you don't need to worry that much about your optimization. But in any case, if your outer (7) measurement of the performance is honest and sound, you at least have a useful estimate of the obtained model, whether it is optimal or not.

IMHO measuring the learning curve is yet a different problem. I'd probably deal with that separately, and I think you need to define more clearly what you need the learning curve for (do you need the learning curve for a data set of the given problem, data, and classification method or the learning curve for this data set of the given problem, data, and classification mehtod), and a bunch of further decisions (e.g. how to deal with the model complexity as function of the training sample size? Optimize all over again, use fixed hyperparameters, decide on function to fix hyperparameters depending on training set size?)

(My data usually has so few independent cases to get the measurement of the learning curve sufficiently precise to use it in practice - but you may be better of if your 1200 rows are actually independent)

update: What is "wrong" with the the scikit-learn example?

First of all, nothing is wrong with nested cross validation here. Nested validation is of utmost importance for data-driven optimization, and cross validation is a very powerful approaches (particularly if iterated/repeated).

Then, whether anything is wrong at all depends on your point of view: as long as you do an honest nested validation (keeping the outer test data strictly independent), the outer validation is a proper measure of the "optimal" model's performance. Nothing wrong with that.

But several things can and do go wrong with grid search of these proportion-type performance measures for hyperparameter tuning of SVM. Basically they mean that you may (probably?) cannont rely on the optimization. Nevertheless, as long as your outer split was done properly, even if the model is not the best possible, you have an honest estimate of the performance of the model you got.

I'll try to give intuitive explanations why the optimization may be in trouble:

Mathematically/statisticaly speaking, the problem with the proportions is that measured proportions $\hat p$ are subject to a huge variance due to finite test sample size $n$ (depending also on the true performance of the model, $p$):
$Var (\hat p) = \frac{p (1 - p)}{n}$

You need ridiculously huge numbers of cases (at least compared to the numbers of cases I can usually have) in order to achieve the needed precision (bias/variance sense) for estimating recall, precision (machine learning performance sense). This of course applies also to ratios you calculate from such proportions. Have a look at the confidence intervals for binomial proportions. They are shockingly large! Often larger than the true improvement in performance over the hyperparameter grid. And statistically speaking, grid search is a massive multiple comparison problem: the more points of the grid you evaluate, the higher the risk of finding some combination of hyperparameters that accidentally looks very good for the train/test split you are evaluating. This is what I mean with skimming variance. The well known optimistic bias of the inner (optimization) validation is just a symptom of this variance skimming.
Intuitively, consider a hypothetical change of a hyperparameter, that slowly causes the model to deteriorate: one test case moves towards the decision boundary. The 'hard' proportion performance measures do not detect this until the case crosses the border and is on the wrong side. Then, however, they immediately assign a full error for an infinitely small change in the hyperparameter.
In order to do numerical optimization, you need the performance measure to be well behaved. That means: neither the jumpy (not continously differentiable) part of the proportion-type performance measure nor the fact that other than that jump, actually occuring changes are not detected are suitable for the optimization.
Proper scoring rules are defined in a way that is particularly suitable for optimization. They have their global maximum when the predicted probabilities match the true probabilities for each case to belong to the class in question.
For SVMs you have the additional problem that not only the performance measures but also the model reacts in this jumpy fashion: small changes of the hyperparameter will not change anything. The model changes only when the hyperparameters are changes enough to cause some case to either stop being support vector or to become support vector. Again, such models are hard to optimize.

Literature:

Brown, L.; Cai, T. & DasGupta, A.: Interval Estimation for a Binomial Proportion, Statistical Science, 16, 101-133 (2001).
Cawley, G. C. & Talbot, N. L. C.: On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, Journal of Machine Learning Research, 11, 2079-2107 (2010).
Gneiting, T. & Raftery, A. E.: Strictly Proper Scoring Rules, Prediction, and Estimation, Journal of the American Statistical Association, 102, 359-378 (2007). DOI: 10.1198/016214506000001437
Brereton, R.: Chemometrics for pattern recognition, Wiley, (2009).
points out the jumpy behaviour of the SVM as function of the hyperparameters.

Update II: Skimming variance

what you can afford in terms of model comparison obviously depends on the number of independent cases. Let's make some quick and dirty simulation about the risk of skimming variance here:

scikit.learn says that they have 1797 are in the digits data.

assume that 100 models are compared, e.g. a $10 \times 10$ grid for 2 parameters.
assume that both parameter (ranges) do not affect the models at all,
i.e., all models have the same true performance of, say, 97 % (typical performance for the digits data set).

Run $10^4$ simulations of "testing these models" with sample size = 1797 rows in the digits data set

p.true = 0.97 # hypothetical true performance for all models
n.models = 100 # 10 x 10 grid

n.rows = 1797 # rows in scikit digits data

sim.test <- replicate (expr= rbinom (n= nmodels, size= n.rows, prob= p.true), 
                       n = 1e4)
sim.test <- colMaxs (sim.test) # take best model

hist (sim.test / n.rows, 
      breaks = (round (p.true * n.rows) : n.rows) / n.rows + 1 / 2 / n.rows, 
      col = "black", main = 'Distribution max. observed performance',
      xlab = "max. observed performance", ylab = "n runs")
abline (v = p.outer, col = "red")

Here's the distribution for the best observed performance:

skimming variance simulation

The red line marks the true performance of all our hypothetical models. On average, we observe only 2/3 of the true error rate for the seemingly best of the 100 compared models (for the simulation we know that they all perform equally with 97% correct predictions).

This simulation is obviously very much simplified:

In addition to the test sample size variance there is at least the variance due to model instability, so we're underestimating the variance here
Tuning parameters affecting the model complexity will typically cover parameter sets where the models are unstable and thus have high variance.
For the UCI digits from the example, the original data base has ca. 11000 digits written by 44 persons. What if the data is clustered according to the person who wrote? (I.e. is it easier to recognize an 8 written by someone if you know how that person writes, say, a 3?) The effective sample size then may be as low as 44.
Tuning model hyperparameters may lead to correlation between the models (in fact, that would be considered well behaved from a numerical optimization perspective). It is difficult to predict the influence of that (and I suspect this is impossible without taking into account the actual type of classifier).

In general, however, both low number of independent test cases and high number of compared models increase the bias. Also, the Cawley and Talbot paper gives empirical observed behaviour.

Cross-Validation – Using Test Data Sets

Let's look at three different approaches

In the simplest scenario one would collect one dataset and train your model via cross-validation to create your best model. Then you would collect another completely independent dataset and test your model. However, this scenario is not possible for many researchers given time or cost limitations.
If you have a sufficiently large dataset, you would want to take a split of your data and leave it to the side (completely untouched by the training). This is to simulate it as a completely independent dataset set even though it comes from the same dataset but the model training won't take any information from those samples. You would then build your model on the remaining training samples and then test on these left-out samples.
If you have a smaller dataset, you may not be able to afford to simply ignore a chunk of your data for model building. As such, the validation is performed on every fold (k-fold CV?) and your validation metric would be aggregated across each validation.

To more directly answer your question, yes you can just do cross-validation on your full dataset. You can then use your predicted and actual classes to evaluate your models performance by whatever metric you prefer (Accuracy, AUC, etc.)

That said, you still probably want to look in to repeated cross-validation to evaluate the stability of your model. Some good answers regarding this are here on internal vs. external CV and here on the # of repeats