Solved – Do I cross-validate the entire dataset, even the validation and test set

cross-validationgeneralized linear model

I have the following dataset where binary_peak is a binary response variable and I have (not shown) 9 explanatory variables (also binary).

    `binary_peak`   H3K18Ac H3K27me3    H3K36me3
1:00    0   0   0   0
2:00    0   0   0   0
3:00    0   0   0   0
4:00    0   0   0   0
5:00    0   0   0   0
---             
1903462:    0   0   1   0
1903463:    0   0   1   0
1903464:    0   0   0   0
1903465:    0   0   0   0
1903466:    0   0   1   0

I am a little bit confused about the cross validation procedure. The way I am currently doing this is fitting a model on all 1.9 million rows.

 r1 = glm(formula = binding_peak ~ 1 + H3K18Ac + H3K27me3 + H3K36me3  
              family = binomial(link = "logit"), 
              data = massive_ds)

After this, I run $K = 10$-fold cross validation on the entire dataset again.

cv.glm(data = massive_ds, glmfit = r1, K = 10)

I believe my approach here is wrong. What I should be doing is splitting the entire dataset into two (or three?) sets:

  1. Training Set
  2. Validation Set
  3. Test Set (??)

Does this mean that when I fit my model and perform K-fold validation, I am ONLY using the training set? I was under the impression that this is exactly what the K-fold cross validation does? That it breaks my entire dataset into groups, uses one group to train a model, and then apply the model on the remaining groups.

Also, how do I then apply this model to the dataset? My goal is to create ROC curves characterizing the model's accuracy, but if I am using the entire thing as a training/validation set (interally), would it be sufficient to just apply the model again on the training set?

Some background: I have data on biologically significant areas of the entire mouse genome. The genome is split into bins of 200 basepairs, and the response variable (binary) indicates whether a bin is of interest or not. Once I get confirmation that I need to, indeed, split my entire dataset I would take chr 1 - 6 as a training set and use the rest as a validation set.

Best Answer

The syntax for cv.glm is clouding the issue here.

In general, one divides the data up into $k$ folds. The first fold is used as the test data, while the remaining $k-1$ folds are used to build the model. We evaluate the model's performance on the first fold and record it. This process is repeated until each fold is used once as test data and $k-1$ times as training data. There's no need to fit a model to the entire data set.

However, cv.glm is a bit of a special case. If you look at its the documentation for cv.glm, you do need to fit an entire model first. Here's the example at the very end of the help text:

require('boot')
data(mammals, package="MASS")
mammals.glm <- glm(log(brain) ~ log(body), data = mammals)
(cv.err <- cv.glm(mammals, mammals.glm)$delta)
(cv.err.6 <- cv.glm(mammals, mammals.glm, K = 6)$delta)

The 4th line does a leave-one-out validation (each fold contains one example), while the last line performs a 6-fold cross-validation.

This sounds problematic: using the same data for training and testing is a sure-fire way to bias your results, but it is actually okay. If you look at the source (in bootfuns.q, starting at line 811), the overall model is not used for prediction. The cross-validiation code just extracts the formula and other fitting options from the model object and reuses those for the cross-validation, which is fine* and then cross-validation is done in the normal leave-a-fold-out sort of way.

It outputs a list and the delta component contains two estimates of the cross-validated prediction error. The first is the raw prediction error (according to your cost function or the average squared error function if you didn't provide one) and the second attempts to adjust to reduce the bias from not doing leave-one-out-validation instead. The help text has a citation, if you care about why/how. These are the values I would report in my manuscript/thesis/email-to-the-boss and what I would use to build an ROC curve.


* I say fine, but it is annoying to fit an entire model just to initialize something. You might think you could do something clever like

my.model <- glm(log(brain) ~ log(body), data=mammals[1:5, :])
cv.err <- cv.glm(mammals, my.model)$delta

but it doesn't actually work because it uses the $y$ values from the overall model instead of the data argument to cv.glm, which is silly. The entire function is less than fifty lines, so you could also just roll your own, I guess.