The syntax for cv.glm
is clouding the issue here.
In general, one divides the data up into $k$ folds. The first fold is used as the test data, while the remaining $k-1$ folds are used to build the model. We evaluate the model's performance on the first fold and record it. This process is repeated until each fold is used once as test data and $k-1$ times as training data. There's no need to fit a model to the entire data set.
However, cv.glm
is a bit of a special case. If you look at its the documentation for cv.glm
, you do need to fit an entire model first. Here's the example at the very end of the help text:
require('boot')
data(mammals, package="MASS")
mammals.glm <- glm(log(brain) ~ log(body), data = mammals)
(cv.err <- cv.glm(mammals, mammals.glm)$delta)
(cv.err.6 <- cv.glm(mammals, mammals.glm, K = 6)$delta)
The 4th line does a leave-one-out validation (each fold contains one example), while the last line performs a 6-fold cross-validation.
This sounds problematic: using the same data for training and testing is a sure-fire way to bias your results, but it is actually okay. If you look at the source (in bootfuns.q
, starting at line 811), the overall model is not used for prediction. The cross-validiation code just extracts the formula and other fitting options from the model object and reuses those for the cross-validation, which is fine* and then cross-validation is done in the normal leave-a-fold-out sort of way.
It outputs a list and the delta
component contains two estimates of the cross-validated prediction error. The first is the raw prediction error (according to your cost function or the average squared error function if you didn't provide one) and the second attempts to adjust to reduce the bias from not doing leave-one-out-validation instead. The help text has a citation, if you care about why/how. These are the values I would report in my manuscript/thesis/email-to-the-boss and what I would use to build an ROC curve.
* I say fine, but it
is annoying to fit an entire model just to initialize something. You might think you could do something clever like
my.model <- glm(log(brain) ~ log(body), data=mammals[1:5, :])
cv.err <- cv.glm(mammals, my.model)$delta
but it doesn't actually work because it uses the $y$ values from the overall model instead of the data
argument to cv.glm
, which is silly. The entire function is less than fifty lines, so you could also just roll your own, I guess.
Similar thing has been already discussed in the question:
Is epoch optimization in CV with constant mini-batch size even possible?
To summarize the result, you should probably keep a few samples aside and use them as a validation set: The benefit from knowing whether your model is still improving and not overfitting yet is probably going to outweight the benefit of having a few samples more for training.
Also, don't forget that if you change the size of the training set, using "epoch count" stops making sense (see the above thread).
Alternatively, see also OAA mentioned in this answer.
Best Answer
Yes this is most probably related: boosting re-weights models based on their performance, so any kind of performance measurement that is done during the weighting process is part of the model training, and not independent of the training data.
That is, I assume the cross validation you refer to is part of the boosting - if you are talking of cross validation of 5 completely trained boosted models we'll have to dig deeper for the reason.
In general, if cross validation is consistently and significantly (do you have enough test cases to actually distinguish cross validation and test results in a statistically sound fashion?) too optimistic, this is a sign that there is some problem in the cross validation procedure. Maybe the most typical problem is a data leak between testing and training cases.
Such a leak can e.g. happen if the data is clustered/hierarchical. I.e., something like (almost) repeated measurements (or any other confounding factor that links some cases more closely together than other cases - for my data usually many measurements of one patient or measurements of solutions from the same stock solution or measurements taken at the same day, ...), and the test data are new clusters.
One way of dealing with that is to make sure the splitting for the cross validation happens at the highest level of this data hierarchy. Many off-the-shelf classifiers do not offer this. In that case, it may be better to stay away from aggressively optimizing methods (such as boosting), as they tend to overfit badly. A symptom of that would be that the internal performance estimate of the boosting algorithm is much overoptimistic compared already with the cross validation.
The other option would be to implement random forest / boosting from cart + a resampling procedure that obeys the data structure.
Update 2: what to do with clustered data?
So far, I have worked with data where knowledge about the data generating process and the application allows to identify possible important causes of clustered data structure (e.g. patient data may be subject to between-patient-variance) - you can deal with this type of clustering as described above.
In addition, you may try to do a cluster analysis to see whether there are groups within the data. Such a finding may influence the choice of model. However, unless you can trace the clusters to some "cause" (e.g. data turns out to be grouped by day, individual, or the like, I'm not sure how to deal with that in the validation: splitting by groups found within the data is everything but independent. Yet it may be worth while to check how the predictive ability deteriorates for (ehich) out-of-training groups of the data.