I didn't see the lecture, so I can't comment on what was said.
My $0.02: If you want to get good estimates of performance using resampling, you should really do all of the operations during resampling instead of prior. This is really true of feature selection [1] as well as non-trivial operations like PCA. If it adds uncertainty to the results, include it in resampling.
Think about principal component regression: PCA followed by linear regression on some of the components. PCA estimates parameters (with noise) and the number of components must also be chosen (different values will result in different results => more noise).
Say we used 10 fold CV with scheme 1:
conduct PCA
pick the number of components
for each fold:
split data
fit linear regression on the 90% used for training
predict the 10% held out
end:
or scheme 2:
for each fold:
split data
conduct PCA on the 90% used for training
pick the number of components
fit linear regression
predict the 10% held out
end:
It should be clear than the second approach should produce error estimates that reflect the uncertainty caused by PCA, selection of the number of components and the linear regression. In effect, the CV in the first scheme has no idea what preceded it.
I'm guilty of not always doing all the operations w/in resampling, but only when I don't really care about performance estimates (which is unusual).
Is there much difference between the two schemes? It depends on the data and the pre-processing. If you are only centering and scaling, probably not. If you have a ton of data, probably not. As the training set size goes down, the risk of getting poor estimates goes up, especially if n is close to p.
I can say with certainty from experience that not including supervised feature selection within resampling is a really bad idea (without large training sets). I don't see why pre-processing would be immune to this (to some degree).
@mchangun: I think that the number of components is a tuning parameter and you would probably want to pick it using performance estimates that are generalizable. You could automatically pick K such that at least X% of the variance is explained and include that process within resampling so we account for the noise in that process.
Max
[1] Ambroise, C., & McLachlan, G. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data. Proceedings of the National Academy of Sciences, 99(10), 6562–6566.
The syntax for cv.glm
is clouding the issue here.
In general, one divides the data up into $k$ folds. The first fold is used as the test data, while the remaining $k-1$ folds are used to build the model. We evaluate the model's performance on the first fold and record it. This process is repeated until each fold is used once as test data and $k-1$ times as training data. There's no need to fit a model to the entire data set.
However, cv.glm
is a bit of a special case. If you look at its the documentation for cv.glm
, you do need to fit an entire model first. Here's the example at the very end of the help text:
require('boot')
data(mammals, package="MASS")
mammals.glm <- glm(log(brain) ~ log(body), data = mammals)
(cv.err <- cv.glm(mammals, mammals.glm)$delta)
(cv.err.6 <- cv.glm(mammals, mammals.glm, K = 6)$delta)
The 4th line does a leave-one-out validation (each fold contains one example), while the last line performs a 6-fold cross-validation.
This sounds problematic: using the same data for training and testing is a sure-fire way to bias your results, but it is actually okay. If you look at the source (in bootfuns.q
, starting at line 811), the overall model is not used for prediction. The cross-validiation code just extracts the formula and other fitting options from the model object and reuses those for the cross-validation, which is fine* and then cross-validation is done in the normal leave-a-fold-out sort of way.
It outputs a list and the delta
component contains two estimates of the cross-validated prediction error. The first is the raw prediction error (according to your cost function or the average squared error function if you didn't provide one) and the second attempts to adjust to reduce the bias from not doing leave-one-out-validation instead. The help text has a citation, if you care about why/how. These are the values I would report in my manuscript/thesis/email-to-the-boss and what I would use to build an ROC curve.
* I say fine, but it
is annoying to fit an entire model just to initialize something. You might think you could do something clever like
my.model <- glm(log(brain) ~ log(body), data=mammals[1:5, :])
cv.err <- cv.glm(mammals, my.model)$delta
but it doesn't actually work because it uses the $y$ values from the overall model instead of the data
argument to cv.glm
, which is silly. The entire function is less than fifty lines, so you could also just roll your own, I guess.
Best Answer
Let's look at three different approaches
In the simplest scenario one would collect one dataset and train your model via cross-validation to create your best model. Then you would collect another completely independent dataset and test your model. However, this scenario is not possible for many researchers given time or cost limitations.
If you have a sufficiently large dataset, you would want to take a split of your data and leave it to the side (completely untouched by the training). This is to simulate it as a completely independent dataset set even though it comes from the same dataset but the model training won't take any information from those samples. You would then build your model on the remaining training samples and then test on these left-out samples.
If you have a smaller dataset, you may not be able to afford to simply ignore a chunk of your data for model building. As such, the validation is performed on every fold (k-fold CV?) and your validation metric would be aggregated across each validation.
To more directly answer your question, yes you can just do cross-validation on your full dataset. You can then use your predicted and actual classes to evaluate your models performance by whatever metric you prefer (Accuracy, AUC, etc.)
That said, you still probably want to look in to repeated cross-validation to evaluate the stability of your model. Some good answers regarding this are here on internal vs. external CV and here on the # of repeats