You have indeed correctly described the way to work with crossvalidation. In fact, you are 'lucky' to have a reasonable validation set at the end, because often, crossvalidation is used to optimize a model, but no "real" validation is done.
As @Simon Stelling said in his comment, crossvalidation will lead to lower estimated errors (which makes sense because you are constantly reusing the data), but fortunately this is the case for all models, so, barring catastrophy (i.e.: errors are only reduced slightly for a "bad" model, and more for "the good" model), selecting the model that performs best on a crossvalidated criterion, will typically also be the best "for real".
A method that is sometimes used to correct somewhat for the lower errors, especially if you are looking for parsimoneous models, is to select the smallest model/simplest method for which the crossvalidated error is within one SD from the (crossvalidated) optimum. As crossvalidation itself, this is a heuristic, so it should be used with some care (if this is an option: make a plot of your errors against your tuning parameters: this will give you some idea whether you have acceptable results)
Given the downward bias of the errors, it is important to not publish the errors or other performance measure from the crossvalidation without mentioning that these come from crossvalidation (although, truth be told: I have seen too many publications that don't mention that the performance measure was obtained from checking the performance on the original dataset either --- so mentioning crossvalidation actually makes your results worth more). For you, this will not be an issue, since you have a validation set.
A final warning: if your model fitting results in some close competitors, it is a good idea to look at their performances on your validation set afterwards, but do not base your final model selection on that: you can at best use this to soothe your conscience, but your "final" model must have been picked before you ever look at the validation set.
Wrt your second question: I believe Simon has given your all the answers you need in his comment, but to complete the picture: as often, it is the bias-variance trade-off that comes into play. If you know that, on average, you will reach the correct result (unbiasedness), the price is typically that each of your individual calculations may lie pretty far from it (high variance). In the old days, unbiasedness was the nec plus ultra, in current days, one has accepted at times a (small) bias (so you don't even know that the average of your calculations will result in the correct result), if it results in lower variance. Experience has shown that the balance is acceptable with 10-fold crossvalidation. For you, the bias would only be an issue for your model optimization, since you can estimate the criterion afterwards (unbiasedly) on the validation set. As such, there is little reason not to use crossvalidation.
Cross validation with k folds means you will have to split you data set in k disjoint groups. In your case for 10-folds you split your data set in 10 disjoint groups each with 400 samples ($G_i$ with $i$ from 1 to 10). Usually the groups should have roughly the same size.
Now do the following:
- Train your classifier on $Train_1 = G_2\cup G_3 \cup ... \cup G_{10}$ and test it on $Test_1 = G_1$. Save test results for later use.
- Train your classifier on $Train_2 = G_1 \cup G_3 \cup .. \cup G_{10}$ and test on $Test_2 = G_2$ and save results for later use.
- Repeat another 8 steps and collect the results.
Now you have for each instance of your dataset, how it was classified, since the reunion of all $Test_i$ is the original data set (each group $G_i$ is tested once). You can measure how do you like the errors.
Now there are a couple of things which I believe you have to pay some attention. You said you have 20 target classes and 4000 samples. I do not know about your specific problem, but it does not seem to have plenty of data. So, I believe is better to do multiple cross validations and average the results, thus you decrease the chance to get too biased results.
Another thing to pay attention for is how do you build your folds. You might use simple random sampling, but I believe is better to use a stratified random procedure. Thus you increase the chances to have a usable CV estimation.
You might also consider bootstrap testing if you do not have enough instances for a 10-fold cross validation with stratified sampling.
Best Answer
Background
Consider the following definitions:
We can define two measures of importance:
Prediction error: expected loss on future test examples $$ PE(D) = E[L(f,z)]$$ Where the expectation is taken with respect to $z$ sampled from $P$
Expected performance error: a more general measure which is the expected loss on training sets of size $n$ sampled from $P$ $$ EPE(n) = E[L(A(D),z)]$$ Where the expectation is taken with respect to $D$ sampled from $P$ and $z$ is independently sampled from $P$ also
Cross validation estimator
In practice the data set $D$ is chunked into $K$ disjoint subsets of the same size with $m = n / K$. Let us write $T_k$ for the $k$-th such block and $D_k$ for the training set obtained by removing the elements in $T_k$ from $D$, then
Once you have this Cross Validation Estimator $CV(D)$ you can construct bootstrap confidence intervals around it in the same way as you would for any other estimator. Bootstrap your dataset, compute the estimator, repeat many times...
The difficulty lies in understanding what you are actually computing, and how close it is from the truth... Here are a few questions: - Is the mean of the $CV$ an estimator of the $PE$ or the $EPE$ ? - What about the variance of $CV$ ? Does it inform us about the uncertainty of the PE or EPE ? - Under what conditions is the difference $|CV(D) - PE(D)|$ bounded ? - And lastly how does bootstrapping effect all of the above ?
This is an active topic of research and there are different views, conclusions, proofs to be taken into account. The paper linked below is a good place to start if you want to go into more depth.
Sources and further reading