Yes, this would be a violation as the test data for folds 2-10 of the outer cross-validation would have been part of the training data for fold 1 which were used to determine the values of the kernel and regularisation parameters. This means that some information about the test data has potentially leaked into the design of the model, which potentially gives an optimistic bias to the performance evaluation, that is most optimistic for models that are very sensitive to the setting of the hyper-parameters (i.e. it most stongly favours models with an undesirable feature).
This bias is likely to be strongest for small datasets, such as this one, as the variance of the model selection criterion is largest for small datasets, which encourages over-fitting the model selection criterion, which means more information about the test data can leak through.
I wrote a paper on this a year or two ago as I was rather startled by the magnitude of the bias deviations from full nested cross-validation can introduce, which can easily swamp the difference in performance between classifier systems. The paper is "On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation"
Gavin C. Cawley, Nicola L. C. Talbot; JMLR 11(Jul):2079−2107, 2010.
Essentially tuning the hyper-parameters should be considered an integral part of fitting the model, so each time you train the SVM on a new sample of data, independently retune the hyper-parameters for that sample. If you follow that rule, you probably can't go too far wrong. It is well worth the computational expense to get an unbiased performance estimate, as otherwise you run the risk of drawing the wrong conclusions from your experiment.
Question 1: local prediction & cross validation
Looking for closeby cases and upweighting them for prediction is referred to as local models or local prediction.
For the proper way to do cross validation, remember that for each fold, you only use training cases, and then do with the test cases exactly what you do for prediciton of a new unkown case.
I'd recommend to see the calculation of $X_1$ as part of the prediction. E.g. in a two level model consisting of a $n$ nearest neighbours + a second level model:
- For each of the training cases, find the $n$ nearest neighbours and calculate $X_{11}$
- Calculate the "2nd level" model based on $X_1, ..., X_{11}$.
So for prediction of a case $X_{new}$, you
- find the $n$ nearest neighbours and calculate the $X_{11}$ for the new case
- then calculate the prediction of the 2nd level model.
You use exactly this prediction procedure to predict the test cases in the cross validation.
Question 2: combining predictions
random forest tends to overfit on training data set
Usually random forest will overfit only in situations where you have a hierarchical/clustered data structer that creates a dependence between (some) rows of your data.
Boosting is more prone to overfitting because of the iteratively weighted average (as opposed to the simple average of the random forest).
I did not yet completely understand your question (see comment).
But here's my guess:
I assume you want to find out the optimal weight you should use for random forest and boosted prediction, which is a linear model of those two models.
(I don't see how you could use the individual trees within those ensemble models because the trees will totally change between the splits). This again amounts to a 2 level model (or 3 levels if combined with the approach of question 1).
The general answer here is that whenever you do a data-driven model or hyperparameter optimization (e.g. optimize the weights for random forest prediction and gradient boosted prediction by test/cross validation results), you need to do an independent validation to assess the real performance of the resulting model. Thus you need either yet another independent test set, or a so-called nested or double cross validation.
- So the 1st approach would not work unless you derive the weights from the training data.
- As you point out for the 2nd approach, having more and more levels of cross validation needs huge sample sizes to start with.
I'd recommend a different approach here: try to cut down as far as possible the number of splits you need by doing as few data-driven hyperparameter calculations or optimizations as possible. There cannot be any discussion about the need of a validation of the final model. But you may be able to show that no inner splitting is needed if you can show that the models you try to stack are not overfit. In addition this would remove the need to stack at all:
Ensemble models only help if the underlying individual models suffer from variance, i.e. are unstable. (Or if they are biased in opposing directions, so the ensembe would roughly cancel the individual biases. I suspect that this is not the case here, assuming that your GBM uses trees like the RF.)
As for the instability, you can measure this easily by repeated aka iterated cross validation (see e.g. this answer). If this does not point to substantial variance in the prediction of the same case by models built on slightly varying training data (i.e. if your RF and GBM are stable), producing an ensemble of the ensemble models is not going to help.
Best Answer
You're describing a scenario in which you use k-fold cross validation on the training set, and the holdout would be considered the validation set, and is not the test set.
Use the validation set to find parameters which minimize the training error, then run on the test set.