Yes, this would be a violation as the test data for folds 2-10 of the outer cross-validation would have been part of the training data for fold 1 which were used to determine the values of the kernel and regularisation parameters. This means that some information about the test data has potentially leaked into the design of the model, which potentially gives an optimistic bias to the performance evaluation, that is most optimistic for models that are very sensitive to the setting of the hyper-parameters (i.e. it most stongly favours models with an undesirable feature).
This bias is likely to be strongest for small datasets, such as this one, as the variance of the model selection criterion is largest for small datasets, which encourages over-fitting the model selection criterion, which means more information about the test data can leak through.
I wrote a paper on this a year or two ago as I was rather startled by the magnitude of the bias deviations from full nested cross-validation can introduce, which can easily swamp the difference in performance between classifier systems. The paper is "On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation"
Gavin C. Cawley, Nicola L. C. Talbot; JMLR 11(Jul):2079−2107, 2010.
Essentially tuning the hyper-parameters should be considered an integral part of fitting the model, so each time you train the SVM on a new sample of data, independently retune the hyper-parameters for that sample. If you follow that rule, you probably can't go too far wrong. It is well worth the computational expense to get an unbiased performance estimate, as otherwise you run the risk of drawing the wrong conclusions from your experiment.
There's nothing wrong with the (nested) algorithm presented, and in fact, it would likely perform well with decent robustness for the bias-variance problem on different data sets. You never said, however, that the reader should assume the features you were using are the most "optimal", so if that's unknown, there are some feature selection issues that must first be addressed.
FEATURE/PARAMETER SELECTION
A lesser biased approached is to never let the classifier/model come close to anything remotely related to feature/parameter selection, since you don't want the fox (classifier, model) to be the guard of the chickens (features, parameters). Your feature (parameter) selection method is a $wrapper$ - where feature selection is bundled inside iterative learning performed by the classifier/model. On the contrary, I always use a feature $filter$ that employs a different method which is far-removed from the classifier/model, as an attempt to minimize feature (parameter) selection bias. Look up wrapping vs filtering and selection bias during feature selection (G.J. McLachlan).
There is always a major feature selection problem, for which the solution is to invoke a method of object partitioning (folds), in which the objects are partitioned in to different sets. For example, simulate a data matrix with 100 rows and 100 columns, and then simulate a binary variate (0,1) in another column -- call this the grouping variable. Next, run t-tests on each column using the binary (0,1) variable as the grouping variable. Several of the 100 t-tests will be significant by chance alone; however, as soon as you split the data matrix into two folds $\mathcal{D}_1$ and $\mathcal{D}_2$, each of which has $n=50$, the number of significant tests drops down. Until you can solve this problem with your data by determining the optimal number of folds to use during parameter selection, your results may be suspect. So you'll need to establish some sort of bootstrap-bias method for evaluating predictive accuracy on the hold-out objects as a function of varying sample sizes used in each training fold, e.g., $\pi=0.1n, 0.2n, 0,3n, 0.4n, 0.5n$ (that is, increasing sample sizes used during learning) combined with a varying number of CV folds used, e.g., 2, 5, 10, etc.
OPTIMIZATION/MINIMIZATION
You seem to really be solving an optimization or minimization problem for function approximation e.g., $y=f(x_1, x_2, \ldots, x_j)$, where e.g. regression or a predictive model with parameters is used and $y$ is continuously-scaled. Given this, and given the need to minimize bias in your predictions (selection bias, bias-variance, information leakage from testing objects into training objects, etc.) you might look into use of employing CV during use of swarm intelligence methods, such as particle swarm optimization(PSO), ant colony optimization, etc. PSO (see Kennedy & Eberhart, 1995) adds parameters for social and cultural information exchange among particles as they fly through the parameter space during learning. Once you become familiar with swarm intelligence methods, you'll see that you can overcome a lot of biases in parameter determination. Lastly, I don't know if there is a random forest (RF, see Breiman, Journ. of Machine Learning) approach for function approximation, but if there is, use of RF for function approximation would alleviate 95% of the issues you are facing.
Best Answer
First of all, while I'd usually agree that hold-out is not making efficient use of the available samples and the typical set-up is prone to the same mistakes as cross validation, repeated set validation / repeated hold-out is a resampling technique that I think is well suitable for your learning curve calculation. This way, you can reflect what is going on inside the data set you have covering the variation due to different splits (but not fully the variation you'd have to expect with new data set of size $n$). This way you get the fine-grained control over training set size of hold out together with resampling properties close to k-fold.
However, here's a caveat for the informed decision: in case you are talking about small sample size classification, the usual figures of merit (sensitivity, specificity, overall accuracy etc.) are subject to very high testing variance. This testing variance is limited by the number of actual independent cases you have in the denominator of the calculation and can easily be so large that you cannot sensibly use such measured learning curves (keep in mind, "use" typically means extrapolation).
See our paper for details: Beleites, C. and Neugebauer, U. and Bocklitz, T. and Krafft, C. and Popp, J.: Sample size planning for classification models. Anal Chim Acta, 2013, 760, 25-33. DOI: 10.1016/j.aca.2012.11.007
accepted manuscript on arXiv: 1211.1323