There's nothing wrong with the (nested) algorithm presented, and in fact, it would likely perform well with decent robustness for the bias-variance problem on different data sets. You never said, however, that the reader should assume the features you were using are the most "optimal", so if that's unknown, there are some feature selection issues that must first be addressed.
FEATURE/PARAMETER SELECTION
A lesser biased approached is to never let the classifier/model come close to anything remotely related to feature/parameter selection, since you don't want the fox (classifier, model) to be the guard of the chickens (features, parameters). Your feature (parameter) selection method is a $wrapper$ - where feature selection is bundled inside iterative learning performed by the classifier/model. On the contrary, I always use a feature $filter$ that employs a different method which is far-removed from the classifier/model, as an attempt to minimize feature (parameter) selection bias. Look up wrapping vs filtering and selection bias during feature selection (G.J. McLachlan).
There is always a major feature selection problem, for which the solution is to invoke a method of object partitioning (folds), in which the objects are partitioned in to different sets. For example, simulate a data matrix with 100 rows and 100 columns, and then simulate a binary variate (0,1) in another column -- call this the grouping variable. Next, run t-tests on each column using the binary (0,1) variable as the grouping variable. Several of the 100 t-tests will be significant by chance alone; however, as soon as you split the data matrix into two folds $\mathcal{D}_1$ and $\mathcal{D}_2$, each of which has $n=50$, the number of significant tests drops down. Until you can solve this problem with your data by determining the optimal number of folds to use during parameter selection, your results may be suspect. So you'll need to establish some sort of bootstrap-bias method for evaluating predictive accuracy on the hold-out objects as a function of varying sample sizes used in each training fold, e.g., $\pi=0.1n, 0.2n, 0,3n, 0.4n, 0.5n$ (that is, increasing sample sizes used during learning) combined with a varying number of CV folds used, e.g., 2, 5, 10, etc.
OPTIMIZATION/MINIMIZATION
You seem to really be solving an optimization or minimization problem for function approximation e.g., $y=f(x_1, x_2, \ldots, x_j)$, where e.g. regression or a predictive model with parameters is used and $y$ is continuously-scaled. Given this, and given the need to minimize bias in your predictions (selection bias, bias-variance, information leakage from testing objects into training objects, etc.) you might look into use of employing CV during use of swarm intelligence methods, such as particle swarm optimization(PSO), ant colony optimization, etc. PSO (see Kennedy & Eberhart, 1995) adds parameters for social and cultural information exchange among particles as they fly through the parameter space during learning. Once you become familiar with swarm intelligence methods, you'll see that you can overcome a lot of biases in parameter determination. Lastly, I don't know if there is a random forest (RF, see Breiman, Journ. of Machine Learning) approach for function approximation, but if there is, use of RF for function approximation would alleviate 95% of the issues you are facing.
Your second procedure assumes you have some other feature selection algorithm (for example, stepwise regression with some stopping rule), distinct from the cross-validation. If you don't have this, you'll just have to use the first procedure (where cross-validation is the whole feature-selection algorithm).
Also, even if the second procedure is applicable, the first procedure might do better. In the second procedure, a greedy feature-selection algorithm might always pick models that are overfit to the training data. Then the CV would only let you choose among these bad models. This shouldn't happen in the first procedure.
On the other hand, if your problem does have a specialized feature-selection algorithm which is computationally-efficient, then the second procedure may run much faster than the first.
If you do use the second procedure, one way to choose a best feature set is to let CV choose the model size. At every model size, you might compare different models on each data split, but average their test errors across all splits. This way, you can use CV to decide which model size gives the best estimated performance. Finally, rerun your feature-selection algorithm on the full dataset, up to the size chosen by CV, and use this as the final feature set.
Best Answer
Your concern is right spot on, and there is a whole lot of literature on this topic, e.g.
The problem is that hyperparameter tuning with cross validation is a data-driven optimization process, and will still tend to overfit to yor data set (less than tuning by resubstitution error, but still). Trying to use the tuning cross validation results as "independent" performance measure is in a way like eating the pie (= tuning) and keeping (= measure final model performance) it.
This does not mean that you shouldn't use cross-validation for hyperparameter tuning. It just means that you can use it only for one purpose. Either optimize or measure model performance for validation purposes.
The solution is that you need to do an independent validation for measuring the quality of the model obtained with the tuned hyperparameters. This is called nested or double validation. You'll find a number of questions and answers here on these topics
Conceptually, I like to say that training includes all kinds of fancy steps to fit not only the "usual" model parameters but also to fit (auto-tune) the hyperparameters. So data-driven optimization of λ is clearly part of the model training.
As a rule of thumb you can also say that model training is everything that needs to be done before you have a ready-to-use final black-box function that is able to produce predictions for new cases.
PS: I find the testing vs. validation terminology very confusing because in my field "validation" means proving that the final model is fit for purpose, and is therefore what other people call testing rather than validation. I prefer to call the inner test set "tuning test set" and the outer "final validation test set" or the like.
Update:
Typically, this is nothing that just happens: there are typical situations that can cause such a failure. And all such situations that I'm aware of are overfitting situations. You need to be aware that while regularization helps to reduce the necessary number of training cases, data-driven optimization needs large amounts of data.
My recommendations:
Typically, you (should) already have rough expectations, e.g. what performance should be achievable, what performance you'd consider suspiciously good looking. Or have specs what performance you need to achieve and a baseline performance. From that and the number of availabe training cases (for the splitting scheme you decided for), calculate the expected uncertainty for the inner (tuning) tests. If that uncertainty indicates that you would not be able to get meaningful comparisons, don't do data-driven optimization.
You should check how stable both the obtained predictions with the chosen λ and the optimal λ found by the auto-tuning procedure are. If λ isn't reasonably stable with respect to different splits of your data, the optimization didn't work.
If you find that either you won't be able to do the data-driven optimization or that it didn't work after all, you may choose the λ by your expert knowledge, e.g. from experience with similar data. Or by the knowledge that if you find out that the optimization failed, you'll need a stronger regularization: the overfitting that leads to the failure works towards too complex models.