The argument that the paper seems to be making appears strange to me.
According to the paper, the goal of CV is to estimate $\alpha_2$, the expected predictive performance of the model on new data, given that the model was trained on the observed dataset $S$. When we conduct $k$-fold CV, we obtain an estimate $\hat A$ of this number. Because of the random partitioning of $S$ into $k$ folds, this is a random variable $\hat A \sim f(A)$ with mean $\mu_k$ and variance $\sigma^2_k$. In contrast, $n$-times-repeated CV yields an estimate with the same mean $\mu_k$ but smaller variance $\sigma^2_k/n$.
Obviously, $\alpha_2\ne \mu_k$. This bias is something we have to accept.
However, the expected error $\mathbb E\big[|\alpha_2-\hat A|^2\big]$ will be larger for smaller $n$, and will be the largest for $n=1$, at least under reasonable assumptions about $f(A)$, e.g. when $\hat A\mathrel{\dot\sim} \mathcal N(\mu_k,\sigma^2_k/n)$. In other words, repeated CV allows to get a more precise estimate of $\mu_k$ and it is a good thing because it gives a more precise estimate of $\alpha_2$.
Therefore, repeated CV is strictly more precise than non-repeated CV.
The authors do not argue with that! Instead they claim, based on the simulations, that
reducing the variance [by repeating CV] is, in many cases, not very useful, and essentially a waste of computational resources.
This just means that $\sigma^2_k$ in their simulations was pretty low; and indeed, the lowest sample size they used was $200$, which is probably big enough to yield small $\sigma^2_k$. (The difference in estimates obtained with non-repeated CV and 30-times-repeated CV is always small.) With smaller sample sizes one can expect larger between-repetitions variance.
CAVEAT: Confidence intervals!
Another point that the authors are making is that
the reporting of confidence intervals [in repeated cross-validation] is
misleading.
It seems that they are referring to confidence intervals for the mean across CV repetitions. I fully agree that this is a meaningless thing to report! The more times CV is repeated, the smaller this CI will be, but nobody is interested in the CI around our estimate of $\mu_k$! We care about the CI around our estimate of $\alpha_2$.
The authors also report CIs for the non-repeated CV, and it's not entirely clear to me how these CIs were constructed. I guess these are the CIs for the means across the $k$ folds. I would argue that these CIs are also pretty much meaningless!
Take a look at one of their examples: the accuracy for adult
dataset with NB algorithm and 200 sample size. They get 78.0% with non-repeated CV, CI (72.26, 83.74), 79.0% (77.21, 80.79) with 10-times-repeated CV, and 79.1% (78.07, 80.13) with 30-times-repeated CV. All of these CIs are useless, including the first one. The best estimate of $\mu_k$ is 79.1%. This corresponds to 158 successes out of 200. This yields 95% binomial confidence interval of (72.8, 84.5) -- broader even than the first one reported. If I wanted to report some CI, this is the one I would report.
MORE GENERAL CAVEAT: variance of CV.
You wrote that repeated CV
has become a popular technique for reducing the variance of cross-validation.
One should be very clear what one means by the "variance" of CV. Repeated CV reduces the variance of the estimate of $\mu_k$. Note that in case of leave-one-out CV (LOOCV), when $k=N$, this variance is equal to zero. Nevertheless, it is often said that LOOCV has actually the highest variance of all possible $k$-fold CVs. See e.g. here: Variance and bias in cross-validation: why does leave-one-out CV have higher variance?
Why is that? This is because LOOCV has the highest variance as an estimate of $\alpha_1$ which is the expected predictive performance of the model on new data when built on a new dataset of the same size as $S$. This is a completely different issue.
An acceptable amount of error depends on many things, like the total sum of squares in the model. To relate the MSE (Mean Squared Error) to something tangible, it is usually compared to the Cross-validated MSE of other versions/variations of the model.
From what I can gather, you first performed your model building, ended up with some model and cross-validated that. But instead, if your aim is to build a model for prediction, it is customary to build your model actually by cross-validating.
For example, take the full model, with all variables included and see what MSE that gives. Repeat this for every combination of variables (including the null-model, with just an intercept) and cross-validate each of those models. That should give you are list of models and their MSE. Now you have an idea of what range of MSE you can expect and which model offers the lowest error.
Best Answer
Predictive accuracy always needs to be calculated on unseen data - whether that data is unseen via cross validation splits or via a separate data set.
So often the most important point is to avoid leaks between training and test data. This may be easier to achieve with hold out (e.g. by obtaining test cases only after model training is finished) than for resampling.
But careful: very often "hold out" or "independent test" are used that are in fact a single random split of the available data set. That procedure is of course prone to the same data leaks that cross validation is.
Yes, for simple data, cross validation makes more efficient use of your data. And in small sample size situations, that can be the crucial advantage of resampling. But when you have to deal with multiple confounders and need to split independently for all those confounders, that advantage vanishes very fast because you end up excluding large parts of your data from both test and training set for each surrogate model.
Related:
UPDATE: described scenario of 100k (I assume cases) x unknown no of variates.
That is certainly not a small sample size situation. In this situation, a random hold out set of 10 % = 10000 cases should have no practically relevant difference to cross validation results. The more so, as a random subset is prone to the same data leaks that cross validation is prone to as well: confounders that lead to clustering in the data. If you have such confounders, your effective sample size may be orders of magnitude below the 100k rows, and any kind of splitting that doesn't take care of those confounders will mean a data leak between training and test and lead to overoptimistic bias in the error estimates.
The more efficient use of cases in cross validation is mostly relevant with small data sets where
here cross validation is better as a full run will test each case.
For theory, I recommend reading up the relevant parts of The Elements of Statistical Learning.
These papers have empirical results on bias and variance of different validation schemes (though they deal explicitly with small sample size situations):