There is definitely a problem with selecting a case where the mean AUC is the best. You should instead report how you set up cross-validation, how many times you ran it, and include some summary statistics of the AUCs you obtained (maybe include a histogram, too).
Cross-validation gives you an estimate of how your model would perform if you train it on a random sample from your distribution (of a similar size to your training folds) on another random sample from your distribution. The variability in AUCs you observe, depending on which examples make it into the training/test sets, shows that your model is somewhat sensitive to your sample. The variance in AUCs gives you a sense of how sensitive it is.
To show why selecting a case with the best AUC is wrong, consider a case where your model is extremely sensitive to your training/test sets. It sounds like a bad model, right? But given the wide variance, on some sample if will work really, really well - by chance. You can then see how reporting just that figure would be really misleading.
There is no discrepancy here, uncertainty due to the small number of tested cases is more than sufficient to explain the situation:
As a quick check of the situation, we can calculate binomial confidence intervals (for the moment ignoring that there may be additional random uncertainty due to model instability) for accuracy based on number of tested cases:
training: $63.3 \% = \frac{38}{60}$. 95% confidence interval 50 - 74 %
validation: $70.6 \% = \frac{24}{34}$. 95% confidence interval 54 - 84 %
With that overlap in the confidence intervals, there's no way to argue that the validation set accuracy is actually higher than trainins set accuracy.
Note that this also means that your sample size is too small to make meaningful comparisons of the predictive performance in order to select the best model unless the difference in the models is huge. To get an idea about the sample sizes needed, have a look at our paper
Beleites, C. and Neugebauer, U. and Bocklitz, T. and Krafft, C. and Popp, J.: Sample size planning for classification models. Anal Chim Acta, 2013, 760, 25-33.
DOI: 10.1016/j.aca.2012.11.007
accepted manuscript on arXiv: 1211.1323
Nevertheless selecting the apparently best model will tend to produce overfitting as your decision heavily influenced by the random uncertainty of your test results: if you take many different models that truly have the same predictive performance, and measure their performance only roughly, you'll still observe better and worse performance estimates. But picking among them cannot produce any truly better model. Almost the same happens if you have models that differ so little in their predictive performance that you cannot reliably detect the difference given your sample size.
This is overfitting in the very sense of the word - actually, it's what I'd consider one of the textbook situations of overfitting.
What you can and should do, though, is measuring model stability. That can be done even if the random uncertainty for the overall accuracy etc. is large due to your small sample size:
Beleites, C. & Salzer, R.: Assessing and improving the stability of chemometric models in small sample size situations Anal Bioanal Chem, 2008, 390, 1261-1271.
DOI: 10.1007/s00216-007-1818-6
This would require changing the outer train/test split into a repeated cross validation. As you did the train-test split yourself (as opposed to, say, having a clinical study where patients have been assigned to train and test arms beforehand), repeated cross validation is much better here anyways as it will lead to each and every sample being tested (and thus helping to narrow the random uncertainty: the same 70.6 % accuracy observed in a cross validation comprising all 94 cases would have 95 % ci 61 - 79 %).
Once you know whether your models are stable or not, you can choose a strategy to improve them based on this information. E.g. if they are, you may consider moving from boosting to more restrictive models or even bagging.
There are approaches to improve the grid search selection procedure (e.g. better behaved figure of merit: a proper scoring rule, and choosing the least complex model that's better than the apparently best performance taking into account random uncertainty on that see here) - but they cannot work miracles. Still back-of-the-envelope of your accuracy, such a least complex model would have only slightly better than guessing performance which is most probabaly not an acceptable result for your application.
I think it very likely that even with a more stable training algorithm than boosting and a proper scoring as your figure of merit, you may find that your data set just doesn't have the required information content to allow data-driven tuning of model parameters. Thus, you may be overall better off using either a modeling algorithm that doesn't need tuning or one that can be tuned by your expert knowledge about your data and the algorithm.
Best Answer
How can you select a model if its the same parameter set but just a different iteration? (high variance)
If you run repeated 10-fold CV ("5 or 10 times") and get different AUC values for the same parameter set, then a fair estimation is the worst outcome. Cross-validation in itself is already a pessimistic estimation of the model trained on the entire data set (that was used for CV). Select 0.58 (the lowest test AUC) - selecting the best train or test AUC is probably over-optimistic.
If 0.58 is not good enough, then the model must be made more robust to noise - trading off less variance for the cost of more bias - that makes selecting a model easier. At some point you need a different hold-out set to test since you are somewhat optimizing your testAUC if you keep reiterating.