What I have learned is that I should use cross-validation performance for selecting the best model. Currently, selecting the models based on cross-validation performance gives lower test performance than if I select the models by training performance. I have 300-400 features and about 10000 samples, with 12 outputs (multi-class classification).

I've split the dataset in 80:20 % shares (training:test). I'm using H2O's deep learning with 5-fold cross-validation.
Is it wrong to use the training log-loss instead of the cross-validation log-loss?
As I've understood it, the cross-validation validation sets are used for scoring the models (5). The XVAL log-loss is then the average (?) of the 5 models trained.

I have 2 loops, one loop for sampling parameters such as epochs (between 1 and 5), number of layers (2-5), and number of nodes in the layers (10-50).
For each network parameter configuration in the outer loop, I train 100 models, and of these I'm checking among others the average log-loss for both training and cross-validation, variance of the log-loss, average true-rate for training and testing, and average profit on test set.

If I selected the 10 best model configurations based on the average training log-loss or average training true-rate, it would give higher true-rate and profit on the test set, compared to using cross-validation metrics.
I think the cross-validation log-loss is good for training (true-rate is probably not good for this), but do I have to also use it for selecting model configurations?

I previously did feature selection and later parameter tuning by following the cross-validation metrics, but now that I would get better results by following the training log-loss instead I'm wondering if I should do the feature selection and parameter tuning again?

There isn't a single "best" answer to your question. There are many threads throughout CV that discuss these issues. Here's one ...

Key points to note are that training or calibration data provides useful information regarding optimal fit (e.g., Harrell recommends using a nonparametric loess smoother for this) but is well known to be positively biased:

The calibration curve is both a measure of goodness-of-fit and a great way to check the accuracy of probabilities estimated by the model.

Regardless, the emphasis in machine learning is on using validation (out-of-sample) data to evaluate fit. Validation results provide useful information regarding the bias-variance tradeoff that can't be inferred from calibration information.

  • If your validation error and training error are both high, you have underfitting (bias).

  • If your validation error is high but training error is low, you have overfitting (variance).

Here's one discussion of this:

