The key thing to remember is that for cross-validation to give an (almost) unbiased performance estimate every step involved in fitting the model must also be performed independently in each fold of the cross-validation procedure. The best thing to do is to view feature selection, meta/hyper-parameter setting and optimising the parameters as integral parts of model fitting and never do any one of these steps without doing the other two.
The optimistic bias that can be introduced by departing from that recipe can be surprisingly large, as demonstrated by Cawley and Talbot, where the bias introduced by an apparently benign departure was larger than the difference in performance between competing classifiers. Worse still biased protocols favours bad models most strongly, as they are more sensitive to the tuning of hyper-parameters and hence are more prone to over-fitting the model selection criterion!
Answers to specific questions:
The procedure in step 1 is valid because feature selection is performed separately in each fold, so what you are cross-validating is whole procedure used to fit the final model. The cross-validation estimate will have a slight pessimistic bias as the dataset for each fold is slightly smaller than the whole dataset used for the final model.
For 2, as cross-validation is used to select the model parameters then you need to repeat that procedure independently in each fold of the cross-validation used for performance estimation, you you end up with nested cross-validation.
For 3, essentially, yes you need to do nested-nested cross-validation. Essentially you need to repeat in each fold of the outermost cross-validation (used for performance estimation) everything you intend to do to to fit the final model.
For 4 - yes, if you have a separate hold-out set, then that will give an unbiased estimate of performance without needing an additional cross-validation.
The most obvious issue that I see is the size of your hidden layers. They're generally trained on a GPU and the training can take hours, days, and sometimes weeks.
The "capacity"of your network is not large enough to fully represent the underlying variability in the data (MNIST). Try increasing the size of your hidden layers.
Best Answer
If you read this thread you will find that:"validation data is not used in the fitting process" and "this data is not passed to the training process so the optimiser doesn't update the parameters with respect to this data".
In autoencoder, this
validation_data
set is used just for calculating the loss function. So, the test error of an autocencoder is nothing but the output of your selectedloss
function on this validation set.