My paper in JMLR addresses this exact question, and demonstrates why the procedure suggested in the question (or at least one very like it) results in optimistically biased performance estimates:
Gavin C. Cawley, Nicola L. C. Talbot, "On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation", Journal of Machine Learning Research, 11(Jul):2079−2107, 2010. (www)
The key thing to remember is that cross-validation is a technique for estimating the generalisation performance for a method of generating a model, rather than of the model itself. So if choosing kernel parameters is part of the process of generating the model, you need to cross-validate the model selection process as well, otherwise you will end up with an optimistically biased performance estimate (as will happen with the procedure you propose).
Assume you have a function fit_model, which takes in a dataset consisting of attributes X and desired responses Y, and which returns the fitted model for that dataset, including the tuning of hyper-parameters (in this case kernel and regularisation parameters). This tuning of hyper-parameters can be performed in many ways, for example minimising the cross-validation error over X and Y.
Step 1 - Fit the model to all available data, using the function fit_model. This gives you the model that you will use in operation or deployment.
Step 2 - Performance evaluation. Perform repeated cross-validation using all available data. In each fold, the data are partitioned into a training set and a test set. Fit the model using the training set (record hyper-parameter values for the fitted model) and evaluate performance on the test set. Use the mean over all of the test sets as a performance estimate (and perhaps look at the spread of values as well).
Step 3 - Variability of hyper-parameter settings - perform analysis of hyper-parameter values collected in step 3. However I should point out that there is nothing special about hyper-parameters, they are just parameters of the model that have been estimated (indirectly) from the data. They are treated as hyper-parameters rather than parameters for computational/mathematical convenience, but this doesn't have to be the case.
The problem with using cross-validation here is that the training and test data are not independent samples (as they share data) which means that the estimate of the variance of the performance estimate and of the hyper-parameters is likely to be biased (i.e. smaller than it would be for genuinely independent samples of data in each fold). Rather than repeated cross-validation, I would probably use bootstrapping instead and bag the resulting models if this was computationally feasible.
The key point is that to get an unbiased performance estimate, whatever procedure you use to generate the final model (fit_model) must be repeated in its entirety independently in each fold of the cross-validation procedure.
Stepwise selection is wrong in multilevel models for the same reasons it is wrong in "regular" regression: The p-values will be too low, the standard errors too small, the parameter estimates biased away from 0 etc. Most important, it denies you the opportunity to think.
9 IVs is not so very many. Why did you choose those 9? Surely you had a reason.
One initial thing to do is look at a lot of plots; which precise ones depends a little on whether your data are longitudinal (in which case plots with time on the x-axis are often useful) or clustered. But surely look at relationships between the 9 IVs and your DV (parallel box plots are one simple possibility).
The ideal would be to build a few models based on substantive sense and compare them using AIC, BIC or some other measure. But don't be surprised if no particular model comes forth as clearly best. You don't say what field you work in, but in many (most?) fields, nature is complicated. Several models may fit about equally well and a different model may fit better on a different data set (even if both are random samples from the same population).
As for references - there are lots of good books on nonlinear mixed models. Which one is best for you depends on a) What field you are in b) What the nature of the data is c) What software you use.
Responding to your comment
If all 9 variables are scientifically important, I would at least consider including them all. If a variable that everyone thinks is important winds up having a small effect, that is interesting.
Certainly plot all your variables over time and in various ways.
For general issues about longitudinal multilevel models I like Hedeker and Gibbons; for nonlinear longitudinal models in SAS I like Molenberghs and Verbeke. The SAS documentation itself (for PROC GLIMMIX
) also provides guidance.
Best Answer
Rather than using a stepwise procedure, I would fit an L1-regularized model, and discard predictors whose coefficients are effectively forced to be zero. See [Ng 2004].