Stepwise selection is wrong in multilevel models for the same reasons it is wrong in "regular" regression: The p-values will be too low, the standard errors too small, the parameter estimates biased away from 0 etc. Most important, it denies you the opportunity to think.
9 IVs is not so very many. Why did you choose those 9? Surely you had a reason.
One initial thing to do is look at a lot of plots; which precise ones depends a little on whether your data are longitudinal (in which case plots with time on the x-axis are often useful) or clustered. But surely look at relationships between the 9 IVs and your DV (parallel box plots are one simple possibility).
The ideal would be to build a few models based on substantive sense and compare them using AIC, BIC or some other measure. But don't be surprised if no particular model comes forth as clearly best. You don't say what field you work in, but in many (most?) fields, nature is complicated. Several models may fit about equally well and a different model may fit better on a different data set (even if both are random samples from the same population).
As for references - there are lots of good books on nonlinear mixed models. Which one is best for you depends on a) What field you are in b) What the nature of the data is c) What software you use.
Responding to your comment
If all 9 variables are scientifically important, I would at least consider including them all. If a variable that everyone thinks is important winds up having a small effect, that is interesting.
Certainly plot all your variables over time and in various ways.
For general issues about longitudinal multilevel models I like Hedeker and Gibbons; for nonlinear longitudinal models in SAS I like Molenberghs and Verbeke. The SAS documentation itself (for PROC GLIMMIX
) also provides guidance.
1) The reason you're confused is that the term "stepwise" is used inconsistently. Sometimes it means pretty specific procedures in which $p$-values of regression coefficients, calculated in the ordinary way, are used to determine what covariates are added to or removed from a model, and this process is repeated several times. It may refer to (a) a particular variation of this procedure in which variables can be added or removed at any step (I think this is what SPSS calls "stepwise"), or it may refer to (b) this variation along with other variations such as only adding variables or only removing variables. More broadly, "stepwise" can be used to refer to (c) any procedure in which features are added to or removed from a model according to some value that's computed each time a feature (or set of features) is added or removed.
These different strategies have all been criticized for various reasons. I would say that most of the criticism is about (b), the key part of that criticism is that $p$-values are poorly equipped for feature selection (the significance tests here are really testing something quite different from "should I include this variable in the model?"), and most serious statisticians recommend against it in all circumstances. (c) is more controversial.
2) Because statistics education is really bad. To give just one example: so far as I can tell from my own education, it's apparently considered a key part of statistics education for psychology majors to tell students to use Bessel's correction to get unbiased estimates of population SD. It's true that Bessel's correction makes the estimate of the variance unbiased, but it's easy to prove that the estimate of the SD is still biased. Better yet, Bessel's correction can increase the MSE of these estimates.
3) Variable selection is practically a field unto itself. Cross-validation and train–test splits are ways to evaluate a model, possibly after feature selection; they don't themselves provide suggestions for which features to use. The lasso is often a good choice. So is best subsets.
4) In my mind, there's still no sense in using (b), especially when you could do something else in (c) instead, like using AIC. I have no objections to AIC-based stepwise selection, but be aware that it's going to be sensitive to the sample (in particular, as samples grow arbitrarily large, AIC, like the lasso, always chooses the most complex model), so don't present the model selection itself as if it were a generalizable conclusion.
If we are looking to see which variables seem to explain the response and in what way
Ultimately, if you want to look at the effects of all the variables, you need to include all the variables, and if your sample is too small for that, you need a bigger sample. Remember, null hypotheses are never true in real life. There aren't going to be a bunch of variables that are associated with an outcome and a bunch of other variables that aren't. Every variable will be associated with the outcome—the questions are to what degree, in what direction, in what interactions with other variables, etc.
Best Answer
I think this approach is mistaken, but perhaps it will be more helpful if I explain why. Wanting to know the best model given some information about a large number of variables is quite understandable. Moreover, it is a situation in which people seem to find themselves regularly. In addition, many textbooks (and courses) on regression cover stepwise selection methods, which implies that they must be legitimate. Unfortunately however, they are not, and the pairing of this situation and goal are quite difficult to successfully navigate. The following is a list of problems with automated stepwise model selection procedures (attributed to Frank Harrell, and copied from here):
The question is, what's so bad about these procedures / why do these problems occur? Most people who have taken a basic regression course are familiar with the concept of regression to the mean, so this is what I use to explain these issues. (Although this may seem off-topic at first, bear with me, I promise it's relevant.)
Imagine a high school track coach on the first day of tryouts. Thirty kids show up. These kids have some underlying level of intrinsic ability to which neither the coach, nor anyone else, has direct access. As a result, the coach does the only thing he can do, which is have them all run a 100m dash. The times are presumably a measure of their intrinsic ability and are taken as such. However, they are probabilistic; some proportion of how well someone does is based on their actual ability and some proportion is random. Imagine that the true situation is the following:
The results of the first race are displayed in the following figure along with the coach's comments to the kids.
Note that partitioning the kids by their race times leaves overlaps on their intrinsic ability--this fact is crucial. After praising some, and yelling at some others (as coaches tend to do), he has them run again. Here are the results of the second race with the coach's reactions (simulated from the same model above):
Notice that their intrinsic ability is identical, but the times bounced around relative to the first race. From the coach's point of view, those he yelled at tended to improve, and those he praised tended to do worse (I adapted this concrete example from the Kahneman quote listed on the wiki page), although actually regression to the mean is a simple mathematical consequence of the fact that the coach is selecting athletes for the team based on a measurement that is partly random.
Now, what does this have to do with automated (e.g., stepwise) model selection techniques? Developing and confirming a model based on the same dataset is sometimes called data dredging. Although there is some underlying relationship amongst the variables, and stronger relationships are expected to yield stronger scores (e.g., higher t-statistics), these are random variables and the realized values contain error. Thus, when you select variables based on having higher (or lower) realized values, they may be such because of their underlying true value, error, or both. If you proceed in this manner, you will be as surprised as the coach was after the second race. This is true whether you select variables based on having high t-statistics, or low intercorrelations. True, using the AIC is better than using p-values, because it penalizes the model for complexity, but the AIC is itself a random variable (if you run a study several times and fit the same model, the AIC will bounce around just like everything else). Unfortunately, this is just a problem intrinsic to the epistemic nature of reality itself.
I hope this is helpful.