Solved – Appropriate number of explanatory variables in redundancy analysis (RDA)

This question comes from a reviewer's comment on a manuscript I recently submitted. I analyzed a multivariate data set (6 response variables, 21 observations for each) using redundancy analysis (RDA) in R with the vegan package. I wanted to determine which explanatory variables could best explain the variation of my 6 response variables taken together.

After removing highly correlated (>0.85) explanatory variables, I still had 25 possible explanatory variables for 21 multivariate observations.

I then standardized my response and explanatory variables and ran the following codes:

rdax.r <- rda(std_flx~., data=std.div.rda.r)
rday.r <- rda(std_flx~1, data=std.div.rda.r)
rdax_select.r <-ordistep(rdax.r, scope=formula(rday.r), direction="both", Pin = 0.05, Pout = 0.1, perm.max = 9999)

The idea here is to use ordistep to sequentially test and remove non-significant explanatory variables.

My final model kept 9 explanatory variables that best explained the variation of my 21 multivariate observations.

My questions: 1) is it appropriate to do this since my full model has 25 explanatory variables but only 21 observations? 2) Is the ratio of explanatory to response variable is too high?

My understanding is that it’s ok since ordistep is sequentially testing the significance of each term and dropping the non-significant ones. Moreover, this technique is similar to the DistLM analysis using the software PRIMER with a stepwise procedure based on AICc, but I my case, I'm basing my selection procedure on p-values.

Best Answer

I would suggest you use LASSO or Ridge regression, which is meant to handle collinear variables as well as situations with $n<p$ (number of observations less than number of predictors). These occur often in biological data. Tibshirani is one of the pioneers of LASSO regression which is a more proficient regression tool that Ridge regression. What it does is actually determines using cross-validation the most appropriate variables based on your explanatory variables. The beauty of it is that it coerces insignificant variables to 0 through its penalization procedure. Here is a link to guide you on the applications side: https://web.stanford.edu/~vcs/talks/MicrosoftMay082008.pdf.

To run LASSO, use the glmnet package in R. It actually has all your diagnostic tools and is simple to use. Here is the link to guide you through the glmnet package: http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html. Step AICc, BIC, etc. are used as a "poor man's method" for regression. Often times in the presence of collinear data, the stepwise procedure overshoots the true value of $R^2$. Also the tests used to decide to include a variable or exclude a variable are biased since they are based on the same data (talking about measuring both the testing data and the training data during a validation procedure). Also, from a methods standpoint, people tend not to think about how, why, or when a variable is or is not included in the model: they tend to accept it as is and run with it.