Solved – Appropriately selecting explanatory (independent) variables

feature selectiongeneralized linear modelpredictorregression

My aim is to carry out a GLM. I have 400 sites where I have count data of animals (response variable) and environmental characteristics (explanatory variables). At the moment I have around 40 explanatory variables.

  • The examples I have seen so far only have up to 10 explanatory variables so I guess I have to choose among the explanatory variables. How many explanatory variables normally are selected?

  • I read that no more than 10% of the number of responses should be chosen as the number of explanatory variables (which in my case is 40). Is that a general rule?

  • Are there any statistical tests which I can use to select the most appropriate explanatory variables? So far knowledge about the relationship of the explanatory variables and the animals is low and I can therefore not select my explanatory variables based on theory.

  • Moreover: How can I deal with multicollinearity and interactions of 40 explanatory variables?

Best Answer

How many variables you want to keep at least partly has to do with what exactly you want to do. Even if generally it would be nice to have as few variables as possible, if your goal is prediction, then having a greater number of variables is ok, as long as the prediction works well enough for you. You could look the the mean squared error of the prediction as a criterion for deciding on a certain number of variables. If on the other hand your goal is to develop a theory (or the extend an existing one) then you would probably want to go for fewer variables. The main reason here is that the model becomes less interpretable, the more predictors are included.

In any case, there are statistical methods for this task that fall under the rubric of "model selection". Some of the are Best Subset Selection, Stepwise (Forward of Backward) Selection, Lasso Regression. You might want to look into these methods.

Related Question