Solved – Proper variable selection method for glm

feature selectiongeneralized linear model

I have a mixed model with a continuous outcome variable and a certain number of predictors. Some need to be included in the model no matter what (sex, age, and a "main factor"), and others must be selected from a list of potential confounders.

I know some software packages have very well developed procedures to do proper variable selection, but I am looking for a simple and "reasonable" method to select the variables manually.

The stategy used until now was to first conduct simple linear regressions with every predictor separately, and proceed to the multiple regression that includes every potential confounder whose p value in the simple regression was ≤ .250. I'm not sure if this threshold is commonly used, and I don't know what threshold to use to "bump out" the variables which don't contribute to the model. I might add that I have a good sample size (500), but that some of the variables have missing values which may not be MCAR (missing completely at random) – hence the need to be parsimonious.

My efforts to find clear and simple guidelines were not successful. I thank you in advance for sharing your advice.

Best Answer

Slightly better than your current method is stepwise forward regression. Please read the criticism on that page, though: it holds some of the many reasons why I don't like it (note that most of those reasons also apply to your current method, and there is even more criticism for that).

Bottom line there is to add one variable at the time (obviously the one for which there is most evidence that it must be added, i.e. smallest p-value in a likelihood ratio test or similar) up to a certain point. When the threshold is reached it is customary to perform clean-up, that is: remove some variables that are below some p-value threshold from the model again. The advantage of this method is that you can easily ensure that some variables are indeed guaranteed to be in your model (you simply start with these variables in the model, and exclude them from the cleanup. In a similar fashion you can also add interaction terms, once you're finished with the main effects.

If you're willing to go one step further, you can use any of the modern penalized regression techniques (LASSO, ridge, ...) though these cannot be applied manually. Software like R makes them easy to use, though (package glmnet).

With regards to your missing data: especially since you are asking for a 'manual' technique: I doubt you'll find any that properly accounts for missing data. One of the easiest solutions (that is statistically correct) is multiple imputation, but that will require a lot of work to do manually.