Solved – Variable Selection for Negative Binomial Regression

count-datanegative-binomial-distributionrregression

First off I apologize, that I cannot share the code or details about the variables for this project. I am new to statistics and am working on a project using count data. I want to make sure I am going about this correctly so I would appreciate any feedback. I am trying to build a predictive model using count data. I have done the following:

1) Looked at the distribution of the dependent variable as well as the independent variables and most if not all follow the Negative Binomial Distribution. Given that the data follows this distribution and I am using count data I decided to use negative binomial regression. Additionally checked to see how the mean of the target variable compared to its variance and since it was not equal, I ruled out poisson regression.

2) In R I used the MASS package and specifically used the glm.nb function with the syntax of glm.nb(y ~., data = data). For the initial run I included all of the variables

3) For the remaining runs, I removed 1 variable at a time at a time (with the highest p-value) and re-ran the model until there were no p-values above 0.05. for each iteration I logged the variable that was removed and the AIC of that model.

At this point I am not sure what is the "correct" next step for this model.

My main questions are:
1) What are some of the next steps that I should take?
– I was thinking of plotting the actual values to the predictive values of each model.
– I was thinking about plotting the residual values for each predictive value to see how off the model was.
– The only reason I was planning to do this was because I did this in the past for OLS, however, I wanted to confirm if this was applicable to Negative Binomial Regression as well.

2) What are some evaluation techniques I should use? I was thinking since I have 150K+ rows of data that I could use train/test split rather than using cross validation but would also be curious to hear any thoughts on this topic as well.

Thank you.

Best Answer

Looked at the distribution of the dependent variable as well as the independent variables and most if not all follow the Negative Binomial Distribution. Given that the data follows this distribution and I am using count data I decided to use negative binomial regression. Additionally checked to see how the mean of the target variable compared to its variance and since it was not equal, I ruled out poisson regression.

  • Looking at the distribution of the independent variables is largely irrelevant; linear modeling makes no assumptions about the predictor (independent) variables except that they're measured without error
  • Looking at the marginal distribution of the response, or comparing its mean and variance is also largely irrelevant; you need to fit the model first, then evaluate whether the residual variance is larger than expected based on a Poisson assumption.

2) In R I used the MASS package and specifically used the glm.nb function with the syntax of glm.nb(y ~., data = data). For the initial run I included all of the variables

OK.

3) For the remaining runs, I removed 1 variable at a time at a time (with the highest p-value) and re-ran the model until there were no p-values above 0.05. for each iteration I logged the variable that was removed and the AIC of that model.

Some of the long-time denizens of this site (including me) feel that stepwise regression is generally a bad idea; it can be OK if all you care about is prediction (not inference or generating confidence intervals), but you should probably select on AIC rather than p-value in that case.

At this point I am not sure what is the "correct" next step for this model.

My main questions are: 1) What are some of the next steps that I should take? - I was thinking of plotting the actual values to the predictive values of each model. - I was thinking about plotting the residual values for each predictive value to see how off the model was. - The only reason I was planning to do this was because I did this in the past for OLS, however, I wanted to confirm if this was applicable to Negative Binomial Regression as well.

You should certainly evaluate diagnostic plots (e.g. plot(your_model)); you probably should do this with your full model, however (whether or not you follow up with model selection). Plotted predicted and actual vs covariates of interest is generally a good idea too.

2) What are some evaluation techniques I should use? I was thinking since I have 150K+ rows of data that I could use train/test split rather than using cross validation but would also be curious to hear any thoughts on this topic as well.

The general recommendation is to use k-fold cross-validation rather than train/test whenever computationally feasible. However, arguably you should have done this before you embarked on selecting a model ...