Solved – Forward and backward stepwise regression (AIC) for negative binomial regression (with real data)

feature selectionnegative-binomial-distributionrregressionstepwise regression

I am doing some count data analysis. The data is in this link.

Column A is the count data, and other columns are the independent variables. At first I used Poisson regression to analyze it:

m0<-glm(A~.,data=d,family="poisson")
summary(m0)

We see that the residual deviance is greater than the degrees of freedom so that we have over-dispersion:

Call:
glm(formula = A ~ ., family = "poisson", data = d)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-28.8979   -4.5110    0.0384    5.4327   20.3809  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  8.7054842  0.9100882   9.566  < 2e-16 ***
B           -0.1173783  0.0172330  -6.811 9.68e-12 ***
C            0.0864118  0.0182549   4.734 2.21e-06 ***
D            0.1169891  0.0301960   3.874 0.000107 ***
E            0.0738377  0.0098131   7.524 5.30e-14 ***
F            0.3814588  0.0093793  40.670  < 2e-16 ***
G           -0.3712263  0.0274347 -13.531  < 2e-16 ***
H           -0.0694672  0.0022137 -31.380  < 2e-16 ***
I           -0.0634488  0.0034316 -18.490  < 2e-16 ***
J           -0.0098852  0.0064538  -1.532 0.125602    
K           -0.1105270  0.0128016  -8.634  < 2e-16 ***
L           -0.3304606  0.0155454 -21.258  < 2e-16 ***
M            0.2274175  0.0259872   8.751  < 2e-16 ***
N            0.2922063  0.0174406  16.754  < 2e-16 ***
O            0.1179708  0.0119332   9.886  < 2e-16 ***
P            0.0618776  0.0260646   2.374 0.017596 *  
Q           -0.0303909  0.0060060  -5.060 4.19e-07 ***
R           -0.0018939  0.0037642  -0.503 0.614864    
S            0.0383040  0.0065841   5.818 5.97e-09 ***
T            0.0318111  0.0116611   2.728 0.006373 ** 
U            0.2421129  0.0145502  16.640  < 2e-16 ***
V            0.1782144  0.0090858  19.615  < 2e-16 ***
W           -0.5105135  0.0258136 -19.777  < 2e-16 ***
X           -0.0583590  0.0043641 -13.373  < 2e-16 ***
Y           -0.1554609  0.0042604 -36.489  < 2e-16 ***
Z            0.0064478  0.0001184  54.459  < 2e-16 ***
AA           0.3880479  0.0164929  23.528  < 2e-16 ***
AB           0.1511362  0.0050471  29.945  < 2e-16 ***
AC           0.0557880  0.0181129   3.080 0.002070 ** 
AD          -0.6569099  0.0368771 -17.813  < 2e-16 ***
AE          -0.0040679  0.0003960 -10.273  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 97109.0  on 56  degrees of freedom
Residual deviance:  5649.7  on 26  degrees of freedom
AIC: 6117.1

Number of Fisher Scoring iterations: 6

Then I think I should use negative binomial regression for the over-dispersion data. Since you can see I have many independent variables, and I wanted to select the important variables. And I decide to use stepwise regression to select the independent variable. At first, I create a full model:

full.model <- glm.nb(A~., data=d,maxit=1000)
# when not indicating maxit, or maxit=100, it shows Warning messages: 1: glm.fit: algorithm did not converge; 2: In glm.nb(A ~ ., data = d, maxit = 100) : alternation limit reached

When indicating maxit=1000, the warning message disappears.

summary(full.model)

Call:
glm.nb(formula = A ~ ., data = d, maxit = 1000, init.theta = 2.730327193, 
    link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5816  -0.8893  -0.3177   0.4882   1.9073  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)   
(Intercept) 11.8228596  8.3004322   1.424  0.15434   
B           -0.2592324  0.1732782  -1.496  0.13464   
C            0.2890696  0.1928685   1.499  0.13393   
D            0.3136262  0.3331182   0.941  0.34646   
E            0.3764257  0.1313142   2.867  0.00415 **
F            0.3257785  0.1448082   2.250  0.02447 * 
G           -0.7585881  0.2343529  -3.237  0.00121 **
H           -0.0714660  0.0343683  -2.079  0.03758 * 
I           -0.1050681  0.0357237  -2.941  0.00327 **
J            0.0810292  0.0566905   1.429  0.15291   
K            0.2582978  0.1574582   1.640  0.10092   
L           -0.2009784  0.1543773  -1.302  0.19296   
M           -0.2359658  0.3216941  -0.734  0.46325   
N           -0.0689036  0.1910518  -0.361  0.71836   
O            0.0514983  0.1383610   0.372  0.70974   
P            0.1843138  0.3253483   0.567  0.57105   
Q            0.0198326  0.0509651   0.389  0.69717   
R            0.0892239  0.0459729   1.941  0.05228 . 
S           -0.0430981  0.0856391  -0.503  0.61479   
T            0.2205653  0.1408009   1.567  0.11723   
U            0.2450243  0.1838056   1.333  0.18251   
V            0.1253683  0.0888411   1.411  0.15820   
W           -0.4636739  0.2348172  -1.975  0.04831 * 
X           -0.0623290  0.0508299  -1.226  0.22011   
Y           -0.0939878  0.0606831  -1.549  0.12142   
Z            0.0019530  0.0015143   1.290  0.19716   
AA          -0.2888123  0.2449085  -1.179  0.23829   
AB           0.1185890  0.0696343   1.703  0.08856 . 
AC          -0.3401963  0.2047698  -1.661  0.09664 . 
AD          -1.3409002  0.4858741  -2.760  0.00578 **
AE          -0.0006299  0.0051338  -0.123  0.90234   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(2.7303) family taken to be 1)

    Null deviance: 516.494  on 56  degrees of freedom
Residual deviance:  61.426  on 26  degrees of freedom
AIC: 790.8

Number of Fisher Scoring iterations: 1


              Theta:  2.730 
          Std. Err.:  0.537 

 2 x log-likelihood:  -726.803 

When not indicating maxit, or maxit=100, it shows

Warning messages: 1: glm.fit: algorithm did not converge; 2: In glm.nb(A ~ ., data = d, maxit = 100) : alternation limit reached.

When indicating maxit=1000, the warning message disappears.

Then I create a first model:

first.model <- glm.nb(A ~ 1, data = d)

Then I tried the forward stepwise regression:

step.model <- step(first.model, direction="forward", scope=formula(full.model))

Error in glm.fit(X, y, wt, offset = offset, family = object\$family, control = object$control) :
NA/NaN/Inf in 'x'
In addition: Warning message:
step size truncated due to divergence

What is the problem?

I also tried the backward regression:

step.model2 <- step(full.model,direction="backward")

#the final step
Step:  AIC=770.45
A ~ B + C + E + F + G + H + I + K + L + R + T + V + W + Y + AA + 
    AB + AD

       Df Deviance    AIC
<none>      62.375 770.45
- AB    1   64.859 770.93
- H     1   65.227 771.30
- V     1   65.240 771.31
- L     1   65.291 771.36
- Y     1   65.831 771.90
- B     1   66.051 772.12
- C     1   67.941 774.01
- AA    1   69.877 775.95
- K     1   70.411 776.48
- W     1   71.526 777.60
- I     1   71.863 777.94
- E     1   72.338 778.41
- G     1   73.344 779.42
- F     1   73.510 779.58
- AD    1   79.620 785.69
- R     1   80.358 786.43
- T     1   95.725 801.80
Warning messages:
1: glm.fit: algorithm did not converge 
2: glm.fit: algorithm did not converge 
3: glm.fit: algorithm did not converge 
4: glm.fit: algorithm did not converge 

My question is: Why it is different in using forward and backward stepwise regression? And why do I get the error message when performing forward selection? Also, what exactly do these warning messages mean? And how should I deal with it?

I am not a stats person but need to conduct statical analysis for my research data. So I am struggling in learning how to do different regression analyses using real data. I searched online for similar questions but I still could understand … And please let me know if I did anything wrong in my regression analysis. I would really appreciate it if you could help me with these questions!

Best Answer

I have good news and bad news.

good news

  • you can probably more or less disregard the warnings. Where stepwise regression is recommended at all (see below ...), backward regression is probably better than forward regression anyway.
  • you can do forward and backward stepwise regression with MASS::stepAIC() (instead of step).

bad news

  • step probably isn't doing what you think it's doing anyway. Rather than refitting the negative binomial dispersion parameter, it's re-fitting with a fixed overdispersion parameter, which is probably not what you want (there's a classically snarky e-mail from Prof. Brian Ripley from 2006 here that discusses this issue in passing). As mentioned above, stepAIC() works better.
  • if you are only interested in predictive accuracy, and not in anything about confidence intervals or hypothesis tests or measuring variable importance ... then stepwise regression might be OK (Murtaugh 2009) ...
  • but if you care at all about being able to make any inferences about the effects of the parameters, you have too many variables and not enough data. A rule of thumb is that (1) you need at least 10 times as many data points as predictor variables to do reliable inference and (2) doing any inference after selecting variables (via stepwise selection or otherwise) is very wrong [unless you do super-cutting-edge stuff that only works with huge data sets and very strong assumptions].

The big question here is: why do you want to do variable selection in the first place?

  • you're only interested in prediction: OK, but something like penalized regression (Dahlgren 2010) will probably work better
  • you're interested in inference: this is going to be tough; you almost certainly don't have enough data to tell the effects of correlated variables apart. In your situation I would probably compute the principal components (PCA) of the predictor variables and use only the first 5 (which fall within the $n/10$ rule, and explain 99.5% of the variance in the predictors ...)

Murtaugh, Paul A. “Performance of Several Variable-Selection Methods Applied to Real Ecological Data.” Ecology Letters 12, no. 10 (October 2009): 1061–68. https://doi.org/10.1111/j.1461-0248.2009.01361.x.

Dahlgren, Johan P. “Alternative Regression Methods Are Not Considered in Murtaugh (2009) or by Ecologists in General.” Ecology Letters 13, no. 5 (May 1, 2010): E7–9. https://doi.org/10.1111/j.1461-0248.2010.01460.x.