This answer comes in two parts. The first addresses the issue of the standard errors and why that implies the models are not identical, as @ndoogan observed in comments, and the second addresses, partially, the issue of when the coefficient estimates might be substantially different.
Consider, for example, a hypothesis test on the coefficient of log(Vmaj)
where the null is that the coefficient equals 0.5. The two sets of model estimates would result in rejection of the null in the Poisson case, unless testing at a very high confidence level, and failure to reject the null in the NB case.
More generally, there is more to a collection of estimates than just the point estimates. In the presence of (Negative binomial-like) overdispersion, the standard errors produced by the Negative Binomial-based model will be more accurate estimates of the underlying standard deviation of the coefficient estimates. Thus, the NB model as a whole is more accurate.
For example, a simple model with $\mu_i = \exp\{1 + x_i\}$, and $y_i \sim NB$ with mean $\mu_i$ and variance $5\mu_i$:
N <- 1000
mu <- exp(1 + (x <- rnorm(N)))
p <- 0.2
r <- mu * p / (1-p)
y <- rnbinom(N, r, p)
poisson_summary = summary(glm(y~x, family="poisson"))
nb_summary = summary(glm.nb(y~x))
# Parametric bootstrap calculation of the s.e. of the coefficient of x
coef_x <- rep(0,1000)
for (i in 1:length(coef_x)) {
mu <- exp(1 + (x <- rnorm(N)))
r <- mu * p / (1-p)
y <- rnbinom(N, r, p)
coef_x[i] <- summary(glm(y~x, family="poisson"))$coefficients["x","Estimate"]
}
data.frame("Sim. SE" = sd(coef_x - 1),
"Poisson SE" = poisson_summary$coefficients["x", "Std. Error"],
"N.B. SE" = nb_summary$coefficients["x", "Std. Error"])
Sim..SE Poisson.SE N.B..SE
1 0.03492467 0.01273354 0.03984743
where you can see that the simulated std. deviation of the coefficient estimate is roughly 3x the Poisson-based estimate of same, and much better estimated by the NB-based estimate of same.
The coefficient estimates are pretty much the same, as one might expect with this sample size and the std. errors above, so I won't take up space by displaying them.
Additionally, although this is often a minor effect when overdispersion is low, by weighting the observations with more accurate weights (i.e., better estimates of observation-specific variances), the parameter estimates themselves will be more accurate, well, asymptotically at any rate. The rule of thumb I learned long ago was that heteroskedasticity adjustments (for that is what they are, in essence) don't buy you much unless the differences between weights are on the order of 5x or more.
Note, however, that in small samples you may well get more accurate (in MSE terms) point estimates with the Poisson model if there isn't much overdispersion, because you are reducing the variability induced by estimating the dispersion parameter. Of course, this is almost certainly more than offset by the loss of accuracy in the standard errors of the coefficient estimates.
If you only have two categorical variables and you include the interaction between the two, then the nbreg
and anova
are equivalent in the sense that they produce exactly the same predicted values. As a consequence both models are "equally right" or "equally wrong".
sysuse nlsw88, clear
gen byte black = race == 2 if race < 3
anova wage black##union
predict yhat_anova
nbreg wage black##union
predict yhat_nbreg
tab yhat_anova yhat_nbreg
Fitted | Predicted number of events
values | 5.968046 7.5821 8.614504 8.735929 | Total
-----------+--------------------------------------------+----------
5.968046 | 350 0 0 0 | 350
7.5821 | 0 1,051 0 0 | 1,051
8.614504 | 0 0 151 0 | 151
8.735929 | 0 0 0 302 | 302
-----------+--------------------------------------------+----------
Total | 350 1,051 151 302 | 1,854
This is not just true for a model with two categorical variables and their interaction; It is true for any fully saturated model, i.e. any model that inlcudes only categorical variables and all interactions including all higher order interactions.
Edit:
In order to see why these models are just different ways of saying the same thing it is easiest to go back at the basics. Consider the example above: Both models model the mean outcome, in this case mean wage. In both models there are only 4 means to model: black non-union, black union, white non-union white union. Both models model these 4 means with 4 parameters (constant, main effect for union, main effect for black, and the interaction term). As a consequence both models impose no constraint on these means.
Lets start with just looking at these means:
table union black, c(mean wage)
------------------------------
union | black
worker | 0 1
----------+-------------------
nonunion | 7.5821 5.968046
union | 8.735929 8.614504
------------------------------
So, among non-union member there is a noticable difference in mean wage between black and white persons, while among union members this difference is a lot smaller.
Lets look at the regression output:
. reg wage i.union##i.black
Source | SS df MS Number of obs = 1854
-------------+------------------------------ F( 3, 1850) = 29.83
Model | 1472.83336 3 490.944452 Prob > F = 0.0000
Residual | 30445.6086 1850 16.4570858 R-squared = 0.0461
-------------+------------------------------ Adj R-squared = 0.0446
Total | 31918.442 1853 17.225279 Root MSE = 4.0567
------------------------------------------------------------------------------
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
union |
union | 1.153829 .2648625 4.36 0.000 .6343677 1.673289
1.black | -1.614053 .2503572 -6.45 0.000 -2.105066 -1.123041
|
union#black |
union#1 | 1.492629 .4755625 3.14 0.002 .5599329 2.425324
|
_cons | 7.5821 .1251339 60.59 0.000 7.336681 7.827518
------------------------------------------------------------------------------
You can see that the mean wage for non-union white persons is the constant. This mean is 1.61 dollars/hour less for black persons. So the wage for black persons is:
di _b[_cons]+_b[1.black]
5.9680463
Which corresponds with the mean reported in the tabel above. The negative effect of being black is reduced by 1.48 dollars/hour, i.e. black union members earn
di _b[1.black] + _b[1.union#1.black]
-.12142489
dollars/hour less than white union members, a number you can also recover from the table of means above.
Consider a multiplicative model, in this case a Poisson with robust standard errors (for a justification of that choice, see here.
. poisson wage i.union##i.black, irr vce(robust)
note: you are responsible for interpretation of noncount dep. variable
Iteration 0: log pseudolikelihood = -5239.2159
Iteration 1: log pseudolikelihood = -5239.2159
Poisson regression Number of obs = 1854
Wald chi2(3) = 106.56
Prob > chi2 = 0.0000
Log pseudolikelihood = -5239.2159 Pseudo R2 = 0.0188
------------------------------------------------------------------------------
| Robust
wage | IRR Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
union |
union | 1.152178 .0360531 4.53 0.000 1.083638 1.225053
1.black | .7871232 .0265286 -7.10 0.000 .7368083 .840874
|
union#black |
union#1 | 1.252791 .0759936 3.72 0.000 1.112359 1.410951
|
_cons | 7.5821 .1308473 117.39 0.000 7.329932 7.842942
------------------------------------------------------------------------------
Again we can see that the mean wage of white non-union members is 7.5821. Black non-union persons earn $(0.787 - 1) \times 100\% = -21\%$ less then white non-union persons, a number you can recover from the table of means: $( 5.968046 - 7.5821 ) / 7.5821*100 = -21\%$. This negative effect of being black increases, which means becomes less negative, by 25% if someone is a union member. So, the effect of being black for union members is $1.25\times.787=.98$ or black union members earn 2\% less than white union members. A number you could recover from the table of means or the regression output, both of which I will leave as an exercise to the reader.
Best Answer
The Poisson and negative binomial (NB) model are nested: Poisson is a special case with theta = infinity. So a likelihood ratio test comparing the two models is testing the null hypothesis that "theta = infinity" against the alternative that "theta < infinity". Here the two models have the following log-likelihoods
Thus, the fitted log-likelihood in the NB model is much larger/better using just one additional parameter (4 regression coefficients plus 1 theta). The value of
m1$theta
is 1.032713, i.e., essentially corresponding to a geometric distribution (theta = 1). So clearly there is overdispersion compared to the Poisson distribution with a theta much lower than infinity. (My personal rule of thumb is that everything beyond 10 is already rather close to infinity.)The likelihood ratio test statistic is then twice the absolute difference of log-likelihoods (2 * (1328.642 - 865.6289) = 926.0262) and has to be compared with a chi-squared distribution with degrees of freedom (df) equal to the difference in df's between the two models (5 - 4 = 1). This is what the code above does. However, it may be somewhat confusing that the difference of
logLik()
objects retains the"logLik"
class and hence displays it with a label and df. Applyingas.numeric()
to drop the class might clarify this:And then you could compute the p-value by hand:
There are also functions that carry out such likelihood ratio tests in a generic way, e.g.,
lrtest()
inlmtest
(among others):As a final detail: The likelihood ratio test of Poisson vs. NB is non-standard because the parameter theta = infinity is on the boundary of the parameter space. Therefore, the p-value from the chi-squared distribution has to be halved (similar to a one-sided test). Of course, if you already have a significant result without halving, it's still significant afterwards...