Solved – R/Stata package for zero-truncated negative binomial GEE

count-datapanel datarstatatruncation

this is my first post. I'm truly grateful for this community.

I am trying to analyze longitudinal count data that is zero-truncated (probability that response variable = 0 is 0), and the mean != variance, so a negative binomial distribution was chosen over a poisson.

Functions/commands I've ruled out:

gee() function in R does not account for zero-truncation nor the negative binomial distribution (not even with the MASS package loaded)
glm.nb() in R doesn't allow for different correlation structures
vglm() from the VGAM package can make use of the posnegbinomial family, but it has the same problem as Stata's ztnb command (see below) in that I can't refit the models using a non-independent correlation structure.

Stata

If the data wasn't longitudinal, I could just use the Stata packages ztnb to run my analysis, BUT that command assumes that my observations are independent.

I've also ruled out GLMM for various methodological/philosophical reasons.

For now, I've settled on Stata's xtgee command (yes, I know that xtnbreg also does the same thing) that takes into account both the nonindependent correlation structures and the neg binomial family, but not the zero-truncation. The added benefit of using xtgee is that I can also calculate qic values (using the qic command) to determine the best fitting correlation structures for my response variables.

If there is a package/command in R or Stata that can take 1) nbinomial family, 2) GEE and 3) zero-truncation into account, I'd be dying to know.

I'd greatly appreciate any ideas you may have. Thank you.

-Casey

Best Answer

For R two options spring to mind, both of which I am only vaguely familiar with at best.

The first is the pscl package, which can fit zero ~~truncated~~ inflated and hurdle models in a very nice, flexible manner. The pscl package suggests the use of the sandwich package which provides "Model-robust standard error estimators for cross-sectional, time series and longitudinal data". So you could fit your count model and then use the sandwich package to estimate an appropriate covariance matrix for the residuals taking into account the longitudinal nature of the data.

The second option might be to look the geepack package which looks like it can do what you want but only for a negative binomial model with known theta, as it will fit any type of GLM that R's glm() function can (so use the family function from MASS).

A third option has raised it's head: gamlss and it's add-on package gamlss.tr. The latter includes a function gen.trun() that can turn any of the distributions supported by gamlss() into a truncated distribution in a flexible way - you can specify left truncated at 0 negative binomial distribution for example. gamlss() itself includes support for random effects which should take care of the longitudinal nature of the data. It isn't immediately clear however if you have to use at least one smooth function of a covariate in the model or can just model everything as linear functions like in a GLM.

Related Solutions

Tobit Regression – Understanding Margins Contrast After Tobit Regression

Here's the explanation of what contrasting margins means.

Let's fit a toy Tobit model (you could also use intreg), where we interact the foreign dummy with weight:

sysuse auto, clear
generate wgt=weight/1000
tobit mpg i.foreign##c.wgt c.headroom, ll(17) ul(30)

This yields:

Tobit regression                                Number of obs     =         74
                                                LR chi2(4)        =      91.39
                                                Prob > chi2       =     0.0000
Log likelihood = -138.22086                     Pseudo R2         =     0.2484

-------------------------------------------------------------------------------
          mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
      foreign |
     Foreign  |   10.16688   5.332589     1.91   0.061     -.468634    20.80239
          wgt |  -6.120729   .8351949    -7.33   0.000    -7.786473   -4.454986
              |
foreign#c.wgt |
     Foreign  |  -5.356987   2.229552    -2.40   0.019    -9.803689   -.9102848
              |
     headroom |  -.5758296    .503259    -1.14   0.256    -1.579548    .4278888
        _cons |   41.58485   2.453002    16.95   0.000     36.69249    46.47721
--------------+----------------------------------------------------------------
       /sigma |   2.945599   .3107564                      2.325815    3.565383
-------------------------------------------------------------------------------
            18  left-censored observations at mpg <= 17
            49     uncensored observations
             7 right-censored observations at mpg >= 30

Now we will take the derivative of mpg with respect to wgt as if all cars were foreign and subtract from that the derivative of mpg with respect to wgt as if all cars were domestic (r. means relative to the base level of foreign):

. margins r.foreign, dydx(wgt) predict(ystar(17,30))

Contrasts of average marginal effects
Model VCE    : OIM

Expression   : E(mpg*|17<mpg<30), predict(ystar(17,30))
dy/dx w.r.t. : wgt

------------------------------------------------
             |         df        chi2     P>chi2
-------------+----------------------------------
wgt          |
     foreign |          1        0.43     0.5134
------------------------------------------------

------------------------------------------------------------------------
                       |   Contrast Delta-method
                       |      dy/dx   Std. Err.     [95% Conf. Interval]
-----------------------+------------------------------------------------
wgt                    |
               foreign |
(Foreign vs Domestic)  |  -.3044572   .4658221     -1.217452    .6085373
------------------------------------------------------------------------

This tells you that the difference in the censored mpg-wgt slope between foreign cars and domestic cars is -.3: a 1000 lbs increase in weight is associated with an additional .3 mpg reduction in efficiency for foreign cars compared to domestic, but that gap is not statistically different from zero. Notice how different that is compared to the effect

We can also do things by hand in two steps (first get the two derivatives, and then take their difference):

. margins, dydx(wgt) at(foreign=(0 1)) predict(ystar(17,30)) post

Average marginal effects                        Number of obs     =         74
Model VCE    : OIM

Expression   : E(mpg*|17<mpg<30), predict(ystar(17,30))
dy/dx w.r.t. : wgt

1._at        : foreign         =           0

2._at        : foreign         =           1

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
wgt          |
         _at |
          1  |  -4.237398   .3787365   -11.19   0.000    -4.979708   -3.495088
          2  |  -4.541855   .2690925   -16.88   0.000    -5.069267   -4.014443
------------------------------------------------------------------------------

. lincom _b[2._at]-_b[1._at]

 ( 1)  - [wgt]1bn._at + [wgt]2._at = 0

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         (1) |  -.3044572   .4658221    -0.65   0.513    -1.217452    .6085373
------------------------------------------------------------------------------

This gets you the same answer.

Solved – Zero inflated negative binomial in Stata

The idea of zero-inflated models in not that there are a lot of zeros in the dependent variable. Rather it is the idea that there are two separate processes in the data which can lead to an observation of zero. In one process, the observations do not participate in the count process - so could never have observed outcomes $Y_i \ne 0$ (call this the zero-inflation process). In the other, the observations do participate in the count process, but have a count of zero. This, clearly, could lead to an excess of zeroes, since there are two distinct processes for observing a zero.

For example, suppose I am interested in the number of times students in a high school who qualify for free lunch actually eat the school lunch. There could be two reasons that a student would have an observation of zero. First, they could have never turned in the form for free lunch, and thus, although they qualify, are never observed eating a free lunch. These students may eat school lunch a lot, but pay for it, so are never observed to eat free school lunch. Basically, they are unable to participate in the count process. Second, a student may qualify, complete the form, and be able every day to get a free lunch. But they have a zero because they bring lunch from home every day. These types of students can participate in the count process, and so the reason they have an observation of zero is totally different from that first group. The first group's observations are zero and cannot be non-zero. In the second group, some are zero, but could have been non-zero. Suppose, further, that we know student are less likely to complete their free lunch form as they get older. Thus, grade level is a good predictor of "zero-inflation" in this case.

For your data, you need to figure out if there are two processes leading to 0 disease cases by week, one in which only a zero is possible, and one in which zero is possible as part of a count process. I'm not sure what this might be in your case, but you know your data and can explore it to see if this is the case. If the zeros in your data are all a result of a count process (i.e., a case is zero, but could have been non-zero), then a zero inflation model is not appropriate. A regular negative binomial model is fine.

To your second question: From this discussion, it follows that you want to include variables that could predict the first zero process, the zero inflation process that leads to some cases only having 0 as a possible outcome. In the case of my example, I would include grade or age as a predictor of zero-inflation.