Solved – Convert SAS NLMIXED code for zero-inflated gamma regression to R

gamlssrsas

I'm trying to run a zero-inflated regression for a continuous response variable in R. I'm aware of a gamlss implementation, but I'd really like to try out this algorithm by Dale McLerran that is conceptually a bit more straightforward. Unfortunately, the code is in SAS and I'm not sure how to re-write it for something like nlme.

The code is as follows:

proc nlmixed data=mydata;
  parms b0_f=0 b1_f=0 
        b0_h=0 b1_h=0 
        log_theta=0;


  eta_f = b0_f + b1_f*x1 ;
  p_yEQ0 = 1 / (1 + exp(-eta_f));


  eta_h = b0_h + b1_h*x1;
  mu    = exp(eta_h);
  theta = exp(log_theta);
  r = mu/theta;


  if y=0 then
     ll = log(p_yEQ0);
  else
     ll = log(1 - p_yEQ0)
          - lgamma(theta) + (theta-1)*log(y) - theta*log(r) - y/r;


  model y ~ general(ll);
  predict (1 - p_yEQ0)*mu out=expect_zig;
  predict r out=shape;
  estimate "scale" theta;
run;

From: http://listserv.uga.edu/cgi-bin/wa?A2=ind0805A&L=sas-l&P=R20779

ADD:

Note: There are no mixed effects present here – only fixed.

The advantage to this fitting is that (even though the coefficients are the same as if you separately fit a logistic regression to P(y=0) and a gamma error regression with log link to E(y | y>0)) you can estimate the combined function E(y) which includes the zeroes. One can predict this value in SAS (with a CI) using the line predict (1 - p_yEQ0)*mu .

Further, one is able to write custom contrast statements to test the significance of predictor variables on E(y). For example, here is another version of the SAS code I have used:

proc nlmixed data=TestZIG;
      parms b0_f=0 b1_f=0 b2_f=0 b3_f=0
            b0_h=0 b1_h=0 b2_h=0 b3_h=0
            log_theta=0;


        if gifts = 1 then x1=1; else x1 =0;
        if gifts = 2 then x2=1; else x2 =0;
        if gifts = 3 then x3=1; else x3 =0;


      eta_f = b0_f + b1_f*x1 + b2_f*x2 + b3_f*x3;
      p_yEQ0 = 1 / (1 + exp(-eta_f));

      eta_h = b0_h + b1_h*x1 + b2_h*x2 + b3_h*x3;
      mu    = exp(eta_h);
      theta = exp(log_theta);
      r = mu/theta;

      if amount=0 then
         ll = log(p_yEQ0);
      else
         ll = log(1 - p_yEQ0)
              - lgamma(theta) + (theta-1)*log(amount) -                      theta*log(r) - amount/r;

      model amount ~ general(ll);
      predict (1 - p_yEQ0)*mu out=expect_zig;
      estimate "scale" theta;
    run;

Then to estimate "gift1" versus "gift2" (b1 versus b2) we can write this estimate statement:

estimate "gift1 versus gift 2" 
 (1-(1 / (1 + exp(-b0_f -b1_f))))*(exp(b0_h + b1_h)) - (1-(1 / (1 + exp(-b0_f -b2_f))))*(exp(b0_h + b2_h)) ;

Can R do this?

Best Answer

Having spent some time on this code, it appears to me as though it basically:

1) Does a logistic regression with right hand side b0_f + b1_f*x1 andy > 0 as a target variable,

2) For those observations for which y > 0, performs a regression with right hand side b0_h + b1_h*x1, a Gamma likelihood and link=log,

3) Also estimates the shape parameter of the Gamma distribution.

It maximizes the likelihood jointly, which is nice, because you only have to make the one function call. However, the likelihood separates anyway, so you don't get improved parameter estimates as a result.

Here is some R code which makes use of the glm function to save programming effort. This may not be what you'd like, as it obscures the algorithm itself. The code certainly isn't as clean as it could / should be, either.

McLerran <- function(y, x)
{
  z <- y > 0
  y.gt.0 <- y[y>0]
  x.gt.0 <- x[y>0]

  m1 <- glm(z~x, family=binomial)
  m2 <- glm(y.gt.0~x.gt.0, family=Gamma(link=log))

  list("p.ygt0"=m1,"ygt0"=m2)
}

# Sample data
x <- runif(100)
y <- rgamma(100, 3, 1)      # Not a function of x (coef. of x = 0)
b <- rbinom(100, 1, 0.5*x)  # p(y==0) is a function of x
y[b==1] <- 0

foo <- McLerran(y,x)
summary(foo$ygt0)

Call:
glm(formula = y.gt.0 ~ x.gt.0, family = Gamma(link = log))

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.08888  -0.44446  -0.06589   0.28111   1.31066  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.2033     0.1377   8.737 1.44e-12 ***
x.gt.0       -0.2440     0.2352  -1.037    0.303    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for Gamma family taken to be 0.3448334)

    Null deviance: 26.675  on 66  degrees of freedom
Residual deviance: 26.280  on 65  degrees of freedom
AIC: 256.42

Number of Fisher Scoring iterations: 6

The shape parameter for the Gamma distribution is equal to 1 / the dispersion parameter for the Gamma family. Coefficients and other stuff which you might like to access programatically can be accessed on the individual elements of the return value list:

> coefficients(foo$p.ygt0)
(Intercept)           x 
   2.140239   -2.393388

Prediction can be done using the output of the routine. Here's some more R code that shows how to generate expected values and some other information:

# Predict expected value
predict.McLerren <- function(model, x.new)
{
  x <- as.data.frame(x.new)
  colnames(x) <- "x"
  x$x.gt.0 <- x$x

  pred.p.ygt0 <- predict(model$p.ygt0, newdata=x, type="response", se.fit=TRUE)
  pred.ygt0 <- predict(model$ygt0, newdata=x, type="response", se.fit=TRUE)  

  p0 <- 1 - pred.p.ygt0$fit
  ev <- (1-p0) * pred.ygt0$fit

  se.p0 <- pred.p.ygt0$se.fit
  se.ev <- pred.ygt0$se.fit

  se.fit <- sqrt(((1-p0)*se.ev)^2 + (ev*se.p0)^2 + (se.p0*se.ev)^2)

  list("fit"=ev, "p0"=p0, "se.fit" = se.fit,
       "pred.p.ygt0"=pred.p.ygt0, "pred.ygt0"=pred.ygt0)
}

And a sample run:

> x.new <- seq(0.05,0.95,length=5)
> 
> foo.pred <- predict.McLerren(foo, x.new)
> foo.pred$fit
       1        2        3        4        5 
2.408946 2.333231 2.201889 2.009979 1.763201 
> foo.pred$se.fit
        1         2         3         4         5 
0.3409576 0.2378386 0.1753987 0.2022401 0.2785045 
> foo.pred$p0
        1         2         3         4         5 
0.1205351 0.1733806 0.2429933 0.3294175 0.4291541

Now for coefficient extraction and the contrasts:

coef.McLerren <- function(model)
{
  temp1 <- coefficients(model$p.ygt0)
  temp2 <- coefficients(model$ygt0)
  names(temp1) <- NULL
  names(temp2) <- NULL
  retval <- c(temp1, temp2)
  names(retval) <- c("b0.f","b1.f","b0.h","b1.h")
  retval
}

contrast.McLerren <- function(b0_f, b1_f, b2_f, b0_h, b1_h, b2_h)
{
  (1-(1 / (1 + exp(-b0_f -b1_f))))*(exp(b0_h + b1_h)) - (1-(1 / (1 + exp(-b0_f -b2_f))))*(exp(b0_h + b2_h))
}


> coef.McLerren(foo)
      b0.f       b1.f       b0.h       b1.h 
 2.0819321 -1.8911883  1.0009568  0.1334845

Related Solutions

Solved – Marginal effects of a logistic model, and their standard errors

Some people would find such 'marginal effects' difficult to interpret and non-unique. There are other ways to get 'marginal effects' in binary logistic regression. Because of non-collapsibility of the odds ratio, marginal estimates are not well defined in general, and they can represent quantities that are not weighted averages over the factors you are unconditioning on. Mitch Gail has an example where the partial odds ratio for an exposure x2 is 9 for both x1=0 and x1=1 but is 5.44 when not holding x1 constant.

@ARTICLE{gai84bia,
  author = {Gail, M.H. and Wieand, S. and Piantadosi, S.},
  year = 1984,
  title = {Biased estimates of treatment effect in randomized experiments with
          nonlinear regressions and omitted covariates},
  journal = Biometrika,
  volume = 71,
  pages = {431-444},
  annote = {covariable adjustment;bias if omitted covariables and model is
           nonlinear}
}

I wonder also whether you meant 'marginal effect' or 'effect on the original scale'. That would involve two different considerations. Effects on the log odds scale are easier to deal with, and you can relate odds ratios to absolute risk changes (as a function of starting risk) using a simple chart.

Solved – SAS ARIMA: estimate without identify? or auto.arima in SAS

You are fortunate to ask this question onn this site because IrishStat has been automating ARIMA models for over 30 years (sorry to give away your age Dave). Also Rob Hyndman wrote the auto.arima procedure in R. I have a connection as I took my first time series course in a short course by Box and Tiao at Carnegie - Mellon University in 1974 (giving away my age now). Also when I was the Chief of Statistical Research at Risk Data Corporation (in the early 1990s) I hired Terry Woodfield who authored the ETS software at the SAS Institute just before we were able to draw him away. I am sure PROC ARIMA has gone through many changes but i am sure that if you make contact with Terry he could probably help you.

Personally the way I learned it from Box, Tiao and Pack ARIMA modeling is an iterative process that should be gone through manually in stages with the user making decisions at various stages. That is not to say that good results cannot be obtained by automated procedures. In fact I think that Dave Reilly (IrishStat) along with his son Tom have so much experience doing this that they will contend that they could produce a better model with their algorithm than I can do manually and they may be right. But my point is that for a time series specialist to take that approach takes away some of the steps that help him really get to understand the characteristics of thee seris very well.

One thing that always troubled me in the early years was that the Box-Jenkins methodology was revered a little too much. Estimation is by conditional least squares and so the normality of the residuals is important and often overlooked (a buried secret). In the late 1970s i work on the problem of outliers in time series and Darryl Downing and I published a paper on the topic in JASA in 1982.

Since then other like Doug Martin, George Tiao and Ruey Tsay have made much bigger contributions. IrishStat is aware of that literature and has incorporated their ideas in his software. That is why he emphasizes checking for level shifts and outliers before fixating on an ARIMA model. That aspect of his software makes it somewhat unique. It is different from auto.arima and SAS/ETS. So keep that in mind in your search for other automated procedures using SAS.

I hope you appreciate this as an answer even though it does not directly answer questions 1 or 2. I am sure you can find Terry Woodfield on the internet or go directly to the SAS Institute with your questions which are very specific to SAS and really require someone with intimate knowledge of the SAS algorithms. I don't think you will find anyone on this site who could give you better help.

Best Answer

Related Solutions

Solved – Marginal effects of a logistic model, and their standard errors

Solved – SAS ARIMA: estimate without identify? or auto.arima in SAS

Related Question