I'm trying to run a zero-inflated regression for a continuous response variable in R. I'm aware of a gamlss implementation, but I'd really like to try out this algorithm by Dale McLerran that is conceptually a bit more straightforward. Unfortunately, the code is in SAS and I'm not sure how to re-write it for something like nlme.
The code is as follows:
proc nlmixed data=mydata;
parms b0_f=0 b1_f=0
b0_h=0 b1_h=0
log_theta=0;
eta_f = b0_f + b1_f*x1 ;
p_yEQ0 = 1 / (1 + exp(-eta_f));
eta_h = b0_h + b1_h*x1;
mu = exp(eta_h);
theta = exp(log_theta);
r = mu/theta;
if y=0 then
ll = log(p_yEQ0);
else
ll = log(1 - p_yEQ0)
- lgamma(theta) + (theta-1)*log(y) - theta*log(r) - y/r;
model y ~ general(ll);
predict (1 - p_yEQ0)*mu out=expect_zig;
predict r out=shape;
estimate "scale" theta;
run;
From: http://listserv.uga.edu/cgi-bin/wa?A2=ind0805A&L=sas-l&P=R20779
ADD:
Note: There are no mixed effects present here – only fixed.
The advantage to this fitting is that (even though the coefficients are the same as if you separately fit a logistic regression to P(y=0) and a gamma error regression with log link to E(y | y>0)) you can estimate the combined function E(y) which includes the zeroes. One can predict this value in SAS (with a CI) using the line predict (1 - p_yEQ0)*mu
.
Further, one is able to write custom contrast statements to test the significance of predictor variables on E(y). For example, here is another version of the SAS code I have used:
proc nlmixed data=TestZIG;
parms b0_f=0 b1_f=0 b2_f=0 b3_f=0
b0_h=0 b1_h=0 b2_h=0 b3_h=0
log_theta=0;
if gifts = 1 then x1=1; else x1 =0;
if gifts = 2 then x2=1; else x2 =0;
if gifts = 3 then x3=1; else x3 =0;
eta_f = b0_f + b1_f*x1 + b2_f*x2 + b3_f*x3;
p_yEQ0 = 1 / (1 + exp(-eta_f));
eta_h = b0_h + b1_h*x1 + b2_h*x2 + b3_h*x3;
mu = exp(eta_h);
theta = exp(log_theta);
r = mu/theta;
if amount=0 then
ll = log(p_yEQ0);
else
ll = log(1 - p_yEQ0)
- lgamma(theta) + (theta-1)*log(amount) - theta*log(r) - amount/r;
model amount ~ general(ll);
predict (1 - p_yEQ0)*mu out=expect_zig;
estimate "scale" theta;
run;
Then to estimate "gift1" versus "gift2" (b1 versus b2) we can write this estimate statement:
estimate "gift1 versus gift 2"
(1-(1 / (1 + exp(-b0_f -b1_f))))*(exp(b0_h + b1_h)) - (1-(1 / (1 + exp(-b0_f -b2_f))))*(exp(b0_h + b2_h)) ;
Can R do this?
Best Answer
Having spent some time on this code, it appears to me as though it basically:
1) Does a logistic regression with right hand side
b0_f + b1_f*x1
andy > 0
as a target variable,2) For those observations for which y > 0, performs a regression with right hand side
b0_h + b1_h*x1
, a Gamma likelihood andlink=log
,3) Also estimates the shape parameter of the Gamma distribution.
It maximizes the likelihood jointly, which is nice, because you only have to make the one function call. However, the likelihood separates anyway, so you don't get improved parameter estimates as a result.
Here is some R code which makes use of the
glm
function to save programming effort. This may not be what you'd like, as it obscures the algorithm itself. The code certainly isn't as clean as it could / should be, either.The shape parameter for the Gamma distribution is equal to 1 / the dispersion parameter for the Gamma family. Coefficients and other stuff which you might like to access programatically can be accessed on the individual elements of the return value list:
Prediction can be done using the output of the routine. Here's some more R code that shows how to generate expected values and some other information:
And a sample run:
Now for coefficient extraction and the contrasts: