R – Penalized Regression with Zero-Inflated Models

elastic netglmnetlassorzero inflation

I'm currently building zero-inflated Poisson & negative binomial predictive models using the zeroinfl() function from the pscl package in R.

Incorporating penalized regressions into my model to account for shrinkage and variable selection is a priority. In addition I'd like to use penalization to avoid convergence issues due to perfect/quasi separation in my data (better than manually removing variables).

Question: Realizing that zero-inflated models $\neq$ hurdle models, for purposes of variable selection will my models be seriously biased if I first run separate run lasso (or elastic net) Poisson and logistic regressions with glmnet to select variables for the zeroinfl()?

Best Answer

I was looking for a ZIP model as well, and had trouble with mpath. I ended up coding my solutions in Matlab and it seems to work well, but I think it would be very easy in R (provided it is correct/ok). I may have overlooked something so let me know if you see any problems. My data has 80 obs and 550 features.

Recall that ZIP (and ZINB) can be solved via the EM algorithm. I am in the process of implementing this for ZINB, but my ZIP solution is as follows.

My understanding of EM -

Until Convergence: 
   step 1: calc expectations based on parameters
   step 2: maximize with respect to parameters.  

To do this in a Lasso framework, set your lambda sequences for Poisson model and the binomial model, call them I and J respectively

for i in I:
    for j in J:
    Until Convergence:
        step 1: calc expectations (estimate latent variable, call it zhat)
        step 2: estimate poisson regression with 1-zhat as weight.  
           estimate logistic regression with zhat as dv. 

The issue with step 2 - glmnet wants a factor or two columns to specify the proportion. Maybe just multiply the zhat by 100 and use that appropriately.

I picked i and j via grid search. Additionally, I do not go the full way down the sequence (maybe 30 on both). I also added in a break similar to the GLMNET package to stop before saturation.

Hope this helps.