Solved – “Zero-inflated continuous covariates”, Can they cause problems in logistic regression

generalized linear modellogisticregressionzero inflation

I pose a very similar question to this, although I felt the advice given does not apply to my particular situation;

I am using logistic regression models for an animal habitat occupancy study, and all the predictor variables I am interested in contain >50% zeros (although they have a decent range of values in the higher percentiles). Can this cause bias or influence how I should interpret the estimated coefficients?

A 2-stage analysis, as suggested in the linked question, doesn't seem to make sense because all the predictors share this zero-inflated distribution.

Thanks for any insights

EDIT Clarifications suggested by Peter Flom;

Sample size ~ 500 (300 "0"s, 200 "1"s)

There are 5 IV's; a typical five-number summary looks like this;

min= 0.000 lower= 0.000 median=0.000 upper= 0.289 max= 16.887

Also, Mean= 0.468, SD= 1.467

correlations between the 5 IV's all absolute r < 0.3

The IV's are hectares of specific habitat types. Every sample has >0 hectare(s) for at least 1 of the IV's.

An example run of the model in R;

    Call:
    glm(formula = use ~ x.1 + x.2 + x.3 + x.4, family = binomial, 
        data = mydata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2338  -0.9312  -0.8679   1.3231   1.6432  

Coefficients:
        Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.78412    0.12814  -6.119 9.41e-10 ***
x.1          0.19866    0.06366   3.121  0.00181 ** 
x.2          0.06956    0.02618   2.657  0.00788 ** 
x.3          0.05238    0.02265   2.313  0.02074 *  
x.4         -0.09995    0.13814  -0.724  0.46935    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 634.10  on 473  degrees of freedom
Residual deviance: 611.18  on 469  degrees of freedom
AIC: 621.18

Number of Fisher Scoring iterations: 4

Best Answer

Logistic regression does not make any assumptions about the distribution of the independent variables (neither does OLS regression, but that's another post).

However, if there are a lot of variables and a lot of zero inflation, then I think the potential for complete or quasi-complete separation increases.

Another problem may be accuracy of estimates; as far as I know, the computed standard errors etc. will be correct, but I think they could well be large.

More details (the number of IVs; the sample size; the nature of the variables, the degree of correlation among the IVs) will help you get more detailed answers. Actually running the regression and posting results would also help.