Solved – “Zero-inflated continuous covariates”, Can they cause problems in logistic regression

I pose a very similar question to this, although I felt the advice given does not apply to my particular situation;

I am using logistic regression models for an animal habitat occupancy study, and all the predictor variables I am interested in contain >50% zeros (although they have a decent range of values in the higher percentiles). Can this cause bias or influence how I should interpret the estimated coefficients?

A 2-stage analysis, as suggested in the linked question, doesn't seem to make sense because all the predictors share this zero-inflated distribution.

Thanks for any insights

EDIT Clarifications suggested by Peter Flom;

Sample size ~ 500 (300 "0"s, 200 "1"s)

There are 5 IV's; a typical five-number summary looks like this;

min= 0.000 lower= 0.000 median=0.000 upper= 0.289 max= 16.887

Also, Mean= 0.468, SD= 1.467

correlations between the 5 IV's all absolute r < 0.3

The IV's are hectares of specific habitat types. Every sample has >0 hectare(s) for at least 1 of the IV's.

An example run of the model in R;

    Call:
    glm(formula = use ~ x.1 + x.2 + x.3 + x.4, family = binomial, 
        data = mydata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2338  -0.9312  -0.8679   1.3231   1.6432  

Coefficients:
        Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.78412    0.12814  -6.119 9.41e-10 ***
x.1          0.19866    0.06366   3.121  0.00181 ** 
x.2          0.06956    0.02618   2.657  0.00788 ** 
x.3          0.05238    0.02265   2.313  0.02074 *  
x.4         -0.09995    0.13814  -0.724  0.46935    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 634.10  on 473  degrees of freedom
Residual deviance: 611.18  on 469  degrees of freedom
AIC: 621.18

Number of Fisher Scoring iterations: 4

Best Answer

Logistic regression does not make any assumptions about the distribution of the independent variables (neither does OLS regression, but that's another post).

However, if there are a lot of variables and a lot of zero inflation, then I think the potential for complete or quasi-complete separation increases.

Another problem may be accuracy of estimates; as far as I know, the computed standard errors etc. will be correct, but I think they could well be large.

More details (the number of IVs; the sample size; the nature of the variables, the degree of correlation among the IVs) will help you get more detailed answers. Actually running the regression and posting results would also help.

Best Answer

Related Solutions

Solved – Do zero inflated continuous covariates cause “problems” in binary logistic regression

Solved – Comparing two linear regression models

Related Question