Solved – the best method to analyze these extremely skewed data with many zeros

regressionskewness

I'm working on my bachelor's thesis and have an analysis where the dependent variable (number of months of parental leave of fathers) has a very skewed distribution, as follows: 1089 times the value 0, 18 times the value 1, 89 times the value 2, 29 times the value 3, 11 times the value 4, and so on, with all further values occurring less than 10 times.

Now, the same variable from the same data set has already been analyzed in several papers that got published in scientific journals, and they all used several variants of linear regression on the untransformed data.

My question: Is this approach really valid? From all I have learned in my introductory statistics classes, you need a normally distributed dependent variable for linear regression. And these data are clearly non-normal and cannot be transformed to be normal either. What other methods could be used instead? Might negative binomial regression be an option? Or is linear regression OK to use after all?

Thanks,
Stefanie

Best Answer

(Zero-inflated) Negative binomial regression would seem like a logical regression model to use. With the type of data you describe, linear regression will tend to be problematic in some respects (e.g. the error model is just wrong, as a result you may get negative months predicted for some records, the confidence intervals do not respect that negative months are not possible, hypothesis tests may not have the specified level etc.), while if the median count is pretty high (let's say 20 or 40) and just a few zeros occur, linear regression will often work pretty well.

The zero-inflated part would distinguish those taking any "meaningful" (i.e. not (rounded to?) zero) leave versus those taking at least something (rounded?) to 1 month. I am speculating here regarding to the rounding, since I would have assumed many would take at least a few days and that the real unit of time taken off would be working days or half-working-days - or is this in any case specific parental leave that usually comes in a unit of months (or weeks) as opposed to taking available vacation time/personal days?

Related Question