Solved – the best method to analyze these extremely skewed data with many zeros

regressionskewness

I'm working on my bachelor's thesis and have an analysis where the dependent variable (number of months of parental leave of fathers) has a very skewed distribution, as follows: 1089 times the value 0, 18 times the value 1, 89 times the value 2, 29 times the value 3, 11 times the value 4, and so on, with all further values occurring less than 10 times.

Now, the same variable from the same data set has already been analyzed in several papers that got published in scientific journals, and they all used several variants of linear regression on the untransformed data.

My question: Is this approach really valid? From all I have learned in my introductory statistics classes, you need a normally distributed dependent variable for linear regression. And these data are clearly non-normal and cannot be transformed to be normal either. What other methods could be used instead? Might negative binomial regression be an option? Or is linear regression OK to use after all?

Thanks,
Stefanie

Best Answer

(Zero-inflated) Negative binomial regression would seem like a logical regression model to use. With the type of data you describe, linear regression will tend to be problematic in some respects (e.g. the error model is just wrong, as a result you may get negative months predicted for some records, the confidence intervals do not respect that negative months are not possible, hypothesis tests may not have the specified level etc.), while if the median count is pretty high (let's say 20 or 40) and just a few zeros occur, linear regression will often work pretty well.

The zero-inflated part would distinguish those taking any "meaningful" (i.e. not (rounded to?) zero) leave versus those taking at least something (rounded?) to 1 month. I am speculating here regarding to the rounding, since I would have assumed many would take at least a few days and that the real unit of time taken off would be working days or half-working-days - or is this in any case specific parental leave that usually comes in a unit of months (or weeks) as opposed to taking available vacation time/personal days?

Related Solutions

Solved – Best method to analyse whole population data

Sounds like random forests (using regression trees) are the perfect tool for you. You can use regression trees to build a series of trees, then check the variable importance.

I don't know much about SPSS, but if you are willing to use R (come on... you know you want to!), the caret package will be able to do this with the train() function (by specifying importance=TRUE and using rfFuncs in the control function). You can then view the importance of each variable. The varImp() function gives you more control. If you want to see what number of variables give you the best results, you can use rfe().

Caret can be a little difficult to wrap your head around, so if you want to include all your variables, you can use this simpler (but less flexible) code from the randomForest package (included with caret):

require(randomForest)
df<-read.csv(file.choose()) ### assuming your data is in a csv
rf.fit<-randomForest(x=df[,1:??],y=df[,??+1],ntree=500,importance=TRUE) #assuming the ?? idependent variables in columns 1-?? and the response is in ??+1 column
print(rf.fit$importance) #importance of variables
print(rf.fit$rsq)  #psuedo rsquared of model

I would do this year by year, rather than include the year as a varibale. Time will no doubt play a role in the regression, but I think it would make more sense to build each model for each year and then look for changes in variable importance over time- though 5 isn't a lot to build a powerful time series with. Others might disagree and I wouldn't mind hearing others' opinions.

If you don't want to use R, I would update your question to signal for SPSS users who might know how to implement random forests with variable importance.

Solved – Interpreting significance of the intercept in a regression analysis

Quoting from the answer on the page suggested by mdewey in a comment:

The intercept is the estimated value of the response variable for the first modalities of each factor under the assumption of additivity.

So how does that apply to your data? It depends on what your software deems to be the "first modality," or reference value, of each of your predictors.

When there is a categorical predictor, like gender, some programs choose the first listed category as the reference, others choose the last. You need to know how your statistics program makes the choice, or specify directly which category to use as the reference.

For a continuous predictor, like age, a value of 0 is typically the reference. That can lead to some statistically "significant" intercept values (that is, values significantly different from 0) that have limited practical importance. If the age range of your participants is from 25 to 50 years old, does it really make sense to extrapolate your results all the way down to the age 0 of a newborn? That, nevertheless, is what the calculation of the intercept will do unless you take additional precautions.

One way around that problem is to use the difference in ages from the mean age, rather than the absolute age, as the independent variable in your model. That makes the mean age of the participants the reference value for age, which probably makes a lot more sense. The coefficient for age will not change, but the intercept would be more readily interpreted.

You can see this issue in your tables. I'll assume that your "first modalities," or reference values, are: gender, male; topic, not feminine; group composition, not female dominated; age, 0. Then the intercept for Model 4 for Total Speaking Time, 407 (minutes?), would be that predicted for a newborn male in a group not dominated by females speaking on a non-feminine topic. Does that type of prediction make any sense? That's why you have to think carefully about whether the significance of the intercept (whether it's different from 0) in any particular model really matters; that question is best answered based on your knowledge of the subject matter.

One additional warning: your breaking down the analyses into 4 separate models for each dependent variable is not best practice. Your Model 4 seems to include all the predictors of interest, and appropriate statistical tests on Model 4 alone would address your underlying question about how age, gender, topic, and group composition affect these dependent variables.

Related to that, you are thus doing many more tests of statistical significance than you need, leading a a potentially exacerbated problem with multiple comparisons. Among your 4 models you seem to be examining 16 different individual coefficients (including intercepts) for each of 15 dependent variables, or 240 statistical tests on coefficients. If you accept p < 0.05 (= 1/20) as "significant," then even if there were no truly significant relations you would nevertheless expect to accept 12 (= 240/20) coefficients as "significant." You should see if you can get some local statistical consultation to help address the multiple comparison problem and how to structure your models appropriately.

Best Answer

Related Solutions

Solved – Best method to analyse whole population data

Solved – Interpreting significance of the intercept in a regression analysis

Related Question