Solved – extremely left-skewed response variable – how to model this dataset

This is a histogram showing my response variable.

histogram of response variable

The response is # (or proportion? or percent?) of aphids eaten off of cards in fields, to model predation by natural enemies.

Predictors: fixed effects are both categorical (i.e. crop type, season) and continuous (i.e., landscape variables – amount of land in production, mean field size), and the variable 'landscape' is a random effect.

I've been working on this so far with the response variable reflected ('number of aphids remaining' rather than 'number of aphids eaten') so that it's right skewed instead which seemed more possible. But I would rather work with it as left-skewed if possible, the results will be easier to discuss that way.

Transformations do not help the response variable be less skewed.

GLMM with Poisson errors does not work because the models created are too overdispersed. GLMM with negative binomial errors – same problem. Any way I try to model this data that treats the response variable as continuous, gives me problematic things when I'm model checking, like residuals by fitted plots that have clear patterns in them, and overdispersion. I'm starting to wonder if I need to somehow rank the data in the response instead? Or partition it into categories?

The potential solution I've gotten to is hurdle models:

"Hurdle models partition the model into two parts: a binary process generating positive counts vs. zero counts, and a process generating only positive counts. The binary process is modeled using a generalized logistic regression, and the positive count process is modeled using a zero-truncated count model" (paraphrased from Zeilis, Kleiber & Jackman 2008)

Is there a way to do this in R? Or would I just do the two models separately, and discuss them separately? Or is there a way to get an AIC value for a hurdle model?

Does anybody have any other ideas for how to model this dataset?

TIA for any help, so appreciated. This dataset has been a thorn in my side for way too long!!

EDITED TO ADD

I now think that a tobit model (censored regression) is how I need to model this dataset. It worked well for modeling my fixed effects only. But I still cannot figure out how to make such a model with mixed effects. These datapoints are grouped by site (3 points per site), so it needs to be a random effect. Is there a way to do a censored regression with mixed effects?

Best Answer

Hurdle models and zero-inflated models could both work on the inverted variable. If you wanted to keep it as is, you might have to do some programming.

In R the pscl package offers both hurdle and zeroinfl functions. There is a vignette here that also covers some other packages that do some of the same things.

This being R, if you do want to play with the program, you can see the code easily enough:

install.packages("pscl")
library("pscl")
zeroinfl

Best Answer

Related Solutions

Related Question