I am working with a very unbalanced dataset that represents algae response to pollution. This dataset is the reunion of data that came from other studies. Algae are expressed as cell abundance counting, so, the response varies from 0 to more than 20000. Pollution is approached as z-scores because I needed to standardize many different variables, resulting in only one variable to compare. My random effect is related to the kind of measurement performed in the studies to compare treatment and controls: if repeated in time (measured on day1, then on day 2, etc, at the same place), if the measurements were performed in different places (polluted vs non-polluted), or if not specified.
This is what algae abundance looks like. As can be noticed, there are many zeros.
This is how z-scores data look like:
Link to dataset: https://drive.google.com/file/d/1lBrEseqDq4K0pNGp0Gvirn3J8lE-oNDu/view?usp=sharing
I need to measure the effect of pollution on algae considering the random effect, so I'm using this model:
model.abund.phyt<-glmer(response ~ z_scores+ (1|random) , data= dataset, family = "poisson")
I don't know what is going wrong with this analysis but when I run model.abund.phyt this is what I get:
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: poisson ( log )
Formula: response ~ z_scores + (1 | random)
Data: dataset
AIC BIC logLik deviance df.resid
Inf Inf -Inf Inf 441
Random effects:
Groups Name Std.Dev.
random (Intercept) 1
Number of obs: 444, groups: random, 3
Fixed Effects:
(Intercept) z_scores
2.333 1.021
optimizer (Nelder_Mead) convergence code: 0 (OK) ; 22820 optimizer warnings; 1 lme4 warnings
When I run the summary of the model, I get several lines of errors like this:
non-integer x = 0.210000
What do I need to do to get the model to run correctly?
The numerical variables are recognized as numeric when I run str function on the dataset. Is there a problem with my data or the model specifications?
Best Answer
We use random effects to encode information about structure in the data that implies that observations are not independent. For example:
You don't have this level of information as you seem to be putting together a meta-dataset by combining data collected under different conditions.
It's not meaningful to conceptualize the different conditions as "random effects". As @EdM advises, it would be better that you treat them as fixed effects. This simplifies the model and it'll be easier to deal with other errors, if any.
No matter what model you choose, if you don't have information about the structure of the data (which location were measurements collected from? at what time?), you cannot model correlations between observations appropriately. And if you assume that observations are independent when they are in fact correlated, the inference from your model won't be quite right: p-values too small, confidence intervals too narrow. Be aware of this limitation and don't overinterpret the results.