My question is close to this one which wasn't really answered: https://stackoverflow.com/questions/66571314/gls-with-arma-terms-not-working-for-one-combination-of-terms
I am running models to understand the trends highlighted by a Principal Component Analysis regarding variability in the morphology of some plants.
I have a set of quantitative (n=1, "PI", a percentage) and mostly qualitative (n=10, eg., Family, Genus, Species, leaf type) variables whose effect I would like to test on a quantitative variable "LMA" (a leaf thickness).
I initially used a GLM (because LMA data are neither normal nor homoskedastic) and on a reduced dataset because of a lot of lines with NAs for LMA.
In order to have a larger dataset, we supplemented LMA measurements with average values for some species (variable "Species") having n>5 LMA measurement. The variance of LMA is now highly heterogeneous and reduced for some of the species.
Thus, I switched to GLS to take into account this particularity with weights = varIdent(form=\~1|Species)
and I made several models with method = REML
to test the effect of the different variables (as some categorical variables were nested, I did not test them at the same time).
Today, I read several times that selecting the best fitting gls using AIC is only possible for gls made using ML (especially since not all my models have the same fixed effects).
A solution seems to be to make anova on the models updated in ML … But … it doesn't work for me … (although I have no NAs in my dataset) …
The problem is that depending on the models, the significant effects are different … I have read that the best model is also the one that contains the most variables with a significant effect. Is this right? Should I just go back to this and choose my model based on this individual anovas?
I hope this post is quite clear and I sincerely thank you in advance!
——-
The dataset I use (all variables but "PI" are qualitative), I specified they were factors, there is no NAs.
x <- c("species_LMA", "Locality", "Family",
"Genus", "Species", "specimen_TCT",
"taxon_TCT", "combined_TCT", "cat_TCT",
"PI", "Pheno", "Growth_form")
noNA4 <- subset(data2, complete.cases(data2[,
x])) # n = 561
Here are are the different models I made:
xm1a <- gls(species_LMA ~ -1 + Family + PI +
Locality + cat_TCT + Pheno ,
weights = varIdent(form=~1|Species),
data=noNA4, na.action=na.omit)
xm1b <- gls(species_LMA ~ -1 + Family + PI +
Locality + specimen_TCT + Pheno ,
weights = varIdent(form=~1|Species),
data=noNA4, na.action=na.omit)
xm1c <- gls(species_LMA ~ -1 + Family + PI +
Locality + combined_TCT + Pheno ,
weights = varIdent(form=~1|Species),
data=noNA4, na.action=na.omit)
xm2a <- gls(species_LMA ~ -1 + Genus + PI +
Locality + cat_TCT + Pheno ,
weights = varIdent(form=~1|Species),
data=noNA4, na.action=na.omit)
xm2b <- gls(species_LMA ~ -1 + Genus + PI +
Locality + specimen_TCT + Pheno ,
weights = varIdent(form=~1|Species),
data=noNA4, na.action=na.omit)
xm2c <- gls(species_LMA ~ -1 + Genus + PI +
Locality + combined_TCT + Pheno ,
weights = varIdent(form=~1|Species),
data=noNA4, na.action=na.omit)
xm3a <- gls(species_LMA ~ -1 + Species + PI +
Locality + cat_TCT + Pheno ,
weights = varIdent(form=~1|Species),
data=noNA4, na.action=na.omit)
#xm3b <- gls(species_LMA ~ -1 + Species + PI +
Locality + specimen_TCT + Pheno ,
weights = varIdent(form=~1|Species),
data=noNA3, na.action=na.omit)
# do not work for some reason ... correlation?`
xm3c <- gls(species_LMA ~ -1 + Species + PI +
Locality + combined_TCT + Pheno ,
weights = varIdent(form=~1|Species),
data=noNA4, na.action=na.omit)
Line to select the best model based on AIC (package MuMIn)
model.sel(xm1a, xm1b, xm1c, xm2a, xm2b, xm2c,
xm3a, xm3c)
Model selection table
cat_TCT Fml Lcl Phn PI spc_TCT cmb_TCT Gns Spc df logLik AICc delta weight
xm3c + + 8.254 + + 60 -2051.166 4237.2 0.00 1
xm3a + + + 8.604 + 56 -2064.354 4253.6 16.37 0
xm2b + + 10.380 + + 52 -2085.641 4286.3 49.10 0
xm2c + + 9.439 + + 49 -2092.155 4292.1 54.84 0
xm2a + + + 10.510 + 45 -2105.152 4308.5 71.26 0
xm1b + + + 2.933 + 47 -2111.535 4326.0 88.79 0
xm1c + + + 5.425 + 44 -2115.838 4327.5 90.26 0
xm1a + + + + 11.680 40 -2138.920 4364.2 127.04 0
Models ranked by AICc(x)
# best option are *3c* then 3a, 2b, 2c, 2a, 1c, 1b, 1a but AIC are highly similar
# xm3c : LMA ~ Species + PI + Locality + combined_TCT + Pheno
but since we took species-mean LMA, it is quite logical … if we consider the relation to species is biased because we used species mean values, then the best model would be xm2b: LMA ~ Genus + PI + Locality + specimen_TCT + Pheno
What I tried to update the model (here an example for two of them) to ML to evaluate their performance using AIC (and resulting error message).
anova(update(xm3c, . ~ ., method = "ML"), update(xm2b, . ~ ., method = "ML"))`
Error in eigen(val, only.values = TRUE) :
infinite or missing values in 'x'
Example of anova results for 3 models. The problem is that depending on the models, the significant effects are different …
anova(xm1b)
Denom. DF: 531
numDF F-value p-value
Family 9 2137401.7 <.0001
PI 1 0.1 0.7331
Locality 1 118.4 <.0001
specimen_TCT 10 5.2 <.0001
Pheno 1 0.1 0.7080
> anova(xm2b) # nothing but Species is significant... marginal corr of PI...
Denom. DF: 526
numDF F-value p-value
Genus 14 247220.46 <.0001
PI 1 2.76 0.0970
Locality 1 0.10 0.7500
specimen_TCT 10 0.56 0.8456
Pheno 1 0.41 0.5226
anova(xm3c) # nothing but Species is
# significant
Denom. DF: 518
numDF F-value p-value
Species 25 195602.70 <.0001
PI 1 2.12 0.1463
Locality 1 0.18 0.6691
combined_TCT 7 0.75 0.6304
Pheno 1 0.17 0.6784
Best Answer
That might not be necessary. The distributional issues have to do with errors around the model predictions, not the raw data.
That's not typically a good way to deal with missing data. See Stef van Buuren's Flexible Imputation of Missing Data for the advantages of creating multiple imputed data sets so that you have the best chance of avoiding bias while including the variance arising from filling in missing data.
Check the documentation for how
anova()
works with those model objects. It's possible that it uses a sequential Type I method so the order of variables in the model might affect the apparent "significance" of the individual predictors.It's not clear why you are doing separate models for each of
cat_TCT
,specimen_TCT
andcombined_TCT
. It's generally better to start with as many predictors in a model as reasonable without overfitting, even if some predictors are somewhat correlated. (If predictors are linearly dependent or close to that, you should choose a linearly-independent set.)I don't know why you are having troubles with ML versus REML. You can't compare AIC values among models having different fixed effects that are fit via REML. Work through one step at a time to see the source of the error, instead of combining updates with
anova()
.