R – Addressing gls with ML Not Working for Best Model Selection

anovabiostatisticslme4-nlmer

My question is close to this one which wasn't really answered: https://stackoverflow.com/questions/66571314/gls-with-arma-terms-not-working-for-one-combination-of-terms

I am running models to understand the trends highlighted by a Principal Component Analysis regarding variability in the morphology of some plants.
I have a set of quantitative (n=1, "PI", a percentage) and mostly qualitative (n=10, eg., Family, Genus, Species, leaf type) variables whose effect I would like to test on a quantitative variable "LMA" (a leaf thickness).

I initially used a GLM (because LMA data are neither normal nor homoskedastic) and on a reduced dataset because of a lot of lines with NAs for LMA.
In order to have a larger dataset, we supplemented LMA measurements with average values for some species (variable "Species") having n>5 LMA measurement. The variance of LMA is now highly heterogeneous and reduced for some of the species.

Thus, I switched to GLS to take into account this particularity with weights = varIdent(form=\~1|Species) and I made several models with method = REML to test the effect of the different variables (as some categorical variables were nested, I did not test them at the same time).

Today, I read several times that selecting the best fitting gls using AIC is only possible for gls made using ML (especially since not all my models have the same fixed effects).
A solution seems to be to make anova on the models updated in ML … But … it doesn't work for me … (although I have no NAs in my dataset) …

The problem is that depending on the models, the significant effects are different … I have read that the best model is also the one that contains the most variables with a significant effect. Is this right? Should I just go back to this and choose my model based on this individual anovas?

I hope this post is quite clear and I sincerely thank you in advance!

——-

The dataset I use (all variables but "PI" are qualitative), I specified they were factors, there is no NAs.

x <- c("species_LMA", "Locality", "Family", 
    "Genus", "Species", "specimen_TCT", 
    "taxon_TCT", "combined_TCT", "cat_TCT", 
    "PI", "Pheno", "Growth_form")  
noNA4 <- subset(data2, complete.cases(data2[, 
                x])) # n = 561

Here are are the different models I made:

xm1a <- gls(species_LMA ~ -1 + Family + PI + 
    Locality + cat_TCT + Pheno , 
    weights = varIdent(form=~1|Species), 
    data=noNA4, na.action=na.omit)  
xm1b <- gls(species_LMA ~ -1 + Family + PI + 
    Locality + specimen_TCT + Pheno , 
    weights = varIdent(form=~1|Species), 
    data=noNA4, na.action=na.omit)  
xm1c <- gls(species_LMA ~ -1 + Family + PI + 
    Locality + combined_TCT + Pheno , 
    weights = varIdent(form=~1|Species), 
    data=noNA4, na.action=na.omit)
xm2a <- gls(species_LMA ~ -1 + Genus + PI + 
    Locality + cat_TCT + Pheno , 
    weights = varIdent(form=~1|Species), 
    data=noNA4, na.action=na.omit)  
xm2b <- gls(species_LMA ~ -1 + Genus + PI + 
    Locality + specimen_TCT + Pheno , 
    weights = varIdent(form=~1|Species), 
    data=noNA4, na.action=na.omit)     
xm2c <- gls(species_LMA ~ -1 + Genus + PI +  
    Locality + combined_TCT + Pheno , 
    weights = varIdent(form=~1|Species), 
    data=noNA4, na.action=na.omit)

xm3a <- gls(species_LMA ~ -1 + Species + PI + 
    Locality + cat_TCT + Pheno , 
    weights = varIdent(form=~1|Species), 
    data=noNA4, na.action=na.omit)    
#xm3b <- gls(species_LMA ~ -1 + Species + PI + 
    Locality + specimen_TCT + Pheno , 
    weights = varIdent(form=~1|Species), 
    data=noNA3, na.action=na.omit) 
# do not work for some reason ... correlation?`  
xm3c <- gls(species_LMA ~ -1 + Species + PI + 
    Locality + combined_TCT + Pheno , 
    weights = varIdent(form=~1|Species), 
    data=noNA4, na.action=na.omit)

Line to select the best model based on AIC (package MuMIn)

model.sel(xm1a, xm1b, xm1c, xm2a, xm2b, xm2c, 
          xm3a, xm3c) 

    Model selection table 
         cat_TCT Fml Lcl Phn     PI spc_TCT cmb_TCT Gns Spc df    logLik   AICc  delta weight
    xm3c               +   +  8.254               +       + 60 -2051.166 4237.2   0.00      1
    xm3a       +       +   +  8.604                       + 56 -2064.354 4253.6  16.37      0
    xm2b               +   + 10.380       +           +     52 -2085.641 4286.3  49.10      0
    xm2c               +   +  9.439               +   +     49 -2092.155 4292.1  54.84      0
    xm2a       +       +   + 10.510                   +     45 -2105.152 4308.5  71.26      0
    xm1b           +   +   +  2.933       +                 47 -2111.535 4326.0  88.79      0
    xm1c           +   +   +  5.425               +         44 -2115.838 4327.5  90.26      0
    xm1a       +   +   +   + 11.680                         40 -2138.920 4364.2 127.04      0
    Models ranked by AICc(x) 
# best option are *3c* then 3a, 2b, 2c, 2a, 1c, 1b, 1a but AIC are highly similar  
# xm3c : LMA ~ Species + PI + Locality + combined_TCT + Pheno 

but since we took species-mean LMA, it is quite logical … if we consider the relation to species is biased because we used species mean values, then the best model would be xm2b: LMA ~ Genus + PI + Locality + specimen_TCT + Pheno

What I tried to update the model (here an example for two of them) to ML to evaluate their performance using AIC (and resulting error message).

anova(update(xm3c, . ~ ., method = "ML"), update(xm2b, . ~ ., method = "ML"))`

    Error in eigen(val, only.values = TRUE) : 
      infinite or missing values in 'x'

Example of anova results for 3 models. The problem is that depending on the models, the significant effects are different …

    anova(xm1b) 
    Denom. DF: 531 
                 numDF   F-value p-value
    Family           9 2137401.7  <.0001
    PI               1       0.1  0.7331
    Locality         1     118.4  <.0001
    specimen_TCT    10       5.2  <.0001
    Pheno            1       0.1  0.7080
    
    > anova(xm2b) # nothing but Species is significant... marginal corr of PI... 
    Denom. DF: 526 
                 numDF   F-value p-value
    Genus           14 247220.46  <.0001
    PI               1      2.76  0.0970
    Locality         1      0.10  0.7500
    specimen_TCT    10      0.56  0.8456
    Pheno            1      0.41  0.5226
    
    anova(xm3c) # nothing but Species is 
                # significant
    Denom. DF: 518 
                 numDF   F-value p-value
    Species         25 195602.70  <.0001
    PI               1      2.12  0.1463
    Locality         1      0.18  0.6691
    combined_TCT     7      0.75  0.6304
    Pheno            1      0.17  0.6784

Best Answer

I initially used a GLM (because LMA data are neither normal nor heteroskedastic [maybe you meant "homoskedastic"?])

That might not be necessary. The distributional issues have to do with errors around the model predictions, not the raw data.

we supplemented LMA measurements with average values for some species

That's not typically a good way to deal with missing data. See Stef van Buuren's Flexible Imputation of Missing Data for the advantages of creating multiple imputed data sets so that you have the best chance of avoiding bias while including the variance arising from filling in missing data.

Example of anova results for 3 models. The problem is that depending on the models, the significant effects are different ...

Check the documentation for how anova() works with those model objects. It's possible that it uses a sequential Type I method so the order of variables in the model might affect the apparent "significance" of the individual predictors.

It's not clear why you are doing separate models for each of cat_TCT, specimen_TCT and combined_TCT. It's generally better to start with as many predictors in a model as reasonable without overfitting, even if some predictors are somewhat correlated. (If predictors are linearly dependent or close to that, you should choose a linearly-independent set.)

I don't know why you are having troubles with ML versus REML. You can't compare AIC values among models having different fixed effects that are fit via REML. Work through one step at a time to see the source of the error, instead of combining updates with anova().