R – Addressing gls with ML Not Working for Best Model Selection

anovabiostatisticslme4-nlmer

My question is close to this one which wasn't really answered: https://stackoverflow.com/questions/66571314/gls-with-arma-terms-not-working-for-one-combination-of-terms

I am running models to understand the trends highlighted by a Principal Component Analysis regarding variability in the morphology of some plants.
I have a set of quantitative (n=1, "PI", a percentage) and mostly qualitative (n=10, eg., Family, Genus, Species, leaf type) variables whose effect I would like to test on a quantitative variable "LMA" (a leaf thickness).

I initially used a GLM (because LMA data are neither normal nor homoskedastic) and on a reduced dataset because of a lot of lines with NAs for LMA.
In order to have a larger dataset, we supplemented LMA measurements with average values for some species (variable "Species") having n>5 LMA measurement. The variance of LMA is now highly heterogeneous and reduced for some of the species.

Thus, I switched to GLS to take into account this particularity with weights = varIdent(form=\~1|Species) and I made several models with method = REML to test the effect of the different variables (as some categorical variables were nested, I did not test them at the same time).

Today, I read several times that selecting the best fitting gls using AIC is only possible for gls made using ML (especially since not all my models have the same fixed effects).
A solution seems to be to make anova on the models updated in ML … But … it doesn't work for me … (although I have no NAs in my dataset) …

The problem is that depending on the models, the significant effects are different … I have read that the best model is also the one that contains the most variables with a significant effect. Is this right? Should I just go back to this and choose my model based on this individual anovas?

I hope this post is quite clear and I sincerely thank you in advance!

——-

The dataset I use (all variables but "PI" are qualitative), I specified they were factors, there is no NAs.

x <- c("species_LMA", "Locality", "Family", 
    "Genus", "Species", "specimen_TCT", 
    "taxon_TCT", "combined_TCT", "cat_TCT", 
    "PI", "Pheno", "Growth_form")  
noNA4 <- subset(data2, complete.cases(data2[, 
                x])) # n = 561

Here are are the different models I made:

xm1a <- gls(species_LMA ~ -1 + Family + PI + 
    Locality + cat_TCT + Pheno , 
    weights = varIdent(form=~1|Species), 
    data=noNA4, na.action=na.omit)  
xm1b <- gls(species_LMA ~ -1 + Family + PI + 
    Locality + specimen_TCT + Pheno , 
    weights = varIdent(form=~1|Species), 
    data=noNA4, na.action=na.omit)  
xm1c <- gls(species_LMA ~ -1 + Family + PI + 
    Locality + combined_TCT + Pheno , 
    weights = varIdent(form=~1|Species), 
    data=noNA4, na.action=na.omit)

xm2a <- gls(species_LMA ~ -1 + Genus + PI + 
    Locality + cat_TCT + Pheno , 
    weights = varIdent(form=~1|Species), 
    data=noNA4, na.action=na.omit)  
xm2b <- gls(species_LMA ~ -1 + Genus + PI + 
    Locality + specimen_TCT + Pheno , 
    weights = varIdent(form=~1|Species), 
    data=noNA4, na.action=na.omit)     
xm2c <- gls(species_LMA ~ -1 + Genus + PI +  
    Locality + combined_TCT + Pheno , 
    weights = varIdent(form=~1|Species), 
    data=noNA4, na.action=na.omit)

xm3a <- gls(species_LMA ~ -1 + Species + PI + 
    Locality + cat_TCT + Pheno , 
    weights = varIdent(form=~1|Species), 
    data=noNA4, na.action=na.omit)    
#xm3b <- gls(species_LMA ~ -1 + Species + PI + 
    Locality + specimen_TCT + Pheno , 
    weights = varIdent(form=~1|Species), 
    data=noNA3, na.action=na.omit) 
# do not work for some reason ... correlation?`  
xm3c <- gls(species_LMA ~ -1 + Species + PI + 
    Locality + combined_TCT + Pheno , 
    weights = varIdent(form=~1|Species), 
    data=noNA4, na.action=na.omit)

Line to select the best model based on AIC (package MuMIn)

model.sel(xm1a, xm1b, xm1c, xm2a, xm2b, xm2c, 
          xm3a, xm3c) 

    Model selection table 
         cat_TCT Fml Lcl Phn     PI spc_TCT cmb_TCT Gns Spc df    logLik   AICc  delta weight
    xm3c               +   +  8.254               +       + 60 -2051.166 4237.2   0.00      1
    xm3a       +       +   +  8.604                       + 56 -2064.354 4253.6  16.37      0
    xm2b               +   + 10.380       +           +     52 -2085.641 4286.3  49.10      0
    xm2c               +   +  9.439               +   +     49 -2092.155 4292.1  54.84      0
    xm2a       +       +   + 10.510                   +     45 -2105.152 4308.5  71.26      0
    xm1b           +   +   +  2.933       +                 47 -2111.535 4326.0  88.79      0
    xm1c           +   +   +  5.425               +         44 -2115.838 4327.5  90.26      0
    xm1a       +   +   +   + 11.680                         40 -2138.920 4364.2 127.04      0
    Models ranked by AICc(x)

# best option are *3c* then 3a, 2b, 2c, 2a, 1c, 1b, 1a but AIC are highly similar  
# xm3c : LMA ~ Species + PI + Locality + combined_TCT + Pheno

but since we took species-mean LMA, it is quite logical … if we consider the relation to species is biased because we used species mean values, then the best model would be xm2b: LMA ~ Genus + PI + Locality + specimen_TCT + Pheno

What I tried to update the model (here an example for two of them) to ML to evaluate their performance using AIC (and resulting error message).

anova(update(xm3c, . ~ ., method = "ML"), update(xm2b, . ~ ., method = "ML"))`

    Error in eigen(val, only.values = TRUE) : 
      infinite or missing values in 'x'

Example of anova results for 3 models. The problem is that depending on the models, the significant effects are different …

    anova(xm1b) 
    Denom. DF: 531 
                 numDF   F-value p-value
    Family           9 2137401.7  <.0001
    PI               1       0.1  0.7331
    Locality         1     118.4  <.0001
    specimen_TCT    10       5.2  <.0001
    Pheno            1       0.1  0.7080
    
    > anova(xm2b) # nothing but Species is significant... marginal corr of PI... 
    Denom. DF: 526 
                 numDF   F-value p-value
    Genus           14 247220.46  <.0001
    PI               1      2.76  0.0970
    Locality         1      0.10  0.7500
    specimen_TCT    10      0.56  0.8456
    Pheno            1      0.41  0.5226
    
    anova(xm3c) # nothing but Species is 
                # significant
    Denom. DF: 518 
                 numDF   F-value p-value
    Species         25 195602.70  <.0001
    PI               1      2.12  0.1463
    Locality         1      0.18  0.6691
    combined_TCT     7      0.75  0.6304
    Pheno            1      0.17  0.6784

Best Answer

I initially used a GLM (because LMA data are neither normal nor heteroskedastic [maybe you meant "homoskedastic"?])

That might not be necessary. The distributional issues have to do with errors around the model predictions, not the raw data.

we supplemented LMA measurements with average values for some species

That's not typically a good way to deal with missing data. See Stef van Buuren's Flexible Imputation of Missing Data for the advantages of creating multiple imputed data sets so that you have the best chance of avoiding bias while including the variance arising from filling in missing data.

Example of anova results for 3 models. The problem is that depending on the models, the significant effects are different ...

Check the documentation for how anova() works with those model objects. It's possible that it uses a sequential Type I method so the order of variables in the model might affect the apparent "significance" of the individual predictors.

It's not clear why you are doing separate models for each of cat_TCT, specimen_TCT and combined_TCT. It's generally better to start with as many predictors in a model as reasonable without overfitting, even if some predictors are somewhat correlated. (If predictors are linearly dependent or close to that, you should choose a linearly-independent set.)

I don't know why you are having troubles with ML versus REML. You can't compare AIC values among models having different fixed effects that are fit via REML. Work through one step at a time to see the source of the error, instead of combining updates with anova().

Related Solutions

Model Selection in Longitudinal Data – Testing the Need for Random-Effects Terms in Longitudinal Data Analysis

The likelihood ratio test is slightly incorrect (in general, conservative) for testing the significance of a random effect, because the null value ($\sigma^2=0$) is at the boundary of the feasible space, but in this case there is overwhelmingly strong evidence against the null hypothesis. The model with random effects of individual is 15713-6772=8941 log-likelihood units better; twice the log-likelihood value is $\chi^2$ distributed, so the direct p-value calculation would give you ...

pchisq(2*8941,df=1,lower.tail=FALSE,log.p=TRUE)/log(10)
## -3885.251

... a p-value of approximately $10^{-3885}$.

You should really consider a random-slope model (random = ~time|id) as well.

Update: relative to the random-intercept model, the random-slopes model is again much better. The improvement is now 935 log-likelihood units, which doing the equivalent calculation as above corresponds to a rejection of the null hypothesis (among-individual variation in slope is equal to zero) with a p-value of "only" $10^{-408}$.

Solved – Mixed effects model output – no difference in AIC values

The first three models you've constructed differ in the ways the parameters are defined, but they have the same number of the parameters and the fits are equivalent in every way except for the numerical values of the parameters.

We can illustrate this with a plain linear model - mixed models just complicate the issue.

set.seed(101)
dd <- expand.grid(light=c("day","dusk","night"),
                  tide=c("base","Flooding","Ebbing"))
dd$y <- rnorm(nrow(dd))
## add one more row so fit isn't perfect
dd <- rbind(dd,dd[1,])
dd$y[nrow(dd)] <- rnorm(1)

Use model.matrix to see what parameters R will construct when fitting the model (you could also use names(coef(...)) on the output of lm(), or names(fixef(...)) on the output of (g)lmer).

tmpf <- function(f) {
    model.matrix(f,data=dd)
}
colnames(m1 <- tmpf(~light+tide+light:tide))
## [1] "(Intercept)"             "lightdusk"              
## [3] "lightnight"              "tideFlooding"           
## [5] "tideEbbing"              "lightdusk:tideFlooding" 
## [7] "lightnight:tideFlooding" "lightdusk:tideEbbing"   
## [9] "lightnight:tideEbbing"

If we use the * operator, we get the interaction plus the main effects; if we redundantly specify the main effects, R silently drops them.

all.equal(m1,tmpf(~light*tide))  ## TRUE
all.equal(m1,tmpf(~light+light*tide))  ## TRUE
all.equal(m1,tmpf(~light+tide+light*tide))  ## TRUE

If we use : but leave out one of the main effects we get the same number of parameters (9), but they are organized differently:

colnames(m2 <- tmpf(~light+light:tide))
## [1] "(Intercept)"             "lightdusk"              
## [3] "lightnight"              "lightday:tideFlooding"  
## [5] "lightdusk:tideFlooding"  "lightnight:tideFlooding"
## [7] "lightday:tideEbbing"     "lightdusk:tideEbbing"   
## [9] "lightnight:tideEbbing"

As I explain elsewhere, it rarely makes sense to test the model with interactions present but main effects missing; the only ways that I know of to do this are to construct the dummy variables yourself (either by hand, or by constructing the model matrix, dropping the terms you don't want, and using the remaining model matrix columns as (numeric) predictor variables.

The MuMIn package tries to do the right thing: from ?dredge,

By default, marginality constraints are respected, so “all possible combinations” include only those containing interactions with their respective main effects and all lower order terms.

library(MuMIn)    
full_model <- lm(y~light*tide,data=dd,na.action="na.fail")    
(dmods <- dredge(full_model))
## Model selection table 
##      (Int) lgh tid lgh:tid df logLik   AICc  delta weight
## 8 -0.27460   +   +       + 10 23.541 -247.1   0.00      1
## 1  0.24500                  2 -8.291   22.3 269.38      0
## 3 -0.16790       +          4 -5.948   27.9 274.98      0
## 2  0.07096   +              4 -7.821   31.6 278.72      0
## 4 -0.25820   +   +          6 -5.543   51.1 298.17      0

As you can see dredge has not tried to fit any models with the interaction but missing some main effects.

Best Answer

Related Solutions

Model Selection in Longitudinal Data – Testing the Need for Random-Effects Terms in Longitudinal Data Analysis

Solved – Mixed effects model output – no difference in AIC values

Related Question