Mathematical notation in heteroscedastic mixed models

lme4-nlmemathematical-statisticsmixed modelrregression

I'm trying to understand how the equations (respecting indices) and the variance and covariance matrices of some mixed models adjusted in R are expressed, which are initially expressed by the lme4 and nlme packages.

The first model is a model of just one random effect on the intercept, this one doesn't seem difficult to me.
Adjusted for example as follows in lme4 and nlme respectively:

# Var6 it's a qualitative variable.
# Var7 it's a qualitative variable.
modA_lmer <- lmer(log(Var1)~log(Var2)+log(Var3)+
                          (Var4)+(Var5)+(1|Var6), data)
modA_lme <- lme(log(Var1)~log(Var2)+log(Var3)+
                          (Var4)+(Var5),
                           random = ~1|Var6, data)

modB_lmer <- lmer(log(Var1)~log(Var2)+log(Var3)+
                          (Var4)+(Var5)+(1|Var7), data)
modB_lme <- lme(log(Var1)~log(Var2)+log(Var3)+
                          (Var4)+(Var5),
                           random = ~1|Var7, data)

modC_lmer <- lmer(log(Var1)~log(Var2)+log(Var3)+
                          (Var4)+(Var5)+(1|Var6)+(1|Var7), data)
modC_lme <- lme(log(Var1)~log(Var2)+log(Var3)+
                          (Var4)+(Var5),
                           random= list(Var8= pdIdent(form = ~ 0 + as.factor(Var6)), 
                           Var7 = ~1), data)

I'm not sure (would it be possible for someone to check? Mainly indexes.), but I think that the equation of the three models, respecting the indices, would be given by:

I believe that the variance and covariance matrices D of the random effects would be expressed by

Ok, assuming that everything described above is correct, now imagine the mixed model B in a heterocedastic case, assuming different variances per level of a dummy variable Var 8, which adjusted through the lme4 and nlme packages would look like:

# Var 8 it's a dummy variable
modD_lmer <- lmer(log(Var1)~log(Var2)+log(Var3)+
                          (Var4)+(Var5)+(Var8-1||Var6), data)
modD_lme <- lme(log(Var1)~log(Var2)+log(Var3)+
                          (Var4)+(Var5),
                           random = list(Var6=pdDiag(~Var8)), data)

In this case, what would the equation of my model look like? I thought of the following situation:

Is this correct? The equation would be exactly the same but the difference would be the introduction of an index in the sigma^2 of the error?

The other question is, well, in this case, how would the matrix of variances and covariances look like? Would I have three sigmas in the diagonal matrix?
I.e:

I found some information in :

Best Answer

There's a lot going on here.

Models A,B,C

Models A,B,C are fine in both R-Code and formula, however i have some notes.

Var6 and Var7 are crossed random effects, i.e. their groupings are different/ not nested (also see here: Crossed vs nested random effects: how do they differ and how are they specified correctly in lme4?), which means that it is not sensible to speak about their covariance matrix, because they don't "realize" on the same objects.
lme does not inherently support crossed random effects, but you seem to have used the approach from here: https://stats.stackexchange.com/a/552180/341520 for modC_lme, which relies on the Var8 being a second intercept column, i.e. all 1s.
If $k$ is the index of the individual observation, then $\epsilon_{ijk}$ is redundant, but not wrong. You might be interested in the equatiomatic package which can create a Latex-formula from most lmer-objects. See the demonstration below.

library(lme4)
library(nlme)
library(equatiomatic)
n <- 700
# only two fixed effects to reduce redundancy
x1 <- rexp(n)
x2 <- runif(n)
var6 <- sample(as.factor(1:40), replace = T, size = n)
ran_eff6 <- rnorm(40, sd = 1.5)
var7 <- sample(as.factor(1:50), replace = T, size = n)
ran_eff7 <- rnorm(50, sd = 2.5)
# dummy for model D, not the right on for modC_lme!
var8 <- sample(as.factor(1:2), replace = T, size = n)
# model C being true
yC <- exp(3*log(x1) + 2*x2 + ran_eff6[as.integer(var6)] + ran_eff7[as.integer(var7)] + rnorm(n, sd = 4))
dat <- data.frame(x1, x2, var6, var7, var8, yC)

modC_lmer <- lmer(log(yC)~log(x1)+(x2)+(1|var6)+(1|var7), data = dat)

extract_eq(modC_lmer)

$$ \begin{aligned} \operatorname{yC}_{i} &\sim N \left(\alpha_{j[i],k[i]} + \beta_{1}(\operatorname{\log(x1)}) + \beta_{2}(\operatorname{x2}), \sigma^2 \right) \\ \alpha_{j} &\sim N \left(\mu_{\alpha_{j}}, \sigma^2_{\alpha_{j}} \right) \text{, for var7 j = 1,} \dots \text{,J} \\ \alpha_{k} &\sim N \left(\mu_{\alpha_{k}}, \sigma^2_{\alpha_{k}} \right) \text{, for var6 k = 1,} \dots \text{,K} \end{aligned} $$

Model D

Here you seem to have gotten confused reading https://rpubs.com/bbolker/factorvar, because your formula describes the simple case of varying variance in $\epsilon_{ijk}$, while your R-Code tries to solve the difficult case of varying variance in $u_i$. Both are correct, they just don't fit together.

Adjusting the formula to the hard case I'd propose writing $u_{i,l}\sim \mathcal N(0, \sigma_{1, l}),\epsilon_{i,l,k}\sim \mathcal N(0, \sigma_3) $, while if your formula is correct we can use weights argument in lme just as Ben Bolker did in your source:

# generate y with simple heteroscedacity 
dat$yD <- exp(3*log(x1) + 2*x2 + ran_eff6[as.integer(var6)] + rnorm(n, sd = ifelse(var8 == "1", 3, 6)))  
modD_lme <- lme(log(yD)~log(x1)+(x2), random = ~1|var6, weights = varIdent(form=~1|var8), data = dat)
summary(modD_lme)

Linear mixed-effects model fit by REML
  Data: dat 
      AIC      BIC    logLik
  4021.07 4048.351 -2004.535

Random effects:
 Formula: ~1 | var6
        (Intercept) Residual
StdDev:    1.287001 5.716438

Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | var8 
 Parameter estimates:
        2         1 
1.0000000 0.5062947 
...more stuff...

In rnorm(n, sd = ifelse(var8 == "1", 3, 6)) we defined $\sigma_{3, 1} = 3$ and $\sigma_{3, 2} = 6$, which correspond to the residual variance of $\hat\sigma_{3, 2} = 5.716438$ and the parameters under Variance function show that this for group $l = 2$ while group $l = 1$ has roughly half that standard deviation. So it fits great! lme4 has no comparable functionality see here: https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#spatial-and-temporal-correlation-models-heteroscedasticity-r-side-models Ben Bolkers GLMM-FAQ is frankly always worth reading, when dealing with mixed models.

A covariance matrix for the random effects is still not sensible.

Related Solutions

Solved – Finding the best combination of variables for high R-squared values

In your case, it might be feasible to try out all combinations (there are 16383 combinations) of sums. I wrote a quick and dirty implementation of that. With 14 variables it takes less than a minute to try out all combinations. If you want a random combination, you can modify the code to meet your needs.

my.vars <- matrix(NA, ncol=14, nrow=) # a matrix with your 14 different environmental variables
colnames(my.vars) <- paste("var", 1:14, sep="") # add row names "var1" - "var14"
my.grad.data <- 1:14
sum.vars <- vector()
r.2 <- vector()
comb.mat <- matrix(numeric(0), nrow=14, ncol=0) # initialise the matrix containing all combinations

for ( i in 1:14 ) { # generate and store all possible combination of sums of the 14 variables

  t.mat <- combn(my.grad.data, m=i)

  comb.mat <- cbind(comb.mat, rbind(t.mat, matrix(NA, ncol=dim(t.mat)[2] , nrow=14-i)))
}

for ( j in 1:dim(comb.mat)[2] ) { # calculate and store the R2 for all combinations

  sum.vec <- rowSums(my.vars[, comb.mat[, j]], na.rm=TRUE)

  sum.vars[j] <- paste(
    colnames(my.vars[, comb.mat[, j]])[!is.na(colnames(my.vars[, comb.mat[, j]]))], 
    collapse="+")

  r.2[j] <- summary(lm(allele ~ sum.vec))$r.squared 
}


result.frame <- data.frame(combination=sum.vars, r2=r.2)

result.frame.sorted <- result.frame[order(r.2, decreasing=TRUE), ]

head(result.frame.sorted, n=10) # the 10 "best" combinations

Lower bound and upper bound of beta estimate in regression is negative and positive, respectively, for each predictor

Your "upper/lower bounds" form 95% confidence intervals. Their interpretation is: if you were to repeat your experiment many times, calculating 95% CIs each time, then 95% of these CIs would contain the true parameter (assuming your model specification is correct). Yes, that is a very cumbersome definition, but it is the best that frequentist statistics can do.

There indeed is a direct connection between p values (from standard two-sided t tests) and CIs: the CI does not contain zero exactly if $p<.05$.

In your case, most of your parameters have CIs that overlap zero. In classical parlance, the parameter estimates are not significantly different from zero. That is: you cannot reject the null hypothesis that the true value of these parameters is zero.

Also, you have a huge model. If you estimate no less than 24 parameters, I truly hope you have multiple thousands of data points. Anything less will be very overparameterized.