correlation – Retrieving Correlations Using Linear Mixed-Effects Modeling

correlationlme4-nlme

I want to retrieve the correlations in a multivariate dataset. Let me first start with a simple case with three variables among which the first two are correlated. In other words, the three variables are assumed to follow a trivariate Gaussian $(y_{1}, y_{2}, y_{3})' \sim N((0, 0, 0)', S)$:

require(MASS)
r1 <- 0.5               # correlation value to be recovered
ns <- 2000              # number of samples

S  <- matrix(c(1,r1,0,  # correlation structure of trivariate data
               r1,1,0,
               0,0,1), nrow=3, ncol=3)

# simulated trivariate data
dat <- data.frame(f = c(rep(paste0('P',1:ns), 2), paste0('U',1:ns)),
                  y = c(mvrnorm(n=ns, mu=c(0, 0, 0), Sigma=S)))

In the data frame dat, each pair of samples from the two correlated variables (with a correlation value of r1 in the variance-covariance matrix S) are coded with the same label P in the factor 'f' while the samples for the third variable are coded with the label U. Now we can construct the following model $y_{ij} = \alpha + \mu_i + \epsilon_{ij}$,

require(lme4)
m1 <- lmer(y ~ 1 + (1|f), data=dat)

With the variances from the model m1 output (each simulated dataset may lead to slightly different results):

summary(m1)

Random effects:
 Groups   Name        Variance Std.Dev.
 f        (Intercept) 0.4916   0.7012  
 Residual             0.5006   0.7075  

we can successfully recover the correlation r1 as below (which should be very close to the simulated r1 value, 0.5): $\frac{0.4916}{0.4916+0.5006} \approx 0.5$

tmp <- unlist(lapply(VarCorr(m1), `[`, 1))
# recover the correlation r1
tmp/(tmp+attr(VarCorr(m1), "sc")^2)

Now Let's switch to a case with 4 variables among which the first and second as well as the third and fourth are correlated. In other words, the four variables are assumed to follow a quadrivariate Gaussian $(y_{1}, y_{2}, y_{3}, y_{4})' \sim N((0, 0, 0, 0)', S)$:

r1 <- 0.2; r2 <- 0.8        # correlation value to be recovered
ns <- 2000                  # number of samples
S  <- matrix(c(1,r1,0,0,    # correlation structure of quadrivariate data
               r1,1,0,0,    # the first and second variables are correlated
               0,0,1,r2,    # the third and fourth variables are correlated
               0,0,r2,1), nrow=4,ncol=4)

# simulated data
dat <- data.frame(f = c(rep(paste0('P',1:ns), 2), rep(paste0('T',1:ns), 2)),
                  R=c(rep('P',2*ns), rep('T',2*ns)),
                  y = c(mvrnorm(n=ns, mu=rep(0,4), Sigma=S)))

In the data frame dat, the first and second variables are correlated (with coefficient r1); each pair of their samples are coded together with the same label in the factor f. Similarly, the third and fourth variables are correlated (with coefficient r2); each pair of their samples are coded together with the same label in the factor f. Because of the correlation structure, all the samples are categorized into 2 (instead of ) levels in the factor R. Our goal is to use the simulated data to recover r1 and r2.

With the following dummy coding

dat$R1 <- as.numeric(dat$R=='P')   # dummy code the first and second variables (r1)
dat$R2 <- as.numeric(dat$R=='T')   # dummy code the third and fourth variables (r2)

I have considered the following two possible models

m2 <- lmer(y ~ 1 + (0+R1|f) + (0+R2|f), data=dat)
m3 <- lmer(y ~ 1 + (1|f) + (0+R2|f), data=dat)

To avoid over-parameterization, I don't include a term (1|f) (or (0+R1|f)) in the model m2 (or m3). However, I've been struggling to figure out a way to recover the correlations r1 and r2 with the variances from the random effects based on the models m2 and m3. In other words, the variances from the random effects in m2 and m3 don't seem to allow me to reconstruct r1 and r2.

summary(m2)
    
Random effects:
 Groups   Name Variance Std.Dev.
 f        R1   0.3835   0.6193  
 f.1      R2   0.6537   0.8085  
 Residual      0.5012   0.7079

and

 summary(m3)

 Random effects:    
  Groups   Name        Variance Std.Dev.
  f        (Intercept) 0.3835   0.6193  
  f.1      R2          0.2702   0.5198  
  Residual             0.5012   0.7079 

Of course we could use a workaround solution by reducing the situation into two cases with two variables:

# workaround solution
m4 <- lmer(y ~ 1 + (1|f), data=dat[dat$R2!=1,])	# recover r1 like model m1 above
m5 <- lmer(y ~ 1 + (1|f), data=dat[dat$R1!=1,])    # recover r2 like model m1 above

Then we can adopt the same strategy as the first example with three variables and recover r1 and r2 separately. However, I would really want to find a way to recover r1 and r2 directly using the full data with models like m2 and m3.

One noticeable aspect is that in the model m1 the total variance is conserved in the sense that the sum of the two variances from the modeling result ($0.4916+0.5006\approx 1$) is equal to the the variance (which is 1) of the simulating distribution $N((0, 0, 0)', S)$. In contrast, this is not true for the models m2 and m3: the 3 variances in summary(m2) and summary(m3) do not add up to 1. This may indicate that there is variance leakage happening in the two latter models? Or the models m2 and m3 are not parameterized properly for retrieving r1 and r2?

So, I'm stuck: Is there a way to recover the two correlations r1 and r2 through variance partitioning or parameterization?

Best Answer

The "problem"

One noticeable aspect is that in the model m1 the total variance is conserved in the sense that the sum of the two variances from the modeling result ($0.4916+0.5006\approx 1$) is equal to the the variance (which is 1) of the simulating distribution $N((0, 0, 0)', S)$. In contrast, this is not true for the models m2 and m3: the 3 variances in summary(m2) and summary(m3) do not add up to 1. This may indicate that there is variance leakage happening in the two latter models? Or the models m2 and m3 are not parameterized properly for retrieving r1 and r2?

The difference in the variance is expected, and you can not get rid of it (at least not with a simple solution, see at the end for more).

The problem is that a mixed model

$$\begin{array}{} y_{ijk} \sim N(\mu_i+\mu_j,\sigma_\epsilon^2) \\\mu_i \sim \begin{cases} N(0,\sigma_{r_1}^2) \quad \text{if $R_1 = 1$} \\ 0 \quad \text{if $R_1 = 0$} \end{cases}\\ \mu_j \sim \begin{cases} N(0,\sigma_{r_2}^2) \quad \text{if $R_2 = 1$} \\ 0 \quad \text{if $R_2 = 0$} \end{cases}\\\end{array}$$

will be mixing the noise and the random effects, such that the variance of the measurements $y$ is not the same if the $\sigma_{r_1}$ and $\sigma_{r_2}$ are different (and you need them to be different to get the different correlations).

The assumed probability density to observe a sample of $\mathbf{ y_{ijk} }$ is:

$$f_{\mathbf{ Y_{ijk} }}(\mathbf{ y_{ijk} }) =det((2\pi)^k\Sigma)^{-\frac{1}{2}} e^{\mathbf{ y_{ijk}^T\Sigma y_{ijk}}}$$

With $\Sigma$ having a block structure like

$$\Sigma = \begin{bmatrix} J_1 & 0 & \dots &0 \\ 0 & J_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & J_n \\ \end{bmatrix}$$

and the blocks are like

$$J_i = \begin{bmatrix} \sigma_f^2+\sigma_{r_1}^2 & \sigma_{r_1}^2 \\ \sigma_{r_1}^2 & \sigma_f^2+\sigma_{r_1}^2 \end{bmatrix} $$

or

$$J_i = \begin{bmatrix} \sigma_f^2+\sigma_{r_2}^2 & \sigma_{r_2}^2 \\ \sigma_{r_2}^2 & \sigma_f^2+\sigma_{r_2}^2 \end{bmatrix} $$

These blocks will be for the observations with the same means $\mu_i$ or same $\mu_j$ those are the pairs in your data matrix with the same label for the factor f (T1, T2, etc. Or P1, P2, etc.).

So on the diagonal you get values $\sigma_f^2+\sigma_{r_2}^2$ or $\sigma_f^2+\sigma_{r_1}^2$. Your example has all values of $1$ on the diagonal and can not be modelled as a mixed effects model (where the correlation originates from the addition of random terms resulting in different variance)


Example

lmer can be used to estimate the following (a copy of your code but with the diagonal not equal to 1)

r1 <- 0.2; r2 <- 0.8        # variance components
sig <- 2                    # random error term
ns <- 20000                  # number of samples
S  <- matrix(c(sig+r1,r1,0,0,    # correlation structure of quadrivariate data
               r1,sig+r1,0,0,    # the first and second variables are correlated
               0,0,sig+r2,r2,    # the third and fourth variables are correlated
               0,0,r2,sig+r2), nrow=4,ncol=4)

It does this well

m <- lme4::lmer(y ~ 0 + (0+R1|f) + (0+R2|f), data=dat)
vc = as.data.frame(lme4::VarCorr(m, comp = "Variance"))
vc

Gives

       grp var1 var2      vcov     sdcor
1        f   R1 <NA> 0.1998553 0.4470517
2      f.1   R2 <NA> 0.7980240 0.8933219
3 Residual <NA> <NA> 2.0154196 1.4196547

So lmer assumes the same residual for all groups, resulting in different variance for the groups (if you add the varying random effect to it) and this is not your case.

That's why the method does not work with multiple groups/correlations.

Solution

  1. I would have to test it but I guess that you could apply weights to the observations, which is effectively like different variance for the points.

    A problem is that beforehand you do not know which weights to apply, because these weights depend on the solution (and the solution depends on the weights). But I imagine that with a sort of iteratively reweighing algorithm you can update the weights and approach the solution.

  2. You could solve the estimation manually by finding the MLE through maximizing the likelihood function with an optimizer. An example is given here: https://stats.stackexchange.com/a/337348

  3. The point of using a linear mixed model is not very clear to me. You write

    "I would really want to find a way to recover r1 and r2 directly using the full data with models like m2 and m3."

    But I guess that behind all the mathematical layers of complexity of mixed effects modeling and likelihood functions there is not hidden a much better estimation$^\star$. Effectively it can't be much better than simply computing the correlation in the sample and use that as an estimate for the correlation.

    So a 'solution' to the problem would be to not use a mixed effects model to estimate the correlation. (And instead estimate it directly)

    $^\star$ One advantage of using the full data might be if your assumption is that the variance on the diagonal must be equal, then you could use a grouped estimate for the variance.

Related Question