I want to retrieve the correlations in a multivariate dataset. Let me first start with a simple case with three variables among which the first two are correlated. In other words, the three variables are assumed to follow a trivariate Gaussian $(y_{1}, y_{2}, y_{3})' \sim N((0, 0, 0)', S)$:
require(MASS)
r1 <- 0.5 # correlation value to be recovered
ns <- 2000 # number of samples
S <- matrix(c(1,r1,0, # correlation structure of trivariate data
r1,1,0,
0,0,1), nrow=3, ncol=3)
# simulated trivariate data
dat <- data.frame(f = c(rep(paste0('P',1:ns), 2), paste0('U',1:ns)),
y = c(mvrnorm(n=ns, mu=c(0, 0, 0), Sigma=S)))
In the data frame dat
, each pair of samples from the two correlated variables (with a correlation value of r1
in the variance-covariance matrix S
) are coded with the same label P
in the factor 'f' while the samples for the third variable are coded with the label U
. Now we can construct the following model $y_{ij} = \alpha + \mu_i + \epsilon_{ij}$,
require(lme4)
m1 <- lmer(y ~ 1 + (1|f), data=dat)
With the variances from the model m1 output (each simulated dataset may lead to slightly different results):
summary(m1)
Random effects:
Groups Name Variance Std.Dev.
f (Intercept) 0.4916 0.7012
Residual 0.5006 0.7075
we can successfully recover the correlation r1
as below (which should be very close to the simulated r1
value, 0.5): $\frac{0.4916}{0.4916+0.5006} \approx 0.5$
tmp <- unlist(lapply(VarCorr(m1), `[`, 1))
# recover the correlation r1
tmp/(tmp+attr(VarCorr(m1), "sc")^2)
Now Let's switch to a case with 4 variables among which the first and second as well as the third and fourth are correlated. In other words, the four variables are assumed to follow a quadrivariate Gaussian $(y_{1}, y_{2}, y_{3}, y_{4})' \sim N((0, 0, 0, 0)', S)$:
r1 <- 0.2; r2 <- 0.8 # correlation value to be recovered
ns <- 2000 # number of samples
S <- matrix(c(1,r1,0,0, # correlation structure of quadrivariate data
r1,1,0,0, # the first and second variables are correlated
0,0,1,r2, # the third and fourth variables are correlated
0,0,r2,1), nrow=4,ncol=4)
# simulated data
dat <- data.frame(f = c(rep(paste0('P',1:ns), 2), rep(paste0('T',1:ns), 2)),
R=c(rep('P',2*ns), rep('T',2*ns)),
y = c(mvrnorm(n=ns, mu=rep(0,4), Sigma=S)))
In the data frame dat
, the first and second variables are correlated (with coefficient r1
); each pair of their samples are coded together with the same label in the factor f
. Similarly, the third and fourth variables are correlated (with coefficient r2
); each pair of their samples are coded together with the same label in the factor f
. Because of the correlation structure, all the samples are categorized into 2 (instead of ) levels in the factor R
. Our goal is to use the simulated data to recover r1
and r2
.
With the following dummy coding
dat$R1 <- as.numeric(dat$R=='P') # dummy code the first and second variables (r1)
dat$R2 <- as.numeric(dat$R=='T') # dummy code the third and fourth variables (r2)
I have considered the following two possible models
m2 <- lmer(y ~ 1 + (0+R1|f) + (0+R2|f), data=dat)
m3 <- lmer(y ~ 1 + (1|f) + (0+R2|f), data=dat)
To avoid over-parameterization, I don't include a term (1|f)
(or (0+R1|f)
) in the model m2
(or m3
). However, I've been struggling to figure out a way to recover the correlations r1
and r2
with the variances from the random effects based on the models m2
and m3
. In other words, the variances from the random effects in m2
and m3
don't seem to allow me to reconstruct r1
and r2
.
summary(m2)
Random effects:
Groups Name Variance Std.Dev.
f R1 0.3835 0.6193
f.1 R2 0.6537 0.8085
Residual 0.5012 0.7079
and
summary(m3)
Random effects:
Groups Name Variance Std.Dev.
f (Intercept) 0.3835 0.6193
f.1 R2 0.2702 0.5198
Residual 0.5012 0.7079
Of course we could use a workaround solution by reducing the situation into two cases with two variables:
# workaround solution
m4 <- lmer(y ~ 1 + (1|f), data=dat[dat$R2!=1,]) # recover r1 like model m1 above
m5 <- lmer(y ~ 1 + (1|f), data=dat[dat$R1!=1,]) # recover r2 like model m1 above
Then we can adopt the same strategy as the first example with three variables and recover r1
and r2
separately. However, I would really want to find a way to recover r1
and r2
directly using the full data with models like m2
and m3
.
One noticeable aspect is that in the model m1
the total variance is conserved in the sense that the sum of the two variances from the modeling result ($0.4916+0.5006\approx 1$) is equal to the the variance (which is 1) of the simulating distribution $N((0, 0, 0)', S)$. In contrast, this is not true for the models m2
and m3
: the 3 variances in summary(m2)
and summary(m3)
do not add up to 1. This may indicate that there is variance leakage happening in the two latter models? Or the models m2
and m3
are not parameterized properly for retrieving r1
and r2
?
So, I'm stuck: Is there a way to recover the two correlations r1
and r2
through variance partitioning or parameterization?
Best Answer
The "problem"
The difference in the variance is expected, and you can not get rid of it (at least not with a simple solution, see at the end for more).
The problem is that a mixed model
$$\begin{array}{} y_{ijk} \sim N(\mu_i+\mu_j,\sigma_\epsilon^2) \\\mu_i \sim \begin{cases} N(0,\sigma_{r_1}^2) \quad \text{if $R_1 = 1$} \\ 0 \quad \text{if $R_1 = 0$} \end{cases}\\ \mu_j \sim \begin{cases} N(0,\sigma_{r_2}^2) \quad \text{if $R_2 = 1$} \\ 0 \quad \text{if $R_2 = 0$} \end{cases}\\\end{array}$$
will be mixing the noise and the random effects, such that the variance of the measurements $y$ is not the same if the $\sigma_{r_1}$ and $\sigma_{r_2}$ are different (and you need them to be different to get the different correlations).
The assumed probability density to observe a sample of $\mathbf{ y_{ijk} }$ is:
$$f_{\mathbf{ Y_{ijk} }}(\mathbf{ y_{ijk} }) =det((2\pi)^k\Sigma)^{-\frac{1}{2}} e^{\mathbf{ y_{ijk}^T\Sigma y_{ijk}}}$$
With $\Sigma$ having a block structure like
$$\Sigma = \begin{bmatrix} J_1 & 0 & \dots &0 \\ 0 & J_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & J_n \\ \end{bmatrix}$$
and the blocks are like
$$J_i = \begin{bmatrix} \sigma_f^2+\sigma_{r_1}^2 & \sigma_{r_1}^2 \\ \sigma_{r_1}^2 & \sigma_f^2+\sigma_{r_1}^2 \end{bmatrix} $$
or
$$J_i = \begin{bmatrix} \sigma_f^2+\sigma_{r_2}^2 & \sigma_{r_2}^2 \\ \sigma_{r_2}^2 & \sigma_f^2+\sigma_{r_2}^2 \end{bmatrix} $$
These blocks will be for the observations with the same means $\mu_i$ or same $\mu_j$ those are the pairs in your data matrix with the same label for the factor
f
(T1, T2, etc. Or P1, P2, etc.).So on the diagonal you get values $\sigma_f^2+\sigma_{r_2}^2$ or $\sigma_f^2+\sigma_{r_1}^2$. Your example has all values of $1$ on the diagonal and can not be modelled as a mixed effects model (where the correlation originates from the addition of random terms resulting in different variance)
Example
lmer can be used to estimate the following (a copy of your code but with the diagonal not equal to 1)
It does this well
Gives
So lmer assumes the same residual for all groups, resulting in different variance for the groups (if you add the varying random effect to it) and this is not your case.
That's why the method does not work with multiple groups/correlations.
Solution
I would have to test it but I guess that you could apply weights to the observations, which is effectively like different variance for the points.
A problem is that beforehand you do not know which weights to apply, because these weights depend on the solution (and the solution depends on the weights). But I imagine that with a sort of iteratively reweighing algorithm you can update the weights and approach the solution.
You could solve the estimation manually by finding the MLE through maximizing the likelihood function with an optimizer. An example is given here: https://stats.stackexchange.com/a/337348
The point of using a linear mixed model is not very clear to me. You write
But I guess that behind all the mathematical layers of complexity of mixed effects modeling and likelihood functions there is not hidden a much better estimation$^\star$. Effectively it can't be much better than simply computing the correlation in the sample and use that as an estimate for the correlation.
So a 'solution' to the problem would be to not use a mixed effects model to estimate the correlation. (And instead estimate it directly)
$^\star$ One advantage of using the full data might be if your assumption is that the variance on the diagonal must be equal, then you could use a grouped estimate for the variance.