Solved – Why can’t I simulate variables with negative correlation? How to fix it

cholesky decompositioncorrelationr

I would like to simulate data with different correlation matrices, with this method:

M = matrix(c(1.0,  0.6,  0.6, 0.6, 
             0.6,  1.0, -0.2, 0.0,
             0.6, -0.2,  1.0, 0.0, 
             0.6,  0.0,  0.0, 1.0 ),
           nrow=4, ncol=4)

Cholesky-decomposition

L = chol(M)
nvars = dim(L)[1]

Random variables:

r = t(L) %*% matrix(rnorm(nvars * megf), nrow=nvars, ncol=megf)
r = t(r)

It worked with positive correlations, but I also need negative. Why doesn't it work? How can I do that?

Source of the code

Best Answer

Your correlation matrix is not positive definite. This means that it is not possible for a real dataset to have generated it.

> det(M)
[1] -0.2496

This works and has a negative correlation:

> M=matrix(c(1.0,  0.6,  0.6, 0.6, 
             0.6,  1.0, -0.2, 0.3,
             0.6, -0.2,  1.0, 0.3, 
             0.6,  0.3,  0.3, 1.0)
            ,nrow=4, ncol=4)
> 
> det(M)
[1] 0.0528

Your code doesn't run, because megf doesn't get defined.

You can save a little effort by using the mvrnorm() function, in the MASS package.

> library(MASS)
> set.seed(1234)  #Set seed for replicability
> r <- mvrnorm(n=1000, Sigma=M, mu=rep(0, 4) )
> cor(r)
          [,1]       [,2]       [,3]      [,4]
[1,] 1.0000000  0.5748690  0.6330390 0.5950443
[2,] 0.5748690  1.0000000 -0.1879727 0.2915380
[3,] 0.6330390 -0.1879727  1.0000000 0.3048610
[4,] 0.5950443  0.2915380  0.3048610 1.0000000

Related Solutions

Solved – Covariance and correlation matrix comparison

Have you tried using the morphometric approaches of Strauss & Bookstein (1982)? It seems like this may give you a relatively straightforward way to compare your populations. Here's a really brief summary, but there's much more in the paper and other morphometric publications.

If necessary, log-transform the 50 measurements ("dimensions")
PCA of these dimensions (variables)
- (note) PC 1 will likely explain almost all of the variance in the dimension data, and it mostly reflects overall size, so...
Regressions of each dimension and PC 1
Residuals of each regression may be used in the construction of discriminant model for DA based on pre-assigned groups (populations)
Use resubstitution error rates to assess morphometric differences between populations
MANOVA/ANOVA on regression residual data for both additional assessment of population differences and to identify specific dimensions that differ
- (note) you may want to be careful even if MANOVA results indicate real differences due to the sheer number of ANOVA

Strauss, R. E. and F. L. Bookstein. 1982. The truss: Body form reconstructions in morphometrics. Systematic Biology 31:113–135.

Cholesky Decomposition – How to Use for Correlated Data Simulation

The approach based on the Cholesky decomposition should work, it is described here and is shown in the answer by Mark L. Stone posted almost at the same time that this answer.

Nevertheless, I have sometimes generated draws from the multivariate Normal distribution $N(\vec\mu, \Sigma)$ as follows:

$$ Y = Q X + \vec\mu \,, \quad \hbox{with}\quad Q=\Lambda^{1/2}\Phi \,, $$

where $Y$ are the final draws, $X$ are draws from the univariate standard Normal distribution, $\Phi$ is a matrix containing the normalized eigenvectors of the target matrix $\Sigma$ and $\Lambda$ is a diagonal matrix containing the eigenvalues of $\Sigma$ arranged in the same order as the eigenvectors in the columns of $\Phi$.

Example in R (sorry I'm not using the same software you used in the question):

n <- 10000
corM <- rbind(c(1.0, 0.6, 0.9), c(0.6, 1.0, 0.5), c(0.9, 0.5, 1.0))
set.seed(123)
SigmaEV <- eigen(corM)
eps <- rnorm(n * ncol(SigmaEV$vectors))
Meps <- matrix(eps, ncol = n, byrow = TRUE)    
Meps <- SigmaEV$vectors %*% diag(sqrt(SigmaEV$values)) %*% Meps
Meps <- t(Meps)
# target correlation matrix
corM
#      [,1] [,2] [,3]
# [1,]  1.0  0.6  0.9
# [2,]  0.6  1.0  0.5
# [3,]  0.9  0.5  1.0
# correlation matrix for simulated data
cor(Meps)
#           [,1]      [,2]      [,3]
# [1,] 1.0000000 0.6002078 0.8994329
# [2,] 0.6002078 1.0000000 0.5006346
# [3,] 0.8994329 0.5006346 1.0000000

You may be also interested in this post and this post.