Generate two random correlation matrices which share equal correlations

correlation matrixcovariance-matrixlinear algebramatrixrandom-generation

My setting is, I want to simulate a data set in two conditions, e.g. control and disease. I want them to share mostly the same correlations except some should be different to simulate a "signal" between the two conditions.

So I'm trying to simulate two correlation matrices (i.e. they have to be positive semidefinite, symmetric with ones on the diagonal) which share some equal correlations, e.g. the first 2 rows should have the same correlation between the two, the rest is random/shuffled, like this:

Note how A and B have the same correlation values in both (first two rows and columns are the same), the rest is different.

Is this possible? I can generate a random correlation matrix using the last method described here, but I can't figure out how to change/shuffle a subset of the matrix still keep them positive semidefinite. I tried shuffling the elements of the rows except the diagonal values and make it symmetric again with

cor.mat[lower.tri(cor.mat)] <- t(cor.mat)[lower.tri(cor.mat)]

but then the matrix isn't positive semidefinite anymore.

Any suggestions would be greatly appreciated!

EDIT:

I found this post where the question was to complete a partial matrix to be positive definite one, could this be used in my case? E.g. generate one matrix, delete the rows I want to randomize and fill it up with random values?

Unfortunately I'm no mathematician and I don't really understand the explanations and can't translate the answers into R code, could someone help with this?

Best Answer

In case anyone also has this problem, I think I found a solution that works for me:

I simulate one correlation matrix, duplicate it and then randomize some values in the duplicate, resulting in the second "condition" matrix. To get this new matrix to be positive semi-definite as well, I use the R function nearPD() in the Matrix package. nearPD() finds the nearest positive semi-definite by adjusting all values slightly till a result is found that fulfills the criteria. This doesn't give me exactly what I wanted (since all values are adjusted in the new matrix, they don't match exactly with the values in the original matrix) but it's close enough to work in my case.

Using vine method

In this thread: How to efficiently generate random positive-semidefinite correlation matrices? -- I described and provided the code for two efficient algorithms of generating random correlation matrices. Both come from a paper by Lewandowski, Kurowicka, and Joe (2009).

Please see my answer there for a lot of figures and matlab code. Here I would only like to say that the vine method allows to generate random correlation matrices with any distribution of partial correlations (note the word "partial") and can be used to generate correlation matrices with large off-diagonal values. Here is the relevant figure from that thread:

Vine method

The only thing that changes between subplots, is one parameter that controls how much the distribution of partial correlations is concentrated around $\pm 1$. As OP was asking for an approximately normal distribution off-diagonal, here is the plot with histograms of the off-diagonal elements (for the same matrices as above):

Off-diagonal elements

I think this distributions are reasonably "normal", and one can see how the standard deviation gradually increases. I should add that the algorithm is very fast. See linked thread for the details.

My original answer

A straight-forward modification of your method might do the trick (depending on how close you want the distribution to be to normal). This answer was inspired by @cardinal's comments above and by @psarka's answer to my own question How to generate a large full-rank random correlation matrix with some strong correlations present?

The trick is to make samples of your $\mathbf X$ correlated (not features, but samples). Here is an example: I generate random matrix $\mathbf X$ of $1000 \times 100$ size (all elements from standard normal), and then add a random number from $[-a/2, a/2]$ to each row, for $a=0,1,2,5$. For $a=0$ the correlation matrix $\mathbf X^\top \mathbf X$ (after standardizing the features) will have off-diagonal elements approximately normally distributed with standard deviation $1/\sqrt{1000}$. For $a>0$, I compute correlation matrix without centering the variables (this preserves the inserted correlations), and the standard deviation of the off-diagonal elements grow with $a$ as shown on this figure (rows correspond to $a=0,1,2,5$):

random correlation matrices

All these matrices are of course positive definite. Here is the matlab code:

offsets = [0 1 2 5];
n = 1000;
p = 100;

rng(42) %// random seed

figure
for offset = 1:length(offsets)
    X = randn(n,p);
    for i=1:p
        X(:,i) = X(:,i) + (rand-0.5) * offsets(offset);
    end
    C = 1/(n-1)*transpose(X)*X; %// covariance matrix (non-centred!)

    %// convert to correlation
    d = diag(C);
    C = diag(1./sqrt(d))*C*diag(1./sqrt(d));

    %// displaying C
    subplot(length(offsets),3,(offset-1)*3+1)
    imagesc(C, [-1 1])

    %// histogram of the off-diagonal elements
    subplot(length(offsets),3,(offset-1)*3+2)
    offd = C(logical(ones(size(C))-eye(size(C))));
    hist(offd)
    xlim([-1 1])

    %// QQ-plot to check the normality
    subplot(length(offsets),3,(offset-1)*3+3)
    qqplot(offd)

    %// eigenvalues
    eigv = eig(C);
    display([num2str(min(eigv),2) ' ... ' num2str(max(eigv),2)])
end

The output of this code (minimum and maximum eigenvalues) is:

0.51 ... 1.7
0.44 ... 8.6
0.32 ... 22
0.1 ... 48

Best Answer

Related Solutions

Solved – How to efficiently generate random positive-semidefinite correlation matrices

Solved – How to generate random correlation matrix that has approximately normally distributed off-diagonal entries with given standard deviation

Using vine method

My original answer

Related Question