Solved – Generate three pairwise correlated random variables

correlationsimulation

I need to simulate three variables $A,B,P$ ~ $N(0,1)$ such that the Pearson correlations $r_{AB}=\operatorname{cor} (A,B)$ and $r_{BP}=\operatorname{cor}(B,P)$ are given. I need to repeat the simulation multiple times in order to generate several values for $r_{AP}=\operatorname{cor}(A,P)$ and to observe a sample distribution for it, that's why I am not fixing it. We want to see if under the simple assumptions that all the variables are ~ $N(0,1)$ we can somehow "estimate" $r_{AP}$ given the other two correlations.

Questions concerning simulations of variables with given correlation have already been addressed multiple times. But none of the suggested methods works well for me, because:

  1. For using the Cholesky decomposition matrix (as explained here) I have to fix the correlation between $r_{AP}$, which is exactly what I don't want to do;
  2. If instead I use some pairwise correlation simulation (as in here), you can analitically show that the expected value of the sample distribution for $r_{AP}$ is exactly $r_{AB} * r_{BP}$, which takes away any predictive power of the simulation.

Does anybody have an idea on how to implement such a simulation? I am not familiar with copula, but I have the feeling it could be a starting point.

** EDIT **

I have been asked several times to clarify my question. I hope this works.

The question is if, just by supposing that you have A,B,P~N(0,1) distributed random variables with pairwise given correlations between two pairs of them (not all three), and no more information at all, you can derive a distribution for the missing correlation. I do not want a sample distribution for the random variables themselves, but for the missing correlation (if there is one at all; I am not even sure about this point).
That's the pseudo-code:

1. Generate A,B,P ~ N(0,1) s.t. cor(A,B) and cor(B,P) are given
2. Calculate cor(A,P)
3. Repeat 1. and 2. several times (say, 500) and get a distribution for cor(A,P).

The problem is, the ways I have been simulating until now (see the two links above) introduce some structure in the data (in particular: P and A get to be linearly dependent) and practically nullifies the predictive power I am hoping to get from the simulation. But there might be no way at all to create artificial data with "partial structure" (distribution and two pairwise correlations assigned) and still hoping for the third correlation to be, to a certain extent, independent from this structure.

Best Answer

Maybe this little "proof" does clear some things for you. I especially think that you do not quite understand the correlation coefficient entirely, but thats just my guess and I don't want to offend you.

First let me remind you that the correlation coefficient is a measure of the linear dependency of two independent random variables X and Y.

So lets look at the formula for the covariance and the correlation coefficient:

Cov(X,Y) = E[(X-E[X])(Y-E[Y])

$r_{XY} = \frac{Cov(X,Y)}{\sqrt{Var(X)}\sqrt{Var(Y)}}$

Since X,Y ~ N(0,1) we get: $Cov(X,Y) = r_{XY}$

So now lets get to linear regression:

$Y = \beta_{0} + \beta_{1}X + \epsilon$

Notice that the $\beta_{0}$ intercept is cancelled out, because of X,Y ~ N(0,1), we get:

$Y = \beta_{1}X + \epsilon$

Now note that the slope $\beta_{1}$ is defined as: $r_{XY} * \frac{\sqrt{Var(Y)}}{\sqrt{Var(X)}}$, since X,Y ~ N(0,1):

$\beta_{1} = r_{XY}$ = Cov(X,Y)

So now lets move on to your example:

What you want to fix is:

$A = r_{AB}B + \epsilon_{AB}$

$P = r_{BP}B + \epsilon_{BP}$

What you want to "estimate" is

$r_{AP}$, which is equal to Cov(A,P) since A,P ~ N(0,1) as shown before.

The Covariance formula can be rewritten as:

Cov(X,Y) = E[XY] - E[X]*E[Y]

So we get $E[AP] - \underbrace{E[A]}_{0}\underbrace{E[B]}_{0}$, again since A,B ~ N(0,1)

Cov(A,P) = E[AP]

Lets plug our two equations for A and P in there:

$E[AP] = E[(r_{AB}B + \epsilon_{AB})(r_{BP}B + \epsilon_{BP}])]$

$E[AP] = E[r_{AB}B^{2}r_{BP}] + \underbrace{E[r_{AB}B\epsilon_{BP}]}_{0} + \underbrace{E[\epsilon_{AB}Br_{BP}]}_{0} + \underbrace{E[\epsilon_{AB}\epsilon_{BP}]}_{0}$, since we assume uncorrelated residuals $\epsilon$ (crucial regression assumption)

$E[AP] = r_{AB}r_{BP}E[B^{2}]$, because $E[r_{AB}] = r_{AB}$ and $E[r_{BP}] = r_{BP}$

$E[AP] = r_{AB}r_{BP}\underbrace{E[B^{2}]}_{1}$, because B ~ N(0,1)

so we finally get $E[AP] = r_{AB}r_{BP}$, which can be rewritten as:

$r_{AP} = r_{AB}r_{BP}$

So now you also should get an idea of when this not holds true. For example if the residuals $\epsilon$ are not uncorrelated (In reality they are almost never entirely uncorrelated).

Maybe @Glen_b♦ can check this answer for mistakes or advance further on situations, when this does not holds true, because he has far more knowledge than I have.

Related Question