Solved – How to combine 2 variables each be strongly correlated with a 3rd variable

correlation

I have 2 sets of variables that are weakly correlated to each other but highly correlated with third variable. Is there any method to combine these 2 variables to achieve a much more stronger correlation to third variable?
for example:
set a (correlation with b:0.2, correlation with c:0.8)
set b (correlation with a:0.3, correlation with c:0.75)

combination set: m*a + n*b (correlation with c:0.95)??(is it any mathematical approach to make a combination set and find weights?)

Best Answer

The problem can be solved straightforwardly by optimization.

Let $L$ be defined as the linear combination $m a + n b$ of $a, b$ using coefficients $m, n$.

The maximum achievable correlation between $L$ and $c$ depends only on the correlations between $a, b, c$, and is independent of the variances of $a, b, c$ (showing this is left as an exercise to the reader). However, the optimal coefficients $m^*, n^*$ achieving the maximum correlation between $L$ and $c$ do depend on the variances of $a, b, c$. Also note that $m^*, n^*$ are only unique to within a positive scale factor, i.e., $s m^*, s n^*$ for any $s > 0$ is also optimal if $m^*, n^*$ are optimal. So if you only want to know the maximum achievable correlation between $L$ and $c$, you can set variances of $a, b, c$ all $= 1$ and deal entirely with correlations.

Finding the optimal values of $m$ and $n$ for any set of input data can be solved as a maximization (optimization) problem of maximizing $correlation(L,c)$:

Maximize with respect to $m$ and $n$ (note: the various variances and covariances are input data to the optimization problem):

$\dfrac{m Cov(a,c) + n Cov(b,c)}{ \sqrt{m^2 Var(a) + n^2 Var(b) + 2 m n Cov(a,b)} \sqrt{Var(c)}}$

Note that the $\sqrt{Var(c)}$ is not needed, but I included it for simplicity of exposition, and so that, for convenience, the objective function directly equals the correlation between $L$ and $c$. Also, as mentioned above, if we merely want to know the maximum achievable correlation between $L$ and $c$, but not how to achieve it, we can set $Var(a) = Var(b) = Var(c) = 1$, and then $Cov(a,c) = correlation(a,c)$ and $Cov(b,c) = correlation(b,c)$.

This can be solved via numerical optimization, which is straightforward regardless of the input values.

Alternatively, I will now show the closed form solution.

$m^* = \dfrac{Cov(b,c) Cov(a,b) - Cov(a,c) Var(b)}{Cov(a,c) Cov(a,b) -Cov(b,c) Var(a)} n^*$

unless the denominator $= 0$, in which case $n^* = 0$ and $m^*$ can be taken as $1$ or $-1$ respectively, according as whether $Cov(a,c)$ is non-negative or negative.

If the denominator $\ne 0$, then $n^*$ can be taken as either $1$ or $-1$, according as which choice makes the $correlation(L,c)$ non-negative.

Example 1: Now for numerical results for the example correlation values in the question. First of all, note that $correlation(a,b)$ is listed as $0.2$ and $correlation(b,a)$ is listed as $0.3$, but these must be the same. A value of $0.2$ is incompatible with the combination of $correlation(a,c) = 0.75$ and $correlation(b,c) = 0.8$ because the resulting covariance matrix between $a, b, c$ would not be positive semi-definite (but for the adventurous, if we use it, we can break the unity barrier in correlation, achieving a maximum correlation between $L$ and $c$ of $1.00130$, ha ha). Therefore, I will use the value $correlation(a,b) = 0.3$.

The maximum achievable correlation between $L$ and $c$ for this example is $0.96220$.

If we set $Var(a) = Var(b) = Var(c) = 1$, then the optimal values of $m, n$ achieving this maximum correlation are $m^* = 1.12745, n^* = 1$. As stated previously, these are only unique to within a positive scale factor. If we set $Var(a) = 1, Var(b) = 4, Var(c) = 9$, then the optimal values of $m, n$ achieving this maximum correlation are $m^* = 2.25490, n^* = 1$. For the input set of correlations used, regardless of the variances of $a, b, c$, the maximum achievable correlation between $L$ and $c$ is $0.96220$. This is as compared to correlation between $a$ and $c$ of $0.75$ and between $b$ and $c$ of $0.8$.

Example 2: This example is the same as example 1, except that $Cov(a,b) = 0.99$ rather than $0.3$. $m^*$ is now positive and $n^*$ negative, resulting in maximum achievable $correlaton(L,c) = 0.85361$. So in this case, some "fancy maneuvering" is needed to increase $correlaton(L,c)$ beyond that achievable with $a$ or $b$ by themselves. The "dividing line" between $n^*$ being positive vs. negative is achieved in this case at $0.9375$, which is the combination of values making the denominator in the formula for $m^*$ equal to $0$.

Best Answer

Related Solutions

Solved – combining/merging correlated variables

Correlation – How to Remove Correlated Features in Python Using Pandas

Related Question