Solved – Least correlated subset of random variables from a correlation matrix

correlation

I have a correlation matrix $A$, which I obtained using the Pearson's linear correlation coefficient through Matlab's corrcoef(). The correlation matrix of dimension 100×100, i.e. I computed the correlation matrix on 100 random variables.

Among these 100 random variables, I would like to find the 10 random variables whose correlation matrix contains as "little correlation" as possible (see Quantifying how much "more correlation" a correlation matrix A contains compared to a correlation matrix B regarding metrics to measure the overall correlation in a correlation matrix). I only care about pairwise correlation.

Are there good methods to find those 10 random variables in a reasonable amount of time (e.g. I don't want to try $\binom{100}{10}$ combinations)? Approximation algorithms are OK.

Best Answer

Let us consider the sum of absolute pairwise correlations as our measure of choice. We thus seek a vector $v\in\{0,1\}^N$ with $l_1(v)=n$ which will minimize $v'Qv$ where $Q_{ij}=|A_{ij}|$.

Assume Q is also positive definite as A, the problem is reduced to solving the constrained quadratic optimization problem:

$$v^*=\min\ v'Qv\ s.t.\ l_1(v)=n,\ v_i\in\{0,1\}$$

This suggest the folloing relaxation:

$$v^*=\min\ v'Qv\ s.t.\ l_1(v)=n,\ v_i\in[0,1]$$

which can be easily solved using off-the-shelf solvers; then the result is given by the largest $n$ components in $v^*$.

Sample matlab code:

N=100;
n=10;
% Generate random data
A=rand(N,1000);
C=corrcoef(A');
Q=abs((C+C')/2); % make sure it is symmetric
x = cplexqp(Q,zeros(1,N),[],[], ones(1, N),n, zeros(N,1), ones(N,1));
% If you don't use CPLEX, use matlab's default
% x = quadprog(Q,zeros(1,N),[],[], ones(1, N),n, zeros(N,1), ones(N,1));
assert(abs(sum(x)-n)<1e-10);
% Find the n largest values
I=sort(x); 
v=zeros(size(x)); v(x>I(N-n))=1; 
assert(abs(sum(v)-n)<1e-10);
% Make sure we do better than 10K random trials
for i=1:10000
   vc=zeros(size(x)); vc(randperm(N,n))=1;
   assert(sum(vc)==n, 'Wrong l0 norm');
   assert(vc'*Q*vc>v'*Q*v, 'Improves result');
end
% Show results
J=find(v==1);
fprintf('The optimal solution total off-diagonal correlations are %1.3f\n', v'*Q*v-n);
fprintf('The matrix:\n');
C(J,J)