Solved – Cross correlation for very sparse binary data

binary datacross correlationsparse

I have a very large (5271159×60) sparse (~2.5%) binary matrix, and I'd like to calculate the cross correlation between each of the columns (sensors) for a series of lags from -10:10, which would give me a 21x60x60 tensor.

The naive implementation involves computing the zscore of the matrices at each of the different lags, and then computing the dot product between the matrices at different lags and dividing by the number of rows. However this is very inefficient, as computing the zscore makes the matrices non-sparse (when the mean is subtracted). For example, in matlab, using the matrix SparseEvents:

lag = -10:10;
P = zeros(length(lag), size(SparseEvents,2), size(SparseEvents,2));
for i=1:length(lag)
    display(sprintf('lag = %d (%d/%d)', lag(i), i, length(lag)));
    M1 = SparseEvents(1 + max(lag) - lag(i) : end + min(lag) - lag(i), :);
    M2 = SparseEvents(1 + lag(i) - min(lag) : end + lag(i) - max(lag), :);
    P(i,:,:) = zscore(M1)' * zscore(M2) / size(M1,1);
end

Am I missing something – could I perform the dot product and subtract the means post-hoc in some manner?

Best Answer

You are right to think of correlation as the mean of the product of the standardized variables: this has great conceptual advantages over other definitions. It also leads to minimal loss of floating point precision. However, it is not necessarily the best approach for all computation, especially when speed is important.

Our goal is to make algebraic manipulations that lead to calculations involving, if possible, dot products of the original variables, for then the inherent sparse matrix operations ought to make short work indeed of those computations. Let, then, $\mathbf u$ and $\mathbf v$ be two of the columns of the original sparse binary matrix $\mathbb Y$ (so that the $n$ entries in each of $\mathbf u$ and $\mathbf v$ are only zeros and ones). Let $\mathbf 1 = (1,1,\ldots, 1)$ be an $n$-vector and write

$$\bar u = (u_1 + u_2 + \cdots + u_n)/n = \frac{1}{n}\mathbf u \cdot \mathbf 1$$

(and likewise $\bar v = \frac{1}{n}\mathbf v \cdot \mathbf 1$) for their means and

$$\mathbf u_0 = (u_1 - \bar u, u_2 - \bar u, \ldots, u_n - \bar u)$$

(and similarly for $\mathbf v_0$) for the centered vectors of their residuals. Then

$$||\mathbf u_0||^2 = \mathbf u_0 \cdot \mathbf u_0$$

shows how to find the lengths of the residual vectors which are used to standardize them to

$$\mathbf \upsilon = \mathbf u_0 / \sqrt{||\mathbf u_0||/n};\quad\mathbf \phi= \mathbf v_0 / \sqrt{||\mathbf v_0||/n}$$

whence, by definition,

$$\rho_{\mathbf u, \mathbf v} = \mathbf \upsilon \cdot \mathbf \phi.$$

(Apart from choices of when to divide by $n$, this appears to be what the code in the question is doing.)

Working backwards (by plugging the foregoing into this formula) easily yields

$$\eqalign{ \rho_{\mathbf u, \mathbf v} &=\mathbf u_0 / \sqrt{||\mathbf u_0||/n} \cdot \mathbf v_0 / \sqrt{||\mathbf v_0||/n} \\ &=n\frac{\left(\mathbf u - \bar u \mathbf 1\right)\cdot\left(\mathbf v - \bar v \mathbf 1\right)}{\sqrt{\left(\mathbf u - \bar u \mathbf 1\right)\cdot\left(\mathbf u - \bar u \mathbf 1\right)\,\left(\mathbf v - \bar v \mathbf 1\right)\cdot\left(\mathbf v - \bar v \mathbf 1\right)}}. }$$

Using the distributive law to expand the dot products shows that

$$\eqalign{ \left(\mathbf u - \bar u \mathbf 1\right)\cdot\left(\mathbf v - \bar v \mathbf 1\right) &= \mathbf u \cdot \mathbf v - \frac{2}{n}(\mathbf u \cdot \mathbf 1) (\mathbf 1 \cdot \mathbf v) + \frac{1}{n^2}(\mathbf u \cdot \mathbf 1)(\mathbf v \cdot \mathbf 1)\mathbf 1 \cdot \mathbf 1 \\ &= \mathbf u \cdot \mathbf v -\frac{1}{n}(\mathbf u \cdot \mathbf 1)(\mathbf v \cdot \mathbf 1) }$$

because $\mathbf 1 \cdot \mathbf 1 = n$. Similar formulas hold for the expressions in the denominator. This shows how the correlation coefficient can be computed in terms of dot products of the original (raw, sparse) vectors. Note in particular that the terms in the denominator can be written

$$\mathbf u \cdot \mathbf u -\frac{1}{n}(\mathbf u \cdot \mathbf 1)(\mathbf u \cdot \mathbf 1)=n \bar u - \frac{1}{n}(n \bar u)^2 = n(\bar u - \bar u^2)$$

An efficient implementation will compute column sums (from which their means $\bar u$ are immediately derived) and obtain all the dot products of all pairs of columns at once by means of a single matrix operation $\mathbb Y^\prime \mathbb Y$. Using these it is simple and fast to obtain the correlation coefficients with the preceding formula. To obtain lagged correlations at lag $k$, remove the first $k$ rows of $\mathbb Y$ (call this $\mathbb Y_{(k)}$ and separately remove the last $k$ rows (call this $\mathbb Y_{(-k)}$). The essential material for the computation can be found in the non-diagonal entries of $\mathbb Y_{(k)}^\prime \mathbb Y_{(-k)}$. The new denominators will scarcely differ from the old ones and so, to a high degree of accuracy, need not be recomputed at all; but for perfect accuracy note that the column sums in $\mathbb Y_{(k)}$ are of the form

$$\sum_{i=k+1}^n u_i = \left(\sum_{i=1}^n u_i\right) - u_k - u_{k-1} - \cdots - u_2 - u_1$$

which are easily obtained from the original column sums by means of just $k$ subtractions (and similarly for the column sums of $\mathbb Y_{(-k)}$.

Thus, computing the entire $21\times 60\times 60$ array comes down to performing $21$ multiplications of sparse binary matrices and adjusting the results. The total number of numeric operations involved (with each dot product requiring about $2n$ multiplications and $2n$ additions) will be approximately

$$2 \times 5271159 \times 60^2 \times 21 \approx 800 \times 10^9.$$

Unparallelized, but running as native code on a modern PC, this would take two minutes without any sparse matrix speed improvements. Tests in R (without using sparse arithmetic) with a $5271159 \times 6$ matrix took $1.3$ seconds; the quadratic scaling indicates R would thereby require $130$ seconds, confirming the two minute estimate. Given a random binary array of dimensions $5271159\times 60$ and mean of $1/40$, Mathematica 9 took one second to compute $\mathbb Y^\prime \mathbb Y$, suggesting the total computation for all $21$ lags should be around $20$ seconds.

Related Question