[Math] Use Pearson’s correlation coefficient on a matrix

correlationlinear algebrastatistics

I have a problem to interpret the following formula which is said to be the Pearson's correlation coefficient:

$$r = \frac{N \left(\sum XY\right) – \left(\sum X\right) \left(\sum Y\right)}{\sqrt{\left[N \left(\sum X^2\right) – \left(\sum X\right)^2\right] \left[N \left(\sum Y^2\right) – \left(\sum Y\right)^2\right]}}$$

It is from Mining a Web Citation Database for author co-citation analysis (p.7). I have problems with its interpretation, since the authors says $X$ and $Y$ are vectors with length $N + 1$ and the product of two column vectors is not defined, at least not normally, isn't it?

I have found a similiar notation of this formular on this Wikipedia article. Here, the formula does not take vectors as arguments, but a series of $n$ measurements with $x_i$ and $y_i$, where $i = 1,2,\dots,n$.

I have problems to combine both formulas and understand what my calculations should look like when applying it. Maybe an example would help:

Let's take this two vectors:

$$X = (0,0.5,0,0)$$

$$Y = (0.5,0,0,0)$$

with $N = 3$ which would be from this matrix:

$$\begin{pmatrix}
0 & 0.5 & 0 & 0\\
0.5 & 0 & 0 & 0\\
0 & 0 & 0 & 0\\
0 & 0 & 0 & 0\\
\end{pmatrix}$$

Best Answer

I am not convinced this expression is correct if these are vectors with length $N+1$ (the implicit means are wrong), so for the rest of this I will assume they are of length $N$.

If $\mathbf{X}$ is $(X_1, X_2, \ldots , X_{N})$ then the interpretation of $\sum X$ is clearly $\sum_{i=1}^{N} X_i$, of $\sum X^2$ is $\sum_{i=1}^{N} X_i^2$, and $\sum XY$ is $\sum_{i=1}^{N} X_i Y_i$. You can regard the last of these either as a dot product or a sum over a pointwise product (for matrices this pointwise product is sometimes called a Hadamard product).

Related Question