Solved – Normalized correlation with a constant vector

correlationcross correlation

I am confused how to interpret the result of performing a normalized correlation with a constant vector. Since you have to divide by the standard deviation of both vectors (reference: http://en.wikipedia.org/wiki/Cross-correlation ), if one of them is constant (say a vector of all 5's, which has standard deviation of zero), then the correlation is infinity, but in fact the correlation should be zero right? This isn't just a corner case, in general if the standard deviation of one of the vectors is small, the correlation to any other vector is very high, which obviously doesn't make sense. Can anyone explain my misinterpretation?

Best Answer

Let $\boldsymbol{x}$ and $\boldsymbol{y}$ be your two vectors and let $\boldsymbol{\bar{x}} \equiv \bar{x} \boldsymbol{1}$ and $\boldsymbol{\bar{y}} \equiv \bar{y} \boldsymbol{1}$ be constant vectors for the means of the two original vectors. The components of the sample correlation are:

$$\begin{matrix} s_{x,y}^2 = \frac{1}{n-1} (\boldsymbol{x} - \boldsymbol{\bar{x}}) \cdot (\boldsymbol{y} - \boldsymbol{\bar{y}}) & & s_x = \frac{1}{n-1} ||\boldsymbol{x} - \boldsymbol{\bar{x}}|| & & s_y = \frac{1}{n-1} ||\boldsymbol{y} - \boldsymbol{\bar{y}}||. \end{matrix}$$

The sample correlation between $\boldsymbol{x}$ and $\boldsymbol{y}$ is just the cosine of the angle between the vectors $\boldsymbol{x} - \boldsymbol{\bar{x}}$ and $\boldsymbol{y} - \boldsymbol{\bar{y}}$. Letting this angle be $\theta$ we have:

$$\rho_{x,y} = \frac{(\boldsymbol{x} - \boldsymbol{\bar{x}}) \cdot (\boldsymbol{y} - \boldsymbol{\bar{y}})}{||\boldsymbol{x} - \boldsymbol{\bar{x}}|| \cdot ||\boldsymbol{y} - \boldsymbol{\bar{y}}||} = \cos \theta.$$

Since scaling of either vector scales the covariance and standard deviation equivalently, this means that correlation is unaffected by scale. It is not correct to say that a low standard deviation gives a high correlation. What matters for correlation is the angle between the vectors, not their lengths.

In the special case where $\boldsymbol{y} \propto \boldsymbol{1}$ (i.e., $\boldsymbol{y}$ is a constant vector) you have $\boldsymbol{y} - \boldsymbol{\bar{y}} = \boldsymbol{0}$ which then gives $s_{x,y}^2 = 0$ and $s_{y} = 0$. In this case the correlation is undefined. Geometrically this occurs because there is no defined angle with the zero vector.

Related Solutions

Solved – What’s the formula of normalized correlation

I haven't come across this usage, but it seems easy to decode.

Matters may differ in your field, but within mainstream statistics, and all statistics-using disciplines I know about, correlation is understood as being by definition scaled to fall within [-1, 1]. When calculated similarly to your formula correlation is a cosine.

So the term "normalized" is just emphasizing that fact; it is not flagging a special case.

The unnormalized correlation would just be called the covariance.

So, you can't find this term being used because it is very unusual.

Solved – R Own Implementation of Cross Correlation using Convolution

I think you're missing two things. First, you need to take a complex conjugate of fft(x) before taking the inverse FFT. More important, and subtle, is that the FFT assumes periodicity in your data. As a result, if you calculate the cross correlation directly you're calculating correlations with wrap around, which isn't what you want I suspect. For example, for 1:5 and a lag of 1, instead of calculating correlation between
1 2 3 4 5
0 1 2 3 4
you're calculating it between
1 2 3 4 5
5 1 2 3 4
You can account for this by padding your vectors with 0 up to a size of $2n-1$. There's a nice explanation of this in: Fast variogram computation with FFT (Marcotte 1996); pdf here. This paper focuses on 2D spatial data, but I think the idea is the same.

Here's some R code to calculate cross-correlation for lags $-(n-1)$ to $n-1$:

fftXcor <- function(x, y) {
    n <- length(x)
    # Normalize
    x <- as.numeric(scale(x)) 
    y <- as.numeric(scale(y))
    # Enlarge with 0's to size 2*n-1 to account for periodicity
    x <- c(x, rep(0, length(x) - 1))
    y <- c(y, rep(0, length(y) - 1))
    # FFT
    xfft <- fft(x)
    yfft <- fft(y)
    # Cross-correlation via convolution
    crosscor <- fft(Conj(xfft) * yfft, inverse=T) / length(x)
    crosscor <- Re(crosscor) / (n - 1)
    # Slice it up to make it for lags -n:n not 0:(2n-1)
    crosscor <- c(crosscor[(n+1):length(crosscor)], crosscor[1:n])
    # Store lag as names attribute of vector
    names(crosscor) <- (1-n):(n-1)
    return(crosscor)
}
x <- 1:9
y  <- 9:1
xc <- fftXcor(x, y)
# Just look at half the vector since it's symmetric
round(xc, 3)[9:17]
#      0      1      2      3      4      5      6      7      8 
# -1.000 -0.667 -0.350 -0.067  0.167  0.333  0.417  0.400  0.267 

# Compare to R's built in cross correlation function
xc_ccf <- ccf(x, y, lag.max=8, plot=F, type='correlation')
as.vector(xc / xc_ccf$acf)
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Best Answer

Related Solutions

Solved – What’s the formula of normalized correlation

Solved – R Own Implementation of Cross Correlation using Convolution

Related Question