Solved – Coefficient for a zero lag cross-correlation different from Pearson correlation r

correlationcross correlationpearson-rtime series

Reading around I understood that using a cross correlation with lag zero should give the same results of a normal pearson correlation. It happens that this is not the case for me. What could be the reason? Could NAs values influence this?

R Code:

> Data.Wide.Weeks$Philadelphia.1.PA
 [1]    NaN    NaN    NaN    NaN    NaN    NaN    NaN  63.00 199.00  81.70 335.00 284.00 455.00 797.00 671.00 368.00 164.00  76.00 150.00  43.00  33.70  18.30  10.00   7.67
[25]   8.33   6.00   5.33   6.67  30.30  21.00  23.30  49.00  80.30  28.00  31.00  18.00

> Data.Wide.Weeks$ArcWd
 [1]  65.3  74.1  77.7  97.4 117.0 138.0 186.0 200.0 204.0 241.0 392.0 337.0 350.0 505.0 380.0 351.0 230.0 242.0 199.0 166.0 100.0  98.1 129.0 113.0 101.0  95.7 101.0 104.0
[29] 121.0 148.0 167.0 182.0 159.0 139.0 137.0 144.0

> cor(Data.Wide.Weeks$ArcWd, Data.Wide.Weeks$Philadelphia.1.PA, use = 'pairwise.complete.obs')
[1] 0.9265837

> ccf(Data.Wide.Weeks$ArcWd, Data.Wide.Weeks$Philadelphia.1.PA, na.action = na.pass, plot =F, lag.max = 5)

Autocorrelations of series ‘X’, by lag

   -5    -4    -3    -2    -1     0     1     2     3     4     5 
0.175 0.349 0.540 0.668 0.835 0.949 0.793 0.635 0.466 0.301 0.176 

The pearson coeff is 0.926, while the ccf zero lag coeff is 0.949, and this happens for all my series.

Best Answer

A simple simulation without the NA values will show that ccf and cor give the same results. So the issue is with how ccf and cor treat NA values.

In your code you use na.pass for ccf and pairwise.complete.obs for cor. The documentation for ccf gives the following warning when using na.pass:

This means that the estimate computed may well not be a valid autocorrelation sequence, and may contain missing values.

Now the option pairwise.complete.obs for cor omits the observations where either of series are NA. So if you want the same behaviour as cor you should do the same. Simple simulation shows that if you use na.omit instead of na.pass the ccf and cor give the same result.

Now the question is how exactly ccf calculates the covariances when na.action=na.pass? This can be seen by inspecting the code of acf, which ccf calls. Here is the offending line:

if (demean) 
        x <- sweep(x, 2, colMeans(x, na.rm = TRUE), check.margin = FALSE)

The default option for demean is TRUE, if it is FALSE it is assumed that the means are zero. It is clear then that for na.pass the mean for the series without the NA values will be calculated from all of the observations, rather than from observations where both of the series are not NA.

So in the end the issue was the same as in the question mentione by @lekshmi dharmarajan. The scaling is different when NA values are treated differently.