Reading around I understood that using a cross correlation with lag zero should give the same results of a normal pearson correlation. It happens that this is not the case for me. What could be the reason? Could NAs values influence this?
R Code:
> Data.Wide.Weeks$Philadelphia.1.PA
[1] NaN NaN NaN NaN NaN NaN NaN 63.00 199.00 81.70 335.00 284.00 455.00 797.00 671.00 368.00 164.00 76.00 150.00 43.00 33.70 18.30 10.00 7.67
[25] 8.33 6.00 5.33 6.67 30.30 21.00 23.30 49.00 80.30 28.00 31.00 18.00
> Data.Wide.Weeks$ArcWd
[1] 65.3 74.1 77.7 97.4 117.0 138.0 186.0 200.0 204.0 241.0 392.0 337.0 350.0 505.0 380.0 351.0 230.0 242.0 199.0 166.0 100.0 98.1 129.0 113.0 101.0 95.7 101.0 104.0
[29] 121.0 148.0 167.0 182.0 159.0 139.0 137.0 144.0
> cor(Data.Wide.Weeks$ArcWd, Data.Wide.Weeks$Philadelphia.1.PA, use = 'pairwise.complete.obs')
[1] 0.9265837
> ccf(Data.Wide.Weeks$ArcWd, Data.Wide.Weeks$Philadelphia.1.PA, na.action = na.pass, plot =F, lag.max = 5)
Autocorrelations of series ‘X’, by lag
-5 -4 -3 -2 -1 0 1 2 3 4 5
0.175 0.349 0.540 0.668 0.835 0.949 0.793 0.635 0.466 0.301 0.176
The pearson coeff is 0.926, while the ccf zero lag coeff is 0.949, and this happens for all my series.
Best Answer
A simple simulation without the NA values will show that
ccf
andcor
give the same results. So the issue is with howccf
andcor
treat NA values.In your code you use
na.pass
forccf
andpairwise.complete.obs
forcor
. The documentation forccf
gives the following warning when usingna.pass
:Now the option
pairwise.complete.obs
forcor
omits the observations where either of series are NA. So if you want the same behaviour ascor
you should do the same. Simple simulation shows that if you usena.omit
instead ofna.pass
theccf
andcor
give the same result.Now the question is how exactly
ccf
calculates the covariances whenna.action=na.pass
? This can be seen by inspecting the code ofacf
, whichccf
calls. Here is the offending line:The default option for
demean
isTRUE
, if it isFALSE
it is assumed that the means are zero. It is clear then that forna.pass
the mean for the series without the NA values will be calculated from all of the observations, rather than from observations where both of the series are not NA.So in the end the issue was the same as in the question mentione by @lekshmi dharmarajan. The scaling is different when NA values are treated differently.