Solved – Preparing data for cross-correlation time series

cross correlationenvironmental-datartime series

I have several data sets with air quality measurements for 20 locations. The measurements were done per second over a period of two weeks, 5 locations per period (because there were only 5 instruments).

The data looks stationary when plotted (see plot for 1 period). I want to calculate the cross-correlation between the relevant time series, to see if there is a temporal correlation between the measured concentrations at the various sites.
enter image description here

It does not feel useful to analyze the data per second. What is the best way to go with this? I tried to use hourly average. Or should I smooth the data? I read something about modeling the data (How to use Pearson correlation correctly with time series), but that sounds rather complicated.

Edit: I am using R.

Best Answer

When attempting to detect cross-correlation between two time series, the first thing you should do is make sure the time series are stationary (i.e. have a constant mean, variance, and autocorrelation).

The reason this is important is because a correlation is looking to measure a linear relationship between two variables. Presence of a time series trend interferes with gauging a true correlation between two time series variables, i.e. is it a true correlation or simply due to chance.

In this regard, firstly use the Dickey-Fuller test to screen for stationarity (it would help if you specify the software package you are using, I am using Python in this instance). Suppose you have two time series x and y:

xdf = ts.adfuller(x, 1)
ydf = ts.adfuller(y, 1)

Here's some sample output:

xdf
(-3.0704779047168596, 0.028816508715839483, 0, 106, {'1%': -3.4936021509366793, '5%': -2.8892174239808703, '10%': -2.58153320754717}, -723.247574137278)
ydf
(-2.949959856756157, 0.03983919029636401, 1, 105, {'1%': -3.4942202045135513, '5%': -2.889485291005291, '10%': -2.5816762131519275}, -815.3639322514784)

In this instance, we have p-values below 0.05, so the series do not need to be differenced for stationarity. In the case that we did, it would be necessary to difference the series. The following tutorial might help you.

Now, it is a matter of calculating the cross-correlation between x and y, and generating the lags:

# Calculate correlations
cc1 = np.correlate(x - x.mean(), y - y.mean())[0] # Remove means
cc1 /= (len(x) * x.std() * y.std()) #Normalise by number of points and product of standard deviations
cc2 = np.corrcoef(x, y)[0, 1]
print(cc1, cc2)

Upon obtaining the cross-correlation coefficient, the lags can be generated and the autocorrelations calculated:

# Generating lags
lg = 108
x = np.random.randn(lg)
Related Question