Solved – Testing significance of cross-correlated series

correlationcross correlationsignal processingtime series

I want to prove that, overall, signal B is correlated to signal A. I was thinking of using cross-correlation (in R) to measure this.

Essentially I have two kinds of signals: signal A is a series of single-valued data describing a particular song; signal B is a series of single-valued data for a user. There are many songs and many users per song, but I do not have the same number of users for every song.

For example:

Signal A (song data), for song 1
0.994
0.986
0.955
0.890
0.795
0.650
...

Signal A (song data), for song 2
0.763
0.788
0.787
0.908
0.854
0.901
...

Signal B (user data), for user 1 listening to song 1
75
74.4
73.7
73
72.3
72
...

Signal B (user data), for user 1 listening to song 2
71
72.3
74.9
73
72.5
72.9

Signal B (user data), for user 2 listening to song 2
60.6
60.2
61
60.7
61
59.3
...

Etc.

The series are obviously truncated for this illustration. Again, there are many songs, and not every user listened to every song.

I am interested in whether I can draw conclusions about how well all song data (signal A) can predict all user response (signal B).

Ideally, I would like to capture the cross-correlation in one number (one test statistic for each song), so that I may easily quantify whether there is an overall correlation between the two signals.
Using ccf (in R) gives me a value for each lag. For example:

> print(ccf(x,y))
Autocorrelations of series ‘X’, by lag                                 
-6     -5     -4     -3     -2     -1      0      1      2      3      4                                                                    
-0.242 -0.090  0.057  0.197  0.466  0.699  0.896  0.436  0.221 -0.018 -0.116

(Are these values the cross-correlation coefficients?)
Also, my data are not stationary. Is there any way (another function?) to test whether signals A and B are correlated across users and songs?
One approach would be to average signal B (take the mean user response) for each song, but because there are a different number of users for each song, working with means might be problematic.

So, my main questions again are:

If I perform a cross-correlation for one user data/song data pair, how do I test for significance? Will R give me a correlation coefficient at each lag, or does it only tell me which lag is significant (but not provide any test statistic)? If the latter is the case, will I need to adjust one series of data (to account for the lag) before running a normal Pearson's correlation?
What test may I use when the data are not stationary?
There are a different number of users for each song. For this reason, I can't simply take the average of all users' data for each song (to correlate the mean user data with the song data) – is that correct? Is there a way to test the correlation between signals A and B for each song (across existing users), or must I try to calculate the correlation for each user/song pair individually?

I hope my intent is clear. Thanks for any insight.

Best Answer

I want to prove that, overall, signal B is correlated to signal A.

If you want to prove that, you could calculate the empirical correlation and estimate its statistical significance under the assumption of $i.i.d.$ observations. However, time series data is notorious for not satisfying the $i.i.d.$ assumption; the conditional means and/or variances of time series usually change with time. Hence, you need some model to describe the relation between A and B and their time development (including possibly the time development of the relationship itself). Once you have built a model and validated its assumptions, you may proceed to model-based inference. For example, you may test the model's overall significance or significance of particular coefficients or their combinations. That way you may establish (or fail to establish) significant relationships between A and B. (You may think of the $i.i.d.$ case as being a very simple model that reflects constancy of means and variances (and higher order moments) and also constancy of the relationship between A and B.)

This may be too general to be directly useful, but it should provide a framework to think and develop a further discussion within. Unfortunately, I do not yet understand your problem sufficiently well to suggest a concrete model to work with.

Related Solutions

Solved – Coefficient for a zero lag cross-correlation different from Pearson correlation r

A simple simulation without the NA values will show that ccf and cor give the same results. So the issue is with how ccf and cor treat NA values.

In your code you use na.pass for ccf and pairwise.complete.obs for cor. The documentation for ccf gives the following warning when using na.pass:

This means that the estimate computed may well not be a valid autocorrelation sequence, and may contain missing values.

Now the option pairwise.complete.obs for cor omits the observations where either of series are NA. So if you want the same behaviour as cor you should do the same. Simple simulation shows that if you use na.omit instead of na.pass the ccf and cor give the same result.

Now the question is how exactly ccf calculates the covariances when na.action=na.pass? This can be seen by inspecting the code of acf, which ccf calls. Here is the offending line:

if (demean) 
        x <- sweep(x, 2, colMeans(x, na.rm = TRUE), check.margin = FALSE)

The default option for demean is TRUE, if it is FALSE it is assumed that the means are zero. It is clear then that for na.pass the mean for the series without the NA values will be calculated from all of the observations, rather than from observations where both of the series are not NA.

So in the end the issue was the same as in the question mentione by @lekshmi dharmarajan. The scaling is different when NA values are treated differently.

Solved – Cross correlation influenced by self auto correlation

Pre-whitening is definitely the way to go. It does not change the relationship but enables identification of the relationship between the original series.. Care should be taken to identify any deterministic structure in the original series and develop the pre-whitening filters in conjunction with them . See http://viewer.zmags.com/publication/9d4dc62a#/9d4dc62a/66 for a review which highlights Transfer Function identification. If you wish you can post your data in an excel format and I will try and explain each step.

EDITED AFTER RECEIPT OF DATA:

120 values for Y (STOCK1) and X (STOCK2) were analyzed utilizing https://onlinecourses.science.psu.edu/stat510/node/75 guidelines using an automatic option available in AUTOBOX http://www.autobox.com/cms/ a commercially available system which I have helped develop. Modelling is an iterative,self-checking process, which extracts structure from the data (with possible model pre-specification) and culminates in a parsimonious equation. I will try and walk through the steps showing details from the automatic process which is faithful to the PSU reference.

The intial pre-whitening filters for X and Y are shown here . Each of the two series is non-stationary and each one required one order of differencing to obtain stationarity.

The pre-whitened cross-correlations and proportional Impulse Response Weights are . AUTOBOX in a conservative mode INITALLY suggests 1 lag in the differnce of X . estimation and diagnostic checking suggests the need to add a second lag to the model . . Intervention detection examines the need to accomodate unspecified deterministic structure and suggests a pulse at period 8 which is not significant. Step-down leads to the final model and here . The model's residuals are plotted here . The Actual/Fit and Forecast (based upon future expectations of X and the model) are here .