Solved – Autocorrelation of a sine wave

autocorrelationcorrelationtime series

I would like to know the meaning of an autocorrelation graph of a sine wave. When the time lag is 0, then the autocorrelation should give the highest value of 1 since a copy of the signal is completely correlated to itself. By this logic, after a time equal to the period of the signal, the correlation should again be maximum since the shifted signal is again the signal itself. However, when I plot the correlation in python, I get a function which keeps increasing which goes against my intuition that the correlation function should be periodic. Could anyone please explain why the autocorrelation shows this trend in its graphical form?

import matplotlib.pyplot as plt
import numpy as np
time = np.arange(0, 10, 0.1);
y = np.sin(time)
result = np.correlate(y, y, mode='full')
plt.plot(result[:int(result.size/2 )])
plt.show()

enter image description here

Best Answer

This is not a real statistical effect, but rather it is due to the way that the numpy.correlate function works, which isn't really suited for what you want to do (at least not without applying some correction afterwards).

First of all, it doesn't compute a correlation coefficient in the typical statistical sense. It starts off the same, computing a dot product between the two input vectors, but then it doesn't normalize them (as in, e.g., a Pearson's correlation coefficient), so the result isn't between [-1, 1]. So the result you get is simply the sum of the pairwise products of the entries in the two vectors.

If you use numpy.correlate with the 'full' setting, it computes this product for every possible 'lag' between the two vectors. In other words, it 'slides' one vector over the other, and takes the dot product between pairs for which both vectors have an entry. The example given in the SciPy documentation is helpful:

>>> np.correlate([1, 2, 3], [0, 1, 0.5], "full")
array([ 0.5,  2. ,  3.5,  3. ,  0. ])

If we examine the output on the second line, we see that the first entry is simply the product between 1 & 0.5, i.e the first and last entries of the two vectors. The second entry is the dot product between the first two entries of the first, and the last two of the second vector, i.e. 1*1 + 2*0.5 = 2, etc.

So in your case, this means that the first value in your graph is just the product between the first value of the sine wave and the last value, which is obviously rather small. As you slide the vectors over each other, you get more and more valid products, which add up to bigger and bigger numbers. So what you're effectively looking at here is the true (periodic) autocorrelation function, multiplied by a line with positive slope (with this line reflecting the increasing number of entries contributing to the dot product), which explains why it seems to 'blow up' at increasing lags.

This really isn't what you want for an autocorrelation function, which should be properly normalized. So I guess the upshot is: you should probably use a different function to compute your correlation that matches the usual statistical definition (e.g. numpy.corrcoef, and/or a wrapper around that function that handles the lags for which you want to compute the autocorrelation), or work out how to normalize the output you get from numpy.corr.