Solved – Why would correlation for daily data be higher than weekly data

multiple regressionseasonalitytime series

I'm running a multiple linear regression on 2 years of daily sales data. When running against the daily data, I get a correlation of 84%, but when summarized to weekly totals, the correlation over the same period drops to 68% with the same underlying inputs/output. How/why would this be?

Further, hold-out testing shows this daily regression to be more predictive at the weekly level than the weekly data regression. How could this be?

Best Answer

This happen all the time in financial time series. I would invert the question: why do you think the series must be correlated to the same degree at different sampling frequencies?

Say you have two stationary time series, maybe autocorrelated. In the long run they must not be correlated much. They're stationary after all, right? However, in a short run they might moving together.

To make it even more obvious. Imagine two perfect sine waves with period one week. If you sample them or average at weekly frequency they'll look like a constant line. You measure correlation, and it will be zero. However if you sample at any higher frequency then you'll see the sine wave, and the correlation will be 1.

Here's a numerical example. Note, that the correlation doesn't go to zero because of the rounding errors, but it changes significantly when I change the sampling frequency: enter image description here

What this demonstrates is a stylized version of a scenario where daily traffic (blue dashes) leads the daily sales (red). The correlation is a perfect 1. So you have a lot of traffic then in a 3.5 days the sales spike. That's the top plot.

Now, you decide to sample at weekly frequency, and maybe sum up all traffic during the week, and get the bottom plot. Since the sales and traffic were stale, you don't really see the changes weekly. All the dynamics is in daily. Not surprisingly, here your correlation drops to 0. As I wrote due to rounding errors, it's not really 0 but still much smaller than perfect.

The MATLAB code:

n = 101;
x = (1:n)';
y = sin(2*pi/10*x);
y2 = sin(2*pi/10*x+pi);
figure
subplot(2,1,1)
plot([y y2],'-.')
title(sprintf('\\rho = %f',corr(y,y2)))

subplot(2,1,2)
plot(1:10:n,[y(1:10:n) y2(1:10:n)],'+')
title(sprintf('\\rho = %f',corr(y(1:10:n), y2(1:10:n))))