Solved – Why would correlation for daily data be higher than weekly data

multiple regressionseasonalitytime series

I'm running a multiple linear regression on 2 years of daily sales data. When running against the daily data, I get a correlation of 84%, but when summarized to weekly totals, the correlation over the same period drops to 68% with the same underlying inputs/output. How/why would this be?

Further, hold-out testing shows this daily regression to be more predictive at the weekly level than the weekly data regression. How could this be?

Best Answer

This happen all the time in financial time series. I would invert the question: why do you think the series must be correlated to the same degree at different sampling frequencies?

Say you have two stationary time series, maybe autocorrelated. In the long run they must not be correlated much. They're stationary after all, right? However, in a short run they might moving together.

To make it even more obvious. Imagine two perfect sine waves with period one week. If you sample them or average at weekly frequency they'll look like a constant line. You measure correlation, and it will be zero. However if you sample at any higher frequency then you'll see the sine wave, and the correlation will be 1.

Here's a numerical example. Note, that the correlation doesn't go to zero because of the rounding errors, but it changes significantly when I change the sampling frequency:

What this demonstrates is a stylized version of a scenario where daily traffic (blue dashes) leads the daily sales (red). The correlation is a perfect 1. So you have a lot of traffic then in a 3.5 days the sales spike. That's the top plot.

Now, you decide to sample at weekly frequency, and maybe sum up all traffic during the week, and get the bottom plot. Since the sales and traffic were stale, you don't really see the changes weekly. All the dynamics is in daily. Not surprisingly, here your correlation drops to 0. As I wrote due to rounding errors, it's not really 0 but still much smaller than perfect.

The MATLAB code:

n = 101;
x = (1:n)';
y = sin(2*pi/10*x);
y2 = sin(2*pi/10*x+pi);
figure
subplot(2,1,1)
plot([y y2],'-.')
title(sprintf('\\rho = %f',corr(y,y2)))

subplot(2,1,2)
plot(1:10:n,[y(1:10:n) y2(1:10:n)],'+')
title(sprintf('\\rho = %f',corr(y(1:10:n), y2(1:10:n))))

Related Solutions

Solved – Auto.arima with daily data: how to capture seasonality/periodicity

If there is weekly seasonality, set the seasonal period to 7.

salests <- ts(data,start=2010,frequency=7) 
modArima <- auto.arima(salests)

Note that the selection of seasonal differencing was not very good in auto.arima() until very recently. If you are using v2.xx of the forecast package, set D=1 in the call to auto.arima() to force seasonal differencing. If you are using v3.xx of the forecast package, the automatic selection of D works much better (using an OCSB test instead of a CH test).

Don't try to compare the AIC for models with different levels of differencing. They are not directly comparable. You can only reliably compare the AIC with models having the same orders of differencing.

You don't need to re-fit the model after calling auto.arima(). It will return an Arima object, just as if you had called arima() with the selected model order.

Solved – Comparing 2 time series in R

There are a number of possible models at a variety of levels of complexity. These include (some are very closely related):

Time series regression with lagged variables

Lagged regression models. See also distributed lag models

Regression with autocorrelated errors

Transfer function modelling /lagged regression with autocorrelated errors

ARMAX models

Vector autoregressive models

State-space/dynamic linear models can incorporate both autocorrelated and regression components

Because your input series is 0/1 you may want to look at lagged regression with autocorrelated errors, but watch for seasonal and calendar effects (like holidays).

So simple-ish models might perhaps look something like

$\qquad\text{ Sales}_t = \phi_0+\phi_1\,\text{Sales}_{t-1} +\beta_3\,\text{job}_{t-3}+\beta_4\,\text{job}_{t-4}+\epsilon_t$

or perhaps something like

$\qquad\text{ Sales}_t = \alpha +\beta_3\,\text{job}_{t-3}+\beta_{12}\,\text{job}_{t-12}+\text{seasonal}_{t}+\eta_t$

where $\eta_t$ is in turn some ARMA model for the noise term (though you may well want more lags in there than just one) -- or a variety of other possibilities. [The seasonal term above doesn't have a parameter because it's likely to have several components, and so several parameters; consider it a placeholder for a model for that component of the data. Neither of those models are likely to be sufficient, they're just to get a general sense of what a simple model might look like]

You may also want to consider whether the binary job-status variable needs a model itself (if you want to forecast further than the smallest lag involving it, it may well be essential to at least consider whether there are any such effects there -- see transfer function models, but you have to consider the special nature of the binary variable)

Once you have an appropriate model for sales that captures the main features well, you can look as testing. You should have enough data (looks like several years) to hold some data out for out-of-sample model testing and validation. I'd start by considering the features of sales alone - is it stationary? Autocorrelated? Does it experience any seasonal/cyclical or calendar components? Are there other major drivers to consider?

Since you mention R, note that the function tslm in the package forecast can be handy for including seasonal or trend components in regression models.

A book that discusses nearly all of those topics is Shumway and Stoffer Time Series Analysis and its Applications (3rd ed is at Stoffer's page here). Another highly recommended text is Forecasting Principles and Practice, Hyndman and Athanasopoulos, here, which covers some of the things I mentioned (but not as many).

Best Answer

Related Solutions

Solved – Auto.arima with daily data: how to capture seasonality/periodicity

Solved – Comparing 2 time series in R

Related Question