Spurious Correlation – Does It Really Matter in Regression and Time Series Analysis?

autocorrelationcorrelationregressionspurious-correlationtime series

Let’s say you are trying to find if there is a correlation between two stock prices, where both are likely non stationary series. You have no concern as it relates to a potentially causal relationship…

You run a simple correlation analysis against all the rules. Both our series are autocorrelated and non stationary. You find there is a 98% correlation so you conclude they depend on each other.

This is the conversation I just had with a colleague… but I think they are 100% wrong and I’d like some confirmation.

If you find two autocorrelated and non stationary series to be 98% correlated, then the correlation is likely spurious. What this means to me is that the correlation we observe is likely due to complete chance (and their correlation is likely a result of their mutual dependence on something else outside of the two series themselves). So if our goal is to identify the extent to which these two series “depend” on each other, finding a valid correlation coefficient is necessary. Is this correct?

Best Answer

Here's a simulated example of two prices that are very highly correlated ($\rho = 0.9875$). When you attempt to predict the price change in one using the lagged value of the other, very little of the variation in the price change is explainable:

. clear

. set seed 12092021

. set obs 102
Number of observations (_N) was 0, now 102.

. gen t = _n

. tsset t

Time variable: t, 1 to 102
        Delta: 1 unit

. gen p1 = 1 + 3*t + rnormal(0,5) 

. gen p2 = 3 + 2*t + rnormal(0,10)

. corr p1 p2
(obs=102)

             |       p1       p2
-------------+------------------
          p1 |   1.0000
          p2 |   0.9875   1.0000


. reg FD.p2 p1

      Source |       SS           df       MS      Number of obs   =       101
-------------+----------------------------------   F(1, 99)        =      0.01
       Model |  .727541841         1  .727541841   Prob > F        =    0.9436
    Residual |  14322.4337        99  144.671048   R-squared       =    0.0001
-------------+----------------------------------   Adj R-squared   =   -0.0100
       Total |  14323.1613       100  143.231613   Root MSE        =    12.028

------------------------------------------------------------------------------
       FD.p2 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
          p1 |   .0009672   .0136392     0.07   0.944    -.0260959    .0280303
       _cons |   1.665843   2.420693     0.69   0.493    -3.137338    6.469024
------------------------------------------------------------------------------

. reg FD.p1 p2

      Source |       SS           df       MS      Number of obs   =       101
-------------+----------------------------------   F(1, 99)        =      0.01
       Model |  .683934381         1  .683934381   Prob > F        =    0.9171
    Residual |  6210.52068        99  62.7325321   R-squared       =    0.0001
-------------+----------------------------------   Adj R-squared   =   -0.0100
       Total |  6211.20461       100  62.1120461   Root MSE        =    7.9204

------------------------------------------------------------------------------
       FD.p1 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
          p2 |  -.0013704   .0131245    -0.10   0.917    -.0274123    .0246715
       _cons |   3.260085   1.574913     2.07   0.041     .1351165    6.385054
------------------------------------------------------------------------------

Here FD is the first difference of subsequent value (so $FD.p_t = (p_{t+1}-p_t)$).

The $R^2$ (aka R-squared) of both models is around zero, so very little of the variation in price changes tomorrow can be explained by the price today. This illustrates the intuition that knowing what you know today, you cannot act on this correlation to make money tomorrow.

You can play around with variations on this approach (using the lagged price change as a predictor, non-linear models, adding more data, more noise, or adding trends), with identical results.

You might object that my toy example is flawed because the high correlation is contemporaneous, so if you knew p1 today, you could predict p2 today. I think that is wrong for the following reason. Suppose the DGP is as above, but unknown to you. You are an executive at company 1, and you learn that your CEO had been falsifying earnings and pinching bottoms. The news will become public shortly and lower p1. You can’t short your own stock without a vacation at Club Fed. Should you short the stock of company 2 if you know the correlation between p1 and p2 is ~1? I think that would be a terrible idea. This is what makes the correlation spurious and why that matters.

You could also have a causal relationship, but no correlation. When a house has air-conditioning with a preset desired temperature, there will be a strong positive non-spurious correlation between the amount of electricity used by the AC and the temperature outside. But there will be no correlation between the amount of electricity consumed and the inside temperature. The outside temperature and the inside temperature will also be uncorrelated. The last two are spurious non-correlations in my mind. But all three correlation are valid (though that has no formal definition in statistics) since a correlation is just a transformation of the data.

This is all to say that a strong correlation is not necessary for a causal dependence to exist. And it is certainly not sufficient. Even the sign on the causal relationship could be different from the sign of the correlation. This matters for using correlations to do things out in the real world (i.e., interventions). This is not just an issue with time series data, but can happen with observational data.

Related Question