Solved – Why are the VAR models working better with nonstationary data than stationary data

forecastingr-squaredstationaritytime seriesvector-autoregression

I'm using python's statsmodels VAR library to model financial time series data and some results have me puzzled. I know that VAR models assume the time series data is stationary. I inadvertently fit a non-stationary series of log prices for two different securities and surprisingly the fitted values and in-sample forecasts were very accurate with relatively insignificant, stationary residuals. The $R^2$ on the in-sample forecast was 99% and the standard deviation of forecast residual series was about 10% of the forecast values.

However, when I difference the log prices and fit that time series to the VAR model, the fitted and forecast values are far off the mark, bouncing in a tight range around the mean. As a result, the residuals do a better job forecasting the log returns than the fitted values, with the standard deviation of the forecast residuals 15X larger than the fitted data series a .007 $R^2$ value for the forecast series.

Am I misinterpreting fitted vs. residuals on the VAR model or making some other error? Why would a non-stationary time series result in more accurate predictions than a stationary one based on the same underlying data? I’ve worked a good bit with ARMA models from the same python library and saw nothing like this modeling single series data.

Best Answer

Two facts:

  1. When you regress one random walk on another random walk and incorrectly assume stationarity, your software will generally spit back statistically significant results, even if they are independent processes! For example, see these lecture notes. (Google for spurious random walk and numerous links will come up.) What's going wrong? The usual OLS estimate and standard-errors are based upon assumptions that aren't true in the case of random walks.

    Pretending the usual OLS assumptions apply and regressing two independent random walks on each other will generally lead to regressions with huge $R^2$, highly significant coefficients, and it's all entirely bogus! When there's a random walk and you run a regression in levels the usual assumptions for OLS are violated, your estimate does not converge as $t \rightarrow \infty$, the usual central limit theorem does not apply, and the t-stats and p-values your regression spits out are all wrong.

  2. If two variables are cointegrated, you can regress one on the other and your estimator will converge faster than usual regression, a result known as super-consistency. Eg. checkout John Cochrane's Time Series book online and search for "superconsistent."