Solved – How many endogenous variables in a VAR model with 120 observations

rtime seriesvector-autoregression

I am doing time series forecasting in R using the VAR model implementation in the "vars" package. I have a default of six lags for my model, and my time series all have 120+ observations. I'm doing a method of testing different combinations of series and back-testing them using VAR in order to decide which set of variables to choose. How many variables can I expect to fit in to my VAR model (including the variable I am interested in forecasting)?

Best Answer

Since vars uses (equation-by-equation) OLS estimation, the number of parameters in one equation cannot be greater than the number of data points used in the estimation, which is the sample size $T$ minus the lag length $p$.

The number of parameters per equation is $p \times K + 1$ where $K$ is the number of endogenous variables and 1 stands for the intercept.

Then the condition is

$$ Kp+1 \leqslant T-p, $$

which gives you

$$ K \leqslant \frac{T-p-1}{p} = \frac{T-1}{p}-1. $$

Take the largest integer $K$ that satisfies the inequality.

Now this is only the technical limit due to the OLS mechanics. But it may not be the limit you are looking for. For a very large (but still feasible) $K$ you may expect your estimates to have very high variance and thus be of limited practical use.

Then a sensible question to ask is, what $K$ would give a model that would still generalize fairly well out of sample. Here you are essentially facing the well-known bias-variance trade-off: including more variables may reduce the omitted variable bias, but would also raise the estimation variance. The bias-variance trade-off has been discussed extensively before, you may review older posts for that. (So in the end, the general answer on the optimal choice of $K$ is, it depends...)

Related Solutions

Solved – How many lags should I include in a VAR model

Regarding the first question, different equations of a VAR model need not have the same lag order. Each equation is meaningful by itself and can be treated separately (as regards estimation). If you find that one of the equations may benefit from including some more regressors, you may as well do that.

Regarding the picture, I can understand why you have one full row in the lag 2 matrix, but why do you also have one full column? Based on what you have told, that seems unnecessary.

Regarding lag 5, is it plausible that there could be an effect with lag 5? (This is a subject-matter question.) If yes, then consider including just lag 5; including all the lags in between 1 and 5 would not be a parsimonious solution. And you should care about parsimony since your sample is quite small. If lag 5 is quite implausible, maybe the significant autocorrelation at that lag is a false positive that is due to chance?

Keep in mind that trying to fit the data very well may lead to overfitting. Using information criteria such as AIC or BIC could help decide between a few sensible candidate models. That means that you would deliberately accept ill-behaved model errors when including extra parameters is too costly due to increased estimation uncertainty. That should give some overall guidance as well as address the questions in the last paragraph.

Solved – the best model for time series data with independent and dependent variables

The documentation for the vars package vignette describes Impulse / Response Analysis as the last step. This sounds like what you are looking for. They end with unemployment predicted by the other variables with the command:

svec.irf <- irf(svec, response = "U", n.ahead = 48, boot = TRUE)

"irf" is an acronym for Impulse Response Function.

There is also a wiki page with a general introduction for several disciplines. My guess is that the one on Economics might be closest to what you want.

Best Answer

Related Solutions

Solved – How many lags should I include in a VAR model

Solved – the best model for time series data with independent and dependent variables

Related Question