Solved – How many endogenous variables in a VAR model with 120 observations

rtime seriesvector-autoregression

I am doing time series forecasting in R using the VAR model implementation in the "vars" package. I have a default of six lags for my model, and my time series all have 120+ observations. I'm doing a method of testing different combinations of series and back-testing them using VAR in order to decide which set of variables to choose. How many variables can I expect to fit in to my VAR model (including the variable I am interested in forecasting)?

Best Answer

Since vars uses (equation-by-equation) OLS estimation, the number of parameters in one equation cannot be greater than the number of data points used in the estimation, which is the sample size $T$ minus the lag length $p$.

The number of parameters per equation is $p \times K + 1$ where $K$ is the number of endogenous variables and 1 stands for the intercept.

Then the condition is

$$ Kp+1 \leqslant T-p, $$

which gives you

$$ K \leqslant \frac{T-p-1}{p} = \frac{T-1}{p}-1. $$

Take the largest integer $K$ that satisfies the inequality.

Now this is only the technical limit due to the OLS mechanics. But it may not be the limit you are looking for. For a very large (but still feasible) $K$ you may expect your estimates to have very high variance and thus be of limited practical use.

Then a sensible question to ask is, what $K$ would give a model that would still generalize fairly well out of sample. Here you are essentially facing the well-known bias-variance trade-off: including more variables may reduce the omitted variable bias, but would also raise the estimation variance. The bias-variance trade-off has been discussed extensively before, you may review older posts for that. (So in the end, the general answer on the optimal choice of $K$ is, it depends...)

Related Question