Solved – How to ensure PROC ARIMA is performing the correct parameterization of input variables

dynamic-regressionsastime series

I'm trying to forecast using ARIMAX with two exogenous (input) variables. I'm using PROC ARIMA, but I can't figure out from the SAS documentation whether my code is producing the parameterization I want.

I want to extend an ARI(12,1) model so that it also includes the last 12 terms of each of the two exogenous variables in my forecast. So, using VariableX with the two exogenous variables VariableY and VariableZ, my best attempt at the code is:

proc arima;
  identify var=VariableY(1) nlag=24;
  estimate p=12;
  identify var=VariableZ(1) nlag=24;
  estimate p=12;
  identify var=VariableX(1) nlag=24 crosscorr=( VariableY(1) VariableZ(1) );
  estimate p=12 input=( VariableY VariableZ );
  forecast id=MonthNumber interval=month alpha=.05 lead=24;
  run;
quit;

The documentation leads me to believe the first four lines of the procedure are required for setting up the forecast at the end. But when I run the procedure, the output appears to show a forecast using only the last term of each of the two exogenous variables.

In summary, I'd like to be sure where each of the following are controlled:

  • The $p$ of $AR(p)$, and similarly for each of the exogenous variables
  • The $d$ of $I(d)$, and similarly for each of the exogenous variables
  • The $q$ of $MA(q)$, and similarly for each of the exogenous variables

Best Answer

Specifying the Input Variables' ARIMA Models

The ARIMA Procedure uses the results of the first pair(s) of identify and estimate statements (i.e., the identify and estimate statements for the input variables) to create models to forecast the values of the input variable(s) (also called exogenous variable(s)) after the last point in time that each of those input variables are observed. In other words, those statements specify the models that are used whenever values for the input variables are needed for periods not yet observed.

Thus, the model for VariableY is specified as

identify var=VariableY(PeriodsOfDifferencing);
estimate p=OrderOfAutoregression q=OrderOfMovingAverage;

where VariableY is modeled as $ARIMA(p,d,q)$ with $p$ = OrderOfAutoregression, $d$ = the order of differencing (determined from PeriodsOfDifferencing), and $q$ = OrderOfMovingAverage.

Specifying Differencing for the Main and Input Series in the ARIMAX Model

The order(s) of differencing to apply to the input variables are specified in the crosscorr option; for modeling VariableX with inputs VariableY and VariableZ, the SAS code is:

identify var=VariableX(DifferencingX) crosscorr=( VariableY(DifferencingY) VariableZ(DifferencingZ) );

where DifferencingX, DifferencingY, and DifferencingZ are the period(s) of differencing for VariableX, VariableY, and VariableZ, respectively.

Specifying the Order of Autoregression and the Order of Moving Average for the Main and Input Series in the ARIMAX Model

The number of input variable lags to include in the model is specified in the transfer function (in the input option). The beginning of the estimate line sets the orders of autoregression and moving average for the main series (i.e., the series for which a model or forecasts are ultimately being sought):

estimate p=AutoregressionX q=MovingAverageX

where VariableX is modeled as $ARIMAX(p,d,q,b)$ with $p$ = AutoregressionX and $q$ = MovingAverageX.

The input option in the same estimate statement sets the orders of autoregression and moving average for the ARIMAX model. The numerator factors for a transfer function for an input series are like the MA part of the ARMA model for the noise series. The denominator factors for a transfer function for an input series are like the AR part of the ARMA model for the noise series. (All examples below will simplify the example down to a single input series VariableY instead of showing both VariableY and VariableZ.)

When specified without any numerator or denominator terms, the input variable is treated as a pure regression term (i.e., the value of the input variable in the current period is used without any lags, whether it is forecast by the input variable's ARIMA model or already present as an observed value in the input series): estimate...input=( VariableY );.

Numerator terms are represented in parentheses before the input variable. estimate...input=( (1 2 3) VariableY ); produces a regression on VariableY, LAG(VariableY), LAG2(VariableY), and LAG3(VariableY).

Denominator terms are represented in parenetheses after a slash and before the input variable. estimate...input=( \ (1) VariableY ); estimates the effect of VariableY as an infinite distributed lag model with exponentially declining weights.

Initial shift is represented before a dollar sign; estimate...input=( k $ ( $\omega$-lags ) / ( $\delta$-lags ) VariableY ); represents the form $B^k \cdot \left(\frac{\omega (B)}{\delta (B)}\right) \cdot \text{VariableY}_t$. The value of k will be added to the exponent of $B$ for all numerator and denominator terms. To use an AR-like shift in the input variable without including the un-shifted (i.e., un-lagged or pure regression) term, use this operator instead of numerator terms in parentheses. For example, to set a 6, 12, and 18 month shift in the input series VariableY without the un-shifted term, the statement would be estimate...input=( 6 $ (6 12) VariableY ); (this results in shifts of 6, 6 + 6 (i.e., 12), and 6 + 12 (i.e., 18)).

Summary

The first pair(s) of identify and estimate statements are used to prepare any necessary forecasted values for the input variable(s).

The last pair of identify and estimate statements run the actual ARIMAX model, and use forecasted values for the input variable(s) (generated from the first pair(s) of identify and estimate statements) when necessary.

The relationship between the main variable and the input variable(s) is specified in the crosscorr option of the identify statement and the input option of the estimate statement. The relationship between the main variable and the input variable(s) can be defined as a run-of-the-mill regression relationship; or it can be defined with differencing, AR term(s), and/or MA term(s).

Attribution

Although this answer is my own, I was able to come up with the answer based on substantial help (and some quotations) from the official SAS documentation ("The ARIMA Procedure: Rational Transfer Functions and Distributed Lag Models", "The ARIMA Procedure: Specifying Inputs and Transfer Functions", "The ARIMA Procedure: Input Variables and Regression with ARMA Errors", and "The ARIMA Procedure: Differencing"), and from direction found in this answer and comments by IrishStat.

Related Question