Specifying the Input Variables' ARIMA Models
The ARIMA Procedure uses the results of the first pair(s) of identify
and estimate
statements (i.e., the identify
and estimate
statements for the input variables) to create models to forecast the values of the input variable(s) (also called exogenous variable(s)) after the last point in time that each of those input variables are observed. In other words, those statements specify the models that are used whenever values for the input variables are needed for periods not yet observed.
Thus, the model for VariableY
is specified as
identify var=VariableY(PeriodsOfDifferencing);
estimate p=OrderOfAutoregression q=OrderOfMovingAverage;
where VariableY
is modeled as $ARIMA(p,d,q)$ with $p$ = OrderOfAutoregression
, $d$ = the order of differencing (determined from PeriodsOfDifferencing
), and $q$ = OrderOfMovingAverage
.
Specifying Differencing for the Main and Input Series in the ARIMAX Model
The order(s) of differencing to apply to the input variables are specified in the crosscorr
option; for modeling VariableX
with inputs VariableY
and VariableZ
, the SAS code is:
identify var=VariableX(DifferencingX) crosscorr=( VariableY(DifferencingY) VariableZ(DifferencingZ) );
where DifferencingX
, DifferencingY
, and DifferencingZ
are the period(s) of differencing for VariableX
, VariableY
, and VariableZ
, respectively.
Specifying the Order of Autoregression and the Order of Moving Average for the Main and Input Series in the ARIMAX Model
The number of input variable lags to include in the model is specified in the transfer function (in the input
option). The beginning of the estimate
line sets the orders of autoregression and moving average for the main series (i.e., the series for which a model or forecasts are ultimately being sought):
estimate p=AutoregressionX q=MovingAverageX
where VariableX
is modeled as $ARIMAX(p,d,q,b)$ with $p$ = AutoregressionX
and $q$ = MovingAverageX
.
The input
option in the same estimate
statement sets the orders of autoregression and moving average for the ARIMAX model. The numerator factors for a transfer function for an input series are like the MA part of the ARMA model for the noise series. The denominator factors for a transfer function for an input series are like the AR part of the ARMA model for the noise series. (All examples below will simplify the example down to a single input series VariableY
instead of showing both VariableY
and VariableZ
.)
When specified without any numerator or denominator terms, the input variable is treated as a pure regression term (i.e., the value of the input variable in the current period is used without any lags, whether it is forecast by the input variable's ARIMA model or already present as an observed value in the input series): estimate
...input=( VariableY );
.
Numerator terms are represented in parentheses before the input variable. estimate
...input=( (1 2 3) VariableY );
produces a regression on VariableY
, LAG(VariableY)
, LAG2(VariableY)
, and LAG3(VariableY)
.
Denominator terms are represented in parenetheses after a slash and before the input variable. estimate
...input=( \ (1) VariableY );
estimates the effect of VariableY
as an infinite distributed lag model with exponentially declining weights.
Initial shift is represented before a dollar sign; estimate
...input=( k $ (
$\omega$-lags ) / (
$\delta$-lags ) VariableY );
represents the form $B^k \cdot \left(\frac{\omega (B)}{\delta (B)}\right) \cdot \text{VariableY}_t$. The value of k
will be added to the exponent of $B$ for all numerator and denominator terms. To use an AR-like shift in the input variable without including the un-shifted (i.e., un-lagged or pure regression) term, use this operator instead of numerator terms in parentheses. For example, to set a 6, 12, and 18 month shift in the input series VariableY
without the un-shifted term, the statement would be estimate
...input=( 6 $ (6 12) VariableY );
(this results in shifts of 6, 6 + 6 (i.e., 12), and 6 + 12 (i.e., 18)).
Summary
The first pair(s) of identify
and estimate
statements are used to prepare any necessary forecasted values for the input variable(s).
The last pair of identify
and estimate
statements run the actual ARIMAX model, and use forecasted values for the input variable(s) (generated from the first pair(s) of identify
and estimate
statements) when necessary.
The relationship between the main variable and the input variable(s) is specified in the crosscorr
option of the identify
statement and the input
option of the estimate
statement. The relationship between the main variable and the input variable(s) can be defined as a run-of-the-mill regression relationship; or it can be defined with differencing, AR term(s), and/or MA term(s).
Attribution
Although this answer is my own, I was able to come up with the answer based on substantial help (and some quotations) from the official SAS documentation ("The ARIMA Procedure: Rational Transfer Functions and Distributed Lag Models", "The ARIMA Procedure: Specifying Inputs and Transfer Functions", "The ARIMA Procedure: Input Variables and Regression with ARMA Errors", and "The ARIMA Procedure: Differencing"), and from direction found in this answer and comments by IrishStat.
Best Answer
You are fortunate to ask this question onn this site because IrishStat has been automating ARIMA models for over 30 years (sorry to give away your age Dave). Also Rob Hyndman wrote the auto.arima procedure in R. I have a connection as I took my first time series course in a short course by Box and Tiao at Carnegie - Mellon University in 1974 (giving away my age now). Also when I was the Chief of Statistical Research at Risk Data Corporation (in the early 1990s) I hired Terry Woodfield who authored the ETS software at the SAS Institute just before we were able to draw him away. I am sure PROC ARIMA has gone through many changes but i am sure that if you make contact with Terry he could probably help you.
Personally the way I learned it from Box, Tiao and Pack ARIMA modeling is an iterative process that should be gone through manually in stages with the user making decisions at various stages. That is not to say that good results cannot be obtained by automated procedures. In fact I think that Dave Reilly (IrishStat) along with his son Tom have so much experience doing this that they will contend that they could produce a better model with their algorithm than I can do manually and they may be right. But my point is that for a time series specialist to take that approach takes away some of the steps that help him really get to understand the characteristics of thee seris very well.
One thing that always troubled me in the early years was that the Box-Jenkins methodology was revered a little too much. Estimation is by conditional least squares and so the normality of the residuals is important and often overlooked (a buried secret). In the late 1970s i work on the problem of outliers in time series and Darryl Downing and I published a paper on the topic in JASA in 1982.
Since then other like Doug Martin, George Tiao and Ruey Tsay have made much bigger contributions. IrishStat is aware of that literature and has incorporated their ideas in his software. That is why he emphasizes checking for level shifts and outliers before fixating on an ARIMA model. That aspect of his software makes it somewhat unique. It is different from auto.arima and SAS/ETS. So keep that in mind in your search for other automated procedures using SAS.
I hope you appreciate this as an answer even though it does not directly answer questions 1 or 2. I am sure you can find Terry Woodfield on the internet or go directly to the SAS Institute with your questions which are very specific to SAS and really require someone with intimate knowledge of the SAS algorithms. I don't think you will find anyone on this site who could give you better help.