Solved – The ARIMA(1,1,2) model for log(dataset) seem insignificant compare with ARIMA(1,0,2) model for diff(log(dataset))

arimastationaritytime serieswhite noise

I am trying to fit ARIMA model for my dataset. I did the following steps:

Here is my dataset plot and log plot after log transformation (ds_log).

In order to archive stationary, so I transformed by log and diff. I tested stationary by Ljung-Box and KPSS. We can see the fitted model is quite good for diff(ds_log). The black line is original data, the red line is fitted value.

> Box.test(diff(ds_log), type = "Ljung-Box")
    Box-Ljung test

data:  diff(ds_log)
X-squared = 48.939, df = 1, p-value = 2.64e-12

> kpss.test(diff(ds_log)) # p-value > 0.05, reject H0

    KPSS Test for Level Stationarity

data:  diff(ds_log)
KPSS Level = 0.072684, Truncation lag parameter = 3, p-value = 0.1

I found the ARIMA(1,0,2) model for diff(ds_log) based on ACF and PACF. Here is the plot for diff(ds_log) and fitted model lines.
```
fit <- Arima(diff(ds_log), order=c(1,0,2))
plot(diff(ds_log))
lines(fitted(fit), col="red", lty=2)
```
Residual diagnostics plot.

Test white-noise by Ljung-Box.

    > Box.test(fit$residuals, lag=20, type = "Ljung-Box")

        Box-Ljung test

    data:  fit$residuals
    X-squared = 15.252, df = 20, p-value = 0.7618

However, when I fitted the model for ds_log by ARIMA(1,1,2) . The model does not look good while residual is white noise.

    > fit2 <- Arima(ds_log, order=c(1,1,2))
    > plot(ds_log)
    > lines(fitted(fit2), col="red",lty=2)

We can see the fitted value is delayed compare to original value. Here is residual diagnostics and Ljung-Box test.

    > tsdiag(fit2)
    > Box.test(fit2$residuals, lag=20, type = "Ljung-Box")

        Box-Ljung test

    data:  fit2$residuals
    X-squared = 14.98, df = 20, p-value = 0.7776

Why does the model ARIMA(1,1,2) on ds_log seem poorly fit compare with ARIMA(1,0,2) on diff(ds_log) ?

Best Answer

ARIMA model identification/estimation can be seriously flawed by the presence of deterministic structure in the data. Deterministic structure can include Pulses , Level/Step shifts; Seasonal Pulses and/or Time Trends. Statistics for time series trend in R Your residuals suggest that a hybrid approach might be needed. Variance heterogenerity can often be dealt with using GLS (weighted estimation) rather than a power transform . See When (and why) should you take the log of a distribution (of numbers)? for a discussion of this. I suggest that you post your data and I will try and help further.

EDITED AFTER RECEIPT OF DATA:

I took your 211 monthly values and used AUTOBOX (a piece of software that I have helped to develop) and requested a totally automatic analysis ( complete with step-by-step details). The original data ( before you tortured it with differencing (injecting structure see What are the consequences of not meeting the assumptions for the residuals of ARIMA model?) AND taking unwarranted logs . The ACF suggested a possible stationary ARMA structure without the need for differencing. . Note that the presence of Level/Step shifts are often misrepresented by taking differences when a simple de-meaning might be more appropriate. Unnecessary differencing injects structure into the residuals which then necessitates ARMA structure to remedy/reverse the incorrect differencing. See Variance of difference of $x_{i,t}$ and $x_{i,t+1}$ to examine the impact of differencing a white noise series.

To illustrate this .. consider the ACF of first differences here reflecting the unfortunate/unintended/incorrect injection of structure.

The model containing three step shifts and an ARMA structure reflecting both one period, three period and an annual structure (a seasonal pulse at period 11 which started 3 years ago .. this phenomenon should be investigated and confirmed ) is here and here with the following statistics. . A number of pulses were found suggesting unusual activity and are clearly presented here . They should be investigated for possible cause effects from unspecified variables.

The plot of the residuals is here with an ACF suggesting approximate sufficiency The forecast plot is here an the Actual/Fit and Forecast plot here

Notice that I added a spurious value at time period 212 just to show how this user-caused anomaly was effectively discarded thus suggesting robustness of approach.

The whole approach that you followed of taking two unnecessary drugs/transformations and using inadequate analytical tools has created a powerful example of what can and did go wrong . The first step in building an ARIMA model is to examine the ACF/PACF of the original series and when conducting a non-automatic analysis is to review a plot of the original data.

You are not alone in trying to form useful models with complicated data and basic tools while attempting to follow a script that might have been useful for a simple textbook example. The mistakes you made are not unusual at all. Assuming that you need to transform (differencing is a form of a transform) and take logs ( a form of a transform ) can lead to the "muddle" you found yourself in as you were "hoisted on your own petard" so to speak i.e. "to fall into one's own trap".

Finally we often see quarterly effects when dealing with monthly data particularly in the drug business due to the way they normally do business.

In summary your analysis exhibited two kinds of statistical errors viz comission and ommission and motivated my response intended to teach good practice.

1)Errors of commission

a) unnecessary differencing b) unnecessary power transformation

2)Errors of omission

c) no treatment of anomalies (one-time pulses some very large and some not-so large) but all significant. d) no recognition of level shifts in the data e) no identification of the month 11 effect arising in the last three years f) no identification of the quarterly effect

You asked for details/criteria regarding strategy for Intervention Detection:

The criteria uses is based on the seminal work of I. Chang , G. Tiao and importantly R.Tsay time-series-ls-ao-tc-using-tsoutliers-package-in-r-how dicusses the TSAY procedure . This discussion might also help How to interpret and do forecasting using tsoutliers package and auto.arima . The major problem with the tsoutliers package is that it requires you to pre-specify an ARIMA model rather than integrating ARIMA model identification,outlier identification , variance transformation identification and time-varying parameter identification , dynamic structure (PDL) for user suggested causal series whereas AUTOBOX ( available in R) does all of this.

Related Solutions

Solved – Ljung-Box Statistics for ARIMA residuals in R: confusing test results

You've interpreted the test wrong. If the p value is greater than 0.05 then the residuals are independent which we want for the model to be correct. If you simulate a white noise time series using the code below and use the same test for it then the p value will be greater than 0.05.

m = c(ar, ma)
w = arima.sim(m, 120)
w = ts(w)
plot(w)
Box.test(w, type="Ljung-Box")

Solved – Ljung-Box always significant for ARIMA models – what now

A note on terminology: commonly we fit a model to the data rather than fit the data to a model.

I can do step 1, but don't know how to relate that to step 2. Am I using the remainder from stl analysis for ARIMA modeling? If not, what's the point of step 1?

From STL you obtain three components: trend, seasonal and remainder. You could remove the seasonal component and use the sum of trend and remainder for further modelling with ARIMA.

But I can't get past the diagnostics. My Ljung-Box values are ALWAYS significant for ALL lags. Okay, so that means my residuals are correlated (I think). And since I want to use the residuals for cross-correlation, I assume that's bad.

Yes, having significant autocorrelations for ALL lags is clearly a problem. I would generally agree with the comment by @Glen_b, but in a case where all lags are significant the problem seem hard to deny. Curiously, the ACF plot does not immediately suggest that the autocorrelations are a really big problem (only a few lags stick outside the confidence interval by much) and the latter only becomes evident from the Ljung-Box test. I would not stop there and I would not accept a model with such a terrible Ljung-Box picture. Instead, I would look for other models.

One caveat: if you use STL and remove the seasonal component before estimating ARIMA models on trend+seasonal, you should not allow for a seasonal component in the ARIMA model (making it a SARIMA model); use option seasonal=FALSE in function auto.arima. Perhaps making this change will help you find better models.

Note also that after taking the 24-hour difference, the ACF and PACF still have significant 24-hour lags. This may indicate that taking the 24-hour difference was not such a good idea. Normally you would expect the lag at which you have differenced the data to not have significant ACF or PACF value.

Does this mean my time series doesn't fit an ARIMA model?

The model you showed us indeed does not seem to fit the data well as evidenced by the poor Ljung-Box statistics. If I were you, I would try some other model instead.

Best Answer

Related Solutions

Solved – Ljung-Box Statistics for ARIMA residuals in R: confusing test results

Solved – Ljung-Box always significant for ARIMA models – what now

Related Question