Solved – The ARIMA(1,1,2) model for log(dataset) seem insignificant compare with ARIMA(1,0,2) model for diff(log(dataset))

arimastationaritytime serieswhite noise

I am trying to fit ARIMA model for my dataset. I did the following steps:

  1. Here is my dataset plot and log plot after log transformation (ds_log).
    enter image description here

  2. In order to archive stationary, so I transformed by log and diff. I tested stationary by Ljung-Box and KPSS. We can see the fitted model is quite good for diff(ds_log). The black line is original data, the red line is fitted value.

    > Box.test(diff(ds_log), type = "Ljung-Box")
        Box-Ljung test
    
    data:  diff(ds_log)
    X-squared = 48.939, df = 1, p-value = 2.64e-12
    
    > kpss.test(diff(ds_log)) # p-value > 0.05, reject H0
    
        KPSS Test for Level Stationarity
    
    data:  diff(ds_log)
    KPSS Level = 0.072684, Truncation lag parameter = 3, p-value = 0.1
    
  3. I found the ARIMA(1,0,2) model for diff(ds_log) based on ACF and PACF. Here is the plot for diff(ds_log) and fitted model lines.

    fit <- Arima(diff(ds_log), order=c(1,0,2))
    plot(diff(ds_log))
    lines(fitted(fit), col="red", lty=2)
    

    enter image description here

  4. Residual diagnostics plot.enter image description here

Test white-noise by Ljung-Box.

    > Box.test(fit$residuals, lag=20, type = "Ljung-Box")

        Box-Ljung test

    data:  fit$residuals
    X-squared = 15.252, df = 20, p-value = 0.7618

However, when I fitted the model for ds_log by ARIMA(1,1,2) . The model does not look good while residual is white noise.

    > fit2 <- Arima(ds_log, order=c(1,1,2))
    > plot(ds_log)
    > lines(fitted(fit2), col="red",lty=2)

enter image description here

We can see the fitted value is delayed compare to original value. Here is residual diagnostics and Ljung-Box test.

    > tsdiag(fit2)
    > Box.test(fit2$residuals, lag=20, type = "Ljung-Box")

        Box-Ljung test

    data:  fit2$residuals
    X-squared = 14.98, df = 20, p-value = 0.7776

fit2 diag

Why does the model ARIMA(1,1,2) on ds_log seem poorly fit compare with ARIMA(1,0,2) on diff(ds_log) ?

Best Answer

ARIMA model identification/estimation can be seriously flawed by the presence of deterministic structure in the data. Deterministic structure can include Pulses , Level/Step shifts; Seasonal Pulses and/or Time Trends. Statistics for time series trend in R Your residuals suggest that a hybrid approach might be needed. Variance heterogenerity can often be dealt with using GLS (weighted estimation) rather than a power transform . See When (and why) should you take the log of a distribution (of numbers)? for a discussion of this. I suggest that you post your data and I will try and help further.

EDITED AFTER RECEIPT OF DATA:

I took your 211 monthly values and used AUTOBOX (a piece of software that I have helped to develop) and requested a totally automatic analysis ( complete with step-by-step details). The original data ( before you tortured it with differencing (injecting structure see What are the consequences of not meeting the assumptions for the residuals of ARIMA model?) AND taking unwarranted logs enter image description here . The ACF suggested a possible stationary ARMA structure without the need for differencing. enter image description here . Note that the presence of Level/Step shifts are often misrepresented by taking differences when a simple de-meaning might be more appropriate. Unnecessary differencing injects structure into the residuals which then necessitates ARMA structure to remedy/reverse the incorrect differencing. See Variance of difference of $x_{i,t}$ and $x_{i,t+1}$ to examine the impact of differencing a white noise series.

To illustrate this .. consider the ACF of first differences here enter image description here reflecting the unfortunate/unintended/incorrect injection of structure.

The model containing three step shifts and an ARMA structure reflecting both one period, three period and an annual structure (a seasonal pulse at period 11 which started 3 years ago .. this phenomenon should be investigated and confirmed ) is here enter image description here and here enter image description here with the following statistics. enter image description here . A number of pulses were found suggesting unusual activity and are clearly presented here enter image description here . They should be investigated for possible cause effects from unspecified variables.

The plot of the residuals is here enter image description here with an ACF suggesting approximate sufficiency enter image description here The forecast plot is here enter image description here an the Actual/Fit and Forecast plot here enter image description here

Notice that I added a spurious value at time period 212 just to show how this user-caused anomaly was effectively discarded thus suggesting robustness of approach.enter image description here

The whole approach that you followed of taking two unnecessary drugs/transformations and using inadequate analytical tools has created a powerful example of what can and did go wrong . The first step in building an ARIMA model is to examine the ACF/PACF of the original series and when conducting a non-automatic analysis is to review a plot of the original data.

You are not alone in trying to form useful models with complicated data and basic tools while attempting to follow a script that might have been useful for a simple textbook example. The mistakes you made are not unusual at all. Assuming that you need to transform (differencing is a form of a transform) and take logs ( a form of a transform ) can lead to the "muddle" you found yourself in as you were "hoisted on your own petard" so to speak i.e. "to fall into one's own trap".

Finally we often see quarterly effects when dealing with monthly data particularly in the drug business due to the way they normally do business.

In summary your analysis exhibited two kinds of statistical errors viz comission and ommission and motivated my response intended to teach good practice.

1)Errors of commission

a) unnecessary differencing b) unnecessary power transformation

2)Errors of omission

c) no treatment of anomalies (one-time pulses some very large and some not-so large) but all significant. d) no recognition of level shifts in the data e) no identification of the month 11 effect arising in the last three years f) no identification of the quarterly effect

You asked for details/criteria regarding strategy for Intervention Detection:

The criteria uses is based on the seminal work of I. Chang , G. Tiao and importantly R.Tsay time-series-ls-ao-tc-using-tsoutliers-package-in-r-how dicusses the TSAY procedure . This discussion might also help How to interpret and do forecasting using tsoutliers package and auto.arima . The major problem with the tsoutliers package is that it requires you to pre-specify an ARIMA model rather than integrating ARIMA model identification,outlier identification , variance transformation identification and time-varying parameter identification , dynamic structure (PDL) for user suggested causal series whereas AUTOBOX ( available in R) does all of this.

Related Question