Solved – Autocorrelated residuals from `auto.arima`

arimaautocorrelation

I'm having issues with the residuals of my ARIMA models in R for two time series. When I run the Ljung-Box test on the residuals, I get that I should reject the null (i.e. my residuals still have some correlation). I don't know what I should do next. My end goal is to show that the steel time series can be used to predict car production.

The steel and cars time series data was extracted from these sources: steel and cars.

The following is my code:

steel <- read.csv("~/stat248/monthly-production-of-raw-steel-.csv")
cars <- read.csv("~/stat248/australia-monthly-production-of-.csv")
colnames(cars)[2]='cars'
colnames(steel)[2]='steel'

cars=ts(cars$cars,start=c(1956,1),end=c(1993,11),frequency = 12)
steel=ts(steel$steel,start=c(1956,1),end=c(1993,11),frequency = 12)
plot(cbind(cars,steel),main="Production of Cars and Steel in Australia")

cars = na.interpolation(cars)
logcars = log(cars)
logsteel = log(steel)
logcars_stl = stl(logcars,s.window = "periodic")
logsteel_stl = stl(logsteel,s.window = "periodic")

logsteel_arima = auto.arima(logsteel_stl$time.series[,"remainder"],approximation = FALSE,trace=FALSE)
logcars_arima = auto.arima(logcars_stl$time.series[,"remainder"],approximation = FALSE,trace=FALSE)

> Box.test(logcars_arima$residuals,lag=20,type="Ljung-Box")

    Box-Ljung test

data:  logcars_arima$residuals
X-squared = 61.454, df = 20, p-value = 4.231e-06

> Box.test(logsteel_arima$residuals,lag=20,type="Ljung-Box")

    Box-Ljung test

data:  logsteel_arima$residuals
X-squared = 56.109, df = 20, p-value = 2.799e-05

Here I get tiny $p$-values even after using auto.arima. The standard ARIMA method of comparing AICs didn't fare any better. Any advice?

Best Answer

Ljung-Box test is inappropriate for testing residuals from an ARIMA model, and Breusch-Godfrey test should be used instead; see Testing for autocorrelation: Ljung-Box versus Breusch-Godfrey.

(Even if you tried using the Ljung-Box test, the standard practice is to adjust the degrees of freedom of the null distribution for the fact that you are supplying residuals rather than raw data to the test. This can be done using argument fitdf in function Box.test. fitdf should equal the $p+q$ where $p$ is the autoregressive order and $q$ is the moving average order.)

Also, no autocorrelation is not necessarily an indication that you have a model that will generalize well out of sample. You are likely to overfit when you require the residuals to have no autocorrelation. Meanwhile, AIC-based model selection as used in auto.arima strikes a good balance between underfitting and overfitting.

Related Solutions

Solved – Ljung-Box Statistics for ARIMA residuals in R: confusing test results

You've interpreted the test wrong. If the p value is greater than 0.05 then the residuals are independent which we want for the model to be correct. If you simulate a white noise time series using the code below and use the same test for it then the p value will be greater than 0.05.

m = c(ar, ma)
w = arima.sim(m, 120)
w = ts(w)
plot(w)
Box.test(w, type="Ljung-Box")

Solved – Ljung-Box always significant for ARIMA models – what now

A note on terminology: commonly we fit a model to the data rather than fit the data to a model.

I can do step 1, but don't know how to relate that to step 2. Am I using the remainder from stl analysis for ARIMA modeling? If not, what's the point of step 1?

From STL you obtain three components: trend, seasonal and remainder. You could remove the seasonal component and use the sum of trend and remainder for further modelling with ARIMA.

But I can't get past the diagnostics. My Ljung-Box values are ALWAYS significant for ALL lags. Okay, so that means my residuals are correlated (I think). And since I want to use the residuals for cross-correlation, I assume that's bad.

Yes, having significant autocorrelations for ALL lags is clearly a problem. I would generally agree with the comment by @Glen_b, but in a case where all lags are significant the problem seem hard to deny. Curiously, the ACF plot does not immediately suggest that the autocorrelations are a really big problem (only a few lags stick outside the confidence interval by much) and the latter only becomes evident from the Ljung-Box test. I would not stop there and I would not accept a model with such a terrible Ljung-Box picture. Instead, I would look for other models.

One caveat: if you use STL and remove the seasonal component before estimating ARIMA models on trend+seasonal, you should not allow for a seasonal component in the ARIMA model (making it a SARIMA model); use option seasonal=FALSE in function auto.arima. Perhaps making this change will help you find better models.

Note also that after taking the 24-hour difference, the ACF and PACF still have significant 24-hour lags. This may indicate that taking the 24-hour difference was not such a good idea. Normally you would expect the lag at which you have differenced the data to not have significant ACF or PACF value.

Does this mean my time series doesn't fit an ARIMA model?

The model you showed us indeed does not seem to fit the data well as evidenced by the poor Ljung-Box statistics. If I were you, I would try some other model instead.

Best Answer

Related Solutions

Solved – Ljung-Box Statistics for ARIMA residuals in R: confusing test results

Solved – Ljung-Box always significant for ARIMA models – what now

Related Question