Solved – Autocorrelation and heteroskedasticity in time series data

autocorrelationheteroscedasticityregressiontime series

I have several time series of two variables over the course of one year (approx. 2.5k observations). I hypothesize one variable (x) acts as a potential predictor for the other variable (y). I looked for the period in which y is best described by x ("best" as in strongest Pearson), took the observations from just that optimal period, calculated the simple linear regression model and its parameters and predicted y upon all x. Hence, I have two y's over the same time series of one year – the observed and predicted, the latter being the result of the regression model calculated from the optimal period.

Now, I detected autocorrelation and heteroskedasticity in the data from the optimal period. For one example time series, see below the regression diagnostic plots and statistical test results inside them.

enter image description here

Top left plot: raw data in a scatterplot; top right plot: residuals vs indepedent varible (DW = Durbin Watson test and BG = Breusch-Godfrey test for autocorrelation); middle left: residuals vs fitted plot (BP = Breusch-Pagan test for heteroskedasticity); middle right: normal Q-Q plot (W = Shapiro-Wilk test and A = Anderson-Darling test for normality in residuals); bottom left: scale-location plot, bottom right: residuals vs leverage plot to detect outliers.

From the tests and visual inspection I can infer that there is autocorrelation and heteroskedasticity present in my data. I am somewhat stuck on how to proceed from here on. Especially, I would be glad about help on the following:

  1. Autocorrelation is common in time series data, which does not mean that I do not have to correct for it, I guess?
  2. Normality in residuals is not THAT important, especially for larger sample sizes. Sample sizes for my optimal periods have 240 observations as a minimum. Enough to put this issue off?
  3. For pure prediction purposes, which is what I am interested in mainly, I heard that regression disgnostics and their treatment do not offer much improvement for the predicted values, but are important for t– and F-statistics. Should I even worry about rectifying any issues revealed by the diagnostics if it is the forecasted y I am after?
  4. Box-Cox transformation could help rectifying heteroskedasticity and the Cochrane–Orcutt, Hildreth-Lu, or First Differences Procedures can tackle autocorrelation, at least in theory. How can I know which problem to address first? Does remedying heteroskedasticity improve or worsen the autocorrelation situation, and vice versa? Is there a procedure that can alleviate both problems in one go (i.e. Newey-West estimator)?

The scatterplot indicates that the relationship is not very well described by a linear regression model and may potentially better fit a gls model in the first place. However, I would like to consistently apply a linear model to all my time series, even though they might not very well explain the distribution of the data. That in itself, would be an interesting outcome.

I am using the R environment for my analyses.

Best Answer

We have seen residual plots such as yours when untreated deterministic effects are present. These might include hourly or daily effects. Care should be taken to identify and incorporate any needed effects like Pulses,Level Shifts,Seasonal Pulses and/or Local Time Trends . Needed ARIMA Structure suggested by model diagnostics should also be included. At that point consider testing for constancy of parameters over time as parameters may have changed OR testing for constancy of error variance over time . Non-constant error variance over time can be remedied by the TSAY test as described here http://docplayer.net/12080848-Outliers-level-shifts-and-variance-changes-in-time-series.html or the classic Box-Cox test . The TSAY test should be implemented first as it is the least intrusive transformation.

The correct Transfer Function identification procedure using pre-whitened cross-correlations in conjunction with the above should lead to a useful model. The process I have described here is fundamentally what I incorporated into AUTOBOX , a piece of software that I have helped to develop. You could attempt to program this yourself but it might be time consuming.

EDITED AFTER COMMENT BY OP

The ACF of the residuals from a tentative model is suggestive of the need to augment your current model with arima structure. The ccf of residuals from a tentative model and a prewhitened X variable suggests possible improvements in the TF structure. Yes to "see the initial lags" you use the CCF https://onlinecourses.science.psu.edu/stat510/node/75 and Why is prewhitening important?. It is literally a minefield of possible ways to do it rong which is why I automated the process. The automation doesn't guarantee optimality but it does clear the path. Advice from non-time series experts in this area should be studiously avoided as your problem can be daunting.

http://autobox.com/cms/images/dllupdate/TFFLOW.png is a start ... where one would add before forecasting checks for 1) constancy of parameters and 2) constancy of the variance of the error process . Now that you have a work statement (flow diagram) the next part is to implement it either manually or with creative productivity aids.

This is (somewhat) repetitive but ....

1.stationarity conditions i.e. the order of differences for both X and Y 2.the form of the X component in terms of needed lags i.e. numerator and denominator structure 3.the required arma 4.the need for Intervention detected variables viz. Pulses, Level Shifts, Seasonal Pulses, Local Time Trends 5.the need to deal with evidented parameter changes over time 6.the need to deal with evidented deterministic variance changes requiring Weighted Least Squares 7.the need to deal with evidented variance changes that are level dependent requiring a Power Transform