Solved – Residuals Interpretation:Time Series Data

rregressionresidualstime series

I am trying to use multiple regression for a time series dataset. I have values corresponding to a variable measured by 24 hrs for 4 months. Since there was a pattern which repeated every 24 hours I used 23 dummy variables for the hourly variations in values.

I used log transformation of the dependent variable before performing multiple regression. The fitted coefficients were highly significant and the R-squared was around 0.99.
However, when I look at the Residuals vs fitted plot, it seems sort of weird. According to the plots here, my plot is neither biased nor heteroskedastic, but it also doesn't look like random noise. Can someone help me find the issue here?
enter image description here
Also please find below a plot of the observed and fitted model for first 500 hrs Observed Values VS Time in hours overlaid by fitted model in red

Best Answer

Residual plots are excellent, but the first and most basic plots are to plot the original data where possible.

You should look at and show us the raw time series. It seems that you have three large negative residuals for 330, 331, 332. You don't tell us what the labels mean, but perhaps they are observation numbers.

A plot of observed and fitted versus time of day might be as useful as plot versus time sequence, or even more so.

As you report that you used logarithms, it is a puzzle to know how values can be say 5 lower than is typical on your logarithmic scale. You don't tell us the base you used. Even for base e, those points are a lot lower than fitted.

It is also far from obvious from the logarithmic transformation was a good idea any way: the distribution of your fitted values is very left-skewed.

Assuming that each vertical stripe corresponds to a separate hour, the pattern seems be less activity for about 8 hours (night?) and more for about 16 hours (day?). Your high $R^2$ is probably higher than deserved because the transformation is spreading the lower values out. An observed versus fitted plot would show that more dramatically.

EDIT: Thanks for showing the plot. The very large negative residuals now appear to be a side effect of using an inappropriate logarithmic transformation. Plot log response versus response for the range of your data to see how the values are stretched out at the lower end.

I'd repeat the suggestion to plot observed versus time of day. That's what the regression "sees". There is no time series analysis here, but just time series data treated with regression.

Related Solutions

Residuals vs Fitted Plot – Interpreting a Residuals vs Fitted Plot and Extracting Points

Well done for looking at the diagnostic plots for your regression. In this case, they have revealed that your model is inappropriate, as @Glen_b says in the comments. Sometimes you can get away with modelling count data with a gaussian "ordinary" regression. But in this case clearly the violations of the standard assumptions are too strong. There are too many actual values at zero where the model predicts negative values; and this is skewing the whole result and hence leaving a lot of structure in the residuals. You need to move to a Poisson distribution glm.

On the second part of your question, for future reference the identify() function is a good way to identify a few points in a plot eg

plot(predict(v.lm), residuals(v.lm))
identify(predict(v.lm), residuals(v.lm))

Another good trick, when you suspect something about those points, is to create a dummy variable for your candidate explanations (eg 1 when the response=0, 0 otherwise) and map that to a colour aesthetic. ggplot2 is a great package to use for this sort of thing.

Regression – Trend in Residuals vs Dependent But Not in Residuals vs Fitted

1) The residuals and the fitted are uncorrelated by construction. In fact if there was any correlation between them, there would be uncaptured linear trend in the data - we could get a closer fit by changing the coefficients until they were uncorrelated.

2) The residuals and the y-variable are always positively correlated. This is a necessary consequence of (1).

$$cov(e,y) = cov(e,e+\hat y) = \sigma^2+0 = \sigma^2$$

So it would be surprising if there wasn't a trend in that first plot.

Consider a simulated example -

Note that by plotting residuals against observed, it's equivalent to using slanted axes in the residuals vs fitted plot:

enter image description here

The reason for the high observed (the grey slanted lines mark constant observed, the ones to the far right are high) being associated with high residuals is clear here, as it the reason for them being only positive near the end.

Best Answer

Related Solutions

Residuals vs Fitted Plot – Interpreting a Residuals vs Fitted Plot and Extracting Points

Regression – Trend in Residuals vs Dependent But Not in Residuals vs Fitted

Related Question