Solved – Time Series Forecasting vs Linear Regression Extrapolation

regressiontime series

I'm working on some problems involving prediction of future values. I need to get an aggregated total at some point in the future.

My question is: what is the best way to predict the future values?

First, I want to mention that there are tens of thousands of different points, so I'm planning on using either the plyr or dplyr packages.

There are two ways I've gone about doing this:

  1. Time Series Forecasting
  2. Linear Regression

Also, before anyone jumps and says "You can't do extrapolation with linear regression!" I understand that and the reasons why it's a problem. But, in the below example, why would it be a problem?

Here's an example of some data I made up that mimics some of the data. The x values are months.

y=c(100,90,70,20,15,11,19,17,10,10,12,14,13,14,11,10,9,7,5,1,0,1,0,0)
x=1:length(y)

plot(y~x)
plot(cumsum(y)~x)
plot(forecast(csy.ts, h=60))
plot(forecast(y.ts, h=60))
plot(cumsum(y)~log(x))

y.ts=ts(y,f=12,s=1)
csy.ts=ts(cumsum(y),f=12,s=1)

# I arbitrarily chose 5 additional years for the forecasting

y.ts_fcast=forecast(y.ts,60)$mean
fcast_total=cumsum(c(y,y.ts_fcast))
max(fcast_total)
# Gives 459

csy.ts_fcast=forecast(csy.ts,60)$mean
csy.ts_fcast[60]
# Gives 457

fit=lm(cumsum(y)~log(x))
fit$coef[2]*log(84)+fit$coef[1]
# Gives 616

Intuitively it makes sense that 459 should be about the right number. However, why is the regression "wrong"? To the above point of extrapolation being a problem with regression, my understanding is that it's usually a problem because there can be a surprise 'dip' in the data in the extrapolated range unaccounted for in the training set. However, the cumsum(y) data looks like a really nice logarithmic curve (it's also always increasing) so I am not worried about any 'dips' (even a spike seems rare, and even if it did spike, it would be a minor percentage of the cumsum(y) data and would not look like such an outlier).

While typing this I happened to see another post mentioning independence being violated, which I feel is most likely one of the issues here. I'm just not sure why.

Best Answer

The main reason why using ordinary least squares regression is frowned upon in modeling time series data is that the error terms are correlated with each other (this is called autocorrelation). If this is the case, then your standard errors from OLS will be incorrect, which affects hypothesis testing.

There are several ways that you can use specialized techiniques to account for autocorrelation, such as autoregressive models with lags, generalized least squares, and HAC (heteroskedasticity and autocorrelation consistent) standard errors.

Related Question