Solved – modeling time series data with lm()

rregressiontime series

After you decompose a univariate time series with stl() function in R you are left with the trend, seasonal and random components of the time series. Is it valid to use those components to then model the original timer series with additional other variables?

For example:

> tsData
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2012  22  26  34  33  40  39  39  45  50  58  64  78
2013  51  60  80  80  93 100  96 108 111 119 140 164
2014 103 112 154 135 156 170 146 156 166 176 193 204

> stl(tsData, s.window = "periodic")
 Call:
 stl(x = tsData, s.window = "periodic")

Components
            seasonal     trend   remainder
Jan 2012 -24.0219753  36.19189   9.8300831
Feb 2012 -20.2516062  37.82808   8.4235219
Mar 2012  -0.4812396  39.46428  -4.9830367
Apr 2012 -10.1034302  41.32047   1.7829612
...
Sep 2014   2.2193527 165.55136  -1.7707170
Oct 2014   7.3239448 169.33893  -0.6628760
Nov 2014  18.4285405 173.12650   1.4449614
Dec 2014  30.5244146 176.84390  -3.3683103

Now if I wanted to model the time series with a linear model with some other variables is it valid to do so?

lm(index ~ trend + seasonal + s1 + s2, data)

When I run that model I get an R-squared = .98 which make sense considering that the original time series index is just the sum of trend + season + error. I guess what I'm concerned about is using a linear model with time series data I want to make sure I'm not violating some major rules of linear regression. I figure since I have the seasonal variable I'm essentially controlling for that element and hopefully reducing the auto correlation or am I since the R-squared is so high? Any help is appreciated!

Best Answer

Once you have decomposed your original $index$ series into $seasonal$, $trend$ and $remainder$, you know that

$$index=seasonal+trend+remainder$$

holds exactly with unit coefficients in front of the three components.

You then remove the last component $remainder$ and put in two regressors $s_1$ and $s_2$ instead.
If you kept the coefficients in front of $seasonal$ and $trend$ fixed at 1 and ran a regression

$$index=\beta_0+1 \cdot seasonal+1 \cdot trend+\beta_1 s_1+\beta_2 s_2+\varepsilon$$

then it would be equivalent to running the following regression

$$remainder=\beta_0+\beta_1 s_1+\beta_2 s_2+\varepsilon$$

This could very well make sense if you were interested in explaining the $remainder$ component using regressors $s_1$ and $s_2$ (and did not care about explaining $seasonal$ and $trend$ components).

What you actually do is leave the coefficients in front of $seasonal$ and $trend$ unrestricted. This implies that you do not completely "agree" with stl decomposition as you allow $seasonal$ and $trend$ to be multiplied by some coefficients.

I wonder how you would interpret that? If you got an OLS estimate of the $seasonal$ coefficient of 1.2, would you say that seasonality is 1.2 times more variable than stl suggests? I am not sure this makes much sense especially when the three components are not observed but rather derived using stl. So first you "agree" with the stl assumptions to derive the components and later you start to "disagree" and try to fit those components using OLS.

Regarding the high $R^2$ value it need not surprise you. The three stl components give a perfect fit for $index$. If the variability in $seasonal$ and $trend$ is large compared with $remainder$, you should expect to get a high $R^2$ in the OLS regresion of $index$ on just $seasonal$ and $trend$ (excluding $remainder$) even without the extra regressors $s_1$ and $s_2$.

Regarding autocorrelation and possible violations of OLS assumptions, you can simply test for those once you have estimated your regression.