After you decompose a univariate time series with stl() function in R you are left with the trend, seasonal and random components of the time series. Is it valid to use those components to then model the original timer series with additional other variables?
For example:
> tsData
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2012 22 26 34 33 40 39 39 45 50 58 64 78
2013 51 60 80 80 93 100 96 108 111 119 140 164
2014 103 112 154 135 156 170 146 156 166 176 193 204
> stl(tsData, s.window = "periodic")
Call:
stl(x = tsData, s.window = "periodic")
Components
seasonal trend remainder
Jan 2012 -24.0219753 36.19189 9.8300831
Feb 2012 -20.2516062 37.82808 8.4235219
Mar 2012 -0.4812396 39.46428 -4.9830367
Apr 2012 -10.1034302 41.32047 1.7829612
...
Sep 2014 2.2193527 165.55136 -1.7707170
Oct 2014 7.3239448 169.33893 -0.6628760
Nov 2014 18.4285405 173.12650 1.4449614
Dec 2014 30.5244146 176.84390 -3.3683103
Now if I wanted to model the time series with a linear model with some other variables is it valid to do so?
lm(index ~ trend + seasonal + s1 + s2, data)
When I run that model I get an R-squared = .98 which make sense considering that the original time series index is just the sum of trend + season + error. I guess what I'm concerned about is using a linear model with time series data I want to make sure I'm not violating some major rules of linear regression. I figure since I have the seasonal variable I'm essentially controlling for that element and hopefully reducing the auto correlation or am I since the R-squared is so high? Any help is appreciated!
Best Answer
Once you have decomposed your original $index$ series into $seasonal$, $trend$ and $remainder$, you know that
$$index=seasonal+trend+remainder$$
holds exactly with unit coefficients in front of the three components.
You then remove the last component $remainder$ and put in two regressors $s_1$ and $s_2$ instead.
If you kept the coefficients in front of $seasonal$ and $trend$ fixed at 1 and ran a regression
$$index=\beta_0+1 \cdot seasonal+1 \cdot trend+\beta_1 s_1+\beta_2 s_2+\varepsilon$$
then it would be equivalent to running the following regression
$$remainder=\beta_0+\beta_1 s_1+\beta_2 s_2+\varepsilon$$
This could very well make sense if you were interested in explaining the $remainder$ component using regressors $s_1$ and $s_2$ (and did not care about explaining $seasonal$ and $trend$ components).
What you actually do is leave the coefficients in front of $seasonal$ and $trend$ unrestricted. This implies that you do not completely "agree" with
stl
decomposition as you allow $seasonal$ and $trend$ to be multiplied by some coefficients.I wonder how you would interpret that? If you got an OLS estimate of the $seasonal$ coefficient of 1.2, would you say that seasonality is 1.2 times more variable than
stl
suggests? I am not sure this makes much sense especially when the three components are not observed but rather derived usingstl
. So first you "agree" with thestl
assumptions to derive the components and later you start to "disagree" and try to fit those components using OLS.Regarding the high $R^2$ value it need not surprise you. The three
stl
components give a perfect fit for $index$. If the variability in $seasonal$ and $trend$ is large compared with $remainder$, you should expect to get a high $R^2$ in the OLS regresion of $index$ on just $seasonal$ and $trend$ (excluding $remainder$) even without the extra regressors $s_1$ and $s_2$.Regarding autocorrelation and possible violations of OLS assumptions, you can simply test for those once you have estimated your regression.