Solved – Detect trend in time series

rtime seriestrend

Hypothesis: time series has an inverted-U shape.

How do we test this numerically?

My idea is to take the first difference of the variable and fit a linear
model using the differentiated variable as endogenous variable and the time
variable as exogenous.

$\Delta y_t = \beta_1 + \beta_2 t_t + \epsilon_t$

If the hypothesis is true, $\beta_2$ should be significantly less than zero.

If we try this approach with computer generated data, it can be seen that it
works well.

enter image description here

Call:
lm(formula = dy ~ dt)

Residuals:
       Min         1Q     Median         3Q        Max 
-1.219e-15 -2.520e-16 -2.218e-17  1.827e-16  1.241e-15 

Coefficients:
              Estimate Std. Error    t value Pr(>|t|)    
(Intercept)  1.210e-01  1.118e-16  1.082e+15   <2e-16 ***
dt          -2.000e-03  2.245e-18 -8.910e+14   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 4.988e-16 on 82 degrees of freedom
Multiple R-squared:     1,  Adjusted R-squared:     1 
F-statistic: 7.939e+29 on 1 and 82 DF,  p-value: < 2.2e-16

However, if a slight amount of noise is added to the data, this method falls
apart catastrophically:

enter image description here

Call:
lm(formula = dy ~ dt)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.96480 -0.21802  0.00826  0.24701  0.93200 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.114305   0.087548   1.306    0.195
dt          -0.001922   0.001758  -1.093    0.277

Residual standard error: 0.3907 on 82 degrees of freedom
Multiple R-squared: 0.01437,    Adjusted R-squared: 0.002345 
F-statistic: 1.195 on 1 and 82 DF,  p-value: 0.2775

So, what is the alternative?

Edit

R code to generate the series and the plots:

t <- 1:85
y <- 0.12 * t - 0.001 * t^2 + rnorm(length(t), sd=0.25)
dt <- tail(t, -1)
dy <- tail(y, -1) - head(y, -1)

plot(t, y, ylim=c(-0.5, 4), pch=19, col='navy')
points(dt, dy, pch=19, col='purple')
legend(x=3, y=3.5, c('y','first difference'), pch=19, col=c('navy','purple'))

summary(lm(dy ~ dt))

Best Answer

If you use lm then you should check the residuals to see if they are autocorrelated or not. I guess they are not uncorrelated and hence your t-test are not valid (this is true also for the case of summary(lm(y~t+I(t^2)). This is basiacally beacuse there is a time variable involved in your lm.

I recommend to use Generalized Least Square approach in order to test the quadratic effect and take into account the autocorrelated problem. For example if you assume the autoregressive of order two (see below) for the residuals of your lm (i.e. $e_t=\phi_1 e_{t-1}+\phi_2 e_{t-2}+\nu_t$, where $\nu_t$ is white noise), then the code would be like

library(nlme)
m1=gls(y~t+I(t^2),correlation=corARMA(p=2))
summary(m1)

Note: You should model the error terms correctly first (i.e. finding the order of $p$ and $q$) maybe by ckecking the ACF or PACF of the residuals in your lm. In above, I assumed AR(2). More complicated ARMA model can be considered and tested.

Related Solutions

Solved – Improvement of regression model

@Roland is correct that it's hard to say much without knowing what you're doing, substantively speaking. However, there are a few remarks we can still make. They fall into the categories: discovering why it's no good, making it better, and demonstrating improvement.

Diagnostics

R has good linear model diagnostics. Apply them, and read up enough to know what they are telling you. To see all the available ones

model <- lm(formula = y ~ x_1 + x_2 + x_3 + x_4 + x_5 + x_6)
plot(model, 1:6) ## all of them

Each addresses a possible failing. You might check for linearity and interactions first because you've enough data to do something about them.

Making it better

You have lots of data. This means that if there is non-linearity you can potentially learn its form from the data. A generalised additive model (GAM) would be a good start and will probably work better than some random set of polynomials. If you don't want or can't do that, then at least some splines might be helpful.

Also, work your way through the interactions that make sense. These will generate apparent non-linearity and spoil predictions if not modeled. Read up about R's formula interface to see how to specify them.

Polynomials can work, but without knowing what your data actually is it's hard to say whether they'd be a good idea. Also hard to say, and for the same reasons, is whether your predictor variables might be usefully transformed (logged, etc.)

Confirming it's better

Since your only task is to make the model better then the only quantity worth working with is held-out prediction error. Do whatever you do on a subset of the data then try it out on the held-out set. (Iteratively this is cross-validation). You have to decide what counts as 'doing better' prediction in the context of your problem, but a common choice is root mean squared error. Here again I'm assuming that you actually do have data that is potentially conditionally normal, as your choice of lm implies.

Practically this would involve writing a function to compute that quantity (or one suitably like it) from a set of predictions and a set of held-out data points. The do your fiddling around and optimizing the model on the other part of the data, use predict to get predictions on the held-out, and apply the function.

Note that performance on held-out data is not any of the quantities you are wondering about. Those are all in-sample measures and will typically overestimate prediction performance on new data.

Caveats

Finally, note that prediction may just be hard. You may not have the right variables: most likely some important ones are missing, and you can do nothing more about that without knowing what they are.

And that's about as much generic advice as can be given for a bunch of variables called $[y, x_1\ldots x_6]$...

Solved – R detect increasing/decreasing trend of time series

You can apply a zero phase shifter filter and cut out all frequencies higher than some threshold; this would give you a "trend" of sorts.

For example, look at this question, "How do I run a high pass or low pass filter on data points in R?" They show how to use Butterworth low pass filter. The problem with that filter is that it's not zero phase shift, i.e. as you see the low frequency component's phase is shifted relative to the original signal. You may want to find the filter which does not shift the phase. If this were economic data, I would suggest Christiano's filter as per "The Band Pass Filter" by Lawrence J. Christiano and Terry J. Fitzgerald (1999). For physical data, there must be a ton of zero phase shift filters available.

UPDATE:

Here's an example of applying the low band pass filter to the LOG of the second time series. The LOG is required to even out the variance.

UPDATE2:

Here's a sample decomposition in frequency domain with frequency bands: irregular [2-19], cyclical [20-99], and trend [100-$\infty$] (in period lengths). The frequency bands have to be chosen carefully based on the understanding of the underlying phenomenon.

Edit

Best Answer

Related Solutions

Solved – Improvement of regression model

Solved – R detect increasing/decreasing trend of time series

Related Question