Solved – Detrending or not and should I always take log first

logarithmstationaritytest-for-trendtrend

I have some trouble with a specific time series. The time series itself is pretty short: Only 73 observations that look like this:

First problem: Detrend or not?

First I need to find out if the time series has a unit root or not. Using the augmented Dickey-Fuller Test the null hypothesis cannot be rejected, therefore I have to assume that the time series has a unit root. Based on this paper one should check this result with another test because the time series is pretty short.

Using the Philips-Perron Test I get the same result (again: the time series has a unit root) but using the KPSS Test I get a contradicting result. Based on the KPSS Test one cannot reject the null hypothesis which states that the time series has no deterministic trend.

I know that unit root and deterministic trend are two different things that are hard to distinguish. But overall I would say that the time series has a trend.

Second problem: Take logarithm or not?

For many time series (e.g., GDP) one has to take the logarithm first because most detrending methods (e.g., Hodrick-Prescott Filter) are linear filters. Apart from that GDP looks pretty exponential so taking log first is probably not the worst idea. But for some time series, especially interest rates or a share of two variables (which is the case for my time series), one does not take the logarithm. But why? Detrending my time series with and without taking the natural logarithm first I get totally different results for the trend and the cyclical component.

Why would I bother?

The reason I have to detrend the time series is that I want to analyze the time series over a business cycle. Therefore if I want to compare (or merge) different business cycles I have to detrend the series from non-cyclical aspects.

To be honest using log first gives better results – but I don't think this is a valid basis on which to answer these questions.

Best Answer

See How to detect seasonality from plotted data without using tools or libraries and the link to when and why you should take logs might be of help to you. Untreated deterministic structure .. pulses/level shifts/time trends often incorrectly suggest transforms. Visually you series might have changes in trends AND/OR changes in intercepts in addition to some form of ARIMA structure . If you post your data I will try and help further.

Detrending , Power Transformations .Differencing AND ARMA are all forms of transformations. Determining the minimally sufficient (parsimonious) combination requires skillful techniques . Simple scripts i.e. hard and fast rules are to be studiously avoided as they limit the scope of the solution and often obfuscate.

I should also add there are two other forms of transformations often suggested by the data ...

Due to changes in parameters over time SEGMENT the data
Due to deterministic error variance change(s) over time employ Weighted Least Squares (GLS)

The opportunity space , correctly evaluated , can often lead to a "useful model" ...ala GEPB

EDITED AFTER RECEIPT OF DATA .

Using AUTOBOX , a piece of software that I have helped to develop the following model was automatically developed. First differences with an AR(2) without lag 1 and three pulses ... two of them at the end of the series. ... using the most recent 29 values. Note that an AR(2) model does not necessarily include both lags whereas some solutions requires that to be true (auto.arima for example )

The statistics are here and here . The ACF of the residuals .

The Actual and Adjusted Plot highlights the "unusual values" . Of particular interest are the two anomalies at the tail-end of the series (information extracted from the Data) . If one were to believe these two values rather than "adjusting for them" the subsequent forecast would be higher as compared to this forecast

A possible useful model would then have no power transform , no weighted least squares BUT have differences, arma and three pulses after having segmented the data into 1-44 AND 45-73 distinct time ranges.

Related Solutions

Time Series – State Space Formulation of Hodrick-Prescott Filter Explained

A) Actually, your reference says "a stationary residual or cyclical component, $c_t$", while this calls $c_t$ a 'cyclical component'.

Maravall and del Rio (2001) say "a residual, $c_t$, to be called “cycle”."

B) This formulation follows from the form of $A$; the second difference of $\tau$ is the error term.

C) some other references you may find helpful:

http://www.reservebank.govt.nz/research/discusspapers/dp03_02.pdf

http://www.stat.itam.mx/seminarios/Resumens/p136.pdf (see p6 which is relevant to your B, but they appear to have a sign in the A-matrix wrong - c.f. your link, p13)

http://www3.istat.it/dati/pubbsci/contributi/Contributi/contr_2005/2005_07.pdf

http://www.terrapub.co.jp/journals/jjss/pdf/3801/38010041.pdf

Solved – raw data stationary but still can see trend and seaonality is stl

This is what I get using the R package forecast for ARIMA model selection and tsoutliers for detection of possible additive outliers, level shifts or temporary changes.

These are the data:

x <- structure(c(1291, 2497, 2643, 2633, 2637, 3246, 2764, 2552, 2190, 2423, 2393, 2278, 1573, 792, 1826, 1819, 1610, 1344, 1762, 1917, 1920, 1804, 1936, 1912, 2000, 1742, 2120, 2483, 2475, 2252, 2061, 2145, 1901, 2223, 1934, 1439, 1963, 1879, 1587, 2467, 2852, 3811, 3154, 2428, 2881, 2625, 2416, 2538, 1947, 2526, 2421, 1277, 1334, 
2009, 2693, 2624, 2714, 2882, 3044, 2847, 2439, 2690, 2677, 2461, 1495, 2501, 2614, 2486, 2390, 1901, 2315, 2710, 2436, 2184, 2079, 2234, 2382, 2213, 2506, 2673, 2579, 2341, 2246, 2331, 1670, 2263, 2453, 2411, 2653, 2754, 2818, 2635, 2577, 2897, 2796, 2039, 2188, 
2749, 2790, 2510, 2128, 2839, 2270, 1791, 1373, 2078, 2878, 3198, 3095, 2964, 3230, 3122, 2809, 2902, 3048, 2984, 2927, 2740, 2732, 1572, 2829, 1831, 2821, 2907, 2973, 2690, 2511, 2632, 2502, 2590, 2907, 3046, 2953, 2663, 2613, 2637, 2061, 2812, 2660, 2653, 2708,  2921, 2862, 2526, 2863, 3002, 3227, 2254, 2606, 2959, 2963, 3302), 
.Tsp = c(2012, 2014.90384615385, 52), class = "ts")

Fit an ARIMA model:

require(tsoutliers)
res <- tso(x, args.tsmethod = list(ic="bic"))
res
# output
# Series: x 
# ARIMA(2,0,0) with non-zero mean 
# Coefficients:
#          ar1      ar2  intercept       TC6       AO14      LS40       AO65
#       0.6172  -0.2038  1960.4325  1024.713  -885.4593  494.7448  -932.6690
# s.e.  0.0836   0.0846    83.7532   296.303   252.3412  102.8566   250.3566
#          LS107       AO120      AO122
#       370.7175  -1219.8245  -987.1700
# s.e.   95.7657    252.4097   252.4363    
# sigma^2 estimated as 88411:  log likelihood=-1081.49
# AIC=2184.99   AICc=2186.87   BIC=2218.25    
# Outliers:
#   type ind    time coefhat  tstat
# 1   TC   6 2012:06  1024.7  3.458
# 2   AO  14 2012:14  -885.5 -3.509
# 3   LS  40 2012:40   494.7  4.810
# 4   AO  65 2013:13  -932.7 -3.725
# 5   LS 107 2014:03   370.7  3.871
# 6   AO 120 2014:16 -1219.8 -4.833
# 7   AO 122 2014:18  -987.2 -3.911

The data are fitted by a stationary AR(2) model with some outliers summarized in the output. According to this model, a cyclic behaviour rather than seasonality is observed in the data.

These are the residuals from the fitted model.

-562.694; 861.332; 215.01; 360.635; 400.555; -19.663; 63.132; 89.367; -159.542; 310.001; 101.865; 
80.454; -540.365; -10.707; -60.819; -131.584; -301.359; -436.549; 105.306; -50.281; -56.658; -142.147; 
62.602; -66.119; 63.857; -253.158; 302.135; 379.365; 224.416; 80.363; 25.394; 181.858; -152.887; 
336.83; -200.61; -451.631; 318.987; -189.269; -322.656; 225.7; 313.434; 1113.322; -57.1; -182.213; 
584.986; -98.521; -57.223; 141.604; -567.276; 401.332; -181.432; -1142.653; -401.002; 5.72; 284.743; 
-68.867; 203.088; 301.483; 378.136; 115.386; -138.022; 324.644; 73.6; -83.233; 14.096; 63.987; 124.937;
-57.862; -51.839; -507.671; 188.566; 228.418; -205.009; -207.419; -212.721; -44.265; -13.322; -242.081;
185.378; 137.111; -0.256; -146.214; -113.48; -18.343; -751.16; 267.111; -43.557; -81.991; 224.644; 
167.73; 218.705; 16.786; 84.769; 403.277; 92.964; -536.499; 59.121; 373.918; 99.044; -91.952; -292.79; 
596.918; -488.727; -471.684; -709.996; 155.382; 64.387; 263.095; 50.069; 47.84; 373.703; 74.842; 
-117.303; 146.866; 171.693; 36.535; 48.783; -116.079; -20.282; 6.377; 5.001; -16.583; 0.506; 82.552; 
96.052; -210.158; -201.05; -27.24; -268.39; -75.503; 160.697; 121.984; 7.788; -196.493; -86.462; 
-90.693; -691.693; 419.689; -313.173; -73.34; -44.991; 132.638; -46.613; -302.799; 229.549; 92.099; 
299.978; -783.563; 214.792; 152.29; 10.151; 418.609

$\quad$

plot(residuals(res$fit))

residuals

Autocorrelations of the residuals:

par(mfrow = c(2,1), mar = c(5,4,4,3))
tmp <- acf(residuals(res$fit), lag.max = 53, plot = FALSE)
tmp$acf[1] <- NA
plot(tmp, ylim = c(-0.4, 0.4))
pacf(residuals(res$fit), lag.max = 53, ylim = c(-0.4, 0.4))

ACF/PACF of residuals

The autocorrelations of the residuals reveal the significance of the lag of order 52 (one year lag). One way to incorporate this lag is to define it as follows:

xlag52 <- ts(c(rep(NA, 52), window(x, start = c(2013))))
tsp(xlag52) <- tsp(x)

and pass xlag52 through argument xreg in function tso or arima. However, probably because the sample is small for such lag, the algorithm run into difficulties when fitting an ARIMA model.

Edit

The idea of seasonal dummies mentioned by @IrishStat is much more appropriate than trying to add a lag of order 52. You can use these functions of the forecast package: seasonaldummies to generate seasonal dummies or fourier to generate the trigonometric representation of seasonal dummies. For illustration and as a complement to the results already shown by @IrishStat, below I show what I get using the first 10 components of the seasonal cycles.

seas <- fourier(x, K=10)
trend <- seq_along(x)
mo <- outliers(c("TC", rep("AO", 5)), c(2, 14, 42, 65, 120, 148))
outl <- outliers.effects(mo, length(x))
seas <- fourier(x, K=10)
trend <- seq_along(x)
res2 <- arima(x, order = c(1,0,0), xreg = cbind(seas, trend, outl))

This yields the following residuals

plot(residuals(res2))

residuals from second model

par(mfrow = c(2,1), mar = c(5,4,4,3))
acf(residuals(res2), lag.max = 52)
pacf(residuals(res2), lag.max = 52)

ACF/PACF of residuals of second model

Now, we can see that the autocorrelations of order 52 are less prominent in the residuals.

The following estimate of the seasonal pattern is obtained:

fitted.seas <- ts(seas %*% coef(res2)[3:22])
tsp(fitted.seas) <- tsp(x)
plot(fitted.seas, type = "l", main = "estimated seasonal component")

estimated seasonal component

Finally, as you used R for the results shown in you question, if you want to work with the model proposed by @IrishStat, you can do the following:

mo <- outliers(rep("AO", 5), c(14, 65, 42, 39, 120))
outl <- outliers.effects(mo, length(x))
seas <- matrix(0, nrow = length(x), ncol = 52)
for (i in seq(52))
  seas[seq(i, length(x), 52),i] <- 1
trend <- seq_along(x)
res3 <- arima(x, order = c(1,0,0), xreg = cbind(seas[,-c(1,18,33)], trend, outl))