Solved – How to model and make predictions on time series data in R

rtime series

I have a dataframe in R with two columns, each row represents a ratio in a year. There're a total of 130 rows. It looks like this:

enter image description here

When plotted out, it's like this:
enter image description here

I am currently using the below code in R to model this time series:

myts <- ts(df_130$sb_ab_ratio, 
           start = min(df_130$yearID), 
           end = max(df_130$yearID),
           frequency = 20)
plot(myts)
fit <- HoltWinters(myts)
accuracy(fit)

Basically, by eyeballing the plot, I thought there's some cyclical effect every 20 years, so just use frequency as 20. I am wondering:

  • Is this the right way to do it? If not, what's the right approach?
  • Does it make more sense just to use a non linear regression model to represent? If yes, which one?

Thanks!

EDIT — adding full dataset in csv format:

"","yearID","sb_ab_ratio"
"1",1886,0.0444691266609075
"2",1887,0.0849011159579308
"3",1888,0.0697252084465595
"4",1889,0.0629723580637569
"5",1890,0.0606468167942309
"6",1891,0.0536301933360757
"7",1892,0.0500970630596781
"8",1893,0.0483672536820275
"9",1894,0.0544773638065905
"10",1895,0.0510463800904977
"11",1896,0.0550107070234475
"12",1897,0.0476847430047048
"13",1898,0.0330189432023109
"14",1899,0.0425104442389259
"15",1900,0.0430849432689359
"16",1901,0.036962583490046
"17",1902,0.0351642002704938
"18",1903,0.0362544572436008
"19",1904,0.0337018717874115
"20",1905,0.0358242510141244
"21",1906,0.0374214661320743
"22",1907,0.0345810918509663
"23",1908,0.033874566187407
"24",1909,0.037860368183397
"25",1910,0.0400402241762015
"26",1911,0.0413440147822176
"27",1912,0.0412608637355404
"28",1913,0.0400167454688731
"29",1914,0.0372878037638575
"30",1915,0.0337375928482219
"31",1916,0.0336047263894145
"32",1917,0.0294071049905551
"33",1918,0.0295476491123821
"34",1919,0.0278153500020088
"35",1920,0.020421012616123
"36",1921,0.0174587892734702
"37",1922,0.0170013826072692
"38",1923,0.0183857607152495
"39",1924,0.0177284860510713
"40",1925,0.0162468395917221
"41",1926,0.0152235170503391
"42",1927,0.0170137696688412
"43",1928,0.0150213068181818
"44",1929,0.0155967609435512
"45",1930,0.0124753092837093
"46",1931,0.0125600295530107
"47",1932,0.0113555550457663
"48",1933,0.010091290981202
"49",1934,0.0105807172539023
"50",1935,0.0102155327001168
"51",1936,0.010988384034323
"52",1937,0.0118772029826786
"53",1938,0.0105395645371884
"54",1939,0.0112748736436574
"55",1940,0.0110810088020185
"56",1941,0.0102616069140634
"57",1942,0.0114772306553128
"58",1943,0.0119242984144225
"59",1944,0.01073950855075
"60",1945,0.0115812284628228
"61",1946,0.0104822548705726
"62",1947,0.009000900090009
"63",1948,0.00961959934131807
"64",1949,0.00865133917990045
"65",1950,0.00766301592728387
"66",1951,0.0101686945277141
"67",1952,0.00915731337965437
"68",1953,0.00785910090944386
"69",1954,0.00828011818528403
"70",1955,0.00830242852015791
"71",1956,0.0085742224766266
"72",1957,0.00903351942147787
"73",1958,0.0088396340081358
"74",1959,0.0101193442000617
"75",1960,0.0109862641940629
"76",1961,0.0107799488828428
"77",1962,0.0121783752529633
"78",1963,0.0112553954869142
"79",1964,0.0106460023174971
"80",1965,0.0132040568986413
"81",1966,0.0132916769437365
"82",1967,0.0125726844008974
"83",1968,0.0139474507926571
"84",1969,0.0140912657003359
"85",1970,0.0144392311185107
"86",1971,0.013520345630592
"87",1972,0.0144759188643574
"88",1973,0.01531394725112
"89",1974,0.0185927292523591
"90",1975,0.0191826458664517
"91",1976,0.0232199201672686
"92",1977,0.0209551724616945
"93",1978,0.0211843155537661
"94",1979,0.0208695164995238
"95",1980,0.0228496115427303
"96",1981,0.0213937142070776
"97",1982,0.0220327577714726
"98",1983,0.0231506639356825
"99",1984,0.0210805887547018
"100",1985,0.0216459898654552
"101",1986,0.0231436837029894
"102",1987,0.0248794198271973
"103",1988,0.023146849222827
"104",1989,0.0218175198325176
"105",1990,0.0230443796929284
"106",1991,0.0218230653013262
"107",1992,0.0228419468840757
"108",1993,0.0210522920094197
"109",1994,0.0204777537953676
"110",1995,0.0211660448434377
"111",1996,0.0206567560155866
"112",1997,0.0212817972439172
"113",1998,0.0196510208477943
"114",1999,0.020468361095156
"115",2000,0.0174786299240839
"116",2001,0.0186664581252933
"117",2002,0.0166080854199128
"118",2003,0.015425490443033
"119",2004,0.0154643179387283
"120",2005,0.0154206871674633
"121",2006,0.0165291231676636
"122",2007,0.0173915116549352
"123",2008,0.0167892318581523
"124",2009,0.0179078559412478
"125",2010,0.0178829534390062
"126",2011,0.0197821429649075
"127",2012,0.0195399725266413
"128",2013,0.016210031914253
"129",2014,0.016689410315553
"130",2015,0.0151370492120275

Best Answer

Analyzing ratios is never a good idea unless that is all you have. It is preferential to model a Y as a function of X and upon arriving at a useful model use a predicted X to obtain a predicted Y and then convert the two predictions to a ratio.

Your series appears to significant change (reduction) in error variance thus you probably will have to use some form of weighted least squares to compensate for this inequality or there could be a change in parameters over time. Nearly impossible to decide simply based on the graph .

Simple AIC/BIC search procedures attempting to just use an ARIMA model may lead to unexpected forecasts as a result of ignoring the anomalies in the data. Your data plot suggests a number of possible deterministic effects such as pulses and either one or more level shifts or a localized time trend. My suggestion is to combine some deterministic structure and some ARIMA structure (memory) resulting in an error process free of structure. Of course all of this could change when you examine the Y responds to X.

Hope this helps. Other readers my have specific suggestions as to the freely availble tools in R which can be used to help you or at least enlighten you as to their capabilities when faced with a challenging series like this one..

EDITED AFTER RECEIPT OF DATA:

Here is the plot of the acf for the enter image description here data

AUTOBOX found a structural breakpoint in parameters starting at 1948 enter image description here . The equation is here enter image description here suggesting a random walk plus three anomalies. . The Actual/Fit and Forecast is here enter image description here with residual plot here enter image description here and residual acf here enter image description here

here is a plot of the most recent 68 values enter image description here

Hope this helps .. Just because a piece of software has powerful analytics to detect structure doesn't mean that all models should be complex .. just complex enough ! In this case the most recent set of values is essentially informationless (i.e. a random walk) except for the most recent value. This model is superior to a mean model or an ARMA model in the sense that the most recent value is the best forecast.

The statistics of this model are here enter image description here

Note that a (1,0,0) model with an AR(1) parameter is approximately equal to a first difference model without drift. Here are the results of specifying an AR(1) model enter image description here with coefficient = .947