This question is an extension to an earlier question Linear regression – date as dummy variable. I still have some doubts in selecting the best modeling approach.
Autoregressive would be better, but I am using regression as an alternative.
If the initial training data used is the one showed below (in R
)
library(dplyr)
library(lubridate)
library(zoo)
library(forecast)
data <- structure(list(Year = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L), .Label = c("2016",
"2017", "2018", "2019"), class = "factor"), Month = structure(c(2L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 11L, 12L, 1L), .Label = c("1", "2", "3", "4", "5", "6",
"7", "8", "9", "10", "11", "12"), class = "factor"), Qty = c(8286,
14804, 8540, 8150, 7410, 7940, 10337, 15306, 7554, 15778, 22091,
28390, 17278, 42589, 11393, 14011, 6726, 27269, 16008, 42521,
17043, 23212, 13752, 22412, 45143, 22428, 16398, 30901, 15760,
23674, 9625, 25319, 34241, 42536, 32043, 62265)), row.names = c(NA,
-36L), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
))
test <- structure(list(Year = structure(c(1L, 1L, 1L, 1L, 1L, 1L),
.Label = "2019", class = "factor"),
Month = structure(1:6, .Label = c("2", "3", "4", "5", "6",
"7"), class = "factor")), class = c("spec_tbl_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -6L))
qtytest <- c(13943, 50692, 19538, 54282, 4986, 12872)
In the first case, from the 'Date' column, 'Year' and 'Month' are extracted and converted to factor to build the model
model <- lm(Qty ~ Year + Month, data = data)
summary(model)$adj.r.squared
[1] 0.5208049
forcastdata <- predict(model, newdata = test)
accuracy(forcastdata, x = qtytest)[, "MAPE"]
[1] 312.0962
With second case, the original dates (monthly data) are scaled by taking the difference from the minimum date value for each 'year'
data2 <- structure(list(Date = structure(c(16832, 16861, 16892, 16922,
16953, 16983, 17014, 17045, 17075, 17106, 17136, 17167, 17198,
17226, 17257, 17287, 17318, 17348, 17379, 17410, 17440, 17471,
17501, 17532, 17563, 17591, 17622, 17652, 17683, 17713, 17744,
17775, 17805, 17836, 17866, 17897), class = "Date"), Qty = c(8286,
14804, 8540, 8150, 7410, 7940, 10337, 15306, 7554, 15778, 22091,
28390, 17278, 42589, 11393, 14011, 6726, 27269, 16008, 42521,
17043, 23212, 13752, 22412, 45143, 22428, 16398, 30901, 15760,
23674, 9625, 25319, 34241, 42536, 32043, 62265), Month = structure(c(2L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 11L, 12L, 1L), .Label = c("1", "2", "3", "4", "5", "6",
"7", "8", "9", "10", "11", "12"), class = "factor")), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -36L))
test2 <- structure(list(Date = structure(c(17928, 17956, 17987, 18017,
18048, 18078), class = "Date"), Month = structure(1:6, .Label = c("2",
"3", "4", "5", "6", "7"), class = "factor")), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))
data2 %>% group_by(Year = year(Date)) %>% mutate(dateInterval = 100 *(as.integer(Date) - first(as.integer(Date)))/first(as.integer(Date))) %>% ungroup %>% select(dateInterval, Month, Qty) -> data2
test2 %>% group_by(Year = year(Date)) %>% mutate(dateInterval = 100 *(as.integer(Date) - first(as.integer(Date)))/first(as.integer(Date))) %>% ungroup %>% select(dateInterval, Month) -> test2
model2 <- lm(Qty ~ dateInterval + Month, data = data2)
summary(model2)$adj.r.squared
[1] 0.2633624
forcastdata2 <- predict(model2, newdata = test2)
accuracy(forcastdata2, x = qtytest)[, "MAPE"]
[1] 72.56004
The accuracy improved with second model, however, the adjusted r-squared decreased to half. Do you have any thoughts about which way to proceed?
Also, I am looking for any other approaches that would be best suitable for this situation.
Best Answer
Regression against time can often be a very poor alternative to model identification strategies incorporating an ARIMA model with possible level shifts and possible time trend predictors. The reason is that your are specifying the form of the model ( no pulses , no level shifts , no arima , no changes in trend rather than determining the form of the model via analytics as is illustrated here https://autobox.com/pdfs/ARIMA%20FLOW%20CHART.pdf which CAN culminate in the regression against time model that you are specifying. This approach requires equally spaced temporal data while your regression approach against time does not thus you may have no choice.
An initial glance suggest a few pulses ( one time unusual values ) which if untreated can effect estimated parameters.