Solved – LInear Regression – approaching models with Date as variable

rregression

This question is an extension to an earlier question Linear regression – date as dummy variable. I still have some doubts in selecting the best modeling approach.

Autoregressive would be better, but I am using regression as an alternative.

If the initial training data used is the one showed below (in R)

library(dplyr)
library(lubridate)
library(zoo)
library(forecast)
data <- structure(list(Year = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L), .Label = c("2016", 
"2017", "2018", "2019"), class = "factor"), Month = structure(c(2L, 
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 
6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 
9L, 10L, 11L, 12L, 1L), .Label = c("1", "2", "3", "4", "5", "6", 
"7", "8", "9", "10", "11", "12"), class = "factor"), Qty = c(8286, 
14804, 8540, 8150, 7410, 7940, 10337, 15306, 7554, 15778, 22091, 
28390, 17278, 42589, 11393, 14011, 6726, 27269, 16008, 42521, 
17043, 23212, 13752, 22412, 45143, 22428, 16398, 30901, 15760, 
23674, 9625, 25319, 34241, 42536, 32043, 62265)), row.names = c(NA, 
-36L), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
))



test <- structure(list(Year = structure(c(1L, 1L, 1L, 1L, 1L, 1L), 
   .Label = "2019", class = "factor"), 
    Month = structure(1:6, .Label = c("2", "3", "4", "5", "6", 
    "7"), class = "factor")), class = c("spec_tbl_df", "tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -6L))
qtytest <- c(13943, 50692, 19538, 54282, 4986, 12872)

In the first case, from the 'Date' column, 'Year' and 'Month' are extracted and converted to factor to build the model

model <- lm(Qty ~ Year + Month, data = data)
summary(model)$adj.r.squared
[1] 0.5208049
forcastdata <- predict(model, newdata = test)
accuracy(forcastdata, x = qtytest)[, "MAPE"]
[1] 312.0962

With second case, the original dates (monthly data) are scaled by taking the difference from the minimum date value for each 'year'

 data2 <- structure(list(Date = structure(c(16832, 16861, 16892, 16922, 
16953, 16983, 17014, 17045, 17075, 17106, 17136, 17167, 17198, 
17226, 17257, 17287, 17318, 17348, 17379, 17410, 17440, 17471, 
17501, 17532, 17563, 17591, 17622, 17652, 17683, 17713, 17744, 
17775, 17805, 17836, 17866, 17897), class = "Date"), Qty = c(8286, 
14804, 8540, 8150, 7410, 7940, 10337, 15306, 7554, 15778, 22091, 
28390, 17278, 42589, 11393, 14011, 6726, 27269, 16008, 42521, 
17043, 23212, 13752, 22412, 45143, 22428, 16398, 30901, 15760, 
23674, 9625, 25319, 34241, 42536, 32043, 62265), Month = structure(c(2L, 
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 
6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 
9L, 10L, 11L, 12L, 1L), .Label = c("1", "2", "3", "4", "5", "6", 
"7", "8", "9", "10", "11", "12"), class = "factor")), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -36L))



 test2 <- structure(list(Date = structure(c(17928, 17956, 17987, 18017, 
18048, 18078), class = "Date"), Month = structure(1:6, .Label = c("2", 
"3", "4", "5", "6", "7"), class = "factor")), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))

 data2 %>% group_by(Year = year(Date)) %>% mutate(dateInterval = 100 *(as.integer(Date)  - first(as.integer(Date)))/first(as.integer(Date))) %>% ungroup %>% select(dateInterval, Month, Qty) -> data2


test2 %>%  group_by(Year = year(Date)) %>% mutate(dateInterval = 100 *(as.integer(Date)  - first(as.integer(Date)))/first(as.integer(Date))) %>% ungroup %>% select(dateInterval, Month) -> test2


model2 <- lm(Qty ~ dateInterval + Month, data = data2)
summary(model2)$adj.r.squared
[1] 0.2633624
forcastdata2 <- predict(model2, newdata = test2)
 accuracy(forcastdata2, x = qtytest)[, "MAPE"]
 [1] 72.56004

The accuracy improved with second model, however, the adjusted r-squared decreased to half. Do you have any thoughts about which way to proceed?

Also, I am looking for any other approaches that would be best suitable for this situation.

Best Answer

Regression against time can often be a very poor alternative to model identification strategies incorporating an ARIMA model with possible level shifts and possible time trend predictors. The reason is that your are specifying the form of the model ( no pulses , no level shifts , no arima , no changes in trend rather than determining the form of the model via analytics as is illustrated here https://autobox.com/pdfs/ARIMA%20FLOW%20CHART.pdf which CAN culminate in the regression against time model that you are specifying. This approach requires equally spaced temporal data while your regression approach against time does not thus you may have no choice.

An initial glance suggest a few pulses ( one time unusual values ) which if untreated can effect estimated parameters.

Related Solutions

Solved – Correlation between two variables measured on a “strongly agree” to “strongly disagree” scale

As you have ordinal factors, means are not so useful. You could use a $\chi^2$ test and/or Spearman correlation to find if the two values are correlated.

Commands:

chisq.test(analysis3$groups,analysis3$quickly) ,

and after converting your "quickly" strings to factors, reordering and extracting the levels to a numeric vector, you can apply Spearman correlation:

analysis3$qui_fact<- as.factor(analysis3$quickly)

levels(analysis$qui_fact) #(alphabetical levels)

analysis$qui_fact<- factor(analysis$qui_fact,levels(analysis$qui_fact)[c(4,1,3,2,5)]) #reorder as needed

analysis$qui_num<- as.numeric(analysis$qui_fact)

cor.test(analysis$groups,analysis$qui_num,alt="two.sided",method="spearman",conf.level=.99)

Solved – Compare linear regression models (same and different response variable)

Just put everything in one model:

library(reshape2)
dat1 <- melt(dat, id.vars=c("x", "year"))

mod <- lm(value~x*variable*year, data=dat1)
anova(mod)

#Analysis of Variance Table
#
#Response: value
#                Df  Sum Sq Mean Sq   F value    Pr(>F)    
#x                1 13546.1 13546.1 1087.7872 < 2.2e-16 ***
#variable         1  1746.5  1746.5  140.2458 3.103e-13 ***
#year             1  4994.1  4994.1  401.0389 < 2.2e-16 ***
#x:variable       1   860.9   860.9   69.1300 1.687e-09 ***
#x:year           1  1399.1  1399.1  112.3510 5.352e-12 ***
#variable:year    1   292.1   292.1   23.4527 3.137e-05 ***
#x:variable:year  1    81.6    81.6    6.5533   0.01539 *  
#Residuals       32   398.5    12.5                        
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

As you see, you have the significant main effects and interactions you'd expect from your artificial data.

It's always recommended to plot:

library(ggplot2)
newdat <- expand.grid(x=c(1,10), variable=c("y_calc", "y_meas"), year=factor(1990:1991))
newdat <- cbind(newdat, predict(mod, newdata=newdat, interval = "confidence"))

ggplot(dat1, aes(x=x, y=value, colour=year, shape=variable, linetype=variable)) +
  geom_ribbon(data=newdat, aes(y=fit, ymin=lwr, ymax=upr, fill=year, colour=NULL), alpha=0.2) +
  geom_line(data=newdat, aes(y=fit)) +
  geom_point()

plot of data and fitted lines with confidence bands

This shows you that differences between slopes are most important, which is confirmed if you look at the summary table:

#Coefficients:
#                          Estimate Std. Error t value Pr(>|t|)    
#(Intercept)                -6.3443     2.4107  -2.632 0.012963 *  
#x                           3.2300     0.3885   8.314 1.69e-09 ***
#variabley_meas             -4.4852     3.4092  -1.316 0.197648    
#year1991                   -0.2361     3.4092  -0.069 0.945218    
#x:variabley_meas            2.2357     0.5494   4.069 0.000288 ***
#x:year1991                  3.1235     0.5494   5.685 2.71e-06 ***
#variabley_meas:year1991    -0.1319     4.8213  -0.027 0.978341    
#x:variabley_meas:year1991   1.9891     0.7770   2.560 0.015394 *

Note that the usual diagnostics (in particular for variance homogeneity) should of course be performed.

Best Answer

Related Solutions

Solved – Correlation between two variables measured on a “strongly agree” to “strongly disagree” scale

Solved – Compare linear regression models (same and different response variable)

Related Question