Solved – Which one to compromise between MAPE and Adj R square in multiple regression

forecastingmultiple regressionpredictionr-squaredtime series

I'm trying to forecast sales of a product based on the other variables like Competitor sales, Fuel Price and CPI (Consumer Price Index).

The below given output (based on 1 to 44 months) gives me the lowest MAPE 11.62 when I validated with 45 to 48 actual sales

Coefficients:                               
                              Estimate Std. Error t value Pr(>|t|)                                  
(Intercept)                 -2320.6320   496.3898  -4.675 3.83e-05 ***                              
Sales lag_1                     0.2124     0.1119   1.898 0.065515 .                                
Competi_sales(1) lag1          -1.6535     0.8875  -1.863 0.070404 .                                
Competi_Sales(1)_lag3          -5.4108     0.8352  -6.478 1.42e-07 ***                              
Competi sales(2)_lag1           2.3004     0.5726   4.017 0.000277 ***                              
Fuel price                     -48.3714    17.5225  -2.761 0.008926 **                              
CPI                            22.2696     3.4485   6.458 1.51e-07 ***                              
---                             
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1                              

Residual standard error: 212.7 on 37 degrees of freedom                             
Multiple R-squared:  0.7252,    Adjusted R-squared:  0.6806                                 
F-statistic: 16.27 on 6 and 37 DF,  p-value: 4.58e-09                               

I understand that by removing Sales lag_1 and Competi_sales(1) lag1 from the model (since both are not significant at alpha 0.05), the Adjusted R squared can be improved from 0.6806 but when If I do that the MAPE is increasing. For business use, MAPE is often preferred because apparently managers understand percentages better than other accuracy parameters.

Should I go ahead and forecast the sales using this model or should I remove the insignificant variables?

Best Answer

+1 to @RichardHardy's comment. In-sample fit is not a good guide to out-of-sample forecast accuracy. Relying on in-sample fits can/will lead to overfitting and poor out-of-sample performance. Instead, use a holdout sample and check accuracy on that.

I heartily recommend this free open source online forecasting textbook, especially the section on evaluating accuracy.

In addition, it is not automatically the case that removing insignificant predictors will improve your adjusted $R^2$. This may happen or not.

Finally, you are including lagged sales. You may want to look at ARIMA models (e.g., auto.arima() in the forecast package, where you can include additional eXplanatory or eXternal variables like laggged competitor sales - note: which you will need to forecast again out-of-sample - or CPI via the xreg parameter.