Solved – Step-by-step process for forecasting time series in R

forecastingrtime series

I have to work with 1000 time series of food retail products (with weekly data).
Each of these time series corresponds to the sales of each product.
I need to obtain forecasts for each of these time series and I would like to know if I'm doing this in a right way.

STEP 1: Data Adjustment
With the group_by function (dplyr package), for each product/time series I add missing values for the weeks that don't exist yet and I put zero values for the sales in that new dates that I've just created. For the other variables, like prices, I rolling the values with na.locf function to avoid NA's value in the new added dates.
My goal is to clean every time series from trash data and obtain time series with no NA's weekly observations from 2014 to today.

STEP 2: Splitting Time Series
With the group_by function, for each product/time series I divide my sample of data in 2 groups: Training Set (80%) and Test Set (20%). My goal is to find the best model possible that fit the time serie of the Training Set and then use that best model to produce forecast for the remaining Test Set. In this way I can compare, even with a plot, the forecasts with the real data of the Test Set.

STEP 3: Creation of time series object and some subset of regressors
With the group_by function, for each product/time series I create the time serie (with the ts function) for the Training Set, then I create some subset of regressors. I do this because I preliminarily don't know which is the best subset of regressors that I have to use to fit a model to my time series.
For example, one subset of regressors could be formed by two variables, another subset of regressors could be formed by only one other variable etc…

STEP 4: Fit a linear model for all the subset of regressors that I've just created
With the function lm I try to find what is the relation between my dependent variable (that I need to forecast) and all the subset of regressors that I've created before.
I don't know if it's correct to use lm, maybe someone could help me with this issue 🙂

STEP 5: Evaluating the best model according to step 4 (AIC,BIC, R^2 adj)
With the function CV I can obtain the indexes (AIC, BIC, R^2 adj, etc) that I need to evaluate which is the best subset that I will have to use in the next steps to fit in the best way possible a model to my time series.
For now, I choose that the best subset of regressors is that with the minimum value of AIC, but it's only my initial choice.

STEP 6: Trasformation in log(x+1) (because of zero's values in time series)
With the group_by function, for each product/time series I transform the time serie of the Training Set in log(ts +1). In that way, I can calculate the logarithm for each observation of my time series, even for that observation where I put zero at STEP 1. (Someone can say if it is mathematically correct?)
N.B. I transform only the time series of my dependent variable, not the regressors; it could be a problem?

STEP 7: Handling outliers
With the group_by function, for each product/time series I use the tso function (tsoutliers package) to find if there are outliers and, if yes, to manage that values. It is necessary to obtain reliable forecasts in the next steps.

STEP 8:Fit auto.arima model with xreg = best subset of regressor find in STEP 5
With the group_by function, for each product/time series of the Training Set I use the auto.arima function (forecast package) with the xreg parameter and then, if it's not possible, I use the simple ets function with no regressors (especially in these case where the time series is too short, is it correct?
Often, if I don't use the ets model the error NO SUITABLE ARIMA MODEL FOUND comes and I don't know why).
Then, with the forecast function with xreg= best subset of regressors (same length of the Test Set) I obtain my forecasts for the Test Set period.

STEP 9: Back transformation from log to original scale
With the group_by function, for each product/time series I re-transform in the original scale the forecasts obtained and I plot these values with the real data of the Test Set.

STEP 10: Evaluate accuracy of the forecasts
With the function accuracy (forecast package) I obtain some indexes, like MASE, MAPE, MAE etc; considering that my time series have all the same scale, I think that it is good to use the MAE as the index of the accuracy, is it true?

Someone can help me with this work? Is it all right what I've just wrote o am I missing something very important? What do you think?
My steps are correct? Someone would add something else?

Thanks in advance guys!

Best Answer

This is a very broad question. I'll comment on a few points.

What do you mean by "I put zero values for the sales in that new dates that I've just created"? This sounds like you fill in zeros at the end for series that ended before a common end date - and then you'd attempt to forecast these zero sales for products that have been delisted. This wouldn't make sense, so I'm sure I'm misunderstanding something.

You should think long and hard about whether filling in NAs with zeros or using loocf (as below) makes sense. Maybe some of your products are seasonal and are simply not sold in summer or in winter? If so, filling in makes no sense. Yes, in such a case ARIMA will have problems, although AFAIK auto.arima() uses some kind of a state space approach that can deal with NAs. You may want to look at fitting simple linear models with a trend regressor and multiple seasonal dummies.
This looks fine to me.
I assume you are using the frequency parameter to tell R that you can have a 52-period seasonality?

It's unclear to me what kind of regressors you are setting up here. Prices? Promotions?
I don't think this makes sense. In step 8 below, you fit a regression with ARIMA errors, to take seasonality, autoregression and MA behavior into account, so you already believe that an OLS model is wrongly specified. So why fit it? Better to skip steps 4 and 5 and evaluate the regressions with ARIMA models directly on the holdout sample.
See above. Information criteria have only a tenuous relationship to forecasting accuracy, if that's what you are after.
Think about where your zeros come from. (See above.) Are your retail sales per store, or aggregated across multiple stores? If the latter, you shouldn't have too many zeros... except for delistings, again see above. If you have really low volume data, because of extreme slow movers, then regression/ARIMA doesn't make a lot of sense, because it inherently presupposes continuous data. I recently wrote a little article on forecasting count data in retail (Kolassa, 2016, International Journal of Forecasting, which may be interesting to you.

Taking logs may or may not be worthwhile.
I'd encourage you to look at each outlier separately and think about what may have caused that outlier. Data errors? Someone buying a heap of product? (That should only show up in a single store and even out if you are looking at aggregate data.) Plus, are you considering your regressors in assessing outliers? An "outlier" may simply be the effect of a successful promotion. Needless to say, you don't want to remove or "correct" these data, but capture and forecast the promotion effect.
See above. This should really be the core of your process. You can iterate the fitting over different subsets of regressors and evaluate each model. If you can't fit an ARIMA model, it makes sense to look at the time series, and the residuals from the regression, and try to understand what's happening. Sometimes you do run into numerical difficulties in fitting. In such a case ets() is a good fallback solution, or you could first run a regression on your regressors, then feed the residuals into ets().
See above. Logs may or may not improve your forecasts.
This makes sense, of course. mae is usually good. Also look at the distribution of residuals, especially if you have low volume count data (see the paper I referenced for details), and at whether your forecasts are systematically biased.

You may want to browse through some previous questions on forecasting in retail.

Best Answer

Related Solutions

Solved – Boosted AR for time series forecasting

Solved – State space models for time series forecasting

Related Question